Alerting and Failure Response
(Section: Technology and Infrastructure)
Brief Summary
Strong alerting is a signal of violation of user value, and not just a "red metric." For iGaming, SLO gates (latency, availability, payment conversion, Time-to-Wallet), multi-burn rules, clear on-call, escalation, ChatOps and runbooks roles are important. The goal is to quickly see the deviation, inform those who can correct, and fix the knowledge in order to react even faster and cheaper next time.
1) The Basics: From Metrics to Action
SLI → SLO → Alert - measured quality → target level → "budget is on" condition.
Severity (SEV): SEV1 - critical (revenue/GGR at risk), SEV2 - serious, SEV3 - moderate, SEV4 - minor.
Impact/Urgency: who is suffering (all/region/tenant/channel) and how urgent (TTW↑, p99↑, error- rate↑).
Actionability: for each alarm - a specific action (runbook + owner).
2) Signal taxonomy
ТехSLO: p95/p99 latency API, error-rate, saturation (CPU/IO/GPU), queue lag.
BusinessSLO: payment conversion (attempt→success), Time-to-Wallet (TTW), betting success, game launch.
Payment routes: PSP-specific metrics (timeout/decline spikes).
Front/mobile: RUM metrics (LCP/INP), crash-rate, scenario synthetics (login/deposit/rate/output).
3) Alerting policy: SLO and burn-rate
SLI/SLO Examples
Payments-api availability ≥ 99. 9% / 30d p95 `/deposit` ≤ 250 ms / 30d
Conversion of 'payments_attempt→success ≥ baseline − 0. 3% / 24h
TTW p95 ≤ 3 min/24h
Multi-window / Multi-burn (идея PromQL)
Fast burn: SLO violation 5-10 × faster than normal (alert page in 5-15 minutes).
Slow burn: slow budget burnout (ticket + analysis in 1-3 hours).
yaml
API success proxy metric (recording rule in advance)
record: job:http:success_ratio expr:
sum(rate(http_requests_total{status=~"2.. 3.."}[5m]))
/ sum(rate(http_requests_total[5m]))
Fast burn (99. 9% SLO)
alert: PaymentsSLOFastBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 14 for: 10m labels: { severity: "page", service: "payments-api" }
annotations:
summary: "SLO fast burn (payments-api)"
runbook: "https://runbooks/payments/slo"
Slow burn alert: PaymentsSLOSlowBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 6 for: 1h labels: { severity: "ticket", service: "payments-api" }
4) Noise reduction and signal quality
The correct source of truth: to alter by aggregates (recording rules), and not by heavy "raw" expressions.
Deduplication - Alertmanager groups by 'service/region/severity'.
Hierarchy: first alert to business/SLI, below - technical metrics as diagnostics.
Suppression: during planned-maintenance/release (annotation), during upstream incidents.
Cardinality: Do not use 'user _ id/session _ id' in alert labels.
Test alerts: regular "training" triggers (checking channels, roles, runabook links).
5) Alertmanager Routing and Escalation
yaml route:
group_by: [service, region]
group_wait: 30s group_interval: 5m repeat_interval: 2h receiver: sre-slack routes:
- matchers: [ severity="page" ]
receiver: pagerduty-sre continue: true
- matchers: [ service="payments-api" ]
receiver: payments-slack
receivers:
- name: pagerduty-sre pagerduty_configs:
- routing_key: <PD_KEY>
severity: "critical"
- name: sre-slack slack_configs:
- channel: "#alerts-sre"
send_resolved: true title: "{{.CommonLabels. service }} {{.CommonLabels. severity }}"
text: "Runbook: {{.CommonAnnotations. runbook }}"
inhibit_rules:
- source_matchers: [ severity="page" ]
target_matchers: [ severity="ticket" ]
equal: [ "service" ]
Idea: SEV = page → PagerDuty/SMS; the rest is Slack/ticket. Inhibition suppresses the "hype" of lower levels with active SEV above.
6) Grafana Alerting (as additional layer)
Centralized Alert rules on dashboards (Prometheus/Loki/Cloud).
Contact points: PagerDuty/Slack/Email, Notification policies per folder.
Silences: planned works, migrations, releases.
Snapshots with an auto-screenshot of the panel in the ticket.
7) On-call and live processes
Rotation: 1st line (SRE/platform), 2nd line (service owner), 3rd (DB/Payments/Sec).
SLA reactions: recognition ≤ 5 min (SEV1), diagnosis ≤ 15 min, communication every 15-30 min.
Duty channels: '# incident-warroom', '# status-updates' (facts only).
Runbooks: link in each alert + ChatOps quick commands ('/rollback ', '/freeze', '/scale ').
Training alarms: monthly (checking people, channels, runabook relevance).
8) Incidents: Life Cycle
1. Detection (alert/report/synthetics) → Acknowledge on-call.
2. Triage: determine SEV/affected/hypothesis, open war-room.
3. Stabilization: rolls/rollback/scaling/phicheflags.
4. Communications: status template (see below), ETA/next steps.
5. Closing: confirmation of SLO recovery.
6. Post-Incident Review (RCA): After 24-72 hours, no charges, action items.
- What is broken/affected (region/tenant/channel)
- When started/SEV
- Temporary measures (mitigation)
- Next status update in N minutes
- Contact (Incident Manager)
9) The specifics of iGaming: "pain" zones and alerts
Payments/TTW: share of PSP timeouts, increase in code failures, TTW p95> 3m.
Tournament peaks: p99 API/game start time/queue lag; promotion of limits/auto-scale.
Conclusions of funds: SLA of backhoe/manual checks, limits by country.
Game providers: availability by studio, session initialization time, launch drop.
RG/Compliance: bursts of long sessions/" dogon," exceeding thresholds - not a page, but a ticket + notification to the RG team.
10) Rule examples (optional)
High latency p95 (API)
promql alert: HighLatencyP95 expr: histogram_quantile(0. 95,
sum by (le, service) (rate(http_request_duration_seconds_bucket{service="api"}[5m]))) > 0. 25 for: 10m labels: { severity: "page", service: "api" }
annotations:
summary: "p95 latency > 250ms"
runbook: "https://runbooks/api/latency"
Lead queue "on"
promql alert: WithdrawalsQueueLag expr: max_over_time(queue_lag_seconds{queue="withdrawals"}[10m]) > 300 for: 10m labels: { severity: "page", service: "payments-worker" }
annotations:
summary: "Withdrawals lag >5m"
runbook: "https://runbooks/payments/queue"
Payment Conversion Dipped
promql alert: PaymentConversionDrop expr:
(sum(rate(payments_success_total[15m])) / sum(rate(payments_attempt_total[15m])))
< (payment_conv_baseline - 0. 003)
for: 20m labels: { severity: "page", domain: "payments" }
annotations:
summary: "Payment conversion below baseline -0. 3%"
runbook: "https://runbooks/payments/conversion"
11) ChatOps and Automation
Auto-posting alerts with action buttons: Stop canary, Rollback, Scale + N.
Command abbreviations: '/incident start ', '/status update', '/call
Bots tighten the context: the latest deploi, dependency graph, exemplars, associated tickets.
12) Post-Incident Work (RCA)
Factbox: Timeline, what saw/tried, what worked.
Root cause: technical and organizational reasons.
Detections & Defenses: Which signals helped/failed.
Action items: specific tasks (SLO/alerts/codes/limits/tests/runabook).
Due dates & owners: terms and responsibilities; follow-up session in 2-4 weeks.
13) Implementation checklist
1. Define SLI/SLO for key streams (API/Payments/Games/TTW).
2. Set up recording rules and multi-burn alerts + Alertmanager routing.
3. Enter on-call with rotation, reaction SLO and escalations.
4. Link alerts to runbooks and ChatOps commands.
5. Configure suppression/quiet windows, release/work annotations.
6. Make learning alarms and game-day scenarios (PSP drop, p99 rise, queue lag rise).
7. Measure Alert Quality: MTTA/MTTR,% noisy/false, coverage by SLO.
8. Regular RCAs and revision of thresholds/processes.
9. Enter business/support communication status (templates).
10. Document everything as code: rules, routes, runabook links.
14) Anti-patterns
Alerting by "every metric →" alert-fetig, ignore.
No SLO → it is not clear what is "normal" and what is "on fire."
No suppression/inhibition → avalanche duplicates.
Page at night for minor events (SEV is not comparable to Impact).
Alerts without runbook/owner.
"Manual" actions without ChatOps/auditing.
No RCA/Action items → repeat incidents.
Summary
Alerting and responding is a process, not a set of rules. Link SLO with multi-burn alerts, build a clear on-call escalation, add ChatOps and live runabook, regularly conduct RCAs and training sessions. Then incidents will be less frequent, shorter and cheaper, and releases will be more predictable even in the hot hours of iGaming.