Incident Management
(Section: Technology and Infrastructure)
Brief Summary
Incident management is a repeatable process to quickly restore user value and minimize business damage. Support - clear roles (Incident Manager, Tech Lead, Comms), SLO gates, escalations, ChatOps processes, prepared runabooks and "innocuous" post-incident parsing with measurable action items.
1) Goals and principles
Speed and safety: rapid diagnosis → safe stabilization → sustained recovery.
Sole Owner - The assigned Incident Manager (IM) makes process decisions.
Communications as a product: predictable updates for stakeholders and users.
Data> opinions: SLO/metrics/trails/logs are the source of truth.
Blameless: analysis of reasons without personal accusations; focus on system improvements.
2) Classification of incidents (Severity/Impact/Urgency)
Severity (example):- SEV1 (critical): severe damage to revenue/TTW/payments,> 20% of users or entire regions; SLA impaired/PII threat.
- SEV2 (high): partial degradation of key flows (deposit/bet/launch of games), impact 5-20%.
- SEV3 (medium): noticeable degradation of secondary services, there is a bypass.
- SEV4 (low): minor, limited effect, no effect on SLO/SLA.
Impact: who is affected (all/region/tenant/channel). Urgency: degradation rate (fast-burn/slow-burn on error budget).
3) Incident lifecycle
1. Detect - signal from alerts/SLO/synthetics/reports.
2. Acknowledge - on-call confirms reception, assigns IM.
3. Triage - SEV/Impact score, hypothesis collection, War-Room discovery.
4. Mitigate - stabilization (rollback/route switching/phicheflags/scaling).
5. Communicate - regular status updates (inside/out).
6. Recover - Full SLO/business metrics recovery.
7. Close - recording of chronology, collection of artifacts, PIR (RCA + action items).
4) Roles and Responsibilities (RACI)
Incident Manager (IM) - process owner, assigns roles, monitors time, makes process decisions (R).
Technical Lead (TL) - conducts diagnostics/hypotheses/fixes, coordinates engineers (A/R).
Communications (Comms) - status updates, connection with support/business/PR, status page (R).
Scribe - protocol (timeline, decisions made, links, artifacts) (R).
Stakeholders - Product/Payments/Gaming Providers/Security (C/I).
Minimum per SEV1: IM + TL + Comms + Scribe. It is allowed to combine roles on the SEV2.
5) War-Room и ChatOps
Individual channels: '# incident-warroom- <id>' (working), '# incident-status' (updates only).
Template commands: '/incident start ', '/status update', '/call <owner> ', '/rollback', '/freeze ', '/scale + N'.
The bot pulls up the context: recent releases, dashboards, related alerts, trace exemplars, dependency schemes.
Rules of communication: briefly, on the facts, one speaker (TL), IM moderates.
6) Triggers and gates
SLO gates: fast/slow burn, payment conversion drop, TTW p95> threshold, p99 API ↑, payment queues are on fire.
Auto actions: stop canary, rollback, enabling degrade mode (limiting functions), enabling high frequency synthetics.
Freeze: all releases/foot migrations before stabilization and PIR.
7) Typical scenarios (runabook patterns)
A) Payments: increase in timeouts/failures at PSP
1. Stop promote and freeze payment loop releases.
2. Switch the PSP route to the standby one, raise the timeout/retray by policy.
3. Reconciliation of incomplete transactions, repetition with idempotent keys.
4. Comms communication → support: do you work reserve? ETA.
B) API p99↑ and 5xx after release
1. Rollback (blue-green/canary → stable).
2. Check cache hit, queue depth, database/game provider hotspots.
3. Temporary scaling, limiting heavy features through feature flags.
C) Game provider unavailable
1. Switch traffic to available studios/games, show a status banner.
2. Turn on synthetic checks every 30-60 s.
3. Agree on compensation/bonuses (by policy) - add to PIR.
D) Leakage/suspected PII
1. Component isolation, key/token revocation, log collection (WORM).
2. Legal communication/regulatory alignment.
3. Post-incident actions: secret rotation, masking, access.
8) Communications (internal/external)
Update frequency: SEV1 - every 15-30 minutes, SEV2 - 30-60 minutes.
Internal status template:- What's broken: "Deposits via PSP-X: The Rise of Timeouts."
- Affected: "TR/BR, ~ 18% of stream users."
- When it started: "12:07 EET, SEV1."
- What we do: "Switching route to PSP-Y, retrayes/rate cap enabled."
- Next update: "in 20 minutes."
- Contact: "IM @ duty-im, TL @ oncall-pay."
Public status (page/social networks) - abbreviated, without PII and unnecessary details, with ETA and a link to further updates.
9) Artifact collection and auditing
Event timeline (minute accuracy), service versions, feature flags, config changes.
Pictures of dashboards, approximate routes (trace_id), logs "before/during/after."
Links to tickets, PR, releases, runabooks.
Communications report (when/to/what).
It all adds up to an incident card.
10) Closure and PIR (Post-Incident Review)
PIR format (short):- Summary: what happened, scale, duration, SEV.
- Impact: users/regions, SLO/SLA, Fin. effect.
- Timeline: in detail, by the minute.
- Root Cause: technical + organizational (why undetected earlier).
- Detections & Defenses: what helped/failed (alerts, synthetics, phicheflags).
- Action Items: specific tasks, owners, deadlines (and how to check the effect).
- Lessons Learned: What we change in process/architecture/observability.
Rules: no charges, maximum facts, mandatory follow-up after 2-4 weeks of checking completed items.
11) Process Reliability Metrics
MTTD - Mean Time To Detect
MTTA (… Acknowledge) - before on-call confirmation.
MTTR (… Restore) - until SLO is restored.
Change Failure Rate -% of releases resulting in incidents.
Incident Rate by SEV, distribution by domain (Payments/Games/Infra).
Alert Quality: Proportion of noisy/false, time to action after alert.
Comm-SLA: compliance with the frequency of status updates.
12) Integration with SLO and releases
Gates in CD: canary promotion only with green SLO proxies (availability, p95, conv, TTW).
Freeze procedures: when fast-burn/SEV1 - stop releases before PIR.
Auto annotations in graphs: releases/flags/migrations are visible on dashboards.
13) Regulatory and Compliance
PII: masking/aliasing in logs/tracks, WORM audit stores, access control.
Regionality: Do not take user data outside of permitted jurisdictions.
Reporting: formalized letters/notifications to regulators - templates and escalation process.
14) Learning and Readiness (Game-Day)
Quarterly exercises: "PSP drop," "game provider unavailable," "p99 surge," "key leak."
Timers on MTTA/MTTR, retro on exercise.
Updating runabooks and contacts, checking ChatOps commands.
15) Readiness checklist (before incident)
1. SEV rules and escalation matrix agreed.
2. Assigned on-call rotations, IM/TL/Comms/Scribe.
3. Runabooks for key scenarios (payments, games, databases, caches, queues).
4. SLO card and burn-rate alerts, status page.
5. ChatOps bot: commands, auto-context, status templates.
6. PIR templates and incident cards.
7. Regular game-day and contact/rights revisions.
8. Freeze policy and "red button" (rollback/kill-switch).
16) Antipatterns
There is no single IM, the "crowd leads" → chaos and delays.
Lack of SLO gates → late detection, noisy alerts.
Release during an incident without freeze → cascading crashes.
Logs and trails are not enough, there are no artifacts → weak PIR.
Accusatory culture → hidden mistakes, fear of escalation.
Inspirational communications → loss of business/user trust.
17) Templates (copy to your wiki)
A) Incident Card (YAML)
yaml id: INC-2025-11-005 title: PSP-X timeouts in TR/BR sev: SEV1 start_at: 2025-11-05T12:07:00+02:00 status: active impact: "Deposits via PSP-X failing for ~18% users (TR, BR)"
im: "@oncall-im"
tl: "@oncall-pay"
comms: "@oncall-comms"
scribe: "@oncall-scribe"
mitigations:
- "Reroute to PSP-Y"
- "Enable retries and raise timeouts"
next_update_in: "20m"
links:
grafana: "<dashboard-url>"
traces: "<tempo-link>"
logs: "<loki-query>"
runbook: "payments/psp_timeout"
B) Status update (internal)
[12:25] SEV1 PSP-X timeouts — TR/BR
Impact: ~18% deposits affected. SLO fast-burn active.
Mitigation: Rerouting to PSP-Y; retries enabled; release freeze.
ETA next update: 12:45 EET
IM: @oncall-im TL: @oncall-pay
C) PIR (cap)
Summary, Impact, Timeline, Root Cause (tech+org),
Detections/Defenses, Action Items (owner+due), Lessons Learned.
Summary
Strong incident management is structure + discipline: pre-agreed roles, SLO gates, worked runabooks, transparent communications, and "innocuous" PIR. This loop reduces MTTA/MTTR, lowers the cost of downtime, builds user trust and allows you to release bolder - but safely.