Incident Management

(Section: Technology and Infrastructure)

Brief Summary

Incident management is a repeatable process to quickly restore user value and minimize business damage. Support - clear roles (Incident Manager, Tech Lead, Comms), SLO gates, escalations, ChatOps processes, prepared runabooks and "innocuous" post-incident parsing with measurable action items.

1) Goals and principles

Speed and safety: rapid diagnosis → safe stabilization → sustained recovery.
Sole Owner - The assigned Incident Manager (IM) makes process decisions.
Communications as a product: predictable updates for stakeholders and users.
Data> opinions: SLO/metrics/trails/logs are the source of truth.
Blameless: analysis of reasons without personal accusations; focus on system improvements.

2) Classification of incidents (Severity/Impact/Urgency)

Severity (example):

SEV1 (critical): severe damage to revenue/TTW/payments,> 20% of users or entire regions; SLA impaired/PII threat.
SEV2 (high): partial degradation of key flows (deposit/bet/launch of games), impact 5-20%.
SEV3 (medium): noticeable degradation of secondary services, there is a bypass.
SEV4 (low): minor, limited effect, no effect on SLO/SLA.

Impact: who is affected (all/region/tenant/channel). Urgency: degradation rate (fast-burn/slow-burn on error budget).

3) Incident lifecycle

1. Detect - signal from alerts/SLO/synthetics/reports.
2. Acknowledge - on-call confirms reception, assigns IM.
3. Triage - SEV/Impact score, hypothesis collection, War-Room discovery.
4. Mitigate - stabilization (rollback/route switching/phicheflags/scaling).
5. Communicate - regular status updates (inside/out).
6. Recover - Full SLO/business metrics recovery.
7. Close - recording of chronology, collection of artifacts, PIR (RCA + action items).

4) Roles and Responsibilities (RACI)

Incident Manager (IM) - process owner, assigns roles, monitors time, makes process decisions (R).
Technical Lead (TL) - conducts diagnostics/hypotheses/fixes, coordinates engineers (A/R).
Communications (Comms) - status updates, connection with support/business/PR, status page (R).
Scribe - protocol (timeline, decisions made, links, artifacts) (R).
Stakeholders - Product/Payments/Gaming Providers/Security (C/I).

Minimum per SEV1: IM + TL + Comms + Scribe. It is allowed to combine roles on the SEV2.

5) War-Room и ChatOps

Individual channels: '# incident-warroom- <id>' (working), '# incident-status' (updates only).
Template commands: '/incident start ', '/status update', '/call <owner> ', '/rollback', '/freeze ', '/scale + N'.
The bot pulls up the context: recent releases, dashboards, related alerts, trace exemplars, dependency schemes.
Rules of communication: briefly, on the facts, one speaker (TL), IM moderates.

6) Triggers and gates

SLO gates: fast/slow burn, payment conversion drop, TTW p95> threshold, p99 API ↑, payment queues are on fire.
Auto actions: stop canary, rollback, enabling degrade mode (limiting functions), enabling high frequency synthetics.
Freeze: all releases/foot migrations before stabilization and PIR.

7) Typical scenarios (runabook patterns)

A) Payments: increase in timeouts/failures at PSP

1. Stop promote and freeze payment loop releases.
2. Switch the PSP route to the standby one, raise the timeout/retray by policy.
3. Reconciliation of incomplete transactions, repetition with idempotent keys.
4. Comms communication → support: do you work reserve? ETA.

B) API p99↑ and 5xx after release

1. Rollback (blue-green/canary → stable).
2. Check cache hit, queue depth, database/game provider hotspots.
3. Temporary scaling, limiting heavy features through feature flags.

C) Game provider unavailable

1. Switch traffic to available studios/games, show a status banner.
2. Turn on synthetic checks every 30-60 s.
3. Agree on compensation/bonuses (by policy) - add to PIR.

D) Leakage/suspected PII

1. Component isolation, key/token revocation, log collection (WORM).
2. Legal communication/regulatory alignment.
3. Post-incident actions: secret rotation, masking, access.

8) Communications (internal/external)

Update frequency: SEV1 - every 15-30 minutes, SEV2 - 30-60 minutes.

Internal status template:

What's broken: "Deposits via PSP-X: The Rise of Timeouts."
Affected: "TR/BR, ~ 18% of stream users."
When it started: "12:07 EET, SEV1."
What we do: "Switching route to PSP-Y, retrayes/rate cap enabled."
Next update: "in 20 minutes."
Contact: "IM @ duty-im, TL @ oncall-pay."

Public status (page/social networks) - abbreviated, without PII and unnecessary details, with ETA and a link to further updates.

9) Artifact collection and auditing

Event timeline (minute accuracy), service versions, feature flags, config changes.

Pictures of dashboards, approximate routes (trace_id), logs "before/during/after."

Links to tickets, PR, releases, runabooks.
Communications report (when/to/what).
It all adds up to an incident card.

10) Closure and PIR (Post-Incident Review)

PIR format (short):

Summary: what happened, scale, duration, SEV.
Impact: users/regions, SLO/SLA, Fin. effect.
Timeline: in detail, by the minute.
Root Cause: technical + organizational (why undetected earlier).
Detections & Defenses: what helped/failed (alerts, synthetics, phicheflags).
Action Items: specific tasks, owners, deadlines (and how to check the effect).
Lessons Learned: What we change in process/architecture/observability.

Rules: no charges, maximum facts, mandatory follow-up after 2-4 weeks of checking completed items.

11) Process Reliability Metrics

MTTD - Mean Time To Detect

MTTA (… Acknowledge) - before on-call confirmation.
MTTR (… Restore) - until SLO is restored.
Change Failure Rate -% of releases resulting in incidents.
Incident Rate by SEV, distribution by domain (Payments/Games/Infra).
Alert Quality: Proportion of noisy/false, time to action after alert.
Comm-SLA: compliance with the frequency of status updates.

12) Integration with SLO and releases

Gates in CD: canary promotion only with green SLO proxies (availability, p95, conv, TTW).
Freeze procedures: when fast-burn/SEV1 - stop releases before PIR.
Auto annotations in graphs: releases/flags/migrations are visible on dashboards.

13) Regulatory and Compliance

PII: masking/aliasing in logs/tracks, WORM audit stores, access control.
Regionality: Do not take user data outside of permitted jurisdictions.
Reporting: formalized letters/notifications to regulators - templates and escalation process.

14) Learning and Readiness (Game-Day)

Quarterly exercises: "PSP drop," "game provider unavailable," "p99 surge," "key leak."

Timers on MTTA/MTTR, retro on exercise.
Updating runabooks and contacts, checking ChatOps commands.

15) Readiness checklist (before incident)

1. SEV rules and escalation matrix agreed.
2. Assigned on-call rotations, IM/TL/Comms/Scribe.
3. Runabooks for key scenarios (payments, games, databases, caches, queues).
4. SLO card and burn-rate alerts, status page.
5. ChatOps bot: commands, auto-context, status templates.
6. PIR templates and incident cards.
7. Regular game-day and contact/rights revisions.
8. Freeze policy and "red button" (rollback/kill-switch).

16) Antipatterns

There is no single IM, the "crowd leads" → chaos and delays.
Lack of SLO gates → late detection, noisy alerts.
Release during an incident without freeze → cascading crashes.
Logs and trails are not enough, there are no artifacts → weak PIR.
Accusatory culture → hidden mistakes, fear of escalation.
Inspirational communications → loss of business/user trust.

17) Templates (copy to your wiki)

A) Incident Card (YAML)

yaml id: INC-2025-11-005 title: PSP-X timeouts in TR/BR sev: SEV1 start_at: 2025-11-05T12:07:00+02:00 status: active impact: "Deposits via PSP-X failing for ~18% users (TR, BR)"
im: "@oncall-im"
tl: "@oncall-pay"
comms: "@oncall-comms"
scribe: "@oncall-scribe"
mitigations:
- "Reroute to PSP-Y"
- "Enable retries and raise timeouts"
next_update_in: "20m"
links:
grafana: "<dashboard-url>"
traces: "<tempo-link>"
logs: "<loki-query>"
runbook: "payments/psp_timeout"

B) Status update (internal)


[12:25] SEV1 PSP-X timeouts — TR/BR
Impact: ~18% deposits affected. SLO fast-burn active.
Mitigation: Rerouting to PSP-Y; retries enabled; release freeze.
ETA next update: 12:45 EET
IM: @oncall-im      TL: @oncall-pay

C) PIR (cap)


Summary, Impact, Timeline, Root Cause (tech+org),
Detections/Defenses, Action Items (owner+due), Lessons Learned.

Summary

Strong incident management is structure + discipline: pre-agreed roles, SLO gates, worked runabooks, transparent communications, and "innocuous" PIR. This loop reduces MTTA/MTTR, lowers the cost of downtime, builds user trust and allows you to release bolder - but safely.

Incident Management

Brief Summary

B) Status update (internal)

C) PIR (cap)

Summary

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects