Incident and accident response
(Section: Operations and Management)
1) Definitions and objectives
Incident - an event that violates SLO/security/compliance or creates a risk to customers, money, data, reputation.
The goals of the reaction: to quickly restore the service, minimize damage, fix evidence, communicate transparently and prevent repetition.
Key principles
Safety first: Protecting people/data/money over features.
One throat to choke: A single Incident Commander (IC) makes decisions.
Actionable now: each hypothesis is followed by a test/action.
Evidence matters: everything is logged, artifacts are signed, the timeline is detailed.
2) Classification (severity & priority)
Trigger: SLO violation, alert rule, manual report, legal incident (DPO/CCO).
3) Roles and Responsibilities (RACI)
Incident Commander (A) - incident leader, task setting, decision making, IC changes for long incidents.
Tech Lead (R) - technical diagnostics/fixes, SRE/engineering coordination.
Comms Lead (R) - writes status updates (inside/outside), owner of the status page.
Scribe (R) - protocol, timeline, collection of artifacts.
Security/Legal (C/A for security cases) - risk assessment, mandatory notifications.
Customer Support (C) - response templates, ticket routing.
Partner Liaison (C) - communication with providers/tenants.
Management (I) - information, business decisions (loans/compensations).
4) First 15 minutes (template)
1. Assign an IC and open an incident card (chat channel, video bridge, Jira/Tracker).
2. Assign a SEV and fix the SLO symptom (what exactly is violated).
- include runbooks/runes: circuit-breakers, throttling, route switching, pause promo;
- in case of compromise - kill-switch sensitive functions.
- 4. Commands: Tech Lead - diagnostics; Comms - "technical hold" (in 10-15 minutes - the first update).
- 5. Identify hypotheses (three maximum), assign owners, set timers for verification (5-10 minutes).
- 6. Collect artifacts: snapshots of metrics, configs, release hashes, logs with 'trace _ id', receipts.
5) First hour (template)
Communication v1 (15-20 min): fact, reach, symptoms, what we do, next update. No speculation.
Incident boundaries: which regions/tenants/channels/versions are affected.
Damage control: temporary caps/restrictions, disconnection of "noisy" integrations, activation of degradation mode.
Forensics: freeze log rotations, protect artifacts (WORM/signatures).
Recovery roadmap: T + 30/T + 60 with check points.
6) Communications and status page
Internal intervals: P1 - every 15 min, P2 - 30-60 min.
External: status page/tenants/SLA partners.
- What you can see: "with X: YY UTC, the increase in checkout failures in the EU region (p95> 250 ms)"
- Affected: "A/B/C operators ~ 40% of traffic"
- What we do: "included an alternative route, throttling promo; we work with the provider PSP-1"
- Data/deadlines: "next update in 15 minutes"
- Compensations: "apply credit notes as per SLA after closure of incident"
7) Playbooks (references for iGaming/fintech)
PriceMismatch (showcase ≠ checkout): cache force disability, 'fx _ version/tax _ rule _ version' reconciliation, dynamic promo freeze, policy discrepancy compensation.
WebhookLag (partners/affiliates): scaling workers, increasing batch, priority retrays, temporary cap on new subscriptions.
Payments Outage/PSP degradation: switching to a backup PSP, reducing client timeouts, manual queue clearing, gray transactions in quarantine.
RTP Drift: bonus pause, paytable/version check, monitoring window extension, RTP profile rollback.
Fraud Spike: tighten velocity/limits, include additional KYC checking, isolate suspicious cohorts, manually review high winnings.
Data/PII Exposure: system isolation, DPO/Legal notification, inventory of affected records, regulatory notifications by timeline.
8) Tools and runes (auto-actions)
Кнопки: Pause Promo, Re-Route, Raise Limit, Rollback, Flush Cache, Disable Webhooks, Enable Safe Mode.
Guard rails: protection against "saddling" - rollbacks are limited, logs are signed, each action ↔ IC/Scribe.
Provability: DSSE signatures, snapshot hashes, Merkle log slices.
9) End of incident
Criteria: SLO restored, queue redeemed, data/money reconciled, risks closed, communications sent.
Closing ritual: final status update, fixed timeline, list of influences, preliminary hypotheses of causes, post-mortem date assigned.
10) Post-mortem (no charges)
Term: P1 - within 3 working days; P2 - 5 working days.
Content: facts/timeline, root causes (5 Whys/FRAM), impact (SLO, finance, customers), what worked/not, action items (owner, term, measurable effect).
Effectiveness check: after 30-60 days - review of performance and metrics (repeatability, MTTR, alert noise).
11) Incident Management Metrics and SLOs
MTTD/MTTA/MTTR, Change Failure Rate, Time to Comms v1,% auto-allowed (runes).
Alert Noise: percentage of irrelevant signals, pages per on-call shift.
Repeat Incidents: Proportion of repeats in 90 days.
Post-mortem SLA: proportion of completed/closed on time.
SLO reactions: P1 - first communication ≤ 15 min; MTTR ≤ 60 min; artifact completeness = 100%.
12) Law/Compliance/Privacy
Legal notices: timing of local regulators for leaks/incidents.
PII minimization: access to the primary only through approved jabs; tokenization/masking.
Artifact storage: WORM logs, retention period by jurisdiction; access control (RBAC/ABAC, JIT).
Counterparties: contractual SLAs, escalation process, proceedings receipts.
13) Organization of duty and escalation
24 × 7 on-call: rotation by role (SRE, App, Data, Security, Payments).
Escalation matrix: who for regions/products/providers; duplicate contacts (chat/voice/SMS).
Exercises (GameDays): simulations - PSP drop, retray avalanche, price misalignment, key compromise, region failure.
14) Dashboards of incidents
Heat (now): SLO status, p95/p99, map of regions/tenants, task queue, artifacts collected/not.
History: trends by incident type, runes efficiency, cause recurrence.
Quality control: timeline completeness, "coverage" of post-mortems, SLA communications.
15) Implementation checklist
- Approve SEV scale and SLO triggers.
- Assign roles (IC/Tech/Comms/Scribe/Sec/Legal) and rotations 24 × 7.
- Launch a single incident card template and status page.
- Describe playbooks (PriceMismatch/WebhookLag/Payments/RTP/Fraud/PII).
- Implement runes with audit and red button.
- Enable WORM/Signatures/Artifact Collection.
- Communications Procedure (internal/external) , SLA updates.
- Post-mortem process and templates; KPI of action items execution.
- GameDays monthly; quarterly review of incident trends.
- Dashboard IR metrics (MTTA/MTTR/Noise/Repeat/Comms SLA).
16) FAQ
Why "IC alone"?
A single decision point removes chaos and accelerates reactions.
When to announce publicly?
As soon as there is a confirmed fact and a stabilization plan. Evaluate regulatory deadlines.
What is more important - a fix or a report?
First, recovery and security. In parallel - the collection of artifacts. Report - after stabilization.
Is it possible to automate everything?
No, but runes close "frequent and simple" steps. The rest is through clear playbooks and workouts.
Recap: Strong Incident Response isn't just about PagerDuty and the chat channel. This is a discipline of roles, fast first 15 minutes, controlled runes, transparent communications, forensics with provability and mandatory post-mortem. With this circuit, you reduce MTTR, protect money and data, and increase customer and regulatory confidence.