Incident and accident response

(Section: Operations and Management)

1) Definitions and objectives

Incident - an event that violates SLO/security/compliance or creates a risk to customers, money, data, reputation.
The goals of the reaction: to quickly restore the service, minimize damage, fix evidence, communicate transparently and prevent repetition.

Key principles

Safety first: Protecting people/data/money over features.
One throat to choke: A single Incident Commander (IC) makes decisions.
Actionable now: each hypothesis is followed by a test/action.
Evidence matters: everything is logged, artifacts are signed, the timeline is detailed.

2) Classification (severity & priority)

SEV	Signs	MTTR objective	Examples
P1 / SEV-0	Massive unavailability/loss of money/PII leak	≤ 60 min	Checkout fails; personal data leakage; incorrect write-offs
P2 / SEV-1	Severe degradation/partial region	≤ 4 h	Lag webhooks, out of sync prices; high provider errors
P3 / SEV-2	Local degradation/error growth	≤ 24 h	Partner queue overload; splash of fraud signals
P4 / SEV-3	Minor bugs/trend risk	According to plan	Metrics deviations, outdated certificates

Trigger: SLO violation, alert rule, manual report, legal incident (DPO/CCO).

3) Roles and Responsibilities (RACI)

Incident Commander (A) - incident leader, task setting, decision making, IC changes for long incidents.
Tech Lead (R) - technical diagnostics/fixes, SRE/engineering coordination.
Comms Lead (R) - writes status updates (inside/outside), owner of the status page.
Scribe (R) - protocol, timeline, collection of artifacts.
Security/Legal (C/A for security cases) - risk assessment, mandatory notifications.
Customer Support (C) - response templates, ticket routing.
Partner Liaison (C) - communication with providers/tenants.
Management (I) - information, business decisions (loans/compensations).

4) First 15 minutes (template)

1. Assign an IC and open an incident card (chat channel, video bridge, Jira/Tracker).
2. Assign a SEV and fix the SLO symptom (what exactly is violated).

3. Stabilize:

include runbooks/runes: circuit-breakers, throttling, route switching, pause promo;
in case of compromise - kill-switch sensitive functions.
4. Commands: Tech Lead - diagnostics; Comms - "technical hold" (in 10-15 minutes - the first update).
5. Identify hypotheses (three maximum), assign owners, set timers for verification (5-10 minutes).
6. Collect artifacts: snapshots of metrics, configs, release hashes, logs with 'trace _ id', receipts.

5) First hour (template)

Communication v1 (15-20 min): fact, reach, symptoms, what we do, next update. No speculation.
Incident boundaries: which regions/tenants/channels/versions are affected.
Damage control: temporary caps/restrictions, disconnection of "noisy" integrations, activation of degradation mode.
Forensics: freeze log rotations, protect artifacts (WORM/signatures).
Recovery roadmap: T + 30/T + 60 with check points.

6) Communications and status page

Internal intervals: P1 - every 15 min, P2 - 30-60 min.
External: status page/tenants/SLA partners.

Message template:

What you can see: "with X: YY UTC, the increase in checkout failures in the EU region (p95> 250 ms)"
Affected: "A/B/C operators ~ 40% of traffic"
What we do: "included an alternative route, throttling promo; we work with the provider PSP-1"
Data/deadlines: "next update in 15 minutes"
Compensations: "apply credit notes as per SLA after closure of incident"

7) Playbooks (references for iGaming/fintech)

PriceMismatch (showcase ≠ checkout): cache force disability, 'fx _ version/tax _ rule _ version' reconciliation, dynamic promo freeze, policy discrepancy compensation.
WebhookLag (partners/affiliates): scaling workers, increasing batch, priority retrays, temporary cap on new subscriptions.
Payments Outage/PSP degradation: switching to a backup PSP, reducing client timeouts, manual queue clearing, gray transactions in quarantine.
RTP Drift: bonus pause, paytable/version check, monitoring window extension, RTP profile rollback.
Fraud Spike: tighten velocity/limits, include additional KYC checking, isolate suspicious cohorts, manually review high winnings.
Data/PII Exposure: system isolation, DPO/Legal notification, inventory of affected records, regulatory notifications by timeline.

8) Tools and runes (auto-actions)

Кнопки: Pause Promo, Re-Route, Raise Limit, Rollback, Flush Cache, Disable Webhooks, Enable Safe Mode.
Guard rails: protection against "saddling" - rollbacks are limited, logs are signed, each action ↔ IC/Scribe.
Provability: DSSE signatures, snapshot hashes, Merkle log slices.

9) End of incident

Criteria: SLO restored, queue redeemed, data/money reconciled, risks closed, communications sent.
Closing ritual: final status update, fixed timeline, list of influences, preliminary hypotheses of causes, post-mortem date assigned.

10) Post-mortem (no charges)

Term: P1 - within 3 working days; P2 - 5 working days.
Content: facts/timeline, root causes (5 Whys/FRAM), impact (SLO, finance, customers), what worked/not, action items (owner, term, measurable effect).
Effectiveness check: after 30-60 days - review of performance and metrics (repeatability, MTTR, alert noise).

11) Incident Management Metrics and SLOs

MTTD/MTTA/MTTR, Change Failure Rate, Time to Comms v1,% auto-allowed (runes).
Alert Noise: percentage of irrelevant signals, pages per on-call shift.
Repeat Incidents: Proportion of repeats in 90 days.
Post-mortem SLA: proportion of completed/closed on time.
SLO reactions: P1 - first communication ≤ 15 min; MTTR ≤ 60 min; artifact completeness = 100%.

12) Law/Compliance/Privacy

Legal notices: timing of local regulators for leaks/incidents.
PII minimization: access to the primary only through approved jabs; tokenization/masking.
Artifact storage: WORM logs, retention period by jurisdiction; access control (RBAC/ABAC, JIT).
Counterparties: contractual SLAs, escalation process, proceedings receipts.

13) Organization of duty and escalation

24 × 7 on-call: rotation by role (SRE, App, Data, Security, Payments).
Escalation matrix: who for regions/products/providers; duplicate contacts (chat/voice/SMS).
Exercises (GameDays): simulations - PSP drop, retray avalanche, price misalignment, key compromise, region failure.

14) Dashboards of incidents

Heat (now): SLO status, p95/p99, map of regions/tenants, task queue, artifacts collected/not.
History: trends by incident type, runes efficiency, cause recurrence.
Quality control: timeline completeness, "coverage" of post-mortems, SLA communications.

15) Implementation checklist

Approve SEV scale and SLO triggers.
Assign roles (IC/Tech/Comms/Scribe/Sec/Legal) and rotations 24 × 7.
Launch a single incident card template and status page.
Describe playbooks (PriceMismatch/WebhookLag/Payments/RTP/Fraud/PII).
Implement runes with audit and red button.
Enable WORM/Signatures/Artifact Collection.
Communications Procedure (internal/external) , SLA updates.
Post-mortem process and templates; KPI of action items execution.
GameDays monthly; quarterly review of incident trends.
Dashboard IR metrics (MTTA/MTTR/Noise/Repeat/Comms SLA).

16) FAQ

Why "IC alone"?
A single decision point removes chaos and accelerates reactions.

When to announce publicly?
As soon as there is a confirmed fact and a stabilization plan. Evaluate regulatory deadlines.

What is more important - a fix or a report?
First, recovery and security. In parallel - the collection of artifacts. Report - after stabilization.

Is it possible to automate everything?
No, but runes close "frequent and simple" steps. The rest is through clear playbooks and workouts.

Recap: Strong Incident Response isn't just about PagerDuty and the chat channel. This is a discipline of roles, fast first 15 minutes, controlled runes, transparent communications, forensics with provability and mandatory post-mortem. With this circuit, you reduce MTTR, protect money and data, and increase customer and regulatory confidence.

Incident and accident response

Key principles

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects