Incident simulations
1) Why do simulations
Incident simulations are safe workouts where the team works out detection, diagnosis, escalation and recovery using real playbooks. Ones:- lower MTTD/MTTA/MTTR, increase confidence in kickbacks and fylovers;
- identify process gaps (escalation, communications) and architectural weaknesses;
- serve as an input to the RCA→CAPA and improve documentation (runbook/SOP);
- Confirm readiness for SLA/regulatory/audit requirements.
2) Simulation formats
Tabletop (tabletop) - conversational script on the board/chat: cheap, fast, great for practicing roles and communications.
Game Day (exercises in the stage/sale with restrictions) - practical steps for playbooks; in sales - only safe, reversible actions with clear gates.
Chaos Engineering - controlled failures (disconnection of dependencies/networks/nodes) to check stability and SLO gates.
DR exercises (Disaster Recovery) - AZ/region failure, recovery from backups, switching providers.
Comms-drill - purely communications: status page, message templates, PR/Legal.
3) Roles and responsibilities
Incident Commander (IC) - makes decisions, leads a plan, de-escalation.
Tech Lead (TL) - diagnostics, technical "injections" and hypotheses.
Comms Lead (CL) - internal/external updates, status page.
Scribe - protocol (timeline, actions, decisions, artifacts).
Observers/Assessors - record metrics and compliance with procedures.
Red Team (optional) - introduces unexpected "injections."
4) Simulation success metrics
MTTD/MTTA/MTTR by synthetic incident.
Comm SLA: timeliness and quality of updates.
SLO-guardrails: correct reaction to burn-rate, quorum of external samples.
Runbook fidelity:% of steps completed per document, no improvisation.
Escalation latency - the speed of connecting the desired role/provider.
Checklists pass-rate: compliance with "ready/accepted/closed."
Noise & Fatigue: extra alerts, overload on-call.
CAPA completion: percentage of completed actions after simulation.
5) Preparation: what you need before the start
Purpose and hypotheses: what we check (processes, architecture, people).
Scenario and "injections": sequence of symptoms/events with timings.
Security restrictions: prohibition of irreversible changes; undo points.
Data and stands: synthetic traffic, degradation feature flags, secure keys.
Documents: links to runbook/SOP, escalation, contact list of providers.
Observability: pre-marked dashboards/alerts, test canaries.
Logistics: time/duration, participants, war-room channel, recording.
6) Simulation execution: stages
1. Brief (5-10 min): IC resembles goals, roles, safety rules, completion criteria.
2. T0 - Injection of symptoms: alert (s), drop in business SLI, external status of the provider.
3. Triage and escalation: assigning SEV, freeze releases, connecting the necessary roles.
4. Diagnostics: hypotheses, DNS/TLS/CDN/DB/cache/bus check, release annotations.
5. Mitigating actions: otkat/kanareyka↓, degradation flags, provider failover, limits/retras.
6. Communications: regular updates (format: Impakt→Diagnostika→Deystviya→Sled. update).
7. Recovery and verification: external synthetics + SLI in green zone N intervals.
8. Debrief (AAR): 15-30 min - facts, conclusions, CAPA.
7) Example scenarios (catalog)
Falling payment success: Provider A degrades in one country; expected actions - traffic redistribution, enabling simplified UX, communication.
DNS failure: write/TTL error, some users do not resolve the domain; expected steps - fixes/folback, CDN clearing, status updates.
Expired TLS certificate: handshake breaks for old customers; emergency extension and chain check pending.
Kafka lag: increasing delay in KYC/AML events; expectations - scale consumers, limit producers.
Database p99 ↑ and growth 5xx: narrow indices, connection limit; expectations - feature flags, limits, hotfix/rollback.
Regional failure: AZ/PoP shutdown; waiting - GSLB/Anycast switching, data verification and SLO.
Communication Drill: everything is "green," but we check patterns, intervals and coordination with Legal/PR.
8) Template "injection" (card)
ID: INJ-2025-11-01-01
Purpose: Verification of failover payments and comms SLA
Trigger T0: 30% reduction in transaction success in the TR region (alert SLI + burn rate)
Signals: 5xx growth in payment API, external status PSP-A = partial outage
Expected actions: reduction of the share on PSP-A to 30%, inclusion of degrade-payments-UX, status update 15 min
Success criteria: success of payments ≥ 98% in 30 minutes, two green SLI intervals
NOTAM (security): prohibition of direct database edits; flags/routing only
9) Safety and compliance
Production simulations - only reversible: feature flags, switching traffic in small fractions, remarks for reading, "shadow traffic."
Access control/audit: all actions via ChatOps/pipeline; Logs in non-modifiable storage.
PII/secrets - not used in training artifacts; data depersonalized.
Regulatory: if the simulation affects client communications - marking "teaching" in private channels; public posts are not imitated.
10) Assessment and AAR → RCA → CAPA
AAR (After Action Review) - immediately after the exercise: what was expected/seen, what worked/not.
RCA - for significant failures (for example, escalation did not work) according to the RCA template.
CAPA - list of actions with owners/deadlines/effect metrics (changes in playbooks, alerts, architecture).
Checkpoints - D + 14/D + 30: verification of execution, repeated mini-drills at vulnerable points.
11) Documentation and artifacts
Simulation plan: goals, scenario, injections, participants, windows, success criteria.
Time line (UTC): T0...Tn, IC solutions, technical steps, updates.
Pictures of dashboards/logs, extracts of alerts and statuses.
Summary Report - Metrics, Playbook Discrepancies, CAPAs
Documentation updates: runbook/SOP/contact edits, links to new dashboards.
12) Frequency and coverage
Tabletop: 2-4 times a month (by key streams and roles).
Game Days in the stage: 1-2 times a month.
Chaos cases (prod-light): quarterly, strictly by gates.
DR exercises: 1-2 times a year with real switching.
Comms-drill: monthly to train templates and SLA updates.
13) Checklists
Before simulation
- Scenario, "injections," success criteria, safety windows.
- Roles, channels, status of templates are consistent.
- Availability of stands/flags/dashboards checked.
- The withdrawal and reversibility plan is documented.
- Risks and impact on SLO/customers assessed.
During
- SEV assigned, freeze releases (if needed).
- Communication on a schedule, the format is consistent.
- All actions via audit tools.
- Scribe maintains a protocol, collects artifacts.
- Safety: prohibitions/restrictions are respected.
After
- AAR posted, report saved.
- RCA (in case of failures) is initiated.
- CAPAs are issued with owners/deadlines.
- Updated runbook/SOP/contacts.
- A retest of the vulnerabilities is planned.
14) Anti-patterns
"Improvisation instead of a plan" - there is no script and criteria for success.
Risks without gates and cancellation plan - exercises turn into an incident.
Working out only equipment without communications and escalation.
Lack of AAR/RCA - the team is not learning.
Prod-chaos without observability and SLO-gardrails.
Opaque rights: secret manual edits in prod.
15) Mini templates
Game Day Agenda (60-90 min)
1. Brief (5 min) → Goals, roles, security.
2. Scenario T0 (5 min) → Presentation of symptoms.
3. Triage/escalation (10 min).
4. Diagnostics + actions (30-45 min) - 1-2 "injections."
5. Recovery and verification (10 min).
6. AAR (15 min) - conclusions, CAPA.
AAR Template (Short)
What was expected:
What happened:
What worked:
What didn't work:
Solutions and why:
Actions (CAPA) with deadlines:
Responsible persons:
Retest Date:
16) The bottom line
Incident simulations are a "simulator" for people, processes, and architecture. Regular, safe and measurable exercises turn crises into a routine: the team reacts faster, playbooks really work, the architecture is more stable, and the regulator and clients see the maturity of the operational function. The main thing is clear goals, safe gates, good metrics and mandatory AAR→RCA→CAPA.