Incidents and SRE playbooks
1) What the incident is and how it relates to SLO
An incident is an event that violates an SLO/service function or creates a risk of violation (an erroneous budget is burned unacceptably quickly).
Classic metrics: MTTD, MTTA, MTTR, MTBF.
The budget error and burn-rate determine the priority and escalation windows.
2) Severity levels (SEVs) and criteria
SEV triggers: exceeding 5xx%, p95> threshold, payment decline spike, Kafka-lag> threshold, NodeNotReady> X min, TLS expires <7 days, DDoS signals/leak.
3) Roles and Responsibilities (RACI)
Incident Commander (IC) - sole decision-making, task flow management, SEV status change.
Ops Lead (Tech Lead) - technical strategy, hypotheses, coordination of fixes.
Communications Lead (Comms) - status updates (internal/external), StatusPage/chat/mail.
Scribe (Chronicler) - timeline, solutions, artifacts, links to graphs/logs.
On-call Engineers/SMEs - execution of playbook actions.
Security/Privacy - Enabled for security or PII incidents.
FinOps/Payments - when affecting billing/PSP/cost.
4) Incident lifecycle
1. Detection (alert/report/synthetic) → auto-creation of an incident card.
2. Triage (IC assigned, SEV assigned, minimum context collection).
3. Stabilization (mitigation: turn off the feature/rollback/rate-limit/failover).
4. Investigation (RCA hypotheses, collection of facts).
5. Service recovery (validate SLO, observation).
6. Communication (inside/outside, final report).
7. Postmortem (no charges, CAPA plan, owners, deadlines).
8. Prevention (tests/alerts/playbooks/flags, additional training of the team).
5) Communications and "war-room"
Unified Incident Channel ('# inc-sev1-YYYYMMDD-hhmm'), only facts and actions.
Radio protocol style commands: "IC: I assign rollback version 1. 24 → ETA 10 min."
Status updates: SEV-1 every 15 minutes, SEV-2 every 30-60 minutes.
Status Page/external communication - via Comms Lead by template.
Forbidden: parallel "quiet" rooms, untested hypotheses into a common channel.
6) Alerting and SLO-burn (example rules)
Fast channel (1-5 min) and slow channel (1-2 h) burn-rate.
Multi-signals: budget error, 5xx%, p95, Kafka-lag, payment decline-rate, synthetics.
Search for the root cause - only after stabilizing symptoms.
promql
Ошибочная доля 5xx > SLO sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01
Burn-rate быстрый (пример)
(sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])))
/ (1 - SLO) > 14.4
7) Playbooks vs ranbooks
Playbook - scenario of actions by type of incident (branching, conditions, risks).
Runbook - a specific "map" of steps/commands (checks, fixes, verification).
Rule: the playbook refers to several runbooks (rollbacks, feature-flags, failover, scaling, blocking traffic, etc.).
8) Incident card template
yaml id: INC-YYYYMMDD-XXXX title: "[SEV-1] Рост 5xx на API /payments"
status: active monitoring resolved sev: 1 reported_at: 2025-11-03T17:42Z ic: <ФИО>
ops_lead: <ФИО>
comms_lead: <ФИО>
scope: regions: [eu-west-1], tenants: [prod], services: [api, payments]
impact: "5xx=12% (обычно <0.5%), конверсия депозитов -20%"
mitigation: "откат на 1.23.4, включен rate-limit 2k rps, фича X выключена"
timeline:
- "17:42: алерт SLO burn-rate быстрый"
- "17:46: назначен IC, открыт war-room"
- "17:52: найден релиз 1.24 как кандидат"
- "18:02: откат завершен, 5xx вернулись к 0.3%"
artifacts:
dashboards: [...]
logs: [...]
traces: [...]
risk: "возможен очередной всплеск при включении фичи X"
next_steps: "канареечный релиз, тесты, постмортем до 2025-11-05"
9) SRE playbook template (Markdown)
markdown
Плейбук: <название>
Область/симптомы
Список детекторов, сигнатуры в метриках/логах/трассах.
Быстрая стабилизация (Triage & Mitigation)
- [ ] Ограничить трафик/включить WAF-правило/фичефлаг OFF
- [ ] Роллбэк/канареечный релиз/выкатить фикс конфигурации
- [ ] Включить деградационный режим (read-only, кэш-форс)
Диагностика (RCA hints)
- Метрики: … Логи: … Трассы: …
- Частые первопричины/чек-лист гипотез
Риски и коммуникации
- Внутренние/внешние апдейты, SLA-обязательства
Верификация
- [ ] SLO восстановлено (порог/время окна)
- [ ] Нет регресса по смежным сервисам
Последующие действия
- CAPA, задачи в backlog, обновление алертов/дашбордов/плейбука
10) Typical playbooks
10. 1 API 5xx Spike
Stabilization: turn off problematic ficheflag; Boost API replicas Enable caching rolling back the release.
Diagnostics: diff release, errors in logs (top-exceptions), p95 growth, pressure DB/cache.
Risks: cascade in payments/backends.
10. 2 БД: replication lag / lock storm
Stabilization: suspension of heavy jobs/reports; redirect reads to the wizard increase wal_buffers/replika-sloty.
Diagnostics: long transactions, blocking requests, plan changes.
Fixation: indexes/hints, redevelopment of jobs, split queries.
10. 3 Kafka consumer lag
Stabilization: temporarily scale consumers; reduce production from non-critical services; increase parties/quotas.
Diagnostics: rebalances, slow deserializations, GC pauses.
Verification: lag → to the target value, no drops.
10. 4 K8s NodeNotReady/resource storm
Stabilization: cordon + drain; redistribute loads; Check CNI/overlay turn off noisy DaemonSets.
Diagnostics: disk pressure, OOM, throttling, network drops.
Prevention: pod disruption budgets, resource limits/requests.
10. 5 TLS/certificates expire
Stabilization: forced update of the secret/ingress; temporary override.
Diagnostics: chain of trust, clock-skew.
Prevention: alerts T-30/T-7/T-1, auto-renual.
10. 6 DDoS/abnormal traffic
Stabilization: WAF/bot rules, rate-limit/geo-filters, upstream shed load.
Diagnostics: attack profiles (L3/4/7), sources, umbrellas.
Prevention: anycast, autoscaling, caching, play-nice with providers.
10. 7 Payment PSP-outage
Stabilization: smart-routing to alternative PSP/methods; raise retry with jitter; "soft" UI degradation.
Diagnostics: spike failures by codes, API statuses/PSP status pages.
Communications: transparent updates for business and support, correct ND/conversion statistics.
10. 8 Safety Incident/PII Leak
Stabilization: node isolation/secret rotation, exfiltration blocking, Legal Hold.
Diagnostics: access timelines, affected subjects/fields.
Notices: Regulators/Partners/Users by Jurisdiction Requirements.
Prevention: DLP/segmentation enhancement, "least privilege."
11) Automation of playbooks
ChatOps commands: '/ic set sev 1 ', '/deploy rollback api 1. 23. 4`, `/feature off X`.
Runbook-bots: semi-automatic steps (drain node, flip traffic, purge cache).
Self-healing hooks: detector → standard mitigation (rate-limit, restart, scale).
Auto-create cards/timelines from alerts and commands.
12) Playbook quality: checklist
- Clear symptoms and detectors (metrics/logs/traces).
- Rapid stabilization steps with risk assessment.
- Commands/scripts are up to date, checked in staging.
- Verification of SLO recovery.
- Communication templates and external update criteria.
- Post-mortem reference and CAPA after closing.
13) Postmortem (blameless) and CAPA
The goal: to learn, not to find the culprit.
Content: what happened, what was found to be good/bad, contribution of factors (those + processes), actions to prevent.
Term: SEV-1 - within 48 hours; SEV-2 - 3 working days.
CAPA: specific owners, timing, measurable effects (reduced MTTR/increased MTTD).
14) Legal aspects and evidence base
Legal Hold: freezing logs/tracks/alerts, write-once storage.
Chain of storage of artifacts: access by role, integrity control.
Regulatory notices: timelines/templates for jurisdictions (especially with affected payments/PII).
Privacy: PII minimization and masking during parsing.
15) Incident Process Performance Metrics
MTTD/MTTA/MTTR by quarter and domain.
SEV accuracy (underrating/overrating).
Share of auto-mitigate incidents.
Playbook coverage of top N scenarios (> 90%).
Perform CAPA on time.
16) Implementation by phase
1. Week 1: SEV matrix, on-call roles, general card template, war-room regulations.
2. Week 2: Playbooks for top 5 symptoms (5xx, DB lag, Kafka-lag, NodeNotReady, TLS).
3. Week 3: ChatOps/bots, auto-creating cards, communication templates/StatusPage.
4. Week 4 +: Safety Playbooks, PSP Outages, Legal Hold, Regular Drills/Chaos Games
17) Examples of "fast" ranbooks (fragments)
Rollback API (K8s)
bash kubectl rollout undo deploy/api -n prod kubectl rollout status deploy/api -n prod --timeout=5m
Верификация:
kubectl -n prod top pods -l app=api
Drain node
bash kubectl cordon $NODE && kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=10m
Feature-flag OFF (example)
bash curl -X POST "$FF_URL/toggle" -H "Authorization: Bearer $TOKEN" -d '{"feature":"X","enabled":false}'
18) Mini-FAQ
When to raise the SEV-1?
When the key SLO/business function (payments, login, game) suffers, and burn-rate "eats up" the budget for hours ahead.
What is more important - RCA or recovery?
Always stabilization, then RCA. Time to stabilization is the main indicator.
Do I need to automate everything?
Automate frequent and safe steps; rare/risky - via semi-auto and IC confirmation.
Result
The robust incident process rests on three pillars: clear roles and SEV rules, quality playbooks/ranbooks with automation, and a post-mortem culture without blame. Capture patterns, train on-call, measure MTTR/erroneous budget, and constantly improve detectors and playbooks - this directly reduces the risk and cost of downtime.