Incidents and SRE playbooks

1) What the incident is and how it relates to SLO

An incident is an event that violates an SLO/service function or creates a risk of violation (an erroneous budget is burned unacceptably quickly).
Classic metrics: MTTD, MTTA, MTTR, MTBF.
The budget error and burn-rate determine the priority and escalation windows.

2) Severity levels (SEVs) and criteria

SEV	Sign	Influence	MTTR objective
SEV-1	Broken critical SLO/total down for key traffic	All users/payments	≤ 60 min
SEV-2	Degradation (p95 latency, 5xx/payment errors ↑)	Significant part	≤ 4 h
SEV-3	Local issues/baselines rejected	Individual service/region	≤ 1 business day
SEV-4	Potential risk/defect without current impact	Preparation of fixes	according to plan

SEV triggers: exceeding 5xx%, p95> threshold, payment decline spike, Kafka-lag> threshold, NodeNotReady> X min, TLS expires <7 days, DDoS signals/leak.

3) Roles and Responsibilities (RACI)

Incident Commander (IC) - sole decision-making, task flow management, SEV status change.
Ops Lead (Tech Lead) - technical strategy, hypotheses, coordination of fixes.
Communications Lead (Comms) - status updates (internal/external), StatusPage/chat/mail.
Scribe (Chronicler) - timeline, solutions, artifacts, links to graphs/logs.
On-call Engineers/SMEs - execution of playbook actions.
Security/Privacy - Enabled for security or PII incidents.
FinOps/Payments - when affecting billing/PSP/cost.

4) Incident lifecycle

1. Detection (alert/report/synthetic) → auto-creation of an incident card.
2. Triage (IC assigned, SEV assigned, minimum context collection).
3. Stabilization (mitigation: turn off the feature/rollback/rate-limit/failover).
4. Investigation (RCA hypotheses, collection of facts).
5. Service recovery (validate SLO, observation).
6. Communication (inside/outside, final report).
7. Postmortem (no charges, CAPA plan, owners, deadlines).
8. Prevention (tests/alerts/playbooks/flags, additional training of the team).

5) Communications and "war-room"

Unified Incident Channel ('# inc-sev1-YYYYMMDD-hhmm'), only facts and actions.

Radio protocol style commands: "IC: I assign rollback version 1. 24 → ETA 10 min."

Status updates: SEV-1 every 15 minutes, SEV-2 every 30-60 minutes.
Status Page/external communication - via Comms Lead by template.
Forbidden: parallel "quiet" rooms, untested hypotheses into a common channel.

6) Alerting and SLO-burn (example rules)

Fast channel (1-5 min) and slow channel (1-2 h) burn-rate.
Multi-signals: budget error, 5xx%, p95, Kafka-lag, payment decline-rate, synthetics.
Search for the root cause - only after stabilizing symptoms.

Examples (generalized):

promql
Error rate 5xx> SLO sum (rate (http_requests_total{status=~"5"..}[5m]) )/sum (rate (http_requests_total[5m]))> 0. 01

Burn-rate fast (example)
(sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])))
/ (1 - SLO) > 14. 4

7) Playbooks vs ranbooks

Playbook - scenario of actions by type of incident (branching, conditions, risks).
Runbook - a specific "map" of steps/commands (checks, fixes, verification).
Rule: the playbook refers to several runbooks (rollbacks, feature-flags, failover, scaling, blocking traffic, etc.).

8) Incident card template

yaml id: INC-YYYYMMDD-XXXX title: "[SEV-1] Рост 5xx на API /payments"
status: active    monitoring    resolved sev: 1 reported_at: 2025-11-03T17:42Z ic: <ФИО>
ops_lead: <name>
comms_lead: <name>
scope: regions: [eu-west-1], tenants: [prod], services: [api, payments]
impact: "5xx = 12% (usually <0. 5%), deposit conversion -20%"
mitigation: "rollback to 1. 23. 4, rate-limit 2k rps on, feature X off"
timeline:
- "17:42: alert SLO burn-rate fast"
- "17:46: IC appointed, war-room open"
- "17:52: release 1 found. 24 as a candidate"
- "18:02: Rollback complete, 5xx back to 0. 3%"
artifacts:
dashboards: [...]
logs: [...]
traces: [...]
risk: "another surge is possible when turning on feature X"
next_steps: "canary release, tests, postmortem until 2025-11-05"

9) SRE playbook template (Markdown)

markdown
Playbook: <title>
Area/symptoms
List of detectors, signatures in metrics/logs/traces.

Triage & Mitigation
- [] Restrict traffic/enable WAF rule/OFF feature
- [] Rollback/canary release/roll out configuration fix
- [] Enable degradation mode (read-only, cache force)

Diagnostics (RCA hints)
- Metrics:... Logs:... Trails:...
- Common Root Causes/Hypothesis Checklist

Risks and communications
- Internal/external updates, SLA obligations

Verification
- [] SLO restored (threshold/window time)
- [] No recourse for related services

Follow-up
- CAPA, tasks in backlog, updating alerts/dashboards/playbook

10) Typical playbooks

10. 1 API 5xx Spike

Stabilization: turn off problematic ficheflag; Boost API replicas Enable caching rolling back the release.
Diagnostics: diff release, errors in logs (top-exceptions), p95 growth, pressure DB/cache.
Risks: cascade in payments/backends.

10. 2 БД: replication lag / lock storm

Stabilization: suspension of heavy jobs/reports; redirect reads to the wizard increase wal_buffers/replika-sloty.
Diagnostics: long transactions, blocking requests, plan changes.
Fixation: indexes/hints, redevelopment of jobs, split queries.

10. 3 Kafka consumer lag

Stabilization: temporarily scale consumers; reduce production from non-critical services; increase parties/quotas.
Diagnostics: rebalances, slow deserializations, GC pauses.
Verification: lag → to the target value, no drops.

10. 4 K8s NodeNotReady/resource storm

Stabilization: cordon + drain; redistribute loads; Check CNI/overlay turn off noisy DaemonSets.
Diagnostics: disk pressure, OOM, throttling, network drops.
Prevention: pod disruption budgets, resource limits/requests.

10. 5 TLS/certificates expire

Stabilization: forced update of the secret/ingress; temporary override.
Diagnostics: chain of trust, clock-skew.
Prevention: alerts T-30/T-7/T-1, auto-renual.

10. 6 DDoS/abnormal traffic

Stabilization: WAF/bot rules, rate-limit/geo-filters, upstream shed load.
Diagnostics: attack profiles (L3/4/7), sources, umbrellas.
Prevention: anycast, autoscaling, caching, play-nice with providers.

10. 7 Payment PSP-outage

Stabilization: smart-routing to alternative PSP/methods; raise retry with jitter; "soft" UI degradation.
Diagnostics: spike failures by codes, API statuses/PSP status pages.
Communications: transparent updates for business and support, correct ND/conversion statistics.

10. 8 Safety Incident/PII Leak

Stabilization: node isolation/secret rotation, exfiltration blocking, Legal Hold.
Diagnostics: access timelines, affected subjects/fields.
Notices: Regulators/Partners/Users by Jurisdiction Requirements.

Prevention: DLP/segmentation enhancement, "least privilege."

11) Automation of playbooks

ChatOps commands: '/ic set sev 1 ', '/deploy rollback api 1. 23. 4`, `/feature off X`.
Runbook-bots: semi-automatic steps (drain node, flip traffic, purge cache).
Self-healing hooks: detector → standard mitigation (rate-limit, restart, scale).
Auto-create cards/timelines from alerts and commands.

12) Playbook quality: checklist

Clear symptoms and detectors (metrics/logs/traces).
Rapid stabilization steps with risk assessment.
Commands/scripts are up to date, checked in staging.
Verification of SLO recovery.
Communication templates and external update criteria.
Post-mortem reference and CAPA after closing.

13) Postmortem (blameless) and CAPA

The goal: to learn, not to find the culprit.
Content: what happened, what was found to be good/bad, contribution of factors (those + processes), actions to prevent.
Term: SEV-1 - within 48 hours; SEV-2 - 3 working days.
CAPA: specific owners, timing, measurable effects (reduced MTTR/increased MTTD).

14) Legal aspects and evidence base

Legal Hold: freezing logs/tracks/alerts, write-once storage.
Chain of storage of artifacts: access by role, integrity control.
Regulatory notices: timelines/templates for jurisdictions (especially with affected payments/PII).
Privacy: PII minimization and masking during parsing.

15) Incident Process Performance Metrics

MTTD/MTTA/MTTR by quarter and domain.
SEV accuracy (underrating/overrating).
Share of auto-mitigate incidents.
Playbook coverage of top N scenarios (> 90%).
Perform CAPA on time.

16) Implementation by phase

1. Week 1: SEV matrix, on-call roles, general card template, war-room regulations.
2. Week 2: Playbooks for top 5 symptoms (5xx, DB lag, Kafka-lag, NodeNotReady, TLS).
3. Week 3: ChatOps/bots, auto-creating cards, communication templates/StatusPage.

4. Week 4 +: Safety Playbooks, PSP Outages, Legal Hold, Regular Drills/Chaos Games

17) Examples of "fast" ranbooks (fragments)

Rollback API (K8s)

bash kubectl rollout undo deploy/api -n prod kubectl rollout status deploy/api -n prod --timeout=5m
Verification:
kubectl -n prod top pods -l app=api

Drain node

bash kubectl cordon $NODE && kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=10m

Feature-flag OFF (example)

bash curl -X POST "$FF_URL/toggle" -H "Authorization: Bearer $TOKEN" -d '{"feature":"X","enabled":false}'

18) Mini-FAQ

When to raise the SEV-1?
When the key SLO/business function (payments, login, game) suffers, and burn-rate "eats up" the budget for hours ahead.

What is more important - RCA or recovery?
Always stabilization, then RCA. Time to stabilization is the main indicator.

Do I need to automate everything?
Automate frequent and safe steps; rare/risky - via semi-auto and IC confirmation.

Total

The robust incident process rests on three pillars: clear roles and SEV rules, quality playbooks/ranbooks with automation, and a post-mortem culture without blame. Capture patterns, train on-call, measure MTTR/erroneous budget, and constantly improve detectors and playbooks - this directly reduces the risk and cost of downtime.

Incidents and SRE playbooks

Drain node

Feature-flag OFF (example)

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects