GH GambleHub

Incidents and SRE playbooks

1) What the incident is and how it relates to SLO

An incident is an event that violates an SLO/service function or creates a risk of violation (an erroneous budget is burned unacceptably quickly).
Classic metrics: MTTD, MTTA, MTTR, MTBF.
The budget error and burn-rate determine the priority and escalation windows.


2) Severity levels (SEVs) and criteria

SEVSignInfluenceMTTR objective
SEV-1Broken critical SLO/total down for key trafficAll users/payments≤ 60 min
SEV-2Degradation (p95 latency, 5xx/payment errors ↑)Significant part≤ 4 h
SEV-3Local issues/baselines rejectedIndividual service/region≤ 1 business day
SEV-4Potential risk/defect without current impactPreparation of fixesaccording to plan

SEV triggers: exceeding 5xx%, p95> threshold, payment decline spike, Kafka-lag> threshold, NodeNotReady> X min, TLS expires <7 days, DDoS signals/leak.


3) Roles and Responsibilities (RACI)

Incident Commander (IC) - sole decision-making, task flow management, SEV status change.
Ops Lead (Tech Lead) - technical strategy, hypotheses, coordination of fixes.
Communications Lead (Comms) - status updates (internal/external), StatusPage/chat/mail.
Scribe (Chronicler) - timeline, solutions, artifacts, links to graphs/logs.
On-call Engineers/SMEs - execution of playbook actions.
Security/Privacy - Enabled for security or PII incidents.
FinOps/Payments - when affecting billing/PSP/cost.


4) Incident lifecycle

1. Detection (alert/report/synthetic) → auto-creation of an incident card.
2. Triage (IC assigned, SEV assigned, minimum context collection).
3. Stabilization (mitigation: turn off the feature/rollback/rate-limit/failover).
4. Investigation (RCA hypotheses, collection of facts).
5. Service recovery (validate SLO, observation).
6. Communication (inside/outside, final report).
7. Postmortem (no charges, CAPA plan, owners, deadlines).
8. Prevention (tests/alerts/playbooks/flags, additional training of the team).


5) Communications and "war-room"

Unified Incident Channel ('# inc-sev1-YYYYMMDD-hhmm'), only facts and actions.

Radio protocol style commands: "IC: I assign rollback version 1. 24 → ETA 10 min."

Status updates: SEV-1 every 15 minutes, SEV-2 every 30-60 minutes.
Status Page/external communication - via Comms Lead by template.
Forbidden: parallel "quiet" rooms, untested hypotheses into a common channel.


6) Alerting and SLO-burn (example rules)

Fast channel (1-5 min) and slow channel (1-2 h) burn-rate.
Multi-signals: budget error, 5xx%, p95, Kafka-lag, payment decline-rate, synthetics.
Search for the root cause - only after stabilizing symptoms.

Examples (generalized):
promql
Ошибочная доля 5xx > SLO sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01

Burn-rate быстрый (пример)
(sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])))
/ (1 - SLO) > 14.4

7) Playbooks vs ranbooks

Playbook - scenario of actions by type of incident (branching, conditions, risks).
Runbook - a specific "map" of steps/commands (checks, fixes, verification).
Rule: the playbook refers to several runbooks (rollbacks, feature-flags, failover, scaling, blocking traffic, etc.).


8) Incident card template

yaml id: INC-YYYYMMDD-XXXX title: "[SEV-1] Рост 5xx на API /payments"
status: active    monitoring    resolved sev: 1 reported_at: 2025-11-03T17:42Z ic: <ФИО>
ops_lead: <ФИО>
comms_lead: <ФИО>
scope: regions: [eu-west-1], tenants: [prod], services: [api, payments]
impact: "5xx=12% (обычно <0.5%), конверсия депозитов -20%"
mitigation: "откат на 1.23.4, включен rate-limit 2k rps, фича X выключена"
timeline:
- "17:42: алерт SLO burn-rate быстрый"
- "17:46: назначен IC, открыт war-room"
- "17:52: найден релиз 1.24 как кандидат"
- "18:02: откат завершен, 5xx вернулись к 0.3%"
artifacts:
dashboards: [...]
logs: [...]
traces: [...]
risk: "возможен очередной всплеск при включении фичи X"
next_steps: "канареечный релиз, тесты, постмортем до 2025-11-05"

9) SRE playbook template (Markdown)

markdown
Плейбук: <название>
Область/симптомы
Список детекторов, сигнатуры в метриках/логах/трассах.

Быстрая стабилизация (Triage & Mitigation)
- [ ] Ограничить трафик/включить WAF-правило/фичефлаг OFF
- [ ] Роллбэк/канареечный релиз/выкатить фикс конфигурации
- [ ] Включить деградационный режим (read-only, кэш-форс)

Диагностика (RCA hints)
- Метрики: … Логи: … Трассы: …
- Частые первопричины/чек-лист гипотез

Риски и коммуникации
- Внутренние/внешние апдейты, SLA-обязательства

Верификация
- [ ] SLO восстановлено (порог/время окна)
- [ ] Нет регресса по смежным сервисам

Последующие действия
- CAPA, задачи в backlog, обновление алертов/дашбордов/плейбука

10) Typical playbooks

10. 1 API 5xx Spike

Stabilization: turn off problematic ficheflag; Boost API replicas Enable caching rolling back the release.
Diagnostics: diff release, errors in logs (top-exceptions), p95 growth, pressure DB/cache.
Risks: cascade in payments/backends.

10. 2 БД: replication lag / lock storm

Stabilization: suspension of heavy jobs/reports; redirect reads to the wizard increase wal_buffers/replika-sloty.
Diagnostics: long transactions, blocking requests, plan changes.
Fixation: indexes/hints, redevelopment of jobs, split queries.

10. 3 Kafka consumer lag

Stabilization: temporarily scale consumers; reduce production from non-critical services; increase parties/quotas.
Diagnostics: rebalances, slow deserializations, GC pauses.
Verification: lag → to the target value, no drops.

10. 4 K8s NodeNotReady/resource storm

Stabilization: cordon + drain; redistribute loads; Check CNI/overlay turn off noisy DaemonSets.
Diagnostics: disk pressure, OOM, throttling, network drops.
Prevention: pod disruption budgets, resource limits/requests.

10. 5 TLS/certificates expire

Stabilization: forced update of the secret/ingress; temporary override.
Diagnostics: chain of trust, clock-skew.
Prevention: alerts T-30/T-7/T-1, auto-renual.

10. 6 DDoS/abnormal traffic

Stabilization: WAF/bot rules, rate-limit/geo-filters, upstream shed load.
Diagnostics: attack profiles (L3/4/7), sources, umbrellas.
Prevention: anycast, autoscaling, caching, play-nice with providers.

10. 7 Payment PSP-outage

Stabilization: smart-routing to alternative PSP/methods; raise retry with jitter; "soft" UI degradation.
Diagnostics: spike failures by codes, API statuses/PSP status pages.
Communications: transparent updates for business and support, correct ND/conversion statistics.

10. 8 Safety Incident/PII Leak

Stabilization: node isolation/secret rotation, exfiltration blocking, Legal Hold.
Diagnostics: access timelines, affected subjects/fields.
Notices: Regulators/Partners/Users by Jurisdiction Requirements.

Prevention: DLP/segmentation enhancement, "least privilege."


11) Automation of playbooks

ChatOps commands: '/ic set sev 1 ', '/deploy rollback api 1. 23. 4`, `/feature off X`.
Runbook-bots: semi-automatic steps (drain node, flip traffic, purge cache).
Self-healing hooks: detector → standard mitigation (rate-limit, restart, scale).
Auto-create cards/timelines from alerts and commands.


12) Playbook quality: checklist

  • Clear symptoms and detectors (metrics/logs/traces).
  • Rapid stabilization steps with risk assessment.
  • Commands/scripts are up to date, checked in staging.
  • Verification of SLO recovery.
  • Communication templates and external update criteria.
  • Post-mortem reference and CAPA after closing.

13) Postmortem (blameless) and CAPA

The goal: to learn, not to find the culprit.
Content: what happened, what was found to be good/bad, contribution of factors (those + processes), actions to prevent.
Term: SEV-1 - within 48 hours; SEV-2 - 3 working days.
CAPA: specific owners, timing, measurable effects (reduced MTTR/increased MTTD).


14) Legal aspects and evidence base

Legal Hold: freezing logs/tracks/alerts, write-once storage.
Chain of storage of artifacts: access by role, integrity control.
Regulatory notices: timelines/templates for jurisdictions (especially with affected payments/PII).
Privacy: PII minimization and masking during parsing.


15) Incident Process Performance Metrics

MTTD/MTTA/MTTR by quarter and domain.
SEV accuracy (underrating/overrating).
Share of auto-mitigate incidents.
Playbook coverage of top N scenarios (> 90%).
Perform CAPA on time.


16) Implementation by phase

1. Week 1: SEV matrix, on-call roles, general card template, war-room regulations.
2. Week 2: Playbooks for top 5 symptoms (5xx, DB lag, Kafka-lag, NodeNotReady, TLS).
3. Week 3: ChatOps/bots, auto-creating cards, communication templates/StatusPage.

4. Week 4 +: Safety Playbooks, PSP Outages, Legal Hold, Regular Drills/Chaos Games


17) Examples of "fast" ranbooks (fragments)

Rollback API (K8s)

bash kubectl rollout undo deploy/api -n prod kubectl rollout status deploy/api -n prod --timeout=5m
Верификация:
kubectl -n prod top pods -l app=api

Drain node

bash kubectl cordon $NODE && kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --timeout=10m

Feature-flag OFF (example)

bash curl -X POST "$FF_URL/toggle" -H "Authorization: Bearer $TOKEN" -d '{"feature":"X","enabled":false}'

18) Mini-FAQ

When to raise the SEV-1?
When the key SLO/business function (payments, login, game) suffers, and burn-rate "eats up" the budget for hours ahead.

What is more important - RCA or recovery?
Always stabilization, then RCA. Time to stabilization is the main indicator.

Do I need to automate everything?
Automate frequent and safe steps; rare/risky - via semi-auto and IC confirmation.


Result

The robust incident process rests on three pillars: clear roles and SEV rules, quality playbooks/ranbooks with automation, and a post-mortem culture without blame. Capture patterns, train on-call, measure MTTR/erroneous budget, and constantly improve detectors and playbooks - this directly reduces the risk and cost of downtime.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.