[SEV] Short Description and Date
1) Principles and culture
Blameless. Error is a property of the system, not a person. We are looking for "why it happened," and not "who is to blame."
Facts and invariants. Any outputs are based on timeline, SLO, traces and logs.
Publicity within the company. Totals and lessons are available to related teams.
Actions are more important than protocols. Unchanged document ≡ lost time.
Quick publishing. Draft postmortem - within 48-72 hours after the incident.
2) Taxonomy and incident criteria
Severity (SEV):- SEV1 - complete inaccessibility/loss of money/data;
- SEV2 - significant degradation (errors> SLO, p99 outside);
- SEV3 - partial degradation/workaround exists.
- Impact: affected regions/tenants/products, duration, business metrics (conversion, GMV, payment failure).
- SLO/erroneous budget: how much budget is exhausted, how it affects the speed of releases and experiments.
3) Incident roles and process
Incident Commander (IC): manages the process, prioritizes steps, assigns owners.
Communications Lead: Informs stakeholders/customers on a template.
Ops/On-call: liquidation, mitigating actions.
Scribe: Maintains timeline and artifacts.
Subject Matter Experts (SME): deep diagnostics.
Stages: detection → escalation → stabilization → verification → restoration → postmorty → introduction of improvements.
4) Postmortem template (structure)
5) RCA Techniques (Root Cause Search)
5 Why - sequential clarification of causes to the system level.
Ishikawa (fish bone) - factors "People/Processes/Tools/Materials/Environment/Dimensions."
Event-Chain/Ripple - a chain of events with probabilities and triggers.
Barrier Analysis - which "fuses" (timeouts, breakers, quotas, tests) were supposed to stop the incident and why they did not work.
Change Correlation - correlation with releases, config digs, feature flags, provider incidents.
Practice: Avoid "root cause = person/one bug." Look for a system combination (debt + lack of guard rails + irrelevant runbooks).
6) Communications and transparency
Internal: single channel (war-room), short updates according to the template: status → actions → ETA of the next update.
External: status page/newsletter with facts without "guilt," with apologies and an action plan.
Sensitivity: do not disclose PD/secrets; legal wording to be agreed.
After the incident: a summary note with human language and a link to a technical report.
External update template (brief):
"31 Oct 2025, 13:40 UTC - some users encountered payment errors (up to 18 minutes). The reason is the degradation of the dependent service. We turned on bypass mode and restored operation at 13:58 UTC. Apologies. Within 72 hours, we will publish a report with actions to prevent recurrence"
7) Actions and implementation management
Each action is owner, deadline, acceptance criteria, risk and priority relationship.
Action classes:
1. Engineering: timeout budgets, jitter retreats, breakers, bulkheads, backprescher, stability/chaos tests.
2. Observability: SLI/SLO, alert guards, saturation, traces, steady-state dashboards.
3. Process: runbook update, on-call workouts, game day, CI gates, bipartisan review for risky changes.
4. Architecture: cache with coalescing, outbox/saga, idempotency, limiters/shading.
Gates: releases fail unless "post-mortem critical actions" are closed (Policy as Code).
Verification: retest (chaos/load) confirms the elimination of the risk.
8) Integration of feedback
Sources:
Telemetry: p99/p99 tails. 9, error-rate, queue depth, CDC lag, retray budget.
VoC/Support: topics of calls, CSAT/NPS, churn signals, "pain points."
Product/Analytics: user behavior, failure/friction, drop-off in funnels.
Partners/Integrators: webhook failures, contract incompatibility, SLA timing.
Signal → decision loop:
1. The signal is classified (severity/cost/frequency).
2. An architectural ticket is created with a hypothesis and the price of the problem.
3. Falls into the engineering portfolio (quarterly/monthly), ranked by ROI and risk.
4. Execute → measure effect → update SLI/SLO/cost baselines.
9) Post-mortem maturity metrics
% postmortems published ≤ 72 h (target ≥ 90%).
Average "lead time" from incident to closure of key actions.
Reopen rate of actions (quality of DoD formulations).
Repeated incidents for the same reason (target → 0).
Proportion of incidents caught by guards (breaker/limiter/timeouts) vs "breakthrough."
Saturation of dashboards (SLI covering critical paths) and "noise" of alerts.
Share of game-day/chaos scenarios that simulate detected failure classes.
10) Example of postmortem (summary)
Event: SEV2. Payment API: up p99 to 1. 8s, 3% 5xx, 31 Oct 2025 (13:22–13:58 UTC).
Impact: 12% of payment attempts with retrays, part - cancellation. Erroneous budget q4: − 7%.
Root Cause: "slow success" of currency dependence (p95 + 400 ms), retrai without jitter → cascade.
Barrier failure: the breaker is configured only for 5xx, not for timeouts; there was no rate-cap for low priority.
What worked: hand shading and stale-rates feature flag.
Actions:
Enter timeout budget and jitter retrays (DoD: p99 <400 ms at + 300 ms to dependency).
Breaker for "slow success" and fallback stale data ≤ 15 minutes.
Update runbook "slow dependency," add chaos script.
Add dashboard "served-stale share" and alert at> 10%.
Enter release-gate: without passing chaos-smoke - prohibit release.
11) Artifact patterns
11. 1 Timeline (example)
13: 22:10 Alert p99> 800ms (gateway)
13: 24:00 IC assigned, war-room open
13: 27:30 currency-api "slow success" identified
13: 30:15 Ficha-flag stale-rates ON (10% traffic)
13: 41:00 Stale-rates 100%, p99 stabilized 290ms
13: 52:40 Limiting Retreas to Gateway
13: 58:00 Incident closed, monitoring 30min
11. 2 Solutions and Validation (DoD)
Solution: enable breaker (slow_success)
DoD: chaos script "+ 300ms to currency" - p99 <450ms, error_rate <0. 5%, stale_share < 12%
11. 3 Policy "gate" (check)
deny_release if any(postmortem_action. status!= "Done" and action. severity in ["critical"])
12) Anti-patterns
"Witch hunt" and punishment → hiding mistakes, loss of signals.
Protocol for the sake of protocol: long documents without actions/owners/deadlines.
OCA level "bug in the code" without system factors.
Closing the incident without retesting and updating the baselines.
Lack of publicity within the company: repeating the same mistakes on other teams.
Ignoring feedback from support/partners and "invisible" degradation (slow success).
Summary "fixed everything, moving on" - no changes in architecture/processes.
13) Architect checklist
1. Do you have a single postmortem template and SLA publication ≤ 72 hours?
2. Are roles (IC, Comms, Scribe, SME) assigned automatically?
3. Timelines are based on telemetry (trails/metrics/logs) and release/flag labels?
4. RCA methods are applied systemically (5 Why, Ishikawa, Barrier)?
5. Actions have owners, deadlines and DoD, associated with risk and release gates?
6. Does the incident update the runbook/xaoc scripts/alerts?
7. Built-in VoC/Support channels, is there a regular review of "top pains"?
8. Does an erroneous budget affect the policy of releases and experiments?
9. Are maturity metrics tracked (time-to-postmortem, reopen rate, repeatability)?
10. Public intra-team analysis and knowledge base with search are available?
Conclusion
Postmortems and feedback are an architecture learning mechanism. When blame-free parsing, measurable effect of actions and integration of signals from production become the norm, the system becomes more stable, faster and clearer every week. Make facts visible, actions mandatory, and knowledge accessible, and incidents become fuel for your platform's evolution.