[SEV] Kurzbeschreibung und Datum

1) Prinzipien und Kultur

Blameless. Der Fehler ist eine Eigenschaft des Systems, nicht des Menschen. Wir suchen „warum es passiert ist“ und nicht „wer ist schuld“.
Fakten und Invarianten. Alle Leads basieren auf Zeitlinien, SLOs, Traces und Protokollen.
Werbung innerhalb des Unternehmens. Ergebnisse und Lektionen stehen angrenzenden Teams zur Verfügung.
Handlungen sind wichtiger als Protokolle. Ein unverändertes Dokument ≡ verlorene Zeit.
Schnelle Veröffentlichung. Entwurf des Postmortems - innerhalb von 48-72 Stunden nach dem Vorfall.

2) Taxonomie und Kriterien für Vorfälle

Schweregrad (SEV):

SEV1 - vollständige Nichtverfügbarkeit/Verlust von Geld/Daten;
SEV2 - erhebliche Verschlechterung (Fehler> SLO, p99 außerhalb);
SEV3 - partielle Degradation/Umgehungsszenario existiert.
Auswirkungen: betroffene Regionen/Tenanten/Produkte, Dauer, Geschäftskennzahlen (Conversion, GMV, Zahlungsausfall).
SLO/fehlerhaftes Budget: Wie viel Budget ausgeschöpft ist, wie sich dies auf die Geschwindigkeit von Releases und Experimenten auswirkt.

3) Rollen und Prozess des Vorfalls

Incident Commander (IC): Verwaltet den Prozess, priorisiert die Schritte und weist die Eigentümer zu.
Communications Lead: informiert die Stakeholder/Kunden per Template.
Ops/On-Call: Liquidation, Mitigation-Aktionen.
Scribe: führt Zeitlinien und Artefakte.
Subject Matter Experts (SME): Tiefendiagnose.

Die Etappen: das Entdecken → die Eskalation → die Stabilisierung → die Verifizierung → die Wiederherstellung → postmortem → die Einführung der Verbesserungen.

4) Postmortem-Vorlage (Struktur)



5) RCA Techniques (Root Cause Search)

5 Why - sequential clarification of causes to the system level.
Ishikawa (fish bone) - factors "People/Processes/Tools/Materials/Environment/Dimensions."
Event-Chain/Ripple - a chain of events with probabilities and triggers.
Barrier Analysis - which "fuses" (timeouts, breakers, quotas, tests) were supposed to stop the incident and why they did not work.
Change Correlation - correlation with releases, config digs, feature flags, provider incidents.

Practice: Avoid "root cause = person/one bug." Look for a system combination (debt + lack of guard rails + irrelevant runbooks).

6) Communications and transparency

Internal: single channel (war-room), short updates according to the template: status → actions → ETA of the next update.
External: status page/newsletter with facts without "guilt," with apologies and an action plan.
Sensitivity: do not disclose PD/secrets; legal wording to be agreed.
After the incident: a summary note with human language and a link to a technical report.

External update template (brief):
"31 Oct 2025, 13:40 UTC - some users encountered payment errors (up to 18 minutes). The reason is the degradation of the dependent service. We turned on bypass mode and restored operation at 13:58 UTC. Apologies. Within 72 hours, we will publish a report with actions to prevent recurrence"

7) Actions and implementation management

Each action is owner, deadline, acceptance criteria, risk and priority relationship.
Action classes:
1. Engineering: timeout budgets, jitter retreats, breakers, bulkheads, backprescher, stability/chaos tests.
2. Observability: SLI/SLO, alert guards, saturation, traces, steady-state dashboards.
3. Process: runbook update, on-call workouts, game day, CI gates, bipartisan review for risky changes.
4. Architecture: cache with coalescing, outbox/saga, idempotency, limiters/shading.
Gates: releases fail unless "post-mortem critical actions" are closed (Policy as Code).
Verification: retest (chaos/load) confirms the elimination of the risk.

8) Integration of feedback

Sources:
Telemetry: p99/p99 tails. 9, error-rate, queue depth, CDC lag, retray budget.
VoC/Support: topics of calls, CSAT/NPS, churn signals, "pain points."
Product/Analytics: user behavior, failure/friction, drop-off in funnels.
Partners/Integrators: webhook failures, contract incompatibility, SLA timing.

Signal → decision loop:
1. The signal is classified (severity/cost/frequency).
2. An architectural ticket is created with a hypothesis and the price of the problem.
3. Falls into the engineering portfolio (quarterly/monthly), ranked by ROI and risk.
4. Execute → measure effect → update SLI/SLO/cost baselines.

9) Post-mortem maturity metrics

% postmortems published ≤ 72 h (target ≥ 90%).
Average "lead time" from incident to closure of key actions.
Reopen rate of actions (quality of DoD formulations).
Repeated incidents for the same reason (target → 0).
Proportion of incidents caught by guards (breaker/limiter/timeouts) vs "breakthrough."
Saturation of dashboards (SLI covering critical paths) and "noise" of alerts.
Share of game-day/chaos scenarios that simulate detected failure classes.

10) Example of postmortem (summary)

Event: SEV2. Payment API: up p99 to 1. 8s, 3% 5xx, 31 Oct 2025 (13:22–13:58 UTC).
Impact: 12% of payment attempts with retrays, part - cancellation. Erroneous budget q4: − 7%.
Root Cause: "slow success" of currency dependence (p95 + 400 ms), retrai without jitter → cascade.
Barrier failure: the breaker is configured only for 5xx, not for timeouts; there was no rate-cap for low priority.
What worked: hand shading and stale-rates feature flag.
Actions:
Enter timeout budget and jitter retrays (DoD: p99 <400 ms at + 300 ms to dependency).
Breaker for "slow success" and fallback stale data ≤ 15 minutes.
Update runbook "slow dependency," add chaos script.
Add dashboard "served-stale share" and alert at> 10%.
Enter release-gate: without passing chaos-smoke - prohibit release.

11) Artifact patterns

11. 1 Timeline (example)

13: 22:10 Alert p99> 800ms (Gateway)

13: 24:00 IC ernannt, Kriegsraum geöffnet

13: 27:30 „Langsame Erfolg“ currency-api identifiziert

13: 30:15 Ficha-flag stale-rates ON (10% des Verkehrs)

13: 41:00 Stale-Preise 100%, p99 stabilisiert 290ms

13: 52:40 Einschränkung der Retrays am Gateway

13: 58:00 Vorfall geschlossen, Überwachung 30min


11. 2 Solutions and Validation (DoD)

Lösung: Schalten Sie den Breaker ein (slow_success)

DoD: Chaos-Szenario „+ 300ms to currency“ - p99 <450ms, error_rate <0. 5%, stale_share < 12%


11. 3 Policy "gate" (check)

deny_release if any(postmortem_action. status!= "Done" and action. severity in ["critical"])


12) Anti-Muster

„Hexenjagd“ und Bestrafung → Verschleierung von Fehlern, Verlust von Signalen.
Protokoll um des Protokolls willen: lange Dokumente ohne Handlungen/Inhaber/Fristen.
RCAs der Bug-in-Code-Ebene ohne Systemfaktoren.
Abschluss des Vorfalls ohne Retest und Aktualisierung der Baslines.
Mangelnde Publizität innerhalb des Unternehmens: Wiederholung der gleichen Fehler in anderen Teams.
Ignorieren von Feedback von Sapport/Partnern und „unsichtbaren“ Degradationen (langsamer Erfolg).
Zusammenfassung „Alles repariert, weiter geht's“ - keine Änderung der Architektur/Prozesse.

13) Checkliste des Architekten

1. Gibt es eine einzige Postmortem-Vorlage und SLA-Veröffentlichung ≤ 72 Stunden?
2. Rollen (IC, Comms, Scribe, SME) werden automatisch zugewiesen?
3. Basieren Zeitlinien auf Telemetrie (Traces/Metriken/Logs) und Release-/Flag-Tags?
4. RCA-Techniken werden systematisch angewendet (5 Warum, Ishikawa, Barrier)?
5. Haben die Aktionen Eigentümer, Zeitrahmen und DoD, Risiko und Release-Gates?
6. Führt der Vorfall zu einem Update von Runbook/XAOC-Skripten/Alerts?
7. Eingebaute VoC/Support-Kanäle, regelmäßige Überprüfung von „Top Pain“?
8. Beeinflusst ein fehlerhaftes Budget die Veröffentlichungs- und Experimentierpolitik?
9. Werden Reifegradmetriken verfolgt (time-to-postmortem, reopen rate, Wiederholbarkeit)?
10. Sind öffentliche teaminterne Analysen und eine Wissensdatenbank mit Suche verfügbar?

Schluss

Post-Mortems und Feedback sind ein Mechanismus, um Architektur zu lernen. Wenn schuldenfreie Analysen, messbare Handlungseffekte und die Integration von Signalen aus der Produktion zur Norm werden, wird das System jede Woche stabiler, schneller und verständlicher. Machen Sie Fakten sichtbar, Aktionen sind obligatorisch und Wissen ist verfügbar, und Vorfälle werden zum Treibstoff für die Entwicklung Ihrer Plattform.

[SEV] Kurzbeschreibung und Datum

deny_release if any(postmortem_action. status!= "Done" and action. severity in ["critical"])

Kontakt aufnehmen

Schneller Kontakt

Das Video wird bald aktualisiert

Wir sind derzeit sehr stark ausgelastet