[SEV] Gysgaça beýany we senesi
1) Ýörelgeler we medeniýet
Blameless. Ýalňyşlyk adam däl-de, ulgamyň aýratynlygydyr. Kim günäkär däl-de, "näme üçin beýle boldy" diýip gözleýäris.
Faktlar we üýtgeşmeler. Islendik netijeler wagtlaýynça, SLO, yzarlamalara we ýazgylara esaslanýar.
Kompaniýanyň içindäki mahabat. Netijeler we sapaklar degişli toparlara elýeterlidir.
Hereketler teswirnamalardan has möhümdir. Resminama ýitirilen wagty ≡.
Çalt çap etmek. Postmortemiň taslamasy - wakadan soň 48-72 sagadyň dowamynda.
2) Hadysalaryň taksonomiýasy we ölçegleri
Agyrlyk (SEV):- SEV1 - puluň/maglumatlaryň doly elýeterli bolmazlygy/ýitmegi;
- SEV2 - düýpli pese gaçmak (ýalňyşlyklar> SLO, p99);
- SEV3 - bölekleýin pese gaçmak/aýlaw ssenarisi bar.
- Täsiri: täsir eden sebitler/tenantlar/önümler, dowamlylygy, iş ölçegleri (öwrülişik, GMV, töleglerden ýüz öwürmek).
- SLO/nädogry býudjet: näçe býudjet gutardy, goýberilişleriň we synaglaryň tizligine nähili täsir edýär.
3) Wakanyň rollary we prosesi
Incident Commander (IC): prosesi dolandyrýar, ädimleri ileri tutýar, eýelerini belleýär.
Communications Lead: Steýkholderlere/müşderilere şablon boýunça habar berýär.
Ops/On-call: ýatyryş, jemleýji hereketler.
Scribe: Timline we artefaktlary alyp barýar.
Subject Matter Experts (SME): çuňňur anyklaýyş.
Tapgyrlar: tapmak → güýçlendirmek → durnuklaşdyrmak → tassyklamak → dikeltmek → postmortem → gowulaşmalary girizmek.
4) Postmortemiň şablony (gurluşy)
5) RCA Techniques (Root Cause Search)
5 Why - sequential clarification of causes to the system level.
Ishikawa (fish bone) - factors "People/Processes/Tools/Materials/Environment/Dimensions."
Event-Chain/Ripple - a chain of events with probabilities and triggers.
Barrier Analysis - which "fuses" (timeouts, breakers, quotas, tests) were supposed to stop the incident and why they did not work.
Change Correlation - correlation with releases, config digs, feature flags, provider incidents.
Practice: Avoid "root cause = person/one bug." Look for a system combination (debt + lack of guard rails + irrelevant runbooks).
6) Communications and transparency
Internal: single channel (war-room), short updates according to the template: status → actions → ETA of the next update.
External: status page/newsletter with facts without "guilt," with apologies and an action plan.
Sensitivity: do not disclose PD/secrets; legal wording to be agreed.
After the incident: a summary note with human language and a link to a technical report.
External update template (brief):
"31 Oct 2025, 13:40 UTC - some users encountered payment errors (up to 18 minutes). The reason is the degradation of the dependent service. We turned on bypass mode and restored operation at 13:58 UTC. Apologies. Within 72 hours, we will publish a report with actions to prevent recurrence"
7) Actions and implementation management
Each action is owner, deadline, acceptance criteria, risk and priority relationship.
Action classes:
1. Engineering: timeout budgets, jitter retreats, breakers, bulkheads, backprescher, stability/chaos tests.
2. Observability: SLI/SLO, alert guards, saturation, traces, steady-state dashboards.
3. Process: runbook update, on-call workouts, game day, CI gates, bipartisan review for risky changes.
4. Architecture: cache with coalescing, outbox/saga, idempotency, limiters/shading.
Gates: releases fail unless "post-mortem critical actions" are closed (Policy as Code).
Verification: retest (chaos/load) confirms the elimination of the risk.
8) Integration of feedback
Sources:
Telemetry: p99/p99 tails. 9, error-rate, queue depth, CDC lag, retray budget.
VoC/Support: topics of calls, CSAT/NPS, churn signals, "pain points."
Product/Analytics: user behavior, failure/friction, drop-off in funnels.
Partners/Integrators: webhook failures, contract incompatibility, SLA timing.
Signal → decision loop:
1. The signal is classified (severity/cost/frequency).
2. An architectural ticket is created with a hypothesis and the price of the problem.
3. Falls into the engineering portfolio (quarterly/monthly), ranked by ROI and risk.
4. Execute → measure effect → update SLI/SLO/cost baselines.
9) Post-mortem maturity metrics
% postmortems published ≤ 72 h (target ≥ 90%).
Average "lead time" from incident to closure of key actions.
Reopen rate of actions (quality of DoD formulations).
Repeated incidents for the same reason (target → 0).
Proportion of incidents caught by guards (breaker/limiter/timeouts) vs "breakthrough."
Saturation of dashboards (SLI covering critical paths) and "noise" of alerts.
Share of game-day/chaos scenarios that simulate detected failure classes.
10) Example of postmortem (summary)
Event: SEV2. Payment API: up p99 to 1. 8s, 3% 5xx, 31 Oct 2025 (13:22–13:58 UTC).
Impact: 12% of payment attempts with retrays, part - cancellation. Erroneous budget q4: − 7%.
Root Cause: "slow success" of currency dependence (p95 + 400 ms), retrai without jitter → cascade.
Barrier failure: the breaker is configured only for 5xx, not for timeouts; there was no rate-cap for low priority.
What worked: hand shading and stale-rates feature flag.
Actions:
Enter timeout budget and jitter retrays (DoD: p99 <400 ms at + 300 ms to dependency).
Breaker for "slow success" and fallback stale data ≤ 15 minutes.
Update runbook "slow dependency," add chaos script.
Add dashboard "served-stale share" and alert at> 10%.
Enter release-gate: without passing chaos-smoke - prohibit release.
11) Artifact patterns
11. 1 Timeline (example)
13: 22:10 Alert p99> 800ms (gateway)
13: 24:00 IC bellendi, war-room açyk
13: 27:30 Kesgitlenen "haýal üstünlik" currency-api
13: 30:15 Ficha-baýdak stale-rates ON (traffigiň 10%)
13: 41:00 Stale-rates 100%, p99 durnuklaşdyryldy 290ms
13: 52:40 gateway-da retraýlary çäklendirmek
13: 58:00 Waka ýapyk, gözegçilik 30min
11. 2 Solutions and Validation (DoD)
Çözgüt: breaker (slow_success)
DoD: chaos-ssenarisi "+ 300ms k currency" - p99 <450ms, error_rate <0. 5%, stale_share < 12%
11. 3 Policy "gate" (check)
deny_release if any(postmortem_action. status!= "Done" and action. severity in ["critical"])
12) Anti-patternler
"Jadygöý awy" we jeza → ýalňyşlyklary gizlemek, signallary ýitirmek.
Teswirnama üçin teswirnama: hereketsiz/eýeleriniň/möhletsiz uzyn resminamalar.
RCA derejesi "kodda ýalňyşlyk" ulgamlaýyn faktorlarsyz.
Wakany retestsiz we bazlinleri täzelemezden ýapmak.
Kompaniýanyň içinde aç-açanlygyň ýoklugy: beýleki toparlarda şol ýalňyşlyklaryň gaýtalanmagy.
Sapportdan/hyzmatdaşlardan we "görünmeýän" zaýalanmalardan seslenmäni äsgermezlik etmek (haýal üstünlik).
"Hemme zady düzeddik, dowam edýäris" gysgaça mazmuny - arhitektura/proseslerde üýtgeşiklik ýok.
13) Arhitektoryň çek-sanawy
1. Postmortem we SLA neşiriniň bitewi şablony barmy ≤ 72 s?
2. Rollar (IC, Comms, Scribe, SME) awtomatiki bellenilýärmi?
3. Wagtlaýynlar telemetriýa (treýsler/metrikler/loglar) we relizleriň/baýdaklaryň belliklerine esaslanýarmy?
4. RCA usullary yzygiderli ulanylýar (5 Why, Ishikawa, Barrier)?
5. Hereketleriň eýeleri, möhletleri we DoD-leri bar, töwekgelçilik we goýberiş girtleri bilen baglanyşyklymy?
6. Bu waka runbook/xaoc-ssenarileriniň/alertleriniň täzelenmegine sebäp bolýarmy?
7. VoC/Support kanallary gurlan, "top-agyrylara" yzygiderli syn barmy?
8. Nädogry býudjet relizleriň we synaglaryň syýasatyna täsir edýärmi?
9. Kämillik ölçegleri yzarlanýarmy (time-to-postmortem, reopen rate, gaýtalanýanlygy)?
10. Köpçülikleýin içerki seljermeler we gözleg bilen bilim bazasy barmy?
Netije
Postmortemalar we seslenme arhitekturany öwrenmegiň mehanizmidir. Aýyplamasyz seljermeler, hereketleriň ölçelip bolýan täsiri we önümçilikden signallaryň goşulmagy kadalaşanda, ulgam her hepde has durnukly, has çalt we düşnükli bolýar. Faktlary görünýän, hereketler hökmany we bilimler elýeterli ediň, wakalar platformanyzyň ewolýusiýasy üçin ýangyja öwrüler.