［SEV］簡要說明和日期

1）原則與文化

Blameless.錯誤-系統屬性而不是人。我們正在尋找「為什麼會發生這種情況」，而不是「誰應該受到指責」。
事實和不變量。任何發現都依賴於時間線，SLO，跟蹤和日誌。
公司內部宣傳。相關團隊可以獲得結果和教訓。
行動比協議更重要。未更改的文檔≡浪費的時間。
快速發布。驗屍草稿-事件發生後48-72小時內。

2）分類和事件標準

嚴重性（SEV）：

SEV1-完全無法獲得/損失金錢/數據；
SEV2-嚴重降解（錯誤>SLO, p99外部）；
SEV3-存在部分降解/解決方案。
影響：受影響的區域/tenant/產品，持續時間，業務指標（轉換，GMV，付款故障）。
SLO/預算錯誤：有多少預算用盡，如何影響發布速度和實驗。

3）角色和事件過程

事件指揮官（IC）：管理過程、優先級步驟、指定所有者。
Communications Lead：按模板通知攤客/客戶。
Ops/On-call：消除、混淆行動。
Scribe：運行時間線和人工制品。
主題問題專家（SME）：深度診斷。

步驟：探測升級穩定核查後恢復實施改進。

4）後驗屍模板（結構）



5) RCA Techniques (Root Cause Search)

5 Why - sequential clarification of causes to the system level.
Ishikawa (fish bone) - factors "People/Processes/Tools/Materials/Environment/Dimensions."
Event-Chain/Ripple - a chain of events with probabilities and triggers.
Barrier Analysis - which "fuses" (timeouts, breakers, quotas, tests) were supposed to stop the incident and why they did not work.
Change Correlation - correlation with releases, config digs, feature flags, provider incidents.

Practice: Avoid "root cause = person/one bug." Look for a system combination (debt + lack of guard rails + irrelevant runbooks).

6) Communications and transparency

Internal: single channel (war-room), short updates according to the template: status → actions → ETA of the next update.
External: status page/newsletter with facts without "guilt," with apologies and an action plan.
Sensitivity: do not disclose PD/secrets; legal wording to be agreed.
After the incident: a summary note with human language and a link to a technical report.

External update template (brief):
"31 Oct 2025, 13:40 UTC - some users encountered payment errors (up to 18 minutes). The reason is the degradation of the dependent service. We turned on bypass mode and restored operation at 13:58 UTC. Apologies. Within 72 hours, we will publish a report with actions to prevent recurrence"

7) Actions and implementation management

Each action is owner, deadline, acceptance criteria, risk and priority relationship.
Action classes:

1. Engineering: timeout budgets, jitter retreats, breakers, bulkheads, backprescher, stability/chaos tests.
2. Observability: SLI/SLO, alert guards, saturation, traces, steady-state dashboards.
3. Process: runbook update, on-call workouts, game day, CI gates, bipartisan review for risky changes.
4. Architecture: cache with coalescing, outbox/saga, idempotency, limiters/shading.
Gates: releases fail unless "post-mortem critical actions" are closed (Policy as Code).
Verification: retest (chaos/load) confirms the elimination of the risk.

8) Integration of feedback

Sources:

Telemetry: p99/p99 tails. 9, error-rate, queue depth, CDC lag, retray budget.
VoC/Support: topics of calls, CSAT/NPS, churn signals, "pain points."
Product/Analytics: user behavior, failure/friction, drop-off in funnels.
Partners/Integrators: webhook failures, contract incompatibility, SLA timing.

Signal → decision loop:

1. The signal is classified (severity/cost/frequency).
2. An architectural ticket is created with a hypothesis and the price of the problem.
3. Falls into the engineering portfolio (quarterly/monthly), ranked by ROI and risk.
4. Execute → measure effect → update SLI/SLO/cost baselines.

9) Post-mortem maturity metrics

% postmortems published ≤ 72 h (target ≥ 90%).
Average "lead time" from incident to closure of key actions.
Reopen rate of actions (quality of DoD formulations).
Repeated incidents for the same reason (target → 0).
Proportion of incidents caught by guards (breaker/limiter/timeouts) vs "breakthrough."
Saturation of dashboards (SLI covering critical paths) and "noise" of alerts.
Share of game-day/chaos scenarios that simulate detected failure classes.

10) Example of postmortem (summary)

Event: SEV2. Payment API: up p99 to 1. 8s, 3% 5xx, 31 Oct 2025 (13:22–13:58 UTC).
Impact: 12% of payment attempts with retrays, part - cancellation. Erroneous budget q4: − 7%.
Root Cause: "slow success" of currency dependence (p95 + 400 ms), retrai without jitter → cascade.
Barrier failure: the breaker is configured only for 5xx, not for timeouts; there was no rate-cap for low priority.
What worked: hand shading and stale-rates feature flag.
Actions:

Enter timeout budget and jitter retrays (DoD: p99 <400 ms at + 300 ms to dependency).
Breaker for "slow success" and fallback stale data ≤ 15 minutes.
Update runbook "slow dependency," add chaos script.
Add dashboard "served-stale share" and alert at> 10%.
Enter release-gate: without passing chaos-smoke - prohibit release.

11) Artifact patterns

11. 1 Timeline (example)

13： 22：10 Alert p99> 800 ms （gateway）

下午1時24分IC指定，戰爭室開放

下午1： 27：30確定了currency-api的「緩慢成功」

下午1時30分15分Ficha-flag stale-rates ON（交通量的10%）

下午1時41分Stale-rates 100%，p99穩定290 m

下午1時52分40分門戶停靠站

下午1時58分事件結束，監視30分鐘


11. 2 Solutions and Validation (DoD)

解決方案： 啟用breaker （slow_success）

DoD： 混沌腳本「+300 ms to currency」-p99 <450 ms，error_rate <0。5%, stale_share < 12%


11. 3 Policy "gate" (check)

deny_release if any(postmortem_action.status!= "Done" and action.severity in ["critical"])


12）反模式

「獵巫」和懲罰→隱藏錯誤，失去信號。
協議：沒有行動/所有者/時間表的長文件。
沒有系統因素的「代碼中的錯誤」級別RCA。
關閉事件而無需退房和基線更新。
公司內部缺乏宣傳：重復其他團隊中的相同錯誤。
忽略薩波特/合作夥伴的反饋和「不可見」的退化（成功緩慢）。
摘要「一切都得到修復，繼續前進」-體系結構/流程沒有變化。

13）建築師支票清單

1.有一個單一的帖子模板和SLA出版物≤ 72小時？
2.角色（IC、Comms、Scribe、SME）是自動分配的?
3.時間線是否基於遙測（預告片/度量/標誌）和發布/標記標簽？
4.RCA技術系統應用（5 Why, Ishikawa, Barrier）?
5.活動是否具有所有者、時間表和國防部,是否與發布風險和門有關?
6.事件是否導致更新運行簿/xaoc 腳本/alerts?
7.內置了VoC/Support通道，是否定期查看「頂級疼痛」？
8.錯誤的預算是否會影響發布和實驗政策?
9.成熟度度量是可跟蹤的（時間到後期,reopen rate,可重復性）?
10.公共團隊內部分析和搜索知識庫？

二.結論

後驗和反饋是體系結構學習的機制。當無罪分析，可測量的動作效果以及生產信號的集成成為常態時，系統每周都會變得更穩定，更快，更清晰。讓事實可見，行動是強制性的，知識是可用的，事件將成為平臺發展的燃料。

［SEV］簡要說明和日期

deny_release if any(postmortem_action.status!= "Done" and action.severity in ["critical"])

與我們聯繫

快速聯繫

影片即將更新

我們目前正忙於各項專案