［SEV］简要说明和日期

1）原则与文化

Blameless.错误-系统属性而不是人。我们正在寻找"为什么会发生这种情况"，而不是"谁应该受到指责"。
事实和不变量。任何发现都依赖于时间线，SLO，跟踪和日志。
公司内部宣传。相关团队可以获得结果和教训。
行动比协议更重要。未更改的文档≡浪费的时间。
快速发布。验尸草稿-事件发生后48-72小时内。

2）分类和事件标准

严重性（SEV）：

SEV1-完全无法获得/损失金钱/数据；
SEV2-严重降解（错误>SLO, p99外部）；
SEV3-存在部分降解/解决方案。
影响：受影响的区域/tenant/产品，持续时间，业务指标（转换，GMV，付款故障）。
SLO/预算错误：有多少预算用尽，如何影响发布速度和实验。

3）角色和事件过程

事件指挥官（IC）：管理过程、优先级步骤、指定所有者。
Communications Lead：按模板通知摊客/客户。
Ops/On-call：消除、溷淆行动。
Scribe：运行时间线和人工制品。
主题问题专家（SME）：深度诊断。

步骤：探测升级稳定核查后恢复实施改进。

4）后验尸模板（结构）



5) RCA Techniques (Root Cause Search)

5 Why - sequential clarification of causes to the system level.
Ishikawa (fish bone) - factors "People/Processes/Tools/Materials/Environment/Dimensions."
Event-Chain/Ripple - a chain of events with probabilities and triggers.
Barrier Analysis - which "fuses" (timeouts, breakers, quotas, tests) were supposed to stop the incident and why they did not work.
Change Correlation - correlation with releases, config digs, feature flags, provider incidents.

Practice: Avoid "root cause = person/one bug." Look for a system combination (debt + lack of guard rails + irrelevant runbooks).

6) Communications and transparency

Internal: single channel (war-room), short updates according to the template: status → actions → ETA of the next update.
External: status page/newsletter with facts without "guilt," with apologies and an action plan.
Sensitivity: do not disclose PD/secrets; legal wording to be agreed.
After the incident: a summary note with human language and a link to a technical report.

External update template (brief):
"31 Oct 2025, 13:40 UTC - some users encountered payment errors (up to 18 minutes). The reason is the degradation of the dependent service. We turned on bypass mode and restored operation at 13:58 UTC. Apologies. Within 72 hours, we will publish a report with actions to prevent recurrence"

7) Actions and implementation management

Each action is owner, deadline, acceptance criteria, risk and priority relationship.
Action classes:

1. Engineering: timeout budgets, jitter retreats, breakers, bulkheads, backprescher, stability/chaos tests.
2. Observability: SLI/SLO, alert guards, saturation, traces, steady-state dashboards.
3. Process: runbook update, on-call workouts, game day, CI gates, bipartisan review for risky changes.
4. Architecture: cache with coalescing, outbox/saga, idempotency, limiters/shading.
Gates: releases fail unless "post-mortem critical actions" are closed (Policy as Code).
Verification: retest (chaos/load) confirms the elimination of the risk.

8) Integration of feedback

Sources:

Telemetry: p99/p99 tails. 9, error-rate, queue depth, CDC lag, retray budget.
VoC/Support: topics of calls, CSAT/NPS, churn signals, "pain points."
Product/Analytics: user behavior, failure/friction, drop-off in funnels.
Partners/Integrators: webhook failures, contract incompatibility, SLA timing.

Signal → decision loop:

1. The signal is classified (severity/cost/frequency).
2. An architectural ticket is created with a hypothesis and the price of the problem.
3. Falls into the engineering portfolio (quarterly/monthly), ranked by ROI and risk.
4. Execute → measure effect → update SLI/SLO/cost baselines.

9) Post-mortem maturity metrics

% postmortems published ≤ 72 h (target ≥ 90%).
Average "lead time" from incident to closure of key actions.
Reopen rate of actions (quality of DoD formulations).
Repeated incidents for the same reason (target → 0).
Proportion of incidents caught by guards (breaker/limiter/timeouts) vs "breakthrough."
Saturation of dashboards (SLI covering critical paths) and "noise" of alerts.
Share of game-day/chaos scenarios that simulate detected failure classes.

10) Example of postmortem (summary)

Event: SEV2. Payment API: up p99 to 1. 8s, 3% 5xx, 31 Oct 2025 (13:22–13:58 UTC).
Impact: 12% of payment attempts with retrays, part - cancellation. Erroneous budget q4: − 7%.
Root Cause: "slow success" of currency dependence (p95 + 400 ms), retrai without jitter → cascade.
Barrier failure: the breaker is configured only for 5xx, not for timeouts; there was no rate-cap for low priority.
What worked: hand shading and stale-rates feature flag.
Actions:

Enter timeout budget and jitter retrays (DoD: p99 <400 ms at + 300 ms to dependency).
Breaker for "slow success" and fallback stale data ≤ 15 minutes.
Update runbook "slow dependency," add chaos script.
Add dashboard "served-stale share" and alert at> 10%.
Enter release-gate: without passing chaos-smoke - prohibit release.

11) Artifact patterns

11. 1 Timeline (example)

13： 22：10 Alert p99> 800 ms （gateway）

下午1时24分IC指定，战争室开放

下午1： 27：30确定了currency-api的"缓慢成功"

下午1时30分15分Ficha-flag stale-rates ON（交通量的10%）

下午1时41分Stale-rates 100%，p99稳定290 m

下午1时52分40分门户停靠站

下午1时58分事件结束，监视30分钟


11. 2 Solutions and Validation (DoD)

解决方桉： 启用breaker （slow_success）

DoD： 混沌脚本"+300 ms to currency"-p99 <450 ms，error_rate <0。5%, stale_share < 12%


11. 3 Policy "gate" (check)

deny_release if any(postmortem_action.status!= "Done" and action.severity in ["critical"])


12）反模式

"猎巫"和惩罚→隐藏错误，失去信号。
协议：没有行动/所有者/时间表的长文件。
没有系统因素的"代码中的错误"级别RCA。
关闭事件而无需退房和基线更新。
公司内部缺乏宣传：重复其他团队中的相同错误。
忽略萨波特/合作伙伴的反馈和"不可见"的退化（成功缓慢）。
摘要"一切都得到修复，继续前进"-体系结构/流程没有变化。

13）建筑师支票清单

1.有一个单一的帖子模板和SLA出版物≤ 72小时？
2.角色（IC、Comms、Scribe、SME）是自动分配的?
3.时间线是否基于遥测（预告片/度量/标志）和发布/标记标签？
4.RCA技术系统应用（5 Why, Ishikawa, Barrier）?
5.活动是否具有所有者、时间表和国防部,是否与发布风险和门有关?
6.事件是否导致更新运行簿/xaoc 脚本/alerts?
7.内置了VoC/Support通道，是否定期查看"顶级疼痛"？
8.错误的预算是否会影响发布和实验政策?
9.成熟度度量是可跟踪的（时间到后期,reopen rate,可重复性）?
10.公共团队内部分析和搜索知识库？

二.结论

后验和反馈是体系结构学习的机制。当无罪分析，可测量的动作效果以及生产信号的集成成为常态时，系统每周都会变得更稳定，更快，更清晰。让事实可见，行动是强制性的，知识是可用的，事件将成为平台发展的燃料。

［SEV］简要说明和日期

deny_release if any(postmortem_action.status!= "Done" and action.severity in ["critical"])

联系我们

快速联系

视频即将更新

我们目前正忙于各项项目