[SEV]简要说明和日期
1)原则与文化
Blameless.错误-系统属性而不是人。我们正在寻找"为什么会发生这种情况",而不是"谁应该受到指责"。
事实和不变量。任何发现都依赖于时间线,SLO,跟踪和日志。
公司内部宣传。相关团队可以获得结果和教训。
行动比协议更重要。未更改的文档≡浪费的时间。
快速发布。验尸草稿-事件发生后48-72小时内。
2)分类和事件标准
严重性(SEV):- SEV1-完全无法获得/损失金钱/数据;
- SEV2-严重降解(错误>SLO, p99外部);
- SEV3-存在部分降解/解决方案。
- 影响:受影响的区域/tenant/产品,持续时间,业务指标(转换,GMV,付款故障)。
- SLO/预算错误:有多少预算用尽,如何影响发布速度和实验。
3)角色和事件过程
事件指挥官(IC):管理过程、优先级步骤、指定所有者。
Communications Lead:按模板通知摊客/客户。
Ops/On-call:消除、溷淆行动。
Scribe:运行时间线和人工制品。
主题问题专家(SME):深度诊断。
步骤:探测 升级 稳定 核查 后恢复 实施改进。
4)后验尸模板(结构)
5) RCA Techniques (Root Cause Search)
5 Why - sequential clarification of causes to the system level.
Ishikawa (fish bone) - factors "People/Processes/Tools/Materials/Environment/Dimensions."
Event-Chain/Ripple - a chain of events with probabilities and triggers.
Barrier Analysis - which "fuses" (timeouts, breakers, quotas, tests) were supposed to stop the incident and why they did not work.
Change Correlation - correlation with releases, config digs, feature flags, provider incidents.
Practice: Avoid "root cause = person/one bug." Look for a system combination (debt + lack of guard rails + irrelevant runbooks).
6) Communications and transparency
Internal: single channel (war-room), short updates according to the template: status → actions → ETA of the next update.
External: status page/newsletter with facts without "guilt," with apologies and an action plan.
Sensitivity: do not disclose PD/secrets; legal wording to be agreed.
After the incident: a summary note with human language and a link to a technical report.
External update template (brief):
"31 Oct 2025, 13:40 UTC - some users encountered payment errors (up to 18 minutes). The reason is the degradation of the dependent service. We turned on bypass mode and restored operation at 13:58 UTC. Apologies. Within 72 hours, we will publish a report with actions to prevent recurrence"
7) Actions and implementation management
Each action is owner, deadline, acceptance criteria, risk and priority relationship.
Action classes:
1. Engineering: timeout budgets, jitter retreats, breakers, bulkheads, backprescher, stability/chaos tests.
2. Observability: SLI/SLO, alert guards, saturation, traces, steady-state dashboards.
3. Process: runbook update, on-call workouts, game day, CI gates, bipartisan review for risky changes.
4. Architecture: cache with coalescing, outbox/saga, idempotency, limiters/shading.
Gates: releases fail unless "post-mortem critical actions" are closed (Policy as Code).
Verification: retest (chaos/load) confirms the elimination of the risk.
8) Integration of feedback
Sources:
Telemetry: p99/p99 tails. 9, error-rate, queue depth, CDC lag, retray budget.
VoC/Support: topics of calls, CSAT/NPS, churn signals, "pain points."
Product/Analytics: user behavior, failure/friction, drop-off in funnels.
Partners/Integrators: webhook failures, contract incompatibility, SLA timing.
Signal → decision loop:
1. The signal is classified (severity/cost/frequency).
2. An architectural ticket is created with a hypothesis and the price of the problem.
3. Falls into the engineering portfolio (quarterly/monthly), ranked by ROI and risk.
4. Execute → measure effect → update SLI/SLO/cost baselines.
9) Post-mortem maturity metrics
% postmortems published ≤ 72 h (target ≥ 90%).
Average "lead time" from incident to closure of key actions.
Reopen rate of actions (quality of DoD formulations).
Repeated incidents for the same reason (target → 0).
Proportion of incidents caught by guards (breaker/limiter/timeouts) vs "breakthrough."
Saturation of dashboards (SLI covering critical paths) and "noise" of alerts.
Share of game-day/chaos scenarios that simulate detected failure classes.
10) Example of postmortem (summary)
Event: SEV2. Payment API: up p99 to 1. 8s, 3% 5xx, 31 Oct 2025 (13:22–13:58 UTC).
Impact: 12% of payment attempts with retrays, part - cancellation. Erroneous budget q4: − 7%.
Root Cause: "slow success" of currency dependence (p95 + 400 ms), retrai without jitter → cascade.
Barrier failure: the breaker is configured only for 5xx, not for timeouts; there was no rate-cap for low priority.
What worked: hand shading and stale-rates feature flag.
Actions:
Enter timeout budget and jitter retrays (DoD: p99 <400 ms at + 300 ms to dependency).
Breaker for "slow success" and fallback stale data ≤ 15 minutes.
Update runbook "slow dependency," add chaos script.
Add dashboard "served-stale share" and alert at> 10%.
Enter release-gate: without passing chaos-smoke - prohibit release.
11) Artifact patterns
11. 1 Timeline (example)
13: 22:10 Alert p99> 800 ms (gateway)
下午1时24分IC指定,战争室开放
下午1: 27:30确定了currency-api的"缓慢成功"
下午1时30分15分Ficha-flag stale-rates ON(交通量的10%)
下午1时41分Stale-rates 100%,p99稳定290 m
下午1时52分40分门户停靠站
下午1时58分事件结束,监视30分钟
11. 2 Solutions and Validation (DoD)
解决方桉: 启用breaker (slow_success)
DoD: 混沌脚本"+300 ms to currency"-p99 <450 ms,error_rate <0。5%, stale_share < 12%
11. 3 Policy "gate" (check)
deny_release if any(postmortem_action.status!= "Done" and action.severity in ["critical"])
12)反模式
"猎巫"和惩罚→隐藏错误,失去信号。
协议:没有行动/所有者/时间表的长文件。
没有系统因素的"代码中的错误"级别RCA。
关闭事件而无需退房和基线更新。
公司内部缺乏宣传:重复其他团队中的相同错误。
忽略萨波特/合作伙伴的反馈和"不可见"的退化(成功缓慢)。
摘要"一切都得到修复,继续前进"-体系结构/流程没有变化。
13)建筑师支票清单
1.有一个单一的帖子模板和SLA出版物≤ 72小时?
2.角色(IC、Comms、Scribe、SME)是自动分配的?
3.时间线是否基于遥测(预告片/度量/标志)和发布/标记标签?
4.RCA技术系统应用(5 Why, Ishikawa, Barrier)?
5.活动是否具有所有者、时间表和国防部,是否与发布风险和门有关?
6.事件是否导致更新运行簿/xaoc 脚本/alerts?
7.内置了VoC/Support通道,是否定期查看"顶级疼痛"?
8.错误的预算是否会影响发布和实验政策?
9.成熟度度量是可跟踪的(时间到后期,reopen rate,可重复性)?
10.公共团队内部分析和搜索知识库?
二.结论
后验和反馈是体系结构学习的机制。当无罪分析,可测量的动作效果以及生产信号的集成成为常态时,系统每周都会变得更稳定,更快,更清晰。让事实可见,行动是强制性的,知识是可用的,事件将成为平台发展的燃料。