［SEV］短い説明と日付

1）原則と文化

責任のない者だ。エラーはシステムのプロパティであり、人ではありません。私たちは「なぜそれが起こったのか」を探しています。
事実と不変量。出力はタイムライン、SLO、トレース、ログに基づいています。
社内の広報。合計とレッスンは、関連するチームに提供されています。
アクションはプロトコルよりも重要です。変更されていない文書≡失われた時間。
クイックパブリッシング。ポストモーテムのドラフト-事件から48〜72時間以内。

2）分類とインシデントの基準

重症度（SEV）：

SEV1-完全なアクセス不能/お金/データの損失。
SEV2-重大な劣化（エラー>SLO、 p99外部）；
SEV3-部分的な劣化/回避策が存在します。
影響：影響を受ける地域/テナント/製品、期間、ビジネス指標（コンバージョン、GMV、支払い失敗）。
SLO/誤った予算：予算がどれだけ使い果たされているか、リリースや実験のスピードにどのように影響するか。

3）インシデントロールとプロセス

インシデントコマンダー（IC）：プロセスを管理し、ステップを優先し、オーナーを割り当てます。
コミュニケーションリード：ステークホルダー/顧客にテンプレートで通知します。
Ops/On-call：清算、軽減アクション。
Scribe：タイムラインとアーティファクトを維持します。
主題の専門家（SME）：深い診断。

ステージ：検出→エスカレーション→安定化→検証→復元→postmorty→改善の導入。

4） Postmortemテンプレート（構造）



5) RCA Techniques (Root Cause Search)

5 Why - sequential clarification of causes to the system level.
Ishikawa (fish bone) - factors "People/Processes/Tools/Materials/Environment/Dimensions."
Event-Chain/Ripple - a chain of events with probabilities and triggers.
Barrier Analysis - which "fuses" (timeouts, breakers, quotas, tests) were supposed to stop the incident and why they did not work.
Change Correlation - correlation with releases, config digs, feature flags, provider incidents.

Practice: Avoid "root cause = person/one bug." Look for a system combination (debt + lack of guard rails + irrelevant runbooks).

6) Communications and transparency

Internal: single channel (war-room), short updates according to the template: status → actions → ETA of the next update.
External: status page/newsletter with facts without "guilt," with apologies and an action plan.
Sensitivity: do not disclose PD/secrets; legal wording to be agreed.
After the incident: a summary note with human language and a link to a technical report.

External update template (brief):
"31 Oct 2025, 13:40 UTC - some users encountered payment errors (up to 18 minutes). The reason is the degradation of the dependent service. We turned on bypass mode and restored operation at 13:58 UTC. Apologies. Within 72 hours, we will publish a report with actions to prevent recurrence"

7) Actions and implementation management

Each action is owner, deadline, acceptance criteria, risk and priority relationship.
Action classes:

1. Engineering: timeout budgets, jitter retreats, breakers, bulkheads, backprescher, stability/chaos tests.
2. Observability: SLI/SLO, alert guards, saturation, traces, steady-state dashboards.
3. Process: runbook update, on-call workouts, game day, CI gates, bipartisan review for risky changes.
4. Architecture: cache with coalescing, outbox/saga, idempotency, limiters/shading.
Gates: releases fail unless "post-mortem critical actions" are closed (Policy as Code).
Verification: retest (chaos/load) confirms the elimination of the risk.

8) Integration of feedback

Sources:

Telemetry: p99/p99 tails. 9, error-rate, queue depth, CDC lag, retray budget.
VoC/Support: topics of calls, CSAT/NPS, churn signals, "pain points."
Product/Analytics: user behavior, failure/friction, drop-off in funnels.
Partners/Integrators: webhook failures, contract incompatibility, SLA timing.

Signal → decision loop:

1. The signal is classified (severity/cost/frequency).
2. An architectural ticket is created with a hypothesis and the price of the problem.
3. Falls into the engineering portfolio (quarterly/monthly), ranked by ROI and risk.
4. Execute → measure effect → update SLI/SLO/cost baselines.

9) Post-mortem maturity metrics

% postmortems published ≤ 72 h (target ≥ 90%).
Average "lead time" from incident to closure of key actions.
Reopen rate of actions (quality of DoD formulations).
Repeated incidents for the same reason (target → 0).
Proportion of incidents caught by guards (breaker/limiter/timeouts) vs "breakthrough."
Saturation of dashboards (SLI covering critical paths) and "noise" of alerts.
Share of game-day/chaos scenarios that simulate detected failure classes.

10) Example of postmortem (summary)

Event: SEV2. Payment API: up p99 to 1. 8s, 3% 5xx, 31 Oct 2025 (13:22–13:58 UTC).
Impact: 12% of payment attempts with retrays, part - cancellation. Erroneous budget q4: − 7%.
Root Cause: "slow success" of currency dependence (p95 + 400 ms), retrai without jitter → cascade.
Barrier failure: the breaker is configured only for 5xx, not for timeouts; there was no rate-cap for low priority.
What worked: hand shading and stale-rates feature flag.
Actions:

Enter timeout budget and jitter retrays (DoD: p99 <400 ms at + 300 ms to dependency).
Breaker for "slow success" and fallback stale data ≤ 15 minutes.
Update runbook "slow dependency," add chaos script.
Add dashboard "served-stale share" and alert at> 10%.
Enter release-gate: without passing chaos-smoke - prohibit release.

11) Artifact patterns

11. 1 Timeline (example)

13： 22：10警告p99> 800ms（ゲートウェイ）

13： 24：00 IC割り当て、戦争部屋オープン

13： 27：30 currency-api 「slow success」識別

13： 30：15 Ficha-flag stale-rate ON（トラフィック10％）

13： 41：00古いレート100％、 p99安定化290ms

13： 52：40ゲートウェイにリトリアを制限する

13： 58：00インシデントクローズ、モニタリング30分


11. 2 Solutions and Validation (DoD)

解決： ブレーカを可能にして下さい（slow_success）

DoD： カオススクリプト「+300ms to currency」-p99 <450ms、 error_rate <0。5％、 stale_share <12％


11. 3 Policy "gate" (check)

deny_releaseもしあれば（postmortem_action。status！=「Done」とアクション。［「critical」］の重大度）


12）アンチパターン

「魔女狩り」と罰→失敗を隠す、信号の喪失。
プロトコルのためのプロトコル：アクション/所有者/期限なしの長い文書。
OCAレベルの「バグインザコード」システムファクタなし。
ベースラインを再テストして更新せずにインシデントを閉じます。
社内の宣伝の欠如：他のチームで同じミスを繰り返します。
サポート/パートナーからのフィードバックを無視し、「見えない」劣化（遅い成功）。
要約「すべてを修正し、上に移動」-アーキテクチャ/プロセスの変更はありません。

13）建築家のチェックリスト

1.PostmortemテンプレートとSLAパブリケーションは72時間≤ありますか？
2.役割（IC、 Comms、 Scribe、 SME）は自動的に割り当てられますか？
3.タイムラインはテレメトリー（トレイル/メトリック/ログ）とリリース/フラグのラベルに基づいていますか？
4.RCAメソッドは体系的に適用されます（5なぜ、石川、バリア）？
5.リスクとリリースゲートに関連する所有者、締め切り、およびDoDがあるアクション？
6.インシデントはrunbook/xaocスクリプト/アラートを更新しますか？
7.内蔵のVoC/サポートチャンネル、「トップペイン」の定期的なレビューはありますか？
8.誤った予算はリリースや実験の方針に影響しますか？
9.成熟度指標は追跡されていますか（ポストモーテム、リオープン率、再現性）？
10.公開チーム内の分析と検索による知識ベースが利用可能ですか？

お知らせいたします

Postmortemsとフィードバックはアーキテクチャ学習メカニズムです。責任のない解析、アクションの測定可能な効果、生産からの信号の統合が標準になると、システムは毎週より安定し、より速く、より明確になります。事実を目に見えるようにし、行動を必須にし、知識にアクセス可能にし、インシデントはプラットフォームの進化のための燃料になります。

［SEV］短い説明と日付

deny_releaseもしあれば（postmortem_action。status！=「Done」とアクション。［「critical」］の重大度）

お問い合わせ

迅速な連絡

動画はまもなく更新されます

現在、私たちはプロジェクトで非常に多忙です