Incident metrics
1) Why measure incidents
Incident metrics turn chaotic events into a manageable process: help reduce response and recovery times, reduce cause recurrence, prove SLO/contract fulfillment, and find automation points. A good set of metrics covers the entire cycle: detection → classification → escalation → mitigating actions → recovery → parsing CAPA →.
2) Basic definitions and formulas
Event intervals
MTTD (Mean Time To Detect) = mean time from T0 (actual onset of influence) to first signal/detection.
MTTA (Mean Time To Acknowledge) = average time from first signal to ack on-call.
MTTM (Mean Time To Mitigate) = mean time to effect reduction below SLO threshold (often = time to UX workaround/degradation).
MTTR (Mean Time To Recover) = average time to complete recovery of target SLIs.
MTBF (Mean Time Between Failures) = average interval between relevant incidents.
Operating times
Time to Declare - from T0 to the official announcement of the SEV/incident level.
Time to Comms - from announcement to first public/internal SLA update.
Time in State - duration at each stage (triage/diag/fix/verify).
Frequency and fractional
Incident Count - number of incidents per period.
Incident Rate - at 1k/10k/100k successful transactions or requests (normalization).
SEV Mix - distribution by severity (SEV-0... SEV-3).
SLA Breach Count/Rate - number/share of violations of external SLAs.
Change Failure Rate -% of incidents caused by changes (releases/configs/migrations).
Quality of signals and processes
% Actionable Pages - the proportion of pages that led to meaningful playbook actions.
False Positive Rate (Pages) - the proportion of false positives.
Detection Coverage - the proportion of incidents detected by automation (not clients/support).
Reopen Rate - the proportion of repeated incidents with the same root cause ≤90 days.
CAPA Completion -% of corrective/preventive actions closed on time.
Comms SLA Adherence - the proportion of updates published by the required frequency.
3) Metrics Map by Incident Stage
4) Normalization and segmentation
Normalize counters to volume (traffic, success, active users).
Segment by: region/tenant, provider (PSP/KYC/CDN), type of change (code/config/infra), time of day (day/night), detection source (synthetic/RUM/infra/support).
Business SLIs (success of payments, registrations, replenishment) are important for business - link incident metrics to their degradation.
5) Threshold goals (landmarks, adapt to the domain)
MTTD: ≤ 5 min for Tier-0, ≤ 10-15 min for Tier-1.
MTTA: ≤ 5 min (24/7), ≤ 10 min (follow-the-sun).
MTTM: ≤ 15 min (Tier-0), ≤ 30-60 min (Tier-1).
MTTR: ≤ 60 min (Tier-0), ≤ 4 h (Tier-1).
Detection Coverage: ≥ 85% automation.
% Actionable Pages: ≥ 80–90%; FP Pages: ≤ 5%.
Reopen Rate (90д): ≤ 5–10%.
CAPA Completion (on time): ≥ 85%.
6) Attribution of causes and impact of changes
Assign a primary cause (Code/Config/Infra/Provider/Security/Data/Capacity) and trigger (release ID, config change, migration, external factor) to each incident.
Keep Change-linked MTTR/Count - how much releases and configs contribute (base for gate/canary policies).
Separately, consider Provider-caused incidents (PSP/KYC/CDN/Cloud) to manage routes and contracts.
7) Communications and Customer Impact
Time to First Public Update and Update Cadence (for example, every 15/30 minutes).
Complaint Rate - tickets/complaints about 1 incident, trend.
Status Accuracy - the share of public updates without retractions.
Post-Incident NPS (by key customer) - a brief boost after SEV-1/0.
8) Alerting quality metrics around incidents
Page Storm Index - the number of pages/hour per on-call during an incident (median/p95).
Dedup Efficiency - the proportion of suppressed duplicates.
Quorum Confirmation Rate - the proportion of incidents where the quorum of probes (≥2 independent sources) was triggered.
Shadow→Canary→Prod conversion of new rules (Alert-as-Code).
9) Dashboards (minimum set)
1. Executive (28 days): number of incidents, SEV distribution, MTTR/MTTM, SLA breaks, Reopen, CAPA.
2. SRE Operations: MTTD/MTTA по часам/сменам, Page Storm, Actionable %, Detection Coverage, Time to Declare/Comms.
3. Change Impact: share of release/config incidents, MTTR for change incidents, maintenance windows vs incidents.
4. Providers: incidents by provider, degradation time, route switches, contractual SLAs.
5. Heatmap by Service/Region: Incidents and MTTR per 1k transactions.
Combine SLI/SLO graphics with release annotations and SEV marks.
10) Incident Data Diagram (recommended)
Minimum card/table fields:
incident_id, sev, state, service, region, tenant, provider?,
t0_actual, t_detected, t_ack, t_declared, t_mitigated, t_recovered,
source_detect (synthetic rum infra support),
root_cause (code config infra provider security data capacity other),
trigger_id (release_id change_id external_id),
slo_impact (availability latency success), burn_minutes,
sla_breach (bool), public_updates[], owners (IC/TL/Comms/Scribe),
postmortem_id, capa_ids[], reopened_within_90d (bool)
11) Calculation examples (SQL idea)
MTTR over time (median):sql
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (t_recovered - t0_actual))/60) AS mttr_min
FROM incidents
WHERE t0_actual >= '2025-10-01' AND t_recovered IS NOT NULL AND sev IN ('SEV-0','SEV-1','SEV-2');
Detection Coverage:
sql
SELECT 100.0 SUM(CASE WHEN source_detect <> 'support' THEN 1 ELSE 0 END) / COUNT() AS detection_coverage_pct
FROM incidents
WHERE t0_actual >= current_date - INTERVAL '28 days';
Change Failure Rate (in 28 days):
sql
SELECT 100.0 COUNT() FILTER (WHERE trigger_id IS NOT NULL) / NULLIF(COUNT(),0) AS change_failure_rate_pct
FROM incidents
WHERE t0_actual >= current_date - INTERVAL '28 days';
12) Link to SLO and error budgets
Record SLO burn minutes per incident - this is the main "weight" of the event.
Prioritize CAPA by total burn and SEV weight rather than incident count.
Stitch together a burn with financial impact (example: $/minute of downtime or $/lost transaction).
13) Program-level metrics
Postmortem Lead Time: Median from closure of incident to publication of report.
Evidence Completeness: share of reports with timeline, SLI charts, logs, links to PR/comms.
Alert Hygiene Score: composite index by actionable/FP/dedup/quorum.
Handover Defects: the proportion of shifts where the context of active incidents is lost.
Training Coverage:% on-call simulated in the quarter.
14) Metrics implementation checklist
- Uniform timestamps (UTC) and incident event contract are defined.
- SEV, root cause taxonomy and detection sources adopted.
- Metrics are normalized to volume (traffic/success).
- Ready 3 dashboards: Executive, Operations, Change Impact.
- Alert-as-Code: Each Page rule has a playbook and an owner.
- SLA post-mortem (e.g. draft ≤72ch, final ≤5 slave. days).
- CAPAs are tracked with effect KPIs and D + 14/D + 30 dates.
- Weekly Incident Review: Trends, Top Reasons, CAPA Status.
15) Anti-patterns
Consider only MTTR without MTTD/MTTA/MTTM → loss of controllability of early phases.
Not to normalize in volume → large services "seem" worse.
Unsystematic SEV → disparate incidents.
The lack of Evidence → controversy instead of improvements.
Focus on number of incidents instead of burn/SLO impact.
Ignore Reopen and CAPA → eternal relapses.
Metrics in Excel without automatic upload from Telemetry/ITSM.
16) Mini templates
Incident Card (abbr.)
INC: 2025-11-01-042 (SEV-1)
T0=12:04Z, Detected=12:07, Ack=12:09, Declared=12:11,
Mitigated=12:24, Recovered=12:48
Service: payments-api (EU)
SLI: success_ratio (-3.6% к SLO, burn=18 мин)
Root cause: provider (PSP-A), Trigger: status red
Comms: first 12:12Z, cadence 15m, SLA met
Links: dashboards, logs, traces, release notes
Executive report (28 days, key lines)
Incidents: 12 (SEV-0:1, SEV-1:3, SEV-2:6, SEV-3:2)
Median MTTR: 52 мин; Median MTTD: 4 мин; MTTA: 3 мин; MTTM: 17 мин
Detection Coverage: 88%; Actionable Pages: 86%; FP Pages: 3.2%
Change Failure Rate: 33% (4/12) — 3 связаны с конфигом
Reopen(90d): 1/12 (8.3%); CAPA Completion: 82% (2 просрочены)
Top Root Causes: provider(4), config(3), capacity(2)
17) Roadmap (4-6 weeks)
1. Ned. 1-Timestamp/field standard, SEV/reason dictionary basic incident showcase.
2. Ned. 2: MTTD/MTTA/MTTM/MTTR calculations, normalization and SEV-dashboard.
3. Ned. 3: bundle with releases/configs, Detection Coverage and Alert Hygiene.
4. Ned. 4: Executive report, SLA post-mortem, CAPA tracker.
5. Ned. 5-6: provider reports, burn→$ financial model, quarterly goals and quarterly Incident Review.
18) The bottom line
Incident metrics are not just numbers, but a storyboard of operational reliability. When you measure the entire flow (from detection to CAPA), normalize metrics, associate them with SLOs and changes, and review regularly, the organization predictably reduces response time, cost, and incident frequency - and users see a stable service.