Incident metrics

1) Why measure incidents

Incident metrics turn chaotic events into a manageable process: help reduce response and recovery times, reduce cause recurrence, prove SLO/contract fulfillment, and find automation points. A good set of metrics covers the entire cycle: detection → classification → escalation → mitigating actions → recovery → parsing CAPA →.

2) Basic definitions and formulas

Event intervals

MTTD (Mean Time To Detect) = mean time from T0 (actual onset of influence) to first signal/detection.
MTTA (Mean Time To Acknowledge) = average time from first signal to ack on-call.
MTTM (Mean Time To Mitigate) = mean time to effect reduction below SLO threshold (often = time to UX workaround/degradation).
MTTR (Mean Time To Recover) = average time to complete recovery of target SLIs.
MTBF (Mean Time Between Failures) = average interval between relevant incidents.

Operating times

Time to Declare - from T0 to the official announcement of the SEV/incident level.
Time to Comms - from announcement to first public/internal SLA update.
Time in State - duration at each stage (triage/diag/fix/verify).

Frequency and proportional

Incident Count - number of incidents per period.
Incident Rate - at 1k/10k/100k successful transactions or requests (normalization).
SEV Mix - distribution by severity (SEV-0... SEV-3).
SLA Breach Count/Rate - number/share of violations of external SLAs.
Change Failure Rate -% of incidents caused by changes (releases/configs/migrations).

Quality of signals and processes

% Actionable Pages - the proportion of pages that led to meaningful playbook actions.
False Positive Rate (Pages) - the proportion of false positives.
Detection Coverage - the proportion of incidents detected by automation (not clients/support).
Reopen Rate - the proportion of repeated incidents with the same root cause ≤90 days.
CAPA Completion -% of corrective/preventive actions closed on time.
Comms SLA Adherence - the proportion of updates published by the required frequency.

3) Metrics Map by Incident Stage

Stage	Key metrics	Question
Detection	MTTD, Detection Coverage, Source Mix (monitoring vs users)	How quickly and who identifies the problem?
Reaction	MTTA, Time to Declare, Page-to-Ack %, Escalation Latency	How quickly does the team mobilize and assign SEVs?
Mitigating	MTTM, Workaround Success %, Change Freeze Latency	How quickly is the impact reduced to a safe level?
Restoration	MTTR, SLO Burn Stopped Time, Residual Risk Window	When did the service fully return to normal?
Comms	Time to Comms, Comms SLA Adherence, Sentiment/Complaints	How well and on time are we communicating?
Training	Postmortem Lead Time, CAPA Completion/Overdue, Reopen Rate	Are we learning and closing the loop of improvements?

4) Normalization and segmentation

Normalize counters to volume (traffic, success, active users).
Segment by: region/tenant, provider (PSP/KYC/CDN), type of change (code/config/infra), time of day (day/night), detection source (synthetic/RUM/infra/support).
Business SLIs (success of payments, registrations, replenishment) are important for business - link incident metrics to their degradation.

5) Threshold goals (landmarks, adapt to the domain)

MTTD: ≤ 5 min for Tier-0, ≤ 10-15 min for Tier-1.
MTTA: ≤ 5 min (24/7), ≤ 10 min (follow-the-sun).
MTTM: ≤ 15 min (Tier-0), ≤ 30-60 min (Tier-1).
MTTR: ≤ 60 min (Tier-0), ≤ 4 h (Tier-1).
Detection Coverage: ≥ 85% automation.
% Actionable Pages: ≥ 80–90%; FP Pages: ≤ 5%.
Reopen Rate (90д): ≤ 5–10%.
CAPA Completion (on time): ≥ 85%.

6) Attribution of causes and impact of changes

Assign a primary cause (Code/Config/Infra/Provider/Security/Data/Capacity) and trigger (release ID, config change, migration, external factor) to each incident.
Keep Change-linked MTTR/Count - how much releases and configs contribute (base for gate/canary policies).
Separately, consider Provider-caused incidents (PSP/KYC/CDN/Cloud) to manage routes and contracts.

7) Communications and Customer Impact

Time to First Public Update and Update Cadence (for example, every 15/30 minutes).
Complaint Rate - tickets/complaints about 1 incident, trend.
Status Accuracy - the share of public updates without retractions.
Post-Incident NPS (by key customer) - a brief boost after SEV-1/0.

8) Alerting quality metrics around incidents

Page Storm Index - the number of pages/hour per on-call during an incident (median/p95).
Dedup Efficiency - the proportion of suppressed duplicates.
Quorum Confirmation Rate - the proportion of incidents where the quorum of probes (≥2 independent sources) was triggered.
Shadow→Canary→Prod conversion of new rules (Alert-as-Code).

9) Dashboards (minimum set)

1. Executive (28 days): number of incidents, SEV distribution, MTTR/MTTM, SLA breaks, Reopen, CAPA.
2. SRE Operations: MTTD/MTTA по часам/сменам, Page Storm, Actionable %, Detection Coverage, Time to Declare/Comms.
3. Change Impact: share of release/config incidents, MTTR for change incidents, maintenance windows vs incidents.
4. Providers: incidents by provider, degradation time, route switches, contractual SLAs.
5. Heatmap by Service/Region: Incidents and MTTR per 1k transactions.

Combine SLI/SLO graphics with release annotations and SEV marks.

10) Incident Data Diagram (recommended)

Minimum card/table fields:


incident_id, sev, state, service, region, tenant, provider?,
t0_actual, t_detected, t_ack, t_declared, t_mitigated, t_recovered,
source_detect (synthetic    rum    infra    support),
root_cause (code    config    infra    provider    security    data    capacity    other),
trigger_id (release_id    change_id    external_id),
slo_impact (availability    latency    success), burn_minutes,
sla_breach (bool), public_updates[], owners (IC/TL/Comms/Scribe),
postmortem_id, capa_ids[], reopened_within_90d (bool)

11) Calculation examples (SQL idea)

MTTR over time (median):

sql
SELECT PERCENTILE_CONT(0. 5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (t_recovered - t0_actual))/60) AS mttr_min
FROM incidents
WHERE t0_actual >= '2025-10-01' AND t_recovered IS NOT NULL AND sev IN ('SEV-0','SEV-1','SEV-2');

Detection Coverage:

sql
SELECT 100. 0 SUM(CASE WHEN source_detect <> 'support' THEN 1 ELSE 0 END) / COUNT() AS detection_coverage_pct
FROM incidents
WHERE t0_actual >= current_date - INTERVAL '28 days';

Change Failure Rate (in 28 days):

sql
SELECT 100. 0 COUNT() FILTER (WHERE trigger_id IS NOT NULL) / NULLIF(COUNT(),0) AS change_failure_rate_pct
FROM incidents
WHERE t0_actual >= current_date - INTERVAL '28 days';

12) Link to SLO and error budgets

Record SLO burn minutes per incident - this is the main "weight" of the event.
Prioritize CAPA by total burn and SEV weight rather than incident count.
Stitch together a burn with financial impact (example: $/minute of downtime or $/lost transaction).

13) Program-level metrics

Postmortem Lead Time: Median from closure of incident to publication of report.
Evidence Completeness: share of reports with timeline, SLI charts, logs, links to PR/comms.
Alert Hygiene Score: composite index by actionable/FP/dedup/quorum.
Handover Defects: the proportion of shifts where the context of active incidents is lost.
Training Coverage:% on-call simulated in the quarter.

14) Metrics implementation checklist

Uniform timestamps (UTC) and incident event contract are defined.
SEV, root cause taxonomy and detection sources adopted.
Metrics are normalized to volume (traffic/success).
Ready 3 dashboards: Executive, Operations, Change Impact.
Alert-as-Code: Each Page rule has a playbook and an owner.
SLA post-mortem (e.g. draft ≤72ch, final ≤5 slave. days).
CAPAs are tracked with effect KPIs and D + 14/D + 30 dates.
Weekly Incident Review: Trends, Top Reasons, CAPA Status.

15) Anti-patterns

Consider only MTTR without MTTD/MTTA/MTTM → loss of controllability of early phases.
Not to normalize in volume → large services "seem" worse.
Unsystematic SEV → disparate incidents.
The lack of Evidence → controversy instead of improvements.
Focus on number of incidents instead of burn/SLO impact.
Ignore Reopen and CAPA → eternal relapses.
Metrics in Excel without automatic upload from Telemetry/ITSM.

16) Mini templates

Incident Card (abbr.)


INC: 2025-11-01-042 (SEV-1)
T0=12:04Z, Detected=12:07, Ack=12:09, Declared=12:11,
Mitigated=12:24, Recovered=12:48
Service: payments-api (EU)
SLI: success_ratio (-3. 6% to SLO, burn = 18 min)
Root cause: provider (PSP-A), Trigger: status red
Comms: first 12:12Z, cadence 15m, SLA met
Links: dashboards, logs, traces, release notes

Executive report (28 days, key lines)


Incidents: 12 (SEV-0:1, SEV-1:3, SEV-2:6, SEV-3:2)
Median MTTR: 52 min; Median MTTD: 4 min; MTTA: 3 min; MTTM: 17 min
Detection Coverage: 88%; Actionable Pages: 86%; FP Pages: 3. 2%
Change Failure Rate: 33% (4/12) - 3 related to config
Reopen(90d): 1/12 (8. 3%); CAPA Completion: 82% (2 overdue)
Top Root Causes: provider(4), config(3), capacity(2)

17) Roadmap (4-6 weeks)

1. Ned. 1-Timestamp/field standard, SEV/reason dictionary basic incident showcase.
2. Ned. 2: MTTD/MTTA/MTTM/MTTR calculations, normalization and SEV-dashboard.
3. Ned. 3: bundle with releases/configs, Detection Coverage and Alert Hygiene.
4. Ned. 4: Executive report, SLA post-mortem, CAPA tracker.
5. Ned. 5-6: provider reports, burn→$ financial model, quarterly goals and quarterly Incident Review.

18) The bottom line

Incident metrics are not just numbers, but a storyboard of operational reliability. When you measure the entire flow (from detection to CAPA), normalize metrics, associate them with SLOs and changes, and review regularly, the organization predictably reduces response time, cost, and incident frequency - and users see a stable service.

Incident metrics

Operating times

Frequency and proportional

Quality of signals and processes

Executive report (28 days, key lines)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects