GH GambleHub

Incident metrics

1) Why measure incidents

Incident metrics turn chaotic events into a manageable process: help reduce response and recovery times, reduce cause recurrence, prove SLO/contract fulfillment, and find automation points. A good set of metrics covers the entire cycle: detection → classification → escalation → mitigating actions → recovery → parsing CAPA →.


2) Basic definitions and formulas

Event intervals

MTTD (Mean Time To Detect) = mean time from T0 (actual onset of influence) to first signal/detection.
MTTA (Mean Time To Acknowledge) = average time from first signal to ack on-call.
MTTM (Mean Time To Mitigate) = mean time to effect reduction below SLO threshold (often = time to UX workaround/degradation).
MTTR (Mean Time To Recover) = average time to complete recovery of target SLIs.
MTBF (Mean Time Between Failures) = average interval between relevant incidents.

Operating times

Time to Declare - from T0 to the official announcement of the SEV/incident level.
Time to Comms - from announcement to first public/internal SLA update.
Time in State - duration at each stage (triage/diag/fix/verify).

Frequency and fractional

Incident Count - number of incidents per period.
Incident Rate - at 1k/10k/100k successful transactions or requests (normalization).
SEV Mix - distribution by severity (SEV-0... SEV-3).
SLA Breach Count/Rate - number/share of violations of external SLAs.
Change Failure Rate -% of incidents caused by changes (releases/configs/migrations).

Quality of signals and processes

% Actionable Pages - the proportion of pages that led to meaningful playbook actions.
False Positive Rate (Pages) - the proportion of false positives.
Detection Coverage - the proportion of incidents detected by automation (not clients/support).
Reopen Rate - the proportion of repeated incidents with the same root cause ≤90 days.
CAPA Completion -% of corrective/preventive actions closed on time.
Comms SLA Adherence - the proportion of updates published by the required frequency.


3) Metrics Map by Incident Stage

StageKey metricsQuestion
DetectionMTTD, Detection Coverage, Source Mix (monitoring vs users)How quickly and who identifies the problem?
ReactionMTTA, Time to Declare, Page-to-Ack %, Escalation LatencyHow quickly does the team mobilize and assign SEVs?
MitigatingMTTM, Workaround Success %, Change Freeze LatencyHow quickly is the impact reduced to a safe level?
RestorationMTTR, SLO Burn Stopped Time, Residual Risk WindowWhen did the service fully return to normal?
CommsTime to Comms, Comms SLA Adherence, Sentiment/ComplaintsHow well and on time are we communicating?
TrainingPostmortem Lead Time, CAPA Completion/Overdue, Reopen RateAre we learning and closing the loop of improvements?

4) Normalization and segmentation

Normalize counters to volume (traffic, success, active users).
Segment by: region/tenant, provider (PSP/KYC/CDN), type of change (code/config/infra), time of day (day/night), detection source (synthetic/RUM/infra/support).
Business SLIs (success of payments, registrations, replenishment) are important for business - link incident metrics to their degradation.


5) Threshold goals (landmarks, adapt to the domain)

MTTD: ≤ 5 min for Tier-0, ≤ 10-15 min for Tier-1.
MTTA: ≤ 5 min (24/7), ≤ 10 min (follow-the-sun).
MTTM: ≤ 15 min (Tier-0), ≤ 30-60 min (Tier-1).
MTTR: ≤ 60 min (Tier-0), ≤ 4 h (Tier-1).
Detection Coverage: ≥ 85% automation.
% Actionable Pages: ≥ 80–90%; FP Pages: ≤ 5%.
Reopen Rate (90д): ≤ 5–10%.
CAPA Completion (on time): ≥ 85%.


6) Attribution of causes and impact of changes

Assign a primary cause (Code/Config/Infra/Provider/Security/Data/Capacity) and trigger (release ID, config change, migration, external factor) to each incident.
Keep Change-linked MTTR/Count - how much releases and configs contribute (base for gate/canary policies).
Separately, consider Provider-caused incidents (PSP/KYC/CDN/Cloud) to manage routes and contracts.


7) Communications and Customer Impact

Time to First Public Update and Update Cadence (for example, every 15/30 minutes).
Complaint Rate - tickets/complaints about 1 incident, trend.
Status Accuracy - the share of public updates without retractions.
Post-Incident NPS (by key customer) - a brief boost after SEV-1/0.


8) Alerting quality metrics around incidents

Page Storm Index - the number of pages/hour per on-call during an incident (median/p95).
Dedup Efficiency - the proportion of suppressed duplicates.
Quorum Confirmation Rate - the proportion of incidents where the quorum of probes (≥2 independent sources) was triggered.
Shadow→Canary→Prod conversion of new rules (Alert-as-Code).


9) Dashboards (minimum set)

1. Executive (28 days): number of incidents, SEV distribution, MTTR/MTTM, SLA breaks, Reopen, CAPA.
2. SRE Operations: MTTD/MTTA по часам/сменам, Page Storm, Actionable %, Detection Coverage, Time to Declare/Comms.
3. Change Impact: share of release/config incidents, MTTR for change incidents, maintenance windows vs incidents.
4. Providers: incidents by provider, degradation time, route switches, contractual SLAs.
5. Heatmap by Service/Region: Incidents and MTTR per 1k transactions.

Combine SLI/SLO graphics with release annotations and SEV marks.


10) Incident Data Diagram (recommended)

Minimum card/table fields:

incident_id, sev, state, service, region, tenant, provider?,
t0_actual, t_detected, t_ack, t_declared, t_mitigated, t_recovered,
source_detect (synthetic    rum    infra    support),
root_cause (code    config    infra    provider    security    data    capacity    other),
trigger_id (release_id    change_id    external_id),
slo_impact (availability    latency    success), burn_minutes,
sla_breach (bool), public_updates[], owners (IC/TL/Comms/Scribe),
postmortem_id, capa_ids[], reopened_within_90d (bool)

11) Calculation examples (SQL idea)

MTTR over time (median):
sql
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (t_recovered - t0_actual))/60) AS mttr_min
FROM incidents
WHERE t0_actual >= '2025-10-01' AND t_recovered IS NOT NULL AND sev IN ('SEV-0','SEV-1','SEV-2');
Detection Coverage:
sql
SELECT 100.0 SUM(CASE WHEN source_detect <> 'support' THEN 1 ELSE 0 END) / COUNT() AS detection_coverage_pct
FROM incidents
WHERE t0_actual >= current_date - INTERVAL '28 days';
Change Failure Rate (in 28 days):
sql
SELECT 100.0 COUNT() FILTER (WHERE trigger_id IS NOT NULL) / NULLIF(COUNT(),0) AS change_failure_rate_pct
FROM incidents
WHERE t0_actual >= current_date - INTERVAL '28 days';

12) Link to SLO and error budgets

Record SLO burn minutes per incident - this is the main "weight" of the event.
Prioritize CAPA by total burn and SEV weight rather than incident count.
Stitch together a burn with financial impact (example: $/minute of downtime or $/lost transaction).


13) Program-level metrics

Postmortem Lead Time: Median from closure of incident to publication of report.
Evidence Completeness: share of reports with timeline, SLI charts, logs, links to PR/comms.
Alert Hygiene Score: composite index by actionable/FP/dedup/quorum.
Handover Defects: the proportion of shifts where the context of active incidents is lost.
Training Coverage:% on-call simulated in the quarter.


14) Metrics implementation checklist

  • Uniform timestamps (UTC) and incident event contract are defined.
  • SEV, root cause taxonomy and detection sources adopted.
  • Metrics are normalized to volume (traffic/success).
  • Ready 3 dashboards: Executive, Operations, Change Impact.
  • Alert-as-Code: Each Page rule has a playbook and an owner.
  • SLA post-mortem (e.g. draft ≤72ch, final ≤5 slave. days).
  • CAPAs are tracked with effect KPIs and D + 14/D + 30 dates.
  • Weekly Incident Review: Trends, Top Reasons, CAPA Status.

15) Anti-patterns

Consider only MTTR without MTTD/MTTA/MTTM → loss of controllability of early phases.
Not to normalize in volume → large services "seem" worse.
Unsystematic SEV → disparate incidents.
The lack of Evidence → controversy instead of improvements.
Focus on number of incidents instead of burn/SLO impact.
Ignore Reopen and CAPA → eternal relapses.
Metrics in Excel without automatic upload from Telemetry/ITSM.


16) Mini templates

Incident Card (abbr.)


INC: 2025-11-01-042 (SEV-1)
T0=12:04Z, Detected=12:07, Ack=12:09, Declared=12:11,
Mitigated=12:24, Recovered=12:48
Service: payments-api (EU)
SLI: success_ratio (-3.6% к SLO, burn=18 мин)
Root cause: provider (PSP-A), Trigger: status red
Comms: first 12:12Z, cadence 15m, SLA met
Links: dashboards, logs, traces, release notes

Executive report (28 days, key lines)


Incidents: 12 (SEV-0:1, SEV-1:3, SEV-2:6, SEV-3:2)
Median MTTR: 52 мин; Median MTTD: 4 мин; MTTA: 3 мин; MTTM: 17 мин
Detection Coverage: 88%; Actionable Pages: 86%; FP Pages: 3.2%
Change Failure Rate: 33% (4/12) — 3 связаны с конфигом
Reopen(90d): 1/12 (8.3%); CAPA Completion: 82% (2 просрочены)
Top Root Causes: provider(4), config(3), capacity(2)

17) Roadmap (4-6 weeks)

1. Ned. 1-Timestamp/field standard, SEV/reason dictionary basic incident showcase.
2. Ned. 2: MTTD/MTTA/MTTM/MTTR calculations, normalization and SEV-dashboard.
3. Ned. 3: bundle with releases/configs, Detection Coverage and Alert Hygiene.
4. Ned. 4: Executive report, SLA post-mortem, CAPA tracker.
5. Ned. 5-6: provider reports, burn→$ financial model, quarterly goals and quarterly Incident Review.


18) The bottom line

Incident metrics are not just numbers, but a storyboard of operational reliability. When you measure the entire flow (from detection to CAPA), normalize metrics, associate them with SLOs and changes, and review regularly, the organization predictably reduces response time, cost, and incident frequency - and users see a stable service.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.