Escalation Matrix
1) Matrix purpose
The escalation matrix is uniform rules on who connects and when, so that incidents quickly move from chaos to a managed process. She sets:- SEV levels and their criteria;
- timings (detection of ack → → escalation → updates);
- Roles/channels for each step
- Exceptions (no quiet hours for security and compliance)
- a bundle with playbooks and a status page.
2) Classification by severity (SEV)
Specify target numbers for your domain and SLO.
3) Basic who/when/where matrix
4) The crucial escalation tree (essence)
1. Any confirmed impact on SLO?
→ Yes: assign an IC, declare a SEV, open a war-room.
→ No: ticket/observation, no page.
2. Got an ACK on time?
→ Yes: we continue along the playbook.
→ No: P2 → IC → DM (ladder in time).
3. Security/leak/PII?
→ Always Security IR + Legal, public communications are coordinated.
4. External provider?
→ Vendor Owner escalation, route switching, fix in status.
5) Escalation Roles and Responsibilities (short)
P1 (Primary): triage, playbook start, link to IC.
P2 (Secondary): backup, complex actions, context retention.
IC (Incident Commander): Announces SEV, decides freeze/rollback, keeps pace.
Duty Manager: removes locks, redistributes resources, makes organizational decisions.
Comms: status page, SLA updates.
Security IR: isolation, forensics, legal notices.
Vendor Owner: external providers, switchover/fallback.
6) Temporary guides (landmarks)
SEV-1/0: ACK ≤ 5 м, Declare ≤ 10 м, First Comms ≤ 15 м, Updates q=15–30 м.
Escalator ladder: P1→P2 (5 m) → IC (10 m) → Duty Manager (15 m) → Exec on-call (30 m).
Security: without delays and "quiet hours," updates q = 15 m.
7) Routing and segmentation
By service/region/tenant: routing key = 'service + region + tenant'.
Quorum of probes: escalate only if ≥2 independent sources are confirmed (synthetic from 2 regions + RUM/business SLI).
Dedup: one master alert instead of dozens of symptoms (DB "red" suppresses 5xx noise).
8) Exceptions and special modes
Security/Legal: escalation of Security IR and Legal out of turn; public texts only through coordination.
Providers: separate OLA/SLA matrix (contacts, time zones, priority).
Change Freeze: if SEV-1/0 - automatic freeze of releases and configs.
9) Matrix maturity metrics
Ack p95 (SEV-1/0) ≤ 5 min.
Time to Declare (median) ≤ 10 min.
Comms SLA Adherence ≥ 95%.
Escalation Success (resolved at P1/P2 level) ≥ 70%.
No-ACK escalations ↓ QoQ.
Vendor Response Time for critical providers within the contract.
10) Checklists
Online (for on-call)
- SLO impact and potential SEV identified.
- ACK made and IC assigned (for SEV-1/0).
- War-room open, playbook attached.
- Status update published/planned by SLA.
- Freeze enabled (if needed), provider/security escalated.
Process (weekly review)
- Did the escalation ladder work on the SLA?
- Were there any unnecessary escalations before IC?
- Are customer notifications timely and accurate?
- Were there blockers (accesses, provider contacts, silent channel)?
- CAPAs for process failures are also in place.
11) Templates
11. 1 Escalation Policy (YAML idea)
yaml policy:
sev_levels:
- id: SEV-0 declare_tgt_min: 5 first_comms_min: 10 update_cadence_min: 15
- id: SEV-1 declare_tgt_min: 10 first_comms_min: 15 update_cadence_min: 30 ack_sla_min:
default: 5 ladder:
- after_min: 5 escalate_to: "P2:oncall-<service>"
- after_min: 10 escalate_to: "IC:ic-of-the-day"
- after_min: 15 escalate_to: "DutyManager:duty"
- after_min: 30 escalate_to: "Exec:oncall-exec"
channels:
war_room: "#war-room-<service>"
alerts: "#alerts-<service>"
security: "#sec-war-room"
providers: "vendors@list"
quorum:
required_sources: 2 sources: ["synthetic:eu,us", "rum:<service>", "biz_sli:<kpi>"]
exceptions:
security: { quiet_hours: false, legal_approval_required: true }
providers: { auto_switch: true, notify_vendor_owner: true }
11. 2 Time escalation card (for bot)
T + 05m: no ACK → escalated to P2
T + 10m: no ACK/Declare → escalated to IC, war-room open
T + 15m: no Comms → reminder Comms, escalation Duty Manager
T + 30m: no Updates → IC reminder, Exec on-call CC
11. 3 Template for the first public update
Impact: [services/regions] affected, [symptoms e.g. delays/errors].
Reason: Investigating; confirmed by monitoring quorum.
Actions: bypass routes/restrictions are enabled, provider switching is in progress.
Next update: [time, time zone].
12) Integrations
Alert-as-Code: Each Page rule references exactly one playbook and knows its own escalation matrix.
ChatOps: commands '/declare sev1 ', '/page p2', '/status update ', auto-timers of updates.
CMDB/Catalog: the service has owners, on-call, matrix, providers, channels.
Status page: templates for SEV-1/0, update history, links to RCA.
13) Anti-patterns
"Escalate all at once" → noise and blurred responsibility.
No IC/war-room - solutions creep into chats.
Delay of the first update - an increase in complaints and PR risks.
No security exceptions - legal risks.
External providers without owner and contacts.
The stairs are not automated - everything is "on the handbrake."
14) Implementation Roadmap (3-5 weeks)
1. Ned. 1: fix SEV criteria and timings; Collect role/provider contacts select channels.
2. Ned. 2: describe the policy (YAML), bind to Alert-as-Code, turn on the ladder in the pager/bot.
3. Ned. 3: pilot on 2-3 critical services; debug SLA Comms and templates.
4. Ned. 4-5: Expand coverage, introduce weekly Escalation Review and maturity metrics.
15) The bottom line
The escalation matrix is the operational Constitution of incidents: who, when and how connects. With clear SEVs, timings, channels, security exceptions and integration with playbooks and a status page, the team reacts quickly, coherently and transparently, and users see predictable updates and confident service recovery.