Escalation Matrix

1) Matrix purpose

The escalation matrix is uniform rules on who connects and when, so that incidents quickly move from chaos to a managed process. She sets:

SEV levels and their criteria;
timings (detection of ack → → escalation → updates);
Roles/channels for each step
Exceptions (no quiet hours for security and compliance)
a bundle with playbooks and a status page.

2) Classification by severity (SEV)

SEV	Impact	Examples	Goals of time
SEV-0	Complete unavailability of key business/data	Regional down, data loss Tier-0	Declare ≤ 5 м; First Comms ≤ 10 м; MTTR — ASAP
SEV-1	Serious SLO degradation	Payments -3% to SLO, p95> 400 ms	Declare ≤ 10 м; First Comms ≤ 15 м; Updates q=15–30 м
SEV-2	Partial degradation/bypass possible	One provider falls, there is folback	Declare ≤ 20 м; Comms as needed
SEV-3	Low impact/internal	Non-customer affecting failures	No public updates

Specify target numbers for your domain and SLO.

3) Basic who/when/where matrix

Event	Timing	Who initiates	Whom we escalate	Channel/Tool	Comment
Detection (Page)	T0 → immediately	Monitoring/P1	P1	Pager/chat # alerts-svc	Playbook Auto Attach
ACK Page	≤ 5 min (SEV-1/0)	P1	—	Pager	If there is no ACK - auto-escalation
No-ACK	5 min	Pager	P2	Pager/Sound	Further - IC in 5-10 min
Declare SEV-1/0	≤ 10 min	IC/P1	Duty Manager, Comms	# war-room- , status page	Freeze releases
First Comms	≤ 15 min	Comms (by IC)	Customers/Int. stakeholders	Status page/mail	Impact-Diag-Actions-ETA Template
Security trigger	At once	Security IR	IC, Legal, Exec	#sec-war-room	Without quiet hours
Provider red	≤ 5 min after confirmation	Vendor Owner	IC, Product	Vendor channel/mail	Initiate switchover
No update	> 30 min (SEV-1/0)	Boat	IC/Comms	War-room	Update SLA Reminder

4) The crucial escalation tree (essence)

1. Any confirmed impact on SLO?

→ Yes: assign an IC, declare a SEV, open a war-room.
→ No: ticket/observation, no page.

2. Got an ACK on time?

→ Yes: we continue along the playbook.
→ No: P2 → IC → DM (ladder in time).

3. Security/leak/PII?

→ Always Security IR + Legal, public communications are coordinated.

4. External provider?

→ Vendor Owner escalation, route switching, fix in status.

5) Escalation Roles and Responsibilities (short)

P1 (Primary): triage, playbook start, link to IC.
P2 (Secondary): backup, complex actions, context retention.
IC (Incident Commander): Announces SEV, decides freeze/rollback, keeps pace.
Duty Manager: removes locks, redistributes resources, makes organizational decisions.
Comms: status page, SLA updates.
Security IR: isolation, forensics, legal notices.
Vendor Owner: external providers, switchover/fallback.

6) Temporary guides (landmarks)

SEV-1/0: ACK ≤ 5 м, Declare ≤ 10 м, First Comms ≤ 15 м, Updates q=15–30 м.
Escalator ladder: P1→P2 (5 m) → IC (10 m) → Duty Manager (15 m) → Exec on-call (30 m).
Security: without delays and "quiet hours," updates q = 15 m.

7) Routing and segmentation

By service/region/tenant: routing key = 'service + region + tenant'.
Quorum of probes: escalate only if ≥2 independent sources are confirmed (synthetic from 2 regions + RUM/business SLI).
Dedup: one master alert instead of dozens of symptoms (DB "red" suppresses 5xx noise).

8) Exceptions and special modes

Security/Legal: escalation of Security IR and Legal out of turn; public texts only through coordination.
Providers: separate OLA/SLA matrix (contacts, time zones, priority).
Change Freeze: if SEV-1/0 - automatic freeze of releases and configs.

9) Matrix maturity metrics

Ack p95 (SEV-1/0) ≤ 5 min.
Time to Declare (median) ≤ 10 min.
Comms SLA Adherence ≥ 95%.
Escalation Success (resolved at P1/P2 level) ≥ 70%.
No-ACK escalations ↓ QoQ.
Vendor Response Time for critical providers within the contract.

10) Checklists

Online (for on-call)

SLO impact and potential SEV identified.
ACK made and IC assigned (for SEV-1/0).
War-room open, playbook attached.
Status update published/planned by SLA.
Freeze enabled (if needed), provider/security escalated.

Process (weekly review)

Did the escalation ladder work on the SLA?
Were there any unnecessary escalations before IC?
Are customer notifications timely and accurate?
Were there blockers (accesses, provider contacts, silent channel)?
CAPAs for process failures are also in place.

11) Templates

11. 1 Escalation Policy (YAML idea)

yaml policy:
sev_levels:
- id: SEV-0 declare_tgt_min: 5 first_comms_min: 10 update_cadence_min: 15
- id: SEV-1 declare_tgt_min: 10 first_comms_min: 15 update_cadence_min: 30 ack_sla_min:
default: 5 ladder:
- after_min: 5 escalate_to: "P2:oncall-<service>"
- after_min: 10 escalate_to: "IC:ic-of-the-day"
- after_min: 15 escalate_to: "DutyManager:duty"
- after_min: 30 escalate_to: "Exec:oncall-exec"
channels:
war_room: "#war-room-<service>"
alerts: "#alerts-<service>"
security: "#sec-war-room"
providers: "vendors@list"
quorum:
required_sources: 2 sources: ["synthetic:eu,us", "rum:<service>", "biz_sli:<kpi>"]
exceptions:
security: { quiet_hours: false, legal_approval_required: true }
providers: { auto_switch: true, notify_vendor_owner: true }

11. 2 Time escalation card (for bot)


T + 05m: no ACK → escalated to P2
T + 10m: no ACK/Declare → escalated to IC, war-room open
T + 15m: no Comms → reminder Comms, escalation Duty Manager
T + 30m: no Updates → IC reminder, Exec on-call CC

11. 3 Template for the first public update


Impact: [services/regions] affected, [symptoms e.g. delays/errors].
Reason: Investigating; confirmed by monitoring quorum.
Actions: bypass routes/restrictions are enabled, provider switching is in progress.
Next update: [time, time zone].

12) Integrations

Alert-as-Code: Each Page rule references exactly one playbook and knows its own escalation matrix.
ChatOps: commands '/declare sev1 ', '/page p2', '/status update ', auto-timers of updates.
CMDB/Catalog: the service has owners, on-call, matrix, providers, channels.
Status page: templates for SEV-1/0, update history, links to RCA.

13) Anti-patterns

"Escalate all at once" → noise and blurred responsibility.
No IC/war-room - solutions creep into chats.
Delay of the first update - an increase in complaints and PR risks.
No security exceptions - legal risks.
External providers without owner and contacts.

The stairs are not automated - everything is "on the handbrake."

14) Implementation Roadmap (3-5 weeks)

1. Ned. 1: fix SEV criteria and timings; Collect role/provider contacts select channels.
2. Ned. 2: describe the policy (YAML), bind to Alert-as-Code, turn on the ladder in the pager/bot.
3. Ned. 3: pilot on 2-3 critical services; debug SLA Comms and templates.
4. Ned. 4-5: Expand coverage, introduce weekly Escalation Review and maturity metrics.

15) The bottom line

The escalation matrix is the operational Constitution of incidents: who, when and how connects. With clear SEVs, timings, channels, security exceptions and integration with playbooks and a status page, the team reacts quickly, coherently and transparently, and users see predictable updates and confident service recovery.

Escalation Matrix

Process (weekly review)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects