GH GambleHub

Escalation Matrix

1) Matrix purpose

The escalation matrix is uniform rules on who connects and when, so that incidents quickly move from chaos to a managed process. She sets:
  • SEV levels and their criteria;
  • timings (detection of ack → → escalation → updates);
  • Roles/channels for each step
  • Exceptions (no quiet hours for security and compliance)
  • a bundle with playbooks and a status page.

2) Classification by severity (SEV)

SEVImpactExamplesGoals of time
SEV-0Complete unavailability of key business/dataRegional down, data loss Tier-0Declare ≤ 5 м; First Comms ≤ 10 м; MTTR — ASAP
SEV-1Serious SLO degradationPayments -3% to SLO, p95> 400 msDeclare ≤ 10 м; First Comms ≤ 15 м; Updates q=15–30 м
SEV-2Partial degradation/bypass possibleOne provider falls, there is folbackDeclare ≤ 20 м; Comms as needed
SEV-3Low impact/internalNon-customer affecting failuresNo public updates

Specify target numbers for your domain and SLO.

3) Basic who/when/where matrix

EventTimingWho initiatesWhom we escalateChannel/ToolComment
Detection (Page)T0 → immediatelyMonitoring/P1P1Pager/chat # alerts-svcPlaybook Auto Attach
ACK Page≤ 5 min (SEV-1/0)P1PagerIf there is no ACK - auto-escalation
No-ACK5 minPagerP2Pager/SoundFurther - IC in 5-10 min
Declare SEV-1/0≤ 10 minIC/P1Duty Manager, Comms# war-room- , status pageFreeze releases
First Comms≤ 15 minComms (by IC)Customers/Int. stakeholdersStatus page/mailImpact-Diag-Actions-ETA Template
Security triggerAt onceSecurity IRIC, Legal, Exec#sec-war-roomWithout quiet hours
Provider red≤ 5 min after confirmationVendor OwnerIC, ProductVendor channel/mailInitiate switchover
No update> 30 min (SEV-1/0)BoatIC/CommsWar-roomUpdate SLA Reminder

4) The crucial escalation tree (essence)

1. Any confirmed impact on SLO?

→ Yes: assign an IC, declare a SEV, open a war-room.
→ No: ticket/observation, no page.

2. Got an ACK on time?

→ Yes: we continue along the playbook.
→ No: P2 → IC → DM (ladder in time).

3. Security/leak/PII?

→ Always Security IR + Legal, public communications are coordinated.

4. External provider?

→ Vendor Owner escalation, route switching, fix in status.

5) Escalation Roles and Responsibilities (short)

P1 (Primary): triage, playbook start, link to IC.
P2 (Secondary): backup, complex actions, context retention.
IC (Incident Commander): Announces SEV, decides freeze/rollback, keeps pace.
Duty Manager: removes locks, redistributes resources, makes organizational decisions.
Comms: status page, SLA updates.
Security IR: isolation, forensics, legal notices.
Vendor Owner: external providers, switchover/fallback.

6) Temporary guides (landmarks)

SEV-1/0: ACK ≤ 5 м, Declare ≤ 10 м, First Comms ≤ 15 м, Updates q=15–30 м.
Escalator ladder: P1→P2 (5 m) → IC (10 m) → Duty Manager (15 m) → Exec on-call (30 m).
Security: without delays and "quiet hours," updates q = 15 m.

7) Routing and segmentation

By service/region/tenant: routing key = 'service + region + tenant'.
Quorum of probes: escalate only if ≥2 independent sources are confirmed (synthetic from 2 regions + RUM/business SLI).
Dedup: one master alert instead of dozens of symptoms (DB "red" suppresses 5xx noise).

8) Exceptions and special modes

Security/Legal: escalation of Security IR and Legal out of turn; public texts only through coordination.
Providers: separate OLA/SLA matrix (contacts, time zones, priority).
Change Freeze: if SEV-1/0 - automatic freeze of releases and configs.

9) Matrix maturity metrics

Ack p95 (SEV-1/0) ≤ 5 min.
Time to Declare (median) ≤ 10 min.
Comms SLA Adherence ≥ 95%.
Escalation Success (resolved at P1/P2 level) ≥ 70%.
No-ACK escalations ↓ QoQ.
Vendor Response Time for critical providers within the contract.

10) Checklists

Online (for on-call)

  • SLO impact and potential SEV identified.
  • ACK made and IC assigned (for SEV-1/0).
  • War-room open, playbook attached.
  • Status update published/planned by SLA.
  • Freeze enabled (if needed), provider/security escalated.

Process (weekly review)

  • Did the escalation ladder work on the SLA?
  • Were there any unnecessary escalations before IC?
  • Are customer notifications timely and accurate?
  • Were there blockers (accesses, provider contacts, silent channel)?
  • CAPAs for process failures are also in place.

11) Templates

11. 1 Escalation Policy (YAML idea)

yaml policy:
sev_levels:
- id: SEV-0 declare_tgt_min: 5 first_comms_min: 10 update_cadence_min: 15
- id: SEV-1 declare_tgt_min: 10 first_comms_min: 15 update_cadence_min: 30 ack_sla_min:
default: 5 ladder:
- after_min: 5 escalate_to: "P2:oncall-<service>"
- after_min: 10 escalate_to: "IC:ic-of-the-day"
- after_min: 15 escalate_to: "DutyManager:duty"
- after_min: 30 escalate_to: "Exec:oncall-exec"
channels:
war_room: "#war-room-<service>"
alerts: "#alerts-<service>"
security: "#sec-war-room"
providers: "vendors@list"
quorum:
required_sources: 2 sources: ["synthetic:eu,us", "rum:<service>", "biz_sli:<kpi>"]
exceptions:
security: { quiet_hours: false, legal_approval_required: true }
providers: { auto_switch: true, notify_vendor_owner: true }

11. 2 Time escalation card (for bot)


T + 05m: no ACK → escalated to P2
T + 10m: no ACK/Declare → escalated to IC, war-room open
T + 15m: no Comms → reminder Comms, escalation Duty Manager
T + 30m: no Updates → IC reminder, Exec on-call CC

11. 3 Template for the first public update


Impact: [services/regions] affected, [symptoms e.g. delays/errors].
Reason: Investigating; confirmed by monitoring quorum.
Actions: bypass routes/restrictions are enabled, provider switching is in progress.
Next update: [time, time zone].

12) Integrations

Alert-as-Code: Each Page rule references exactly one playbook and knows its own escalation matrix.
ChatOps: commands '/declare sev1 ', '/page p2', '/status update ', auto-timers of updates.
CMDB/Catalog: the service has owners, on-call, matrix, providers, channels.
Status page: templates for SEV-1/0, update history, links to RCA.

13) Anti-patterns

"Escalate all at once" → noise and blurred responsibility.
No IC/war-room - solutions creep into chats.
Delay of the first update - an increase in complaints and PR risks.
No security exceptions - legal risks.
External providers without owner and contacts.

The stairs are not automated - everything is "on the handbrake."

14) Implementation Roadmap (3-5 weeks)

1. Ned. 1: fix SEV criteria and timings; Collect role/provider contacts select channels.
2. Ned. 2: describe the policy (YAML), bind to Alert-as-Code, turn on the ladder in the pager/bot.
3. Ned. 3: pilot on 2-3 critical services; debug SLA Comms and templates.
4. Ned. 4-5: Expand coverage, introduce weekly Escalation Review and maturity metrics.

15) The bottom line

The escalation matrix is the operational Constitution of incidents: who, when and how connects. With clear SEVs, timings, channels, security exceptions and integration with playbooks and a status page, the team reacts quickly, coherently and transparently, and users see predictable updates and confident service recovery.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.