Escalation of incidents
1) Purpose and principles
Incident escalation is the managed process of quickly attracting the right roles and resources to minimize the impact on users and business metrics.
Key principles:- Speed is more important than ideality. It is better to declare the incident earlier and de-escalate than to be late.
- Unified command. One person responsible for the solution is Incident Commander (IC).
- Transparency. Clear statuses and communication channels for internal and external stakeholders.
- Documentability. All steps, decisions and timelines are captured for audit and improvement.
2) Severity gradation (SEV/P-levels)
Example scale (adapt to domain/jurisdictions):- SEV-0/P0 (critical) - complete unavailability of the key function (login/payment), data leakage, legal risk. Immediate paging of the entire kernel on-call, freeze releases.
- SEV-1/P1 (high) - p95/p99 degradation, increased share of errors/failures in the key process, inaccessibility of the region/provider.
- SEV-2/P2 (medium) - partial degradation for a limited cohort (region, provider), there is a workaround.
- SEV-3/P3 (low) - not critical for the user, but requires attention (ETL background delay, overdue report).
- Lesion radius (how many users/turnover) × duration × sensitivity (regulatory/PR) → SEV level.
3) Process KPI
MTTD (detection time) - from the beginning of the incident to the first signal.
MTTA (Receive Time) - Signal to IC acknowledgement.
MTTR (recovery time) - until the SLO/function is restored.
Escalation Latency - from confirmation to connecting the desired role/command.
Reopen Rate - the proportion of incidents reopened after "resolved."
Comm SLA - compliance with the intervals of external/internal updates.
4) Roles and Responsibilities (RACI)
Incident Commander (IC): the owner of the solution, sets the level, plan, freeze, escalation, de-escalation. Does not write fixes.
Tech Lead (TL): technical diagnostics, hypotheses, coordination of engineers.
Comms Lead (CL): status pages, client and internal communication, coordination with Legal/PR.
Scribe: accurate recording of facts, timelines, decisions made.
Liaisons: representatives of external providers/teams (payments, KYC, hosting).
On-call engineers: execution of the plan, launching playbooks/rollbacks.
Assign duty schedules and backups for each role.
5) Channels and artifacts
War-room channel (ChatOps): a single point of coordination (Slack/Teams) with a template of auto-annotations (versions, flags, canaries).
Video bridge for SEV-1 +.
Incident ticket (one-pager): ID, SEV, IC, participants, hypothesis/diagnosis, steps, ETA, status, impact, links to graphs.
Status page: public/internal; schedule of regular updates (for example, every 15-30 minutes for SEV-1 +).
6) Time boxes and standard intervals
T0 (min. 0-5): IC assigned, SEV assigned, freeze releases (if necessary), war-room open.
T + 15 min: first public/internal message (what is affected, workaround, next update window).
T + 30/60 min: escalation of the next level (platform/DB/security/providers), if there is no stable dynamics.
Regular updates: SEV-0: every 15 minutes; SEV-1: every 30 minutes; SEV-2 +: every hour.
7) Auto-escalation rules (trigger policies)
Recorded as code and connected to monitoring/alerting:- Burn-rate error budget above threshold in short and long windows.
- Quorum of external samples: ≥2 regions record HTTP/TLS/DNS degradation.
- Business SLI (success of payments/registrations) falls below SLO.
- Security signatures: suspected leak/compromise.
- Provider signal: webhook status "major outage."
8) Process from discovery to solution
1. Incident Declaration (IC): SEV, coverage, freeze, playbook launch.
2. Diagnostics (TL): hypotheses, radius isolation (region, provider, feature), checks (DNS/TLS/CDN/DB/caches/bus).
3. Mitigating actions (quick victories): rollback/canary ↓, degradation flag feature, provider failover, rate-limit, cache overlay.
4. Communication (CL): status page, customers/partners, Legal/PR, updates on schedule.
5. Confirmation of recovery: external synthetics + real metrics (SLI), freeze removal.
6. De-escalation: decrease in SEV, transition to observation N minutes/hours.
7. Closure and RCA: post-mortem preparation, action items, owners and timing.
9) Working with external providers
Own samples to providers from several regions + mirror log examples of requests/errors.
Escalation agreements (contacts, response SLAs, priority, status webhooks).
Automatic failover/traffic transfer via SLO provider.
Evidence base: timeline, sample requests/responses, latency/error graphs, provider ticket ID.
10) Regulatory, Safety and PR
Security/P0: isolation, collection of artifacts, minimization of disclosure, mandatory notifications (internal/external/regulator).
Legal: approval of the wording of external updates, accounting for contractual SLAs/fines.
PR/Customer Service: ready-made response templates, Q&A, compensations/credits (if applicable).
11) Message templates
Primary (T + 15):- "We are investigating a SEV-1 incident affecting [function/region]. Symptoms: [briefly]. We activated the workaround [description]. The next update is at [time]"
- "Diagnosis: [hypothesis/confirmation]. Actions: [switched provider/rolled back release/enabled degradation]. Impact reduced to [percent/cohort]. The next update is [time]"
- "The incident has SEV-1 been resolved. Reason: [root]. Recovery time: [MTTR]. Next steps: [fix/checks/watch N hours]. Post-mortem - [when/where]"
12) Playbooks (exemplary)
Falling success of payments: reduce the share on provider A, transfer X% to B; Enable degrade-payments-UX include retras in limits; notify the fin command.
p99 API growth: reduce the canary of the new version; turn off heavy features; increase cache-TTL; check DB indexes/connections.
DNS/TLS/CDN problem: verify certificates/chain; Update the record Switch to the standby CDN rebuild the cache.
Security suspicion: node isolation, key rotation, enabling mTLS pens, collecting artifacts, Legal notification.
13) De-escalation and "resolved" criteria
An incident is downgraded if:- SLI/SLO stable in green zone ≥ N intervals;
- mitigating actions and observation were performed - without regression;
- for security class - vectors are confirmed closed, keys/secrets are rotated.
Closing - only after fixing the timeline, action items owners and deadlines.
14) Post-mortem (non-punitive)
Structure:1. Facts (timeline, what users/metrics have seen).
2. Root cause (technical/process).
3. What worked/didn't work in escalation.
4. Preventive measures (tests, alerts, limits, architecture).
5. Action plan with deadlines and owners.
6. Link to error budget and revise SLOs/processes.
15) Process Maturity Metrics
Percentage of incidents reported prior to user complaints.
MTTA by SEV levels; time to connect the desired role.
Compliance with update intervals (Comm SLA).
Percentage of incidents solved by playbooks without manual "creativity."
Execution of action items from post-mortems on time.
16) Anti-patterns
"Somebody do something" - no IC/roles.
Polyphony in the war-room is a dispute over versions instead of actions.
Late declaration → loss of time for gathering people.
No freeze and release annotations - concurrent changes mask the cause.
Lack of external communication - escalating complaints/PR risk.
Closing without post-mortem and actions - we repeat the same mistakes.
17) IC Check List (Pocket Card)
- Assign a SEV and open the war-room.
- Assign TL, CL, Scribe, check on-call present.
- Enable release-freeze (if SEV-1 +).
- Confirm sources of truth: SLI dashboards, synthetics, logs, tracing.
- Accept quick mitigating actions (rollback/flags/failover).
- Provide scheduled regular updates.
- Capture Criteria for Resolve and post-recovery monitoring.
- Initiate post-mortem and assign action items owners.
18) Embedding in daily operations
Game-days: simulations on key scenarios.
Playbook catalog: versioned, tested, with parameters.
Tools: ChatOps commands "/declare, ""/page," "/status, ""/rollback."
Integrations: ticketing, status page, post-mortems, CMDB/service catalog.
Negotiation with SLO/Error Budget: auto-escalation triggers and freeze rules.
19) The bottom line
Escalation is an operational discipline, not just a call to the attendant. Clear SEV levels assigned by IC, ready-made playbooks, update timeboxes, and integration with SLO metrics and budget policies turn a chaotic fire into a manageable process with a predictable outcome - fast service recovery, minimal PR/regulatory risk, and systemic improvements after each incident.