Roles and Responsibilities in Operations
1) Why formalize roles
Clear role allocation reduces MTTA/MTTR, eliminates grey areas, speeds up releases, and makes SLO/compliance compliant reproducible. Roles = responsibility + authority + interfaces (to whom we write, whom we escalate, what decisions are authorized).
2) Basic RACI model
R (Responsible) - performs the work.
A (Accountable) - bears the final responsibility and makes decisions.
C (Consulted) - expert, consulted before/during.
I (Informed) - informed by SLA.
3) Role catalogue (descriptions and responsibilities)
3. 1 Incident Commander (IC)
Purpose: Leads the response to the SEV-1/0 incident.
Authority: declare SEV, freeze releases, switch traffic, escalate.
Main tasks: timeline, decision making, focus retention, task allocation, Go/No-Go.
Artifacts: incident card, SLA updates, final AAR.
3. 2 P1/P2 On-Call (Primary/Secondary)
Objective: initial response and technical actions.
P1: triage, running playbooks, communication with IC.
P2: backup, complex changes, context retention, in storms - takes substreams.
3. 3 SRE / Platform Engineer
Purpose: platform reliability and railing (SLO, alerts, GitOps, autoscale, DR).
Tasks: SLI/SLO, alert hygiene, progressive releases, infrastructure as code, capacity, observability.
During the incident: root diagnostics, rollbacks/folbacks, degrade-UX enabled.
3. 4 Service Owner / Product Owner
Purpose: quality of service in a business sense.
Tasks: defining SLO/priorities, coordinating releases/windows, participating in Go/No-Go.
Comms: Deciding when and what to tell customers alongside Comms.
3. 5 Release Manager
Purpose: Secure change delivery.
Tasks: orchestration of releases, checkup of gates, canary/blue-green, annotations of releases, freeze for incidents.
3. 6 CAB Chair / Change Manager
Purpose: Change Risk Management
Tasks: RFC process, plan/backout, conflict calendar, high-risk approvals.
3. 7 RCA Lead / Problem Manager
Purpose: post-incident debriefing, CAPA.
Objectives: timeline, evidential causality, actions to correct/prevent, D + 14/D + 30 control.
3. 8 Security (IR Lead, AppSec/CloudSec)
Purpose: Security and Incident Response.
Tasks: triage security events, key rotation, isolation, forensics, regulatory notifications, WORM audit.
3. 9 DataOps / Analytics
Purpose: reliability of data and pipelines.
Objectives: freshness/quality (DQ), data contracts, lineage, backfills, SLA BI/reports.
3. 10 FinOps
Purpose: managed value.
Tasks: quotas/limits, reports $/unit, budget gates, optimizations (log volumes, egress, reservation).
3. 11 Compliance / Legal
Purpose: regulatory and contractual compliance.
Tasks: notification terms, retention/invariability of evidence, coordination of public texts.
3. 12 Support / Comms
Purpose: communications with customers/internal stakeholders.
Tasks: status page, mockups of updates, frequency and clarity of messages, collection of feedback.
3. 13 Vendor Manager / Provider Owner
Purpose: relations with external providers (PSP/KYC/CDN, etc.).
Tasks: escalation, SLA/OLA, backup routes, window coordination.
4) Roles in shift and escalation
Shift: P1/P2 + IC-of-the-day (do not combine with P1).
Time escalation: P1→P2 (5 min without ack) → IC (10 min) → Duty Manager (15 min).
Quiet Hours: P2/P3 signals do not wake up; security signals - always.
5) Interfaces of interactions (who with whom and how)
IC ↔ Release Manager: freeze/rollback solutions.
IC ↔ Comms: update texts and frequency.
SRE ↔ DataOps: business SLI (payment success, data freshness) in SLO-gardrails.
Security ↔ Legal: reports of security incidents, notification periods.
Vendor Owner ↔ IC: provider status, switchover/folback.
6) KPI by role (benchmarks)
IC: Time-to-Declare, Comms SLA compliance, MTTR by SEV-1/0.
P1/P2: MTTA, Time-to-First-Action,% follow playbooks.
SRE/Platform: SLO coverage, Alert Hygiene,% auto-rollbacks successful.
Release Manager: Change Failure Rate, On-time windows, Mean Rollback Time.
RCA Lead: Postmortem Lead Time, CAPA Completion/Overdue, Reopen ≤ 5–10%.
Security: Mean Time to Contain, Secret/Cert Rotation Time.
DataOps: Freshness SLO Adherence, Success Rate Backfills.
Comms: Status Accuracy, Complaint Rate/Incident.
FinOps: $/unit,% QoQ savings, quota compliance.
7) Role card templates
7. 1 IC Card
Role: Incident Commander
Scope: SEV-1/0 (prod)
Decisions: declare SEV, freeze deploy, traffic shift, rollback/failover
Runbooks: rb://core/ic, rb://comms/status
SLA: TTD ≤10m, first comms ≤15m, updates q=15–30m
Escalations: Duty Manager (15m), Exec On-call (30m)
7. 2 P1/P2 card
Role: Primary/Secondary On-call (service: checkout-api)
Runbooks: rb://checkout/5xx, rb://checkout/rollback
Tools: logs, traces, SLO board, feature flags
SLA: Ack ≤5m, first action ≤10m, handover at shift boundaries
7. 3 Release Manager Card
Role: Release Manager
Gates: tests, signatures, active_sev=none, SLO guardrails green 30m
Strategy: canary 1/5/25%, blue-green optional, auto-rollback on burn
Evidence: release annotations, diff configs, dashboards before/after
8) Processes and role participation (summary)
A — Accountable, R — Responsible, C — Consulted, I — Informed.
9) Checklists
9. 1 Assigning roles
- Each role has an owner, a substitute, and a coverage area.
- The authorizations (what decisions can make) are described.
- Bound playbooks and links.
- Published SLAs by reaction/comms.
- Role is available in the CMDB for each service.
9. 2 Shift and handover
- Shift card updated (active incidents, risks, windows).
- JIT/JEA accesses verified.
- Echo message to channel "change accepted/passed."
9. 3 Post-incident
- AAR conducted, RCA assigned.
- CAPA with owners/deadlines, D + 14/D + 30 control.
- Updated playbooks/alerts/policies.
10) Anti-patterns
Unclear "who decides" → delays and duplicate efforts.
IC combined with P1 - loss of leadership.
Public comms without agreement with Legal/Comms.
A release without Release Manager and gates → CFR growth.
No role reservation (sickness/leave).
"Heroism" instead of the process: we save manually, but do not fix the railing.
Roles are not reflected in the CMDB/Service Catalog → lost escalations.
11) Embedding in tools
ChatOps: команды `/who oncall`, `/declare sev1`, `/freeze`, `/rollback`, `/status update`.
Directory/CMDB: the service has an owner, on-call, SLO, dashboards, playbooks, windows.
Alert-as-Code: Each Page has an owner and a default playbook.
GitOps: IC/Release solutions are reflected in release annotations and tickets.
12) Role distribution maturity metrics
Coverage of roles in directories: ≥ 100% of critical services.
On-call SLA: Ack p95 ≤ 5 min; Page Storm p95 under control.
Postmortem SLA: draft ≤ 72h; CAPA completion ≥ 85%.
Change governance:% high-risk changes with RFC/CAB ≥ 95%.
Comms: Adherence ≥ 95%, Complaint Rate ↓ QoQ.
13) Mini templates
13. 1 RACI for service (file in repo)
yaml service: payments-api roles:
owner: team-payments oncall: oncall-payments ic: ic-of-the-day raci:
incident: {A: ic-of-the-day, R: oncall-payments, C: security,data, I: mgmt,comms}
releases: {A: release-manager, R: dev,platform, C: security, I: support}
changes: {A: cab, R: owner, C: sre,security, I: affected-teams}
postmortem: {A: rca-lead, R: owner, C: security,data, I: mgmt}
13. 2 Role profile (Markdown)
Role: Duty Manager
Purpose: Escalation and SEV-1/0
Powers: Assign ICs, reallocate resources, approve freeze
Inputs: # war-room channel, SLO dashboards, IC reports
Outputs: resolutions, post-factual report, CAPA escalations
14) The bottom line
Operations are robust when roles are transparent, empowered, and built into tools. The role catalog, RACI, clear interfaces and metrics for each role turn incidents, releases and changes into managed processes: decisions are made quickly, risks are controlled, and users see a stable service.