Roles and Responsibilities in Operations

1) Why formalize roles

Clear role allocation reduces MTTA/MTTR, eliminates grey areas, speeds up releases, and makes SLO/compliance compliant reproducible. Roles = responsibility + authority + interfaces (to whom we write, whom we escalate, what decisions are authorized).

2) Basic RACI model

R (Responsible) - performs the work.
A (Accountable) - bears the final responsibility and makes decisions.
C (Consulted) - expert, consulted before/during.
I (Informed) - informed by SLA.

Top-level example:

Process	A	R	C	I
Incidents (SEV-1/0)	IC	P1/P2, SRE, Owning Team	Security, Product, Data	Mgmt, Support
Releases	Release Manager/Owner	Dev, Platform/SRE	Security, QA	Support, Mgmt
Changes (RFC/CAB)	CAB Chair	Service Owner	Security, SRE, Data	Affected teams
Maintenance windows	Service Owner	Platform/SRE	Product, Support	Customers/Partners
Post-mortems	RCA Lead	Owning Team, Scribe	Security, Data, Product	Mgmt

3) Role catalogue (descriptions and responsibilities)

3. 1 Incident Commander (IC)

Purpose: Leads the response to the SEV-1/0 incident.
Authority: declare SEV, freeze releases, switch traffic, escalate.
Main tasks: timeline, decision making, focus retention, task allocation, Go/No-Go.
Artifacts: incident card, SLA updates, final AAR.

3. 2 P1/P2 On-Call (Primary/Secondary)

Objective: initial response and technical actions.
P1: triage, running playbooks, communication with IC.
P2: backup, complex changes, context retention, in storms - takes substreams.

3. 3 SRE / Platform Engineer

Purpose: platform reliability and railing (SLO, alerts, GitOps, autoscale, DR).
Tasks: SLI/SLO, alert hygiene, progressive releases, infrastructure as code, capacity, observability.
During the incident: root diagnostics, rollbacks/folbacks, degrade-UX enabled.

3. 4 Service Owner / Product Owner

Purpose: quality of service in a business sense.
Tasks: defining SLO/priorities, coordinating releases/windows, participating in Go/No-Go.
Comms: Deciding when and what to tell customers alongside Comms.

3. 5 Release Manager

Purpose: Secure change delivery.
Tasks: orchestration of releases, checkup of gates, canary/blue-green, annotations of releases, freeze for incidents.

3. 6 CAB Chair / Change Manager

Purpose: Change Risk Management

Tasks: RFC process, plan/backout, conflict calendar, high-risk approvals.

3. 7 RCA Lead / Problem Manager

Purpose: post-incident debriefing, CAPA.
Objectives: timeline, evidential causality, actions to correct/prevent, D + 14/D + 30 control.

3. 8 Security (IR Lead, AppSec/CloudSec)

Purpose: Security and Incident Response.
Tasks: triage security events, key rotation, isolation, forensics, regulatory notifications, WORM audit.

3. 9 DataOps / Analytics

Purpose: reliability of data and pipelines.
Objectives: freshness/quality (DQ), data contracts, lineage, backfills, SLA BI/reports.

3. 10 FinOps

Purpose: managed value.
Tasks: quotas/limits, reports $/unit, budget gates, optimizations (log volumes, egress, reservation).

3. 11 Compliance / Legal

Purpose: regulatory and contractual compliance.
Tasks: notification terms, retention/invariability of evidence, coordination of public texts.

3. 12 Support / Comms

Purpose: communications with customers/internal stakeholders.
Tasks: status page, mockups of updates, frequency and clarity of messages, collection of feedback.

3. 13 Vendor Manager / Provider Owner

Purpose: relations with external providers (PSP/KYC/CDN, etc.).
Tasks: escalation, SLA/OLA, backup routes, window coordination.

4) Roles in shift and escalation

Shift: P1/P2 + IC-of-the-day (do not combine with P1).
Time escalation: P1→P2 (5 min without ack) → IC (10 min) → Duty Manager (15 min).
Quiet Hours: P2/P3 signals do not wake up; security signals - always.

5) Interfaces of interactions (who with whom and how)

IC ↔ Release Manager: freeze/rollback solutions.
IC ↔ Comms: update texts and frequency.
SRE ↔ DataOps: business SLI (payment success, data freshness) in SLO-gardrails.
Security ↔ Legal: reports of security incidents, notification periods.
Vendor Owner ↔ IC: provider status, switchover/folback.

6) KPI by role (benchmarks)

IC: Time-to-Declare, Comms SLA compliance, MTTR by SEV-1/0.
P1/P2: MTTA, Time-to-First-Action,% follow playbooks.
SRE/Platform: SLO coverage, Alert Hygiene,% auto-rollbacks successful.
Release Manager: Change Failure Rate, On-time windows, Mean Rollback Time.
RCA Lead: Postmortem Lead Time, CAPA Completion/Overdue, Reopen ≤ 5–10%.
Security: Mean Time to Contain, Secret/Cert Rotation Time.
DataOps: Freshness SLO Adherence, Success Rate Backfills.
Comms: Status Accuracy, Complaint Rate/Incident.
FinOps: $/unit,% QoQ savings, quota compliance.

7) Role card templates

7. 1 IC Card


Role: Incident Commander
Scope: SEV-1/0 (prod)
Decisions: declare SEV, freeze deploy, traffic shift, rollback/failover
Runbooks: rb://core/ic, rb://comms/status
SLA: TTD ≤10m, first comms ≤15m, updates q=15–30m
Escalations: Duty Manager (15m), Exec On-call (30m)

7. 2 P1/P2 card


Role: Primary/Secondary On-call (service: checkout-api)
Runbooks: rb://checkout/5xx, rb://checkout/rollback
Tools: logs, traces, SLO board, feature flags
SLA: Ack ≤5m, first action ≤10m, handover at shift boundaries

7. 3 Release Manager Card


Role: Release Manager
Gates: tests, signatures, active_sev=none, SLO guardrails green 30m
Strategy: canary 1/5/25%, blue-green optional, auto-rollback on burn
Evidence: release annotations, diff configs, dashboards before/after

8) Processes and role participation (summary)

Process	IC	P1/P2	SRE/Platform	Owner	Release	CAB	Security	DataOps	Comms	Vendor
Incident	A	R	R	C	I	I	C	C	R	C
Release	I	I	C	A	R	C	C	C	I	I
RFC/Window	I	I	R	A	C	A	C	C	C	C
Post-mortem	A	R	R	C	C	I	C	C	I	I

A — Accountable, R — Responsible, C — Consulted, I — Informed.

9) Checklists

9. 1 Assigning roles

Each role has an owner, a substitute, and a coverage area.
The authorizations (what decisions can make) are described.
Bound playbooks and links.
Published SLAs by reaction/comms.
Role is available in the CMDB for each service.

9. 2 Shift and handover

Shift card updated (active incidents, risks, windows).
JIT/JEA accesses verified.
Echo message to channel "change accepted/passed."

9. 3 Post-incident

AAR conducted, RCA assigned.
CAPA with owners/deadlines, D + 14/D + 30 control.
Updated playbooks/alerts/policies.

10) Anti-patterns

Unclear "who decides" → delays and duplicate efforts.
IC combined with P1 - loss of leadership.
Public comms without agreement with Legal/Comms.
A release without Release Manager and gates → CFR growth.
No role reservation (sickness/leave).
"Heroism" instead of the process: we save manually, but do not fix the railing.
Roles are not reflected in the CMDB/Service Catalog → lost escalations.

11) Embedding in tools

ChatOps: команды `/who oncall`, `/declare sev1`, `/freeze`, `/rollback`, `/status update`.
Directory/CMDB: the service has an owner, on-call, SLO, dashboards, playbooks, windows.
Alert-as-Code: Each Page has an owner and a default playbook.
GitOps: IC/Release solutions are reflected in release annotations and tickets.

12) Role distribution maturity metrics

Coverage of roles in directories: ≥ 100% of critical services.
On-call SLA: Ack p95 ≤ 5 min; Page Storm p95 under control.
Postmortem SLA: draft ≤ 72h; CAPA completion ≥ 85%.
Change governance:% high-risk changes with RFC/CAB ≥ 95%.
Comms: Adherence ≥ 95%, Complaint Rate ↓ QoQ.

13) Mini templates

13. 1 RACI for service (file in repo)

yaml service: payments-api roles:
owner: team-payments oncall: oncall-payments ic: ic-of-the-day raci:
incident:  {A: ic-of-the-day, R: oncall-payments, C: security,data, I: mgmt,comms}
releases:  {A: release-manager, R: dev,platform, C: security, I: support}
changes:  {A: cab, R: owner, C: sre,security, I: affected-teams}
postmortem: {A: rca-lead, R: owner, C: security,data, I: mgmt}

13. 2 Role profile (Markdown)


Role: Duty Manager
Purpose: Escalation and SEV-1/0
Powers: Assign ICs, reallocate resources, approve freeze
Inputs: # war-room channel, SLO dashboards, IC reports
Outputs: resolutions, post-factual report, CAPA escalations

14) The bottom line

Operations are robust when roles are transparent, empowered, and built into tools. The role catalog, RACI, clear interfaces and metrics for each role turn incidents, releases and changes into managed processes: decisions are made quickly, risks are controlled, and users see a stable service.

Roles and Responsibilities in Operations

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects