SOP:
Standardization of operating procedures
1) Why do you need it
SOP is the company's "operating OS." Standardization removes chaos and "individual styles," reduces MTTR, alert noise and incident risks, accelerates onboarding, and makes results reproducible.
Objectives:- Reduce the variability of actions in incidents and routines.
- Accelerate training and improve the quality of handovers.
- Make processes auditable: auditing, metrics, data improvements.
- Ensure compliance with regulatory and internal requirements.
2) Standardization principles
1. Uniform format and terminology. One notation, one definition (SLO, ETA, Owner).
2. Actionable, not an encyclopedia. Only verifiable steps, success criteria, and rollback.
3. Minimal branching. Clear if/then solutions instead of freewheeling.
4. Versioning and ownership. Each SOP has an owner, version, and revision date.
5. Integration with tools. Links to dashboards, tickets, ficheflags, CLI commands.
6. Availability in on-call. Quickly search, read, execute with one link.
7. Continuous improvement. Post-mortems → SOP update tasks.
3) SOP framework (template)
4) SOP classification
Incident: P1/P2 (critical), P3 (important).
Operational routines: releases, feature flags, database migrations, provider failover.
DR/BCP: disabling the region, restoring from backup, working offline.
Quality control/audit: revisions, readiness questionnaires, access.
Security/compliance: KYC/AML checks, log storage, privacy.
5) RACI: Ownership and Responsibility
Process R (performer) A (responsible) C (consultant) I (notify)
------------------------ --------------- ----------------- --------------- -------------
Create/Update SOP Domain Owner Head of Ops SRE/Compliance Teams
SLA Revision Ops Enablement Head of Ops Domain leads All
Use in an incident On-call Incident Manager Domain Owner Stakeholders
6) SOP lifecycle
1. Initiation: need from post-mortem/incident/audit.
2. Draft: by template, with specific artifacts and commands.
3. Review: Domain Owner + Head of Ops + specialized consultants.
4. Publishing: to portal/repository; annotations on dashboards.
5. Training: short training/screencast, knowledge test.
6. Application: recorded in ticket/incident.
7. Audit: by SLA revision or after a significant event.
8. Archiving: mark 'deprecated', indicate replacement.
7) Documentation as code (minimum standard)
We store SOP in Git (Markdown + YAML metadata), PR review, CI-lint.
Required fields are 'owner', 'version', 'last _ review', 'sla _ review'.
Link checker and structure validator in CI; auto-release portal after merge.
Significant changes - through changelog and notifications in the # ops channel.
8) SOP integrations
Incident Manager: Open SOP button when creating/escalating an incident.
Grafana/Observability: references from panels to relevant SOPs; release annotations.
Feature Flags/Release: canary step templates, SLO gates, rollback.
AI assistant: RAG search by SOP, TL; DR and proposals for action.
BCP/DR: DR-playbook automatically loaded by trigger.
9) SOP quality check (KPI and review)
KPI:
Coverage ≥ 90% of critical scenarios are closed by SOP.
Review SLA ≤ 180 days (share of overdue - 0).
Usage Rate ≥ 70% of overt SOP incidents.
DoD Pass Rate ≥ 90% of steps are closed with success criteria.
Broken Links = 0 (по CI).
Weekly monitoring:
Top 5 used and top 5 obsolete SOPs.
SOP communication ↔ postmortems: whether Preventive Actions have been performed.
Noisy SOPs (frequent rollback returns) are candidates for recycling.
10) Containment standards
Steps → specifics: commands/queries/parameters + expected effect in metric.
Time requirements: ETA for updates/next steps.
Escalation: clear matrix, contacts, backup channels.
Security: warnings, restrictions, PII/secrets - via vault/links.
Localization: in the on-call language (critical for distributed commands).
11) SOP examples (fragments)
SOP: Canary pause in SLO degradation
Triggers: error_budget_burn > 4x 10m, api_p99 > 1. 3×baseline 10m
Steps:- 1) Pause canary in release-tool
- 2) Check panels "Change Safety" and "API p99"
- 3) Create ticket REG-
, specify baseline/window - DoD: p99 ≤ 1. 1 × baseline 15m,
- Rollback: disable flag completely, postmortem ≤72ch
SOP: PSP Provider Feilover
Triggers: quota_usage>0. 9 OR outbound_error_rate>2×baseline 5m
Steps:- 1) Enable PSP-Y routing (config/button)
- 2) Check deposit conversion and p95 PSP-Y
- 3) Annotations on graphs, update in # incident-channel
- DoD: success_rate ≥ 99. 5%, p95 ≤ 300ms 10m
- Rollback: 20% partial return of traffic at PSP-X stabilization
12) Checklists
SOP readiness checklist:
[] The objective and triggers are clear and measurable.
[] There are steps for commands/links.
[] DoD/Rollback formulated.
[] Escalations and contacts are relevant.
[] Metadata is filled (owner, version, last_review).
[] Link checker and CI validator pass.
SOP application checklist (in incident):
[] SOP opened from Incident Manager/panel link.
[] The steps are completed and the results recorded.
[] DoD Reached/Not - Checked.
[] Actions/inconsistencies are recorded in the ticket.
[] SOP updates/enhancements created by tasks (if needed).
13) Training and onboarding
Mini-courses on key SOPs (Payments/Bets/Games/KYC).
Shadow duty with the mandatory use of SOP in training.
Weekly "SOP clinics": 30 minutes of analysis/improvement.
Simulations (game-days): development of DR and incident SOPs.
14) SOP Change Management
RFC via PR, tags' minor/major/breaking '.
Breaking changes - with mandatory training and announcement.
Auto-notifications to domain owners and on-call.
Separate "SOP-Release Notes" at the end of each week.
15) Anti-patterns
Free form "as it turns out" and different patterns by command.
SOP without owner/revision/revision date.
"Encyclopedic" texts instead of step-by-step actions.
No Rollback/DoD - nothing to check success with.
Broken links, "manual from chat" commands, private "secret" steps.
Invisible SOP changes without recording or training.
16) 30/60/90 - implementation plan
30 days:
Approve SOP template and minimum standards.
Create a repository'ops-sop/' (docs-as-code), enable CI linters.
Digitize 10-15 critical SOPs (incidents/releases/providers).
Connect Incident Manager and visibility panels to SOP links.
60 days:
Reach Coverage ≥ 70% for critical scenarios.
Launch weekly "SOP clinics" and on-call trainings.
Add AI search (RAG) by SOP and TL; DR cards.
Enter Review SLA (180 days) and report past due SOPs.
90 days:
Coverage ≥ 90%, Usage Rate ≥ 70% of incidents.
Embed DoD/Rollback in all SOPs, close broken links (0).
Bind SOP KPI to command OKR (MTTR, Change Failure Rate).
Retro and record next quarter's improvements.
17) FAQ
Q: How is SOP different from runbook?
A: SOP - standardized procedure (regulation "how to"). Runbook - detailed instructions for a specific case/service. Often, the SOP refers to one or more runbooks.
Q: How many details should there be in the SOP?
A: Just enough for the operator to perform actions without "digging" into the chat. All that does not affect the action is in separate reference materials.
Q: How to maintain relevance?
A: SLA revisions (≤180 days), automatic reminders, CI linters and Usage/DoD metrics. Any deviation incident → SOP update task.