Standard Operating Procedures
1) What is SOP and why is it needed
SOP (Standard Operating Procedure) is a formalized, validated sequence of steps for repeatable operations with understandable inputs/outputs, roles, and quality criteria.
The objectives of the SOP are:- Reduce execution variability and risks.
- Reduce MTTA/MTTR through off-the-shelf actions.
- Compliance and audit: reproducibility, traceability.
- Onboarding: accelerating learning and shadow → solo.
SOP ≠ playbook: playbook - decision tree with forks, SOP - linear rules for a specific scenario (or playbook branch).
2) Good SOP principles
Outcome-Driven: Focus on outcome (SLO/business criteria), not just steps.
Unambiguity: commands, parameters, expected effects and control points.
Security by default: gates, limits, backout/rollback are registered.
Minimum context: short notes + links to detailed runbooks/diagnostics.
Relevance: review date, owner, version, expiration date.
Executability: JIT/JEA accesses, precondition checks, artifact templates.
3) SOP standard structure (skeleton)
ID/Version/Review Date
Name and short purpose (what and why)
Scope (Services/Regions/Tenants, SEV/Risk)
Roles and Responsibilities (RACI: R/A/C/I)
Preconditions (accesses, windows, stage, reserve, artifacts)
Materials/tools (dashboards, feature flags, repos, keys)
Quality gates (SLO-gardrails, quorum of probes, alerts)
Step-by-step instruction (step → command → expected result → verification)
Branches (if X - perform Y) [minimum]
Backout/Rollback (start conditions, steps, verification)
Communications (who, when, where; message templates)
Evidence (what to save: screenshots, logs, chexums, links)
Completion (success criteria, watching who closes the ticket)
Change History (What, By Whom, and Why)
4) SOP directory and ownership
Single repository (Docs-as-Code) with tags: 'domain/ops', 'service/checkout', 'risk/high', 'provider/psp-a'.
Owner card: team, duty contacts, backup owner.
SLA relevance (e.g. review every ≤90 days or after incident/release).
Linter/SOP validator (CI): verification of structure, links, owners, review period.
5) SOP lifecycle
1. Initiation (after incident/drill/new process).
2. Draft (author = service/process owner).
3. Review (SRE/Security/Legal/Comms - by domain).
4. Pilot (tabletop/game day): measure time, finds → edits.
5. Publication (version, date, number, templates in CMDB/service catalog).
6. Operational application (annotations in tickets/chats, evidence collection).
7. Update (by RCA/CAPA, by review deadline, by architecture changes).
8. Archiving/depletion (replaced by a new SOP/playbook).
6) Connections with neighboring artifacts
Playbooks: SOP - "linear branch" inside the playbook; reference from steps.
Runbook 'and: technical details/scripts are placed in the runbook, SOP refers.
Policies (Policy-as-Code): access gates, permissions, RBAC - mandatory links.
SLO/SLI: success criteria and garde-rails.
Escalation matrix: roles/timings when SOP execution fails.
Maintenance windows: slot/comma requirements for high-risk SOP.
7) SOP performance metrics
Time-to-Execute (median/p95) - how long the procedure takes.
Success Rate - success rate without escalation/rollback.
Evidence Completeness - the fullness of artifacts.
SLO Impact - is there any degradation during/after the step (burn-minutes).
Defect Density - Review/Exercise Notes at 10 SOPs.
Freshness is the proportion of SOPs with a review of ≤90 days.
Adoption - how many alerts/windows are actually tied to the SOP.
8) SOP Author Checklist
- Purpose and application boundaries defined.
- Roles, accesses and windows - described.
- Quality gates and SLO are measurable, there are signal sources.
- Steps executable: commands/scripts, expected results, verification.
- Backout/rollback and launch criteria - clear.
- Comm templates are attached.
- The evidence list is structured.
- Version/date/owner/review specified.
9) SOP checklist
- JIT/JEA preconditions and accesses confirmed.
- Ticket/war-room is open and annotations are included.
- Observability: the necessary dashboards/alerts are open.
- I follow the steps in order; after each - verification.
- In case of violation of gardrails - immediate backout and escalation.
- Evidence is full; final SLO/business SLI check.
- Ticket closed, status page/comms updated.
10) SOP examples (fragments)
10. 1 SOP: Canary release rollback (REL-ROLLBACK-01)
The goal: to return the stable version when the burn-rate is exceeded or the p99 grows.
Scope: checkout-api service (prod, EU).
Roles: Release (R), IC (A in SEV-1), P1 (R), Comms (I).
Preconditions: feature flags are ready; JEA accesses; release-annotations included.
Gates: slo. payment_success, http_p99; quorum synthetic EU/US + RUM.
Steps:
1) Freeze unrelated depleys.
2) rollback to tag v2. 3. 7 (command...) → waiting 5 minutes.
I expect: p99↓, error_rate↓, burn-rate <threshold.
3) Business SLI check (payment success, conversion) 10 min.
4) Remove the suppression of alerts; update release annotation.
Backout: if rollback does not help - escalate to IC, enable degrade-UX, consider failover.
Comms: "Rolled back; metrics stabilize; next update in 15 minutes."
Evidence: before/after screenshots, link to dashboards, command and output.
Completion: 30 min green SLOs; close the ticket; assign an RCA (if SEV-1).
Version: 1. 6 (2025-10-28)
10. 2 SOP: Scheduled DB upgrade (MW-DB-UPGRADE-02)
Purpose: update PostgreSQL minor without data loss.
Area: payments-db (prod EU), 02: 00-04: 00 Europe/Kyiv.
Roles: DB Lead (R), SRE (C), Service Owner (A), Comms (R clients).
Preconditions: OK backups; replica in sync; Test upgrade passed.
Gates: lag≤30s, error_rate<0. 5%, p99 <400ms, SLO green 30m.
Steps:
1) Transfer traffic to canary replica 1%→5%→25%; SLI monitoring.
2) Consistently upgrade secondary nodes → switch over → upgrade of the former primary.
3) Restore replication, check consistency.
Backout: promote stable replica; return writer; rolling back packets.
Comms: T-7/-2 days and T-60/-15 min alert; updates q = 30m during the window.
Evidence: migration logs, checksums, p95/p99 graphs.
Completion: observation 60m without burn; MW report with evidence.
Version: 2. 1 (2025-09-12)
10. 3 SOP: PSP Provider Switching (PROV-PSP-SWITCH-01)
Objective: to maintain payment success_ratio in case of PSP-A degradation.
Trigger: PSP-A red/partial status + success_ratio% ≥2 drop.
Steps:
1) Install weights: PSP-A 30%, PSP-B 70%.
2) Turn on the degrade_payments_ux; enhance retrays (within SLA).
3) Monitor fraud_rate/chargeback-risk 30m.
Backout: Regain weights at green SLI 60m.
Comms: status page (first ≤15m, cadence 30m).
10. 4 SOP: Backup recovery check (DATA-BACKUP-RESTORE-CHECK-03)
Objective: weekly verification of recoverability.
Steps: lift from backup in isolation → hash control → consistency requests → report.
Success criterion: time-to-restore ≤ 45 min; 100% integrity.
11) Automation around SOPs
SOP templating: skeleton generation with RACI/gates/comma block.
Bot performer: steps with check boxes, timers, cadence reminders, evidence auto-collection.
Integration with CMDB/Catalog - Service has a list of relevant SOPs.
Telemetry annotations: "SOP-RUN: <ID> step N" → quick parsing.
Admission policies: The deployment/window starts only with green SOP gates.
12) Anti-patterns
SOP without owner/date review - "dead" document.
Bloated instructions without success criteria and backout.
Inconsistent commands/keys - risk of errors and leaks.
Different versions in the wiki and in the repository are a divergence of sources of truth.
No evidence - nothing to confirm quality/compliance.
"One SOP for all cases" - executability is lost.
13) Implementation Roadmap (4-6 weeks)
1. Ned. 1: approve SOP template, linter and catalog; select top 10 scenarios.
2. Ned. 2: write SOP for releases/rollback/provider/backups; tabletop pilots.
3. Ned. 3: connect ChatOps bot and telemetry annotations; associate alerts with SOPs.
4. Ned. 4: quarterly review schedule; enter Freshness/Success Rate metrics.
5. Ned. 5-6: cover 90% of critical operations; DR/Security-SOP; automate evidence collection.
14) The bottom line
SOP makes operations predictable and verifiable: uniform quality gates, detailed steps, explicit roles, and reversibility. In conjunction with playbooks, politicians, SLO and automation, this turns operation into a reliable production line - quick reactions, minimal risk and understandable responsibility.