SOP:
Amal amallaryny standartlaşdyrmak
1) Bu näme üçin zerur?
SOP - bu kompaniýanyň "operasiýa OS". Standartlaşdyrmak bulam-bujarlygy we "aýratyn stilleri" aýyrýar, MTTR-i, aladalaryň sesini we hadysalaryň töwekgelçiligini azaldýar, onbordingi çaltlaşdyrýar we netijeleri köpeldip bolýar.
Maksatlar:- Wakalar we rutinler ýüze çykan halatynda hereketleriň üýtgeýşini azaltmak.
- Okuwy çaltlaşdyrmak we hendowerleriň hilini ýokarlandyrmak.
- Amallary barlanylýan etmek: audit, metrika, maglumatlar boýunça gowulaşmalar.
- Düzgünleşdiriji we içerki talaplaryň berjaý edilmegini üpjün etmek.
2) Standartlaşdyrmagyň ýörelgeleri
1. Bitewi format we terminologiýa. Bir bellik, bir kesgitleme (SLO, ETA, Owner).
2. Actionable, ensiklopediýa däl. Diňe barlanylýan ädimler, üstünlik we yza gaýdyp gelmek ölçegleri.
3. Iň az şahasy. Erkin beýan etmegiň ýerine "eger/onda" anyk çözgütler.
4. Wersiýalaşdyrmak we eýeçilik etmek. Her SOP-iň eýesi, wersiýasy we gözden geçiriliş senesi bar.
5. Gurallar bilen integrasiýa. Dashbordlara, biletlere, aýratynlyklara, CLI buýruklaryna salgylanmalar.
6. On-call elýeterliligi. Çalt gözläň, okaň, bir baglanyşyk bilen ýerine ýetiriň.
7. Yzygiderli gowulaşmak. Postmortemalar → SOP täzelenmesi üçin meseleler.
3) SOP çarçuwasy (şablon)
4) SOP classification
Incident: P1/P2 (critical), P3 (important).
Operational routines: releases, feature flags, database migrations, provider failover.
DR/BCP: disabling the region, restoring from backup, working offline.
Quality control/audit: revisions, readiness questionnaires, access.
Security/compliance: KYC/AML checks, log storage, privacy.
5) RACI: Ownership and Responsibility
Process R (performer) A (responsible) C (consultant) I (notify)
------------------------ --------------- ----------------- --------------- -------------
Create/Update SOP Domain Owner Head of Ops SRE/Compliance Teams
SLA Revision Ops Enablement Head of Ops Domain leads All
Use in an incident On-call Incident Manager Domain Owner Stakeholders
6) SOP lifecycle
1. Initiation: need from post-mortem/incident/audit.
2. Draft: by template, with specific artifacts and commands.
3. Review: Domain Owner + Head of Ops + specialized consultants.
4. Publishing: to portal/repository; annotations on dashboards.
5. Training: short training/screencast, knowledge test.
6. Application: recorded in ticket/incident.
7. Audit: by SLA revision or after a significant event.
8. Archiving: mark 'deprecated', indicate replacement.
7) Documentation as code (minimum standard)
We store SOP in Git (Markdown + YAML metadata), PR review, CI-lint.
Required fields are 'owner', 'version', 'last _ review', 'sla _ review'.
Link checker and structure validator in CI; auto-release portal after merge.
Significant changes - through changelog and notifications in the # ops channel.
8) SOP integrations
Incident Manager: Open SOP button when creating/escalating an incident.
Grafana/Observability: references from panels to relevant SOPs; release annotations.
Feature Flags/Release: canary step templates, SLO gates, rollback.
AI assistant: RAG search by SOP, TL; DR and proposals for action.
BCP/DR: DR-playbook automatically loaded by trigger.
9) SOP quality check (KPI and review)
KPI:
Coverage ≥ 90% of critical scenarios are closed by SOP.
Review SLA ≤ 180 days (share of overdue - 0).
Usage Rate ≥ 70% of overt SOP incidents.
DoD Pass Rate ≥ 90% of steps are closed with success criteria.
Broken Links = 0 (по CI).
Weekly monitoring:
Top 5 used and top 5 obsolete SOPs.
SOP communication ↔ postmortems: whether Preventive Actions have been performed.
Noisy SOPs (frequent rollback returns) are candidates for recycling.
10) Containment standards
Steps → specifics: commands/queries/parameters + expected effect in metric.
Time requirements: ETA for updates/next steps.
Escalation: clear matrix, contacts, backup channels.
Security: warnings, restrictions, PII/secrets - via vault/links.
Localization: in the on-call language (critical for distributed commands).
11) SOP examples (fragments)
SOP: Canary pause in SLO degradation
Triggers: error_budget_burn > 4x 10m, api_p99 > 1. 3×baseline 10m
Steps:- 1) Pause canary release-tool (baglanyşyk)
- 2) "Change Safety" we "API p99" panellerini barlaň
- 3) REG-
biletini dörediň, baseline/penjiräni görkeziň - DoD: p99 ≤ 1. 1 × baseline 15m, ýalňyşlyklar
- Rollback: baýdagy doly öçürmek, postmortem ≤ 72h
SOP: PSP Provider Feilover
Triggers: quota_usage>0. 9 OR outbound_error_rate>2×baseline 5m
Steps:- 1) PSP-Y marşrutyny açyň
- 2) Goýumlaryň öwrülişigini barlamak we p95 PSP-Y
- 3) Grafiklerdäki düşündirişler, -channel #incident täzelenme
- DoD: success_rate ≥ 99. 5%, p95 ≤ 300ms 10m
- Rollback: PSP-X durnuklaşanda 20% traffigiň bölekleýin yzyna gaýtarylmagy
12) Çek-listler
SOP taýynlyk barlagy:
[] Maksat we triggerler düşnükli we ölçelip bolýar.
[] Toparlar/baglanyşyklar bilen ädimme-ädim hereketler bar.
[] DoD/Rollback düzüldi.
[] Eskalasiýa we aragatnaşyklar möhümdir.
[] Meta-maglumatlar dolduryldy (owner, wersiýa, last_review).
[] Link çeker we CI tassyklaýjy geçýär.
SOP ulanyş barlagy (hadysada):
[] SOP "Incident Manager" -dan açykdyr.
[] Ädimler ýerine ýetirildi we netijeler hasaba alyndy.
[] DoD ýetildi/ýok - bellendi.
[] Hereketler/laýyk gelmezlikler biletde ýazylýar.
[] SOP täzelenmeleri/gowulaşmalary wezipeler tarapyndan döredildi (zerur bolsa).
13) Okuw we onbording
Esasy SOP (Payments/Bets/Games/KYC) boýunça kiçi kurslar.
Türgenleşiklerde SOP-ni hökmany ulanmak bilen şadow-nobatçylyk.
Hepdelik "SOP-klinikalar": 30 minut gözleg/gowulaşma.
Simulýasiýa (game-days): DR- we hadysaly SOP-leri işlemek.
14) SOP üýtgetmelerini dolandyrmak
PR arkaly RFC, bellikler 'minor/major/breaking'.
Breaking-üýtgeşmeler - hökmany okuw we yglan etmek bilen.
Domen eýelerine we on-kolla awto-habarnamalar.
Her hepdäniň ahyrynda aýratyn "SOP-Release Notes".
15) Anti-patternler
"Nädip bolýar" erkin görnüşi we toparlar boýunça dürli şablonlar.
Eýesiz SOP/wersiýasy/gözden geçiriliş senesi.
Ädimme-ädim hereketleriň ýerine "ensiklopedik" tekstler.
Rollback/DoD ýok - üstünligi barlamak üçin hiç zat ýok.
Döwülen baglanyşyklar, "el bilen söhbetdeşlik" buýruklary, şahsy "gizlin" ädimler.
Ýazylman we öwrenilmän görünmeýän SOP üýtgeşmeleri.
16) 30/60/90 - durmuşa geçirmek meýilnamasy
30 gün:
SOP şablonyny we iň pes standartlary tassyklaň.
'ops-sop/' (docs-as-code) ammaryny dörediň, CI-linterleri goşuň.
10-15 sany möhüm SOP (hadysalar/goýberişler/üpjün edijiler) sanlaşdyryň.
Incident Manager we gözegçilik panellerini SOP baglanyşyklaryna birikdiriň.
60 gün:
Coverage ≥ 70% -e ýetiň.
Hepdelik "SOP-klinikalary" we on-call okuwlaryny başla.
SOP we TL boýunça AI-gözleg (RAG) goşmak; DR kartoçkalary.
SLA synyny (180 gün) we möhleti geçen SOP hasabatyny giriziň.
90 gün:
Coverage ≥ 90%, Usage Rate ≥ 70% hadysalar.
DoD/Rollback programmasyny ähli SOP-lere birleşdiriň, döwülen baglanyşyklary ýapyň (0).
KPI SOP-i OKR (MTTR, Change Failure Rate) bilen baglanyşdyryň.
Retro geçiriň we indiki çärýegiň gowulaşmalaryny ýazga alyň.
17) FAQ
Q: SOP runbook-dan nähili tapawutlanýar?
A: SOP - standartlaşdyrylan amal ("nädip dogry" düzgüni). Runbook - belli bir iş/hyzmat üçin jikme-jik görkezmeler. SOP köplenç bir ýa-da birnäçe runbook-a salgylanýar.
Q: SOP-da näçe bölek bolmaly?
A: Operator söhbetdeşlige "gazmazdan" hereketleri ýerine ýetirip biler ýaly. Herekete täsir etmeýän hemme zat - aýry-aýry salgylanma materiallaryna.
Q: Nädip aktuallygyny saklamaly?
A: SLA barlaglary (180 günden ≤), awtomatiki ýatlatmalar, CI-linterler we Usage/DoD metrikasy. Islendik sapma hadysasy → SOP-ni täzelemek meselesi.