SOP:
Operatsion tartib-taomillarni standartlashtirish
1) Nima uchun bu zarur?
SOP - bu kompaniyaning «operatsion operatsion tizimi». Standartlashtirish tartibsizlik va «individual uslublarni» olib tashlaydi, MTTR, alertlar shovqini va hodisalar xavfini kamaytiradi, onbordingni tezlashtiradi va natijalarni takrorlanuvchan qiladi.
Maqsadlar:- Hodisalar va odatlardagi harakatlarning o’zgaruvchanligini kamaytirish.
- O’qitishni jadallashtirish va hendoverlar sifatini oshirish.
- Tekshiriladigan jarayonlar: audit, metrika, ma’lumotlar bo’yicha yaxshilanishlar.
- Tartibga solish va ichki talablarga muvofiqligini ta’minlash.
2) Standartlashtirish prinsiplari
1. Yagona format va terminologiya. Bitta nota, bitta ta’rif (SLO, ETA, Owner).
2. Actionable, ensiklopediya emas. Faqat tekshiriladigan qadamlar, muvaffaqiyat va orqaga qaytish mezonlari.
3. Minimal tarmoqlanish. Erkin bayon qilish o’rniga «agar/bo’lsa» aniq qarorlari.
4. Versiyalash va egalik qilish. Har bir SOPning egasi, versiyasi va taftish sanasi bor.
5. Asboblar bilan integratsiya qilish. Dashbordlar, chiptalar, ficheflaglar, CLI buyruqlariga havolalar.
6. On-colleda foydalanish imkoniyati. Bitta havolani tezda qidirish, oʻqish, bajarish.
7. Uzluksiz yaxshilanish. Postmortemlar → SOP yangilanish vazifalari.
3) Karkas SOP (shablon)
4) SOP classification
Incident: P1/P2 (critical), P3 (important).
Operational routines: releases, feature flags, database migrations, provider failover.
DR/BCP: disabling the region, restoring from backup, working offline.
Quality control/audit: revisions, readiness questionnaires, access.
Security/compliance: KYC/AML checks, log storage, privacy.
5) RACI: Ownership and Responsibility
Process R (performer) A (responsible) C (consultant) I (notify)
------------------------ --------------- ----------------- --------------- -------------
Create/Update SOP Domain Owner Head of Ops SRE/Compliance Teams
SLA Revision Ops Enablement Head of Ops Domain leads All
Use in an incident On-call Incident Manager Domain Owner Stakeholders
6) SOP lifecycle
1. Initiation: need from post-mortem/incident/audit.
2. Draft: by template, with specific artifacts and commands.
3. Review: Domain Owner + Head of Ops + specialized consultants.
4. Publishing: to portal/repository; annotations on dashboards.
5. Training: short training/screencast, knowledge test.
6. Application: recorded in ticket/incident.
7. Audit: by SLA revision or after a significant event.
8. Archiving: mark 'deprecated', indicate replacement.
7) Documentation as code (minimum standard)
We store SOP in Git (Markdown + YAML metadata), PR review, CI-lint.
Required fields are 'owner', 'version', 'last _ review', 'sla _ review'.
Link checker and structure validator in CI; auto-release portal after merge.
Significant changes - through changelog and notifications in the # ops channel.
8) SOP integrations
Incident Manager: Open SOP button when creating/escalating an incident.
Grafana/Observability: references from panels to relevant SOPs; release annotations.
Feature Flags/Release: canary step templates, SLO gates, rollback.
AI assistant: RAG search by SOP, TL; DR and proposals for action.
BCP/DR: DR-playbook automatically loaded by trigger.
9) SOP quality check (KPI and review)
KPI:
Coverage ≥ 90% of critical scenarios are closed by SOP.
Review SLA ≤ 180 days (share of overdue - 0).
Usage Rate ≥ 70% of overt SOP incidents.
DoD Pass Rate ≥ 90% of steps are closed with success criteria.
Broken Links = 0 (по CI).
Weekly monitoring:
Top 5 used and top 5 obsolete SOPs.
SOP communication ↔ postmortems: whether Preventive Actions have been performed.
Noisy SOPs (frequent rollback returns) are candidates for recycling.
10) Containment standards
Steps → specifics: commands/queries/parameters + expected effect in metric.
Time requirements: ETA for updates/next steps.
Escalation: clear matrix, contacts, backup channels.
Security: warnings, restrictions, PII/secrets - via vault/links.
Localization: in the on-call language (critical for distributed commands).
11) SOP examples (fragments)
SOP: Canary pause in SLO degradation
Triggers: error_budget_burn > 4x 10m, api_p99 > 1. 3×baseline 10m
Steps:- 1) Pause canary in release-tool (havola)
- 2) «Change Safety» va «API p99» panellarini tekshirish
- 3) REG-
sertifikatini yaratish, baseline/oynani koʻrsatish - DoD: p99 ≤ 1. 1 × baseline 15m, xato
- Rollback: bayroqni to’liq o’chirish, postmortem ≤ 72 soat
SOP: PSP Provider Feilover
Triggers: quota_usage>0. 9 OR outbound_error_rate>2×baseline 5m
Steps:- 1) PSP-Y routingini yoqish
- 2) Depozitlar va p95 PSP-Y konversiyasini tekshirish
- 3) Grafiklardagi izohlar, -channel #incident apdeyt
- DoD: success_rate ≥ 99. 5%, p95 ≤ 300ms 10m
- Rollback: PSP-X barqarorlashganda 20% trafikni qisman qaytarish
12) Chek-varaqlar
SOP tayyorlik chek-varaqasi:
[] Maqsad va triggerlar tushunarli va o’lchovlidir.
[] Buyruqlar/havolalar bilan bosqichma-bosqich harakatlar mavjud.
[] DoD/Rollback formulalangan.
[] Eskalatsiya va aloqalar dolzarbdir.
[] Meta maʼlumotlar (owner, version, last_review).
[] Link-checker va CI validator o’tadi.
SOP qoʻllash chek varaqasi (hodisada):
[] SOP Incident Manager/panel bogʻlamalaridan ochilgan.
[] Qadamlar bajarildi va natijalar qayd etildi.
[] DoD ga erishildi/erishilmadi - qayd etildi.
[] Xatti-harakatlar/nomuvofiqliklar chiptada qayd etilgan.
[] SOP yangilanishlari/yaxshilanishlari vazifalar tomonidan yaratilgan (agar kerak boʻlsa).
13) O’qitish va onbording
Asosiy SOP (Payments/Bets/Games/KYC) bo’yicha mini-kurslar.
Mashqlarda majburiy SOP qo’llagan holda shadow-navbatchilik.
Haftalik «SOP-klinikalar»: 30 daqiqa o’rganish/yaxshilash.
Simulyatsiyalar (game-days): DR- va noxush SOPlarni ishlab chiqish.
14) SOP o’zgarishlarini boshqarish
RFC PR orqali, teglar’minor/major/breaking’.
Breaking-o’zgarishlar - majburiy ta’lim va e’lon bilan.
Domen egalariga va on-kollga avto-bildirishnomalar.
Har hafta oxirida alohida «SOP-Release Notes».
15) Anti-patternlar
«Qanday bo’ladi» erkin shakli va buyruqlar bo’yicha turli namunalar.
SOP egasi/versiyasi/taftish sanasisiz.
Bosqichma-bosqich harakatlar o’rniga «ensiklopediya» matnlari.
Rollback/DoD yo’q - muvaffaqiyatni tekshirish uchun hech narsa yo’q.
Buzilgan havolalar, «qo’lda chat» buyruqlari, shaxsiy «maxfiy» qadamlar.
Yozilmagan va oʻrganilmagan SOP koʻrinmas oʻzgarishlari.
16) 30/60/90 - joriy etish rejasi
30 kun:
SOP namunasi va minimal standartlar tasdiqlansin.
’ops-sop/’ (docs-as-code) repozitorini yaratish, CI-linterlarni yoqish.
10-15 ta tanqidiy SOP (hodisalar/relizlar/provayderlar) ni raqamlashtirish.
Incident Manager va kuzatuv panellarini SOP bogʻlamalariga ulash.
60 kun:
Tanqidiy stsenariylarda Coverage ≥ 70% ga yetish.
Haftalik «SOP-klinikalar» va on-cola treninglarini ishga tushirish.
SOP va TL boʻyicha AI-qidiruv (RAG) qoʻshish; DR kartochkalar.
Review SLA (180 kun) va muddati o’tgan SOP bo’yicha hisobot joriy etilsin.
90 kun:
Coverage ≥ 90%, Usage Rate ≥ 70%.
DoD/Rollbackni barcha SOPlarga integratsiyalash, singan bogʻlarni yopish (0).
KPI SOPni OKR buyruqlariga (MTTR, Change Failure Rate) bogʻlash.
Retro o’tkazish va keyingi chorakdagi yaxshilanishlarni qayd etish.
17) FAQ
Q: SOP runbook dan qanday farq qiladi?
A: SOP - standartlashtirilgan protsedura ("to’g" ri "reglament). Runbook - muayyan keys/servis uchun batafsil ko’rsatmalar. SOP ko’pincha bitta yoki bir nechta runbook’ga murojaat qiladi.
Q: SOPda qancha tafsilotlar bo’lishi kerak?
A: Aynan shunchalik ko’p bo’ladiki, operator chatga «ildiz otmasdan» harakat qila oladi. Ta’sirga ta’sir qilmaydigan hamma narsa alohida ma’lumotnoma materiallarida.
Q: Qanday qilib dolzarblikni saqlab qolish mumkin?
A: SLA taftish (180 kundan ≤), avtomatik eslatmalar, CI-linterlar va Usage/DoD metrikasi. Har qanday nuqsonli hodisa → SOPni yangilash vazifasi.