Operatsion tahlil
1) Operatsion tahlil nima va nima uchun zarur
Operatsion analitika (Ops Analytics) - kuzatish qobiliyati (metrika/logi/treys), ITSM (hodisalar/muammolar/o’zgarishlar), CI/CD (relizlar/konfiglar), provayderlar (PSP/KYC/CDN/Cloud), FinOps () va biznes-SLI (to’lovlarning muvaffaqiyati, ro’yxatdan o’tkazish), qarorlar qabul qilish uchun yagona vitrinalar va dashbordlarga aylantirilgan.
Maqsadlar:- sabablarni erta aniqlash va to’g "ri atributlash hisobiga MTTD/MTTRni kamaytirish;
- SLO va xatolar byudjetini nazorat ostida ushlab turish;
- o’zgarishlarni → impakt (relizlar/konfigi → SLI/SLO/shikoyatlar/xarajatlar) bilan bog’lash;
- self-service tahlilini jamoalar va menejmentga berish.
2) Manbalar va ma’lumotlarning kanonik qatlami
Telemetriya: metriklar (SLI/resurslar), loglar (sempling/tahririyat PII), treyslar (trace_id/span_id, reliz-teglar).
ITSM/Incident modullari: SEV, T0/Detected/Ack/Declared/Mitigated/Recovered taymstamplari, RCA/CAPA.
CI/CD & Config: versiyalar, kommitlar, kanarika/blue-green, bayroq-steyt, maqsadli konfiglar.
Provayderlar: statuslar/SLA, kechikishlar, xato kodlari, yo’nalish og’irligi.
FinOps: taglar/akkauntlar/tenantlar bo’yicha qiymati, $/birlik (1k operasi) .
DataOps: vitrinalarning yangiligi, DQ xatolari, lineage.
Asosiy tamoyil - identifikatorlar orqali yagona korrelyatsiya:’service’,’region’,’tenant’,’release _ id’,’change _ id’,’incident _ id’,’provider’,’trace _ id’.
3) Ma’lumotlarning yagona modeli (soddalashtirilgan karkas)
dim_service(service_id, owner, tier, slo_targets…)
dim_time(ts, date, hour, tz)
dim_region(region_id, country, cloud)
dim_provider(provider_id, type, sla)
fact_sli(ts, service_id, region_id, tenant, metric, value, target, window)
fact_incident(incident_id, service_id, sev, t0, t_detected, t_ack, t_declared, t_mitigated, t_recovered, root_cause, trigger_id, burn_minutes)
fact_change(change_id, type(code config infra), service_id, region_id, started_at, finished_at, canary_pct, outcome(ok rollback), annotations)
fact_cost(ts, service_id, region_id, tenant, cost_total, cost_per_1k)
fact_provider(ts, provider_id, region_id, metric(latency error status), value)
fact_dq(ts, dataset, freshness_min, dq_errors)
4) SLI/SLO va biznes metriklar
Бизнес-SLI: `payment_success_ratio`, `signup_completion`, `deposit_latency`.
Тех-SLI: `availability`, `http_p95`, `error_rate`, `queue_depth`.
SLO qatlami: maqsadlar + burn-rate (qisqa/uzun oyna), buzilishlarning avtomatik izohlari.
Normallashtirish: 1k muvaffaqiyatli operatsiyalar/foydalanuvchilar/trafik ko’rsatkichlari.
5) Korrelyatsiya va sabablar atributi
Relizlar/konfigi, SLI/SLO: grafalardagi izohlar; sababiy-oqibatli hisobotlar (o’zgarishlar bilan bog’liq hodisalar ulushi; MTTR change-hodisalar).
Biznes-SLI provayderlari: yo’nalishlar vazni vs latency/xatolar, har bir provayderning SLO xatolariga qo’shgan hissasi.
Sig’imi/resurslari: hovuzlarning haddan tashqari qizishi → p95 o’sishi → konversiyaga ta’siri.
6) Anomaliyalar va prognozlashtirish
Anomaliya-detekt: mavsumiylik + persentil ostonalar + change-qidiruv fichlari (chiqarilishdan oldin/keyin).
Prognoz: haftalik/mavsumiy yuk patternlari, xatolar budjetining burn-out prognozi, xarajatlar prediksiyasi ($/birlik) .
Gardreyllar: faqat manbalar kvorumida (synthetic + RUM + biznes-SLI) alertlar.
7) Vitrinalar va dashbordlar (referens)
1. Executive 28d: SEV-mix, MTTR/MTTD, SLO adherence, $/birlik, top-sabablar.
2. SRE Ops: SLI/SLO + burn-rate, Page Storm, Actionable %, Change Failure Rate.
3. Change Impact: relizlar/konfigi, SLI/SLO/shikoyatlar, qaytishlar va ularning samarasi.
4. Providers: PSP/KYC/CDN status liniyalari, biznes-SLI ta’siri, javoblar vaqti.
5. FinOps: cost per 1k txn, logi/egress, xarajatlar anomaliyalari, tavsiyalar (sempling, saqlash).
6. DataOps: vitrinalarning yangiligi, DQ xatolari, SLA payplaynlar, backfill muvaffaqiyati.
8) Ma’lumotlar sifati va governance
Voqealar kontraktlari: hodisalar/relizlar/SLI uchun aniq sxemalar (majburiy maydonlar, yagona vaqt zonalari).
DQ chekerlari: kalitlarning to’liqligi, o’ziga xosligi, taymline (t0 ≤ detected ≤ ack...).
Linedj: dashborddan manbagacha (traceable).
PII/sirlar: siyosat bo’yicha tahrirlash/niqoblash; evidence uchun WORM.
SLA yangilik: vitrinalar Ops ≤ 5 daqiqa kechikish.
9) Operatsion tahlilning yetuklik metrikasi
Coverage: vitrinalar va SLO-bordlardagi muhim xizmatlar% (maqsad ≥ 95%).
Freshness: 5 daqiqa ≤ vidjetlar ulushi (maqsad ≥ 95%).
Actionability:% dashborddan harakatga oʻtish (pleybuk/SOP/ticet) ≥ 90%.
Detection Coverage: ≥ 85% hodisalarni avtomatika aniqlaydi.
Attribution Rate: tasdiqlangan sabab va trigger bilan bog’liq hodisalar ulushi ≥ 90%.
Change Impact Share: oʻzgarishlar bilan bogʻliq hodisalar ulushi (biz trendni nazorat qilamiz).
Data Quality: DQ xatolari/hafta → ↓ QoQ.
10) Jarayon: ma’lumotlardan harakatlarga
1. To’plash → tozalash → normallashtirish → vitrin (ETL/ELT, ML uchun feature-qatlam).
2. Aniqlash/prognoz → matritsada eskalatsiya (IC/P1/P2/Comms).
3. Amal: pleybuk/SOP, reliz-geyt, ficha-bayroq, provayderni almashtirish.
4. Evidence va AAR/RCA: taymline, grafiklar, relizlarga/loglarga/treyslarga havolalar.
5. CAPA va mahsulot yechimlari: burn-min va $-impakt bo’yicha ustuvorlik.
11) So’rovlar namunalari (g’oya)
11. 1 Relizlarning SLOga ta’siri (24 soat)
sql
SELECT r. change_id,
COUNT(i. incident_id) AS incidents,
SUM(i. burn_minutes) AS burn_total_min,
AVG(CASE WHEN i.root_cause='code' THEN 1 ELSE 0 END) AS code_ratio
FROM fact_change r
LEFT JOIN fact_incident i
ON i.trigger_id = r. change_id
WHERE r. started_at >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY burn_total_min DESC;
11. 2 Hududlar bo’yicha provayderlardan muammolar ulushi
sql
SELECT region_id, provider_id,
SUM(CASE WHEN root_cause='provider' THEN 1 ELSE 0 END) AS prov_inc,
COUNT() AS all_inc,
100. 0SUM(CASE WHEN root_cause='provider' THEN 1 ELSE 0 END)/COUNT() AS pct
FROM fact_incident
WHERE t0 >= DATE_TRUNC('month', NOW())
GROUP BY 1,2
ORDER BY pct DESC;
11. 3 Cost per 1k muvaffaqiyatli to’lovlar
sql
SELECT date(ts) d,
SUM(cost_total)/NULLIF(SUM(success_payments)/1000. 0,0) AS cost_per_1k
FROM fact_cost c
JOIN biz_payments b USING (ts, service_id, region_id, tenant)
GROUP BY d ORDER BY d DESC;
12) Artefaktlar shablonlari
12. 1 Hodisa hodisasi sxemasi (JSON, parcha)
json
{
"incident_id": "2025-11-01-042",
"service": "payments-api",
"region": "eu",
"sev": "SEV-1",
"t0": "2025-11-01T12:04:00Z",
"detected": "2025-11-01T12:07:00Z",
"ack": "2025-11-01T12:09:00Z",
"declared": "2025-11-01T12:11:00Z",
"mitigated": "2025-11-01T12:24:00Z",
"recovered": "2025-11-01T12:48:00Z",
"root_cause": "provider",
"trigger_id": "chg-7842",
"burn_minutes": 18
}
12. 2 Metrik katalog (YAML, parcha)
yaml metric: biz. payment_success_ratio owner: team-payments type: sli target: 99. 5 windows: ["5m","1h","6h","28d"]
tags: [tier0, region:eu]
pii: false
12. 3 Executive hisoboti kartochkasi (bo’limlar)
1) SEV mix and MTTR/MTTD trends
2) SLO adherence and burn-out risks
3) Change Impact (CFR)
4) Providers: Degradation and switchover
5) FinOps: $/unit, log anomalies/egress
6) CAPAs: Status and Deadlines
13) Asboblar va arxitektura patternlari
Data Lake + DWH: telemetriya uchun «xom» qatlam, yechimlar uchun vitrinalar.
Stream-protsessing: near-real-time SLI/burn-rate, anomaliyalar uchun onlayn chichlar.
Feature Store: fich (kanarika, mavsumiylik, provayder-signallar) dan qayta foydalanish.
Semantic Layer/Metric Store: metrikaning yagona ta’riflari (SLO, MTTR...).
Access Control: RBAC/ABAC, row-level security.
Catalog/Lineage: qidirish, tavsiflash, qaramlik, egalari.
14) Chek-varaqlar
14. 1 Operatsion tahlilni ishga tushirish
- SLI/SLO, SEV lugʻatlari, sabablari, change turlari tasdiqlandi.
- Voqealar sxemalari va yagona taymzonlar.
- Telemetriya konnektorlari, ITSM, CI/CD, provayderlar, billing.
- Vitrinalar: SLI/SLO, Incidents, Changes, Providers, FinOps.
- Executive/SRE/Change/Providers dashbordlari mavjud.
- Quorum-alertlar va suppression xizmat koʻrsatish oynalariga moslashtirilgan.
14. 2 Haftalik Ops-sharh
- SEV trendlari, MTTR/MTTD, SLO xatolari, burn-minutlar.
- Change Impact va CFR, qaytarish holati.
- Provayder hodisalari va reaktsiya vaqtlari.
- FinOps: $/birlik, log/egress anomaliyalari.
- CAPA maqomi, kechikishlar, ustuvorliklar.
15) Anti-patternlar
Harakatlarga o’tmasdan «Grafiklar devori».
Buyruqlarning turli xil metrik taʼriflari (semantik qatlam yoʻq).
Relizlar/derazalar izohlari yo’qligi sabablarning zaif atributidir.
p95/p99 o’rniga o’rtacha ko’rsatkichlarga yo’naltirish.
Hajmni normallashtirish yo’q - yirik xizmatlar «yomonroq ko’rinadi».
PII log/vitrinalarda, retensiyalarning buzilishi.
Ma’lumotlar «turg’unlashadi» (real-time vidjetlar uchun> 5-10 daqiqa).
16) Joriy etish yo’l xaritasi (4-8 hafta)
1. Ned. 1: metriklar lug’ati, voqealar sxemasi, id-korrelyatsiya bo’yicha kelishuvlar; SLI/SLO va ITSM ulanishi.
2. Ned. 2: Incidents/Changes/Providers vitrinalari, relizlarning izohlari; Executive & SRE dashbordlari.
3. Ned. 3: FinOps qatlami ($/birlik) , SLI bilan bog’lash; kvorumli detal-anomaliya.
4. Ned. 4: self-service (semantic layer/metric store), katalog va lineage.
5. Ned. 5-6: yuklama/xarajatlar prognozi, provayderlarga hisobotlar, CAPA-vitrin.
6. Ned. 7-8: qamrov ≥ 95% Tier-0/1, SLA yangilik ≤ 5 min, muntazam Ops-sharhlar.
17) Jami
Operatsion tahlil - bu qarorlar qabul qilish mashinasi: metrikalarning yagona aniqlanishi, yangi vitrinalar, sabablarning to’g’ri atributi va pleybuklar va SOPga to’g’ridan-to’g’ri o’tish. Bunday tizimda jamoa tanaffuslarni tezda aniqlaydi va tushuntiradi, relizlar va provayderlarning ta’sirini aniq baholaydi, xarajatlarni boshqaradi va xavfni tizimli ravishda kamaytiradi - foydalanuvchilar esa barqaror xizmatga ega bo’ladi.