GH GambleHub

Hodisalar metrikasi

1) Nima uchun hodisalarni o’lchash kerak?

Hodisalar metrikasi tartibsiz hodisalarni boshqariladigan jarayonga aylantiradi: reaktsiya va tiklanish vaqtini kamaytirishga, sabablarning takrorlanishini kamaytirishga, SLO/shartnomalar bajarilishini isbotlashga va avtomatlashtirish nuqtalarini topishga yordam beradi. Metriklarning yaxshi to’plami butun tsiklni qamrab oladi: kashf etish → tasniflash → eskalatsiya → mitigatsiya → tiklash → tahlil → CAPA.


2) Bazaviy ta’riflar va formulalar

Hodisa oraliqlari

MTTD (Mean Time To Detect) = T0 dan (ta’sirning haqiqiy boshlanishi) birinchi signal/aniqlashgacha bo’lgan o’rtacha vaqt.
MTTA (Mean Time To Acknowledge) = birinchi signaldan ack on-call gacha bo’lgan o’rtacha vaqt.
MTTM (Mean Time To Mitigate) = ta’sirning SLO chegarasidan pastga tushishiga qadar bo’lgan o’rtacha vaqt (ko’pincha = aylanma yechim/UX degradatsiyasiga qadar bo’lgan vaqt).
MTTR (Mean Time To Recover) = maqsadli SLI to’liq tiklangunga qadar o’rtacha vaqt.
MTBF (Mean Time Between Failures) = tegishli hodisalar orasidagi o’rtacha oraliq.

Operatsion vaqtlar

Time to Declare - T0 dan SEV/hodisa darajasi rasmiy e’lon qilingunga qadar.
Time to Comms - e’londan SLA bo’yicha birinchi ommaviy/ichki yangilanishgacha.
Time in State - har bir bosqichning davomiyligi (triage/diag/fix/verify).

Chastota va ulush

Incident Count - davr uchun hodisalar soni.
Incident Rate - 1k/10k/100k muvaffaqiyatli tranzaksiya yoki so’rovlarga (normallashtirish).
SEV Mix - og’irlik bo’yicha taqsimlash (SEV-0... SEV-3).
SLA Breach Count/Rate - tashqi SLA buzilishlari soni/ulushi.
Change Failure Rate - oʻzgarishlar (relizlar/konfigurlar/migratsiyalar) tufayli yuzaga kelgan hodisalar%.

Signallar va jarayonlar sifati

% Actionable Pages - pleybuk bo’yicha mazmunli harakatlarga olib kelgan peyjlar ulushi.
False Positive Rate (Pages) - soxta ishlanmalar ulushi.
Detection Coverage - avtomatika tomonidan aniqlangan hodisalar ulushi (mijozlar tomonidan emas).
Reopen Rate - 90 kundan ≤ bir xil asosiy sabab bilan takrorlangan hodisalar ulushi.
CAPA Completion - o’z vaqtida yopilgan tuzatish/ogohlantirish harakatlarining%.
Comms SLA Adherence - talab qilinadigan chastota bo’yicha e’lon qilingan yangiliklar ulushi.


3) Hodisa bosqichlari bo’yicha metrika xaritasi

BosqichAsosiy metriklarSavol
TopishMTTD, Detection Coverage, Source Mix (monitoring vs users)Muammoni tezda va kim aniqlaydi?
ReaksiyaMTTA, Time to Declare, Page-to-Ack %, Escalation LatencyJamoa tezda safarbar bo’lib, SEVni o’zlashtiradi?
MitiglashMTTM, Workaround Success %, Change Freeze LatencyTa’sir qanchalik tez xavfsiz darajaga tushadi?
Qayta tiklashMTTR, SLO Burn Stopped Time, Residual Risk WindowXizmat qachon toʻliq tiklandi?
CommsTime to Comms, Comms SLA Adherence, Sentiment/ComplaintsO’z vaqtida va sifatli aloqa qilyapmiz?
O’qitishPostmortem Lead Time, CAPA Completion/Overdue, Reopen RateBiz oʻrganyapmizmi va yaxshilanishlarni yopyapmizmi?

4) Normallashtirish va segmentlash

Hisoblagichlarni hajm boʻyicha normallashtiring (trafik, muvaffaqiyatli operatsiyalar, faol foydalanuvchilar).
Mintaqa/tenant, provayder (PSP/KYC/CDN), oʻzgarish turi (kod/ /infra), sutka vaqti (day/night), deteksiya manbai (synthetic/RUM/infra/support) boʻyicha segmentlang.
Biznes uchun biznes-SLI (to’lovlar, ro’yxatdan o’tish, to’ldirish muvaffaqiyati) muhim ahamiyatga ega - hodisalar metrikasini ularning tanazzulga uchrashiga bog’lang.


5) Chegara maqsadlari (domenga moslashtiring)

MTTD: ≤ uchun Tier-0 5 daqiqa, Tier-1 uchun ≤ 10-15 daqiqa.
MTTA: ≤ 5 min (24/7), ≤ 10 min (follow-the-sun).
MTTM: ≤ 15 min (Tier-0), ≤ 30-60 min (Tier-1).
MTTR: ≤ 60 min (Tier-0), ≤ 4 soat (Tier-1).
Detection Coverage: 85% avtomatika ≥.
% Actionable Pages: ≥ 80–90%; FP Pages: ≤ 5%.
Reopen Rate (90д): ≤ 5–10%.
CAPA Completion (muddatida): 85% ≥.


6) Sabablar atributi va o’zgarishlarning ta’siri

Har bir hodisaga primary cause (Code/Config/Infra/Provider/Security/Data/Capacity) va trigger (release ID, -o’zgarish, migratsiya, tashqi omil) bering.
Change-linked MTTR/Count - relizlar va konfigurlar qancha hissa qo’shadi (geyt/kanareya siyosati uchun baza).
Yo’nalishlar va shartnomalarni boshqarish uchun Provider-caused hodisalarini (PSP/KYC/CDN/Cloud) alohida hisobga oling.


7) Kommunikatsiyalar va mijoz impakti

Time to First Public Update va Update Cadence (masalan, har 15/30 daqiqada).
Complaint Rate - 1 hodisa, trend bo’yicha tiketlar/shikoyatlar.
Status Accuracy - retraksiyasiz ommaviy yangilanishlar ulushi.
Post-Incident NPS (asosiy mijozlar bo’yicha) - SEV-1/0 keyin qisqa impuls.


8) Hodisalar atrofidagi alerting sifati metrikasi

Page Storm Index - hodisa paytida bir on-call (median/p95) uchun peyj/soat soni.
Dedup Efficiency - bostirilgan dublikatlar ulushi.
Quorum Confirmation Rate - zondlar kvorumi ishlaydigan hodisalar ulushi (2 mustaqil manbadan ≥).
Shadow → Canary → Prod yangi qoidalarni konvertatsiya qilish (Alert-as-Code).


9) Dashbordlar (minimal to’plam)

1. Executive (28 kun): hodisalar soni, SEV taqsimoti, MTTR/MTTM, SLA breaches, Reopen, CAPA.
2. SRE Operations: MTTD/MTTA по часам/сменам, Page Storm, Actionable %, Detection Coverage, Time to Declare/Comms.
3. Change Impact: relizlar/konfiguratsiyalar bilan bog’liq hodisalar ulushi, change hodisalari uchun MTTR, xizmat ko’rsatish oynalari va boshqalar.
4. Providers: provayderlar bo’yicha hodisalar, degradatsiya vaqti, yo’nalishlarni o’zgartirish, shartnomaviy SLA.
5. Servislar/hududlar bo’yicha Heatmap: hodisalar va 1k tranzaksiya uchun MTTR.

SLI/SLO grafiklarini relizlar izohlari va SEV belgilari bilan birlashtiring.


10) Hodisa ma’lumotlari sxemasi (tavsiya etiladigan)

Kartochka/jadvalning eng kichik maydonlari:

incident_id, sev, state, service, region, tenant, provider?,
t0_actual, t_detected, t_ack, t_declared, t_mitigated, t_recovered,
source_detect (synthetic    rum    infra    support),
root_cause (code    config    infra    provider    security    data    capacity    other),
trigger_id (release_id    change_id    external_id),
slo_impact (availability    latency    success), burn_minutes,
sla_breach (bool), public_updates[], owners (IC/TL/Comms/Scribe),
postmortem_id, capa_ids[], reopened_within_90d (bool)

11) Hisob-kitoblar namunalari (SQL g’oya)

Davr uchun MTTR (mediana):
sql
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (t_recovered - t0_actual))/60) AS mttr_min
FROM incidents
WHERE t0_actual >= '2025-10-01' AND t_recovered IS NOT NULL AND sev IN ('SEV-0','SEV-1','SEV-2');
Detection Coverage:
sql
SELECT 100.0 SUM(CASE WHEN source_detect <> 'support' THEN 1 ELSE 0 END) / COUNT() AS detection_coverage_pct
FROM incidents
WHERE t0_actual >= current_date - INTERVAL '28 days';
Change Failure Rate (28 kun uchun):
sql
SELECT 100.0 COUNT() FILTER (WHERE trigger_id IS NOT NULL) / NULLIF(COUNT(),0) AS change_failure_rate_pct
FROM incidents
WHERE t0_actual >= current_date - INTERVAL '28 days';

12) SLO va xato budjetlari bilan aloqa

Hodisaga SLO burn minutes yozib oling - bu hodisaning asosiy «og’irligi».
CAPAni hodisalar soni bo’yicha emas, balki umumiy burn va SEV vazni bo’yicha ustuvorlik qiling.
Burn’ni moliyaviy impakt bilan tiking (masalan: $/daqiqalik nuqta yoki $/yo’qolgan tranzaksiya).


13) Jarayon etukligi metrikasi (program-level)

Postmortem Lead Time: hodisa yopilishidan hisobot chop etilgunga qadar median.
Evidence Completeness: taymline, SLI grafiklari, loglar, PR/komms havolalari bilan hisobotlar ulushi.
Alert Hygiene Score: actionable/FP/dedup/kvorum bo’yicha tarkibiy indeks.
Handover Defects: faol hodisalar kontekstini yo’qotgan smenalar ulushi.
Training Coverage: chorak davomida simulyatsiyadan oʻtgan% on-call.


14) Metrikalarni joriy etish chek-varaqasi

  • Hodisaning yagona vaqt belgilari (UTC) va kontrakti aniqlandi.
  • SEV lugʻati qabul qilindi, sabablari (root cause taxonomy) va deteksiya manbalari.
  • Metriklar hajm bo’yicha normallashtiriladi (trafik/muvaffaqiyatli operatsiyalar).
  • 3 dashbord tayyor: Executive, Operations, Change Impact.
  • Alert-as-Code: Har bir Page qoidalari pleybukga va egasiga ega.
  • SLA post-mortemi (masalan, loyiha ≤ 72 soat, final ≤ 5 qul. ).
  • CAPA effekt KPI va D + 14/D + 30 sanalari bilan yoritiladi.
  • Haftalik Incident Review: trendlar, eng yaxshi sabablar, CAPA maqomi.

15) Anti-patternlar

MTTD/MTTA/MTTM’siz faqat MTTR deb hisoblansin → erta fazalarni boshqarish qobiliyatini yoʻqotish.
Katta xizmatlarni me’yorlashtirmaslik yomonroq ko’rinadi.
Tizimsiz SEV → noxush hodisalarning taqqoslanmasligi.
Evidence yo’qligi → yaxshilanish o’rniga bahslar.
Burn/SLO ta’siri o’rniga hodisalar soniga e’tibor qarating.
Reopen va CAPA → abadiy qaytalanishlarni eʼtiborsiz qoldiring.
«Metrika v Excel» avtomatik ravishda telemetriyadan tushirmasdan/ITSM.


16) Mini-shablonlar

Hodisa kartochkasi (so’r.)


INC: 2025-11-01-042 (SEV-1)
T0=12:04Z, Detected=12:07, Ack=12:09, Declared=12:11,
Mitigated=12:24, Recovered=12:48
Service: payments-api (EU)
SLI: success_ratio (-3.6% к SLO, burn=18 мин)
Root cause: provider (PSP-A), Trigger: status red
Comms: first 12:12Z, cadence 15m, SLA met
Links: dashboards, logs, traces, release notes

Executive hisoboti (28 kun, asosiy satrlar)


Incidents: 12 (SEV-0:1, SEV-1:3, SEV-2:6, SEV-3:2)
Median MTTR: 52 мин; Median MTTD: 4 мин; MTTA: 3 мин; MTTM: 17 мин
Detection Coverage: 88%; Actionable Pages: 86%; FP Pages: 3.2%
Change Failure Rate: 33% (4/12) — 3 связаны с конфигом
Reopen(90d): 1/12 (8.3%); CAPA Completion: 82% (2 просрочены)
Top Root Causes: provider(4), config(3), capacity(2)

17) Yo’l xaritasi (4-6 hafta)

1. Ned. 1: vaqt/maydon belgilari standarti, SEV/sabablar lug’ati; hodisalarning bazaviy vitrini.
2. Ned. 2: MTTD/MTTA/MTTM/MTTR hisob-kitoblari, normallashtirish va SEV-dashbord.
3. Ned. 3: relizlar/konfiguratsiyalar, Detection Coverage va Alert Hygiene.
4. Ned. 4: Executive-hisobot, SLA post-mortemlar, CAPA-treker.
5. Ned. 5-6: provayder hisobotlari, burn → $ moliyaviy modeli, choraklik maqsadlar va choraklik Incident Review.


18) Jami

Hodisalar ko’rsatkichlari shunchaki sonlar emas, balki operatsion ishonchlilikning tasviri. Siz butun oqimni (kashfiyotdan CAPA gacha) o’lchaganingizda, ko’rsatkichlarni normallashtirganingizda, ularni SLO va o’zgarishlar bilan bog’laganingizda va muntazam ravishda sharhlar o’tkazganingizda, tashkilot reaktsiya vaqtini, xarajatlarni va hodisalarning takrorlanishini pasaytiradi - foydalanuvchilar esa barqaror xizmatni ko’rishadi.

Contact

Biz bilan bog‘laning

Har qanday savol yoki yordam bo‘yicha bizga murojaat qiling.Doimo yordam berishga tayyormiz.

Integratsiyani boshlash

Email — majburiy. Telegram yoki WhatsApp — ixtiyoriy.

Ismingiz ixtiyoriy
Email ixtiyoriy
Mavzu ixtiyoriy
Xabar ixtiyoriy
Telegram ixtiyoriy
@
Agar Telegram qoldirilgan bo‘lsa — javob Email bilan birga o‘sha yerga ham yuboriladi.
WhatsApp ixtiyoriy
Format: mamlakat kodi va raqam (masalan, +998XXXXXXXX).

Yuborish orqali ma'lumotlaringiz qayta ishlanishiga rozilik bildirasiz.