Alerting va nosozliklarga javob berish
(Bo’lim: Texnologiyalar va infratuzilma)
Qisqacha xulosa
Kuchli alerting - bu shunchaki «qizil metrika» emas, balki foydalanuvchi qadriyatining buzilishi haqida signaldir. iGaming uchun SLO-geytlar (yashirin, foydalanish imkoniyati, to’lov konvertatsiyasi, Time-to-Wallet), multi-burn qoidalari, on-call, eskalatsiya, ChatOps va runbooksning aniq rollari muhimdir. Maqsad - og’ishni tezda ko’rish, tuzatishga qodir bo’lganlarga xabar berish va keyingi safar tezroq va arzonroq javob berish uchun bilimlarni tuzatishdir.
1) Asoslari: metrikdan harakatlarga
SLI → SLO → Alert: o’lchanadigan sifat → maqsadli daraja → shartlar «byudjet yonmoqda».
Severity (SEV): SEV1 - tanqidiy (daromad/GGR xavf ostida), SEV2 - jiddiy, SEV3 - mo "tadil, SEV4 - minor.
Impact/Urgency: kim azob chekmoqda (butun/mintaqa/tenant/kanal) va qanchalik shoshilinch (TTW ↑, p99 ↑, error-rate ↑).
Actionability: har bir ogohlantirishda - muayyan harakat (runbook + egasi).
2) Signallar taksonomiyasi
ТехSLO: p95/p99 latency API, error-rate, saturation (CPU/IO/GPU), queue lag.
BiznesSLO: to’lovlarni konvertatsiya qilish (attempt → success), Time-to-Wallet (TTW), stavkalarning muvaffaqiyati, o’yinlarning boshlanishi.
To’lov yo’nalishlari: PSP-o’ziga xos metriklar (timeout/decline spikes).
Front/mobayl: RUM-metrika (LCP/INP), crash-rate, ssenariylar sintetikasi (login/depozit/stavka/chiqarish).
3) Alerting siyosati: SLO va burn-rate
SLI/SLO namunalari
Foydalanish imkoniyati’payments-api’≥ 99. 9% / 30d p95 `/deposit` ≤ 250 ms / 30d
’payments _ attempt → success’ ≥ baseline − 0. 3% / 24h
TTW p95 ≤ 3 min/24h
Multi-window / Multi-burn (идея PromQL)
Fast burn: SLOning normadan 5-10 × tezroq buzilishi (5-15 daqiqada alert-peyj).
Slow burn: budjetning asta-sekin yonib ketishi (1-3 soat uchun test + tahlil).
yaml
API success proxy metric (recording rule in advance)
record: job:http:success_ratio expr:
sum(rate(http_requests_total{status=~"2.. 3.."}[5m]))
/ sum(rate(http_requests_total[5m]))
Fast burn (99. 9% SLO)
alert: PaymentsSLOFastBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 14 for: 10m labels: { severity: "page", service: "payments-api" }
annotations:
summary: "SLO fast burn (payments-api)"
runbook: "https://runbooks/payments/slo"
Slow burn alert: PaymentsSLOSlowBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 6 for: 1h labels: { severity: "ticket", service: "payments-api" }
4) Shovqinni kamaytirish va signallar sifati
Haqiqatning to’g’ri manbai: og’ir «xom» so’zlar bilan emas, balki agregatlar bo’yicha (recording rules) alertlash.
Deduplikatsiya: Alertmanager’service/region/severity’bo’yicha guruhlaydi.
Ierarxiya: avval biznes/SLI uchun alert, pastda - diagnostika sifatida texnometrika.
Supressiya: planned-maintenance/reliz (annotatsiya) paytida, upstream-hodisalarda.
Kardinallik:’user _ id/session _ id’dan alert belgilarida foydalanmang.
Test-alertlar: muntazam «o’quv» triggerlari (kanallar, rollar, runabuk-havolalarni tekshirish).
5) Alertmanager: marshrut va eskalatsiyalar
yaml route:
group_by: [service, region]
group_wait: 30s group_interval: 5m repeat_interval: 2h receiver: sre-slack routes:
- matchers: [ severity="page" ]
receiver: pagerduty-sre continue: true
- matchers: [ service="payments-api" ]
receiver: payments-slack
receivers:
- name: pagerduty-sre pagerduty_configs:
- routing_key: <PD_KEY>
severity: "critical"
- name: sre-slack slack_configs:
- channel: "#alerts-sre"
send_resolved: true title: "{{.CommonLabels. service }} {{.CommonLabels. severity }}"
text: "Runbook: {{.CommonAnnotations. runbook }}"
inhibit_rules:
- source_matchers: [ severity="page" ]
target_matchers: [ severity="ticket" ]
equal: [ "service" ]
G’oya: SEV = page → PagerDuty/SMS; qolganlari - Slack/chipta. Ingibitsiya yuqori faol SEVda past darajadagi «shovqin» ni bostiradi.
6) Grafana Alerting (qo’shimcha qatlam sifatida)
Dashbordlarda markazlashtirilgan Alert rules (Prometheus/Loki/Cloud).
Contact points: PagerDuty/Slack/Email, Notification policies per folder.
Silences: rejalashtirilgan ishlar, migratsiyalar, relizlar.
Panelning avto-skrinshoti bilan Snapshots.
7) On-call va «jonli» jarayonlar
Rotatsiya: 1-liniya (SRE/platforma), 2-liniya (servis egasi), 3-liniya (DB/Payments/Sec).
SLA reaksiyalar: tan olish ≤ 5 daqiqa (SEV1), diagnostika ≤ 15 daqiqa, kommunikatsiyalar har 15-30 daqiqada.
Navbatchi kanallar:’#incident-warroom’,’#status-updates’(faqat faktlar).
Runbooks: har bir alertdagi havola + ChatOps tezkor buyruqlari (’/rollback’, ’/freeze’, ’/scale’).
O’quv tashvishlari: har oyda (odamlar, kanallar, runabuk-dolzarbligini tekshirish).
8) Hodisalar: hayot sikli
1. Detekt (alert/report/sintetika) → Acknowledge on-call.
2. Triaj: SEV/ta’sirlangan/gipotezani aniqlash, war-room ochish.
3. Barqarorlashtirish: rulbuklar/otkat/masshtablash/ficheflaglar.
4. Aloqa: maqom namunasi (quyida qarang), ETA/keyingi qadamlar.
5. Yopish: SLO tiklanganligini tasdiqlash.
6. Post-Incident Review (RCA): 24-72 soatdan keyin, ayblovlarsiz, action items.
- Singan/kimga ta’sir qilgan (mintaqa/tenant/kanal)
- Qachon boshlangan/SEV
- Vaqtinchalik choralar (mitigation)
- Keyingi holati N daqiqadan keyin yangilanadi
- Aloqa (Hodisa boshqaruvchisi)
9) iGaming xususiyatlari: «og’riqli» zonalar va alertlar
Payments/TTW: PSP taymautlarining ulushi, kod bo’yicha nosozliklarning o’sishi, TTW p95> 3m.
Turnirlarning cho’qqilari: p99 API/o’yinlar boshlanish vaqti/queue lag; limitlar/avto-skayl targ’iboti.
Mablag’larning natijalari: SLA backofis/qo’lda tekshirishlar, mamlakatlar bo’yicha limitlar.
O’yin provayderlari: studiyalar bo’yicha foydalanish imkoniyati, sessiyani boshlash vaqti, ishga tushirishning pasayishi.
RG/Compliance: uzoq sessiyalar/» dugonalar» portlashi, chegaradan oshib ketish - peyj emas, balki chipta + RG-jamoani xabardor qilish.
10) Qoidalar namunalari (qo’shimcha)
Yuqori latentlik p95 (API)
promql alert: HighLatencyP95 expr: histogram_quantile(0. 95,
sum by (le, service) (rate(http_request_duration_seconds_bucket{service="api"}[5m]))) > 0. 25 for: 10m labels: { severity: "page", service: "api" }
annotations:
summary: "p95 latency > 250ms"
runbook: "https://runbooks/api/latency"
Xulosalar navbati «yonmoqda»
promql alert: WithdrawalsQueueLag expr: max_over_time(queue_lag_seconds{queue="withdrawals"}[10m]) > 300 for: 10m labels: { severity: "page", service: "payments-worker" }
annotations:
summary: "Withdrawals lag >5m"
runbook: "https://runbooks/payments/queue"
To’lovlarning konvertatsiyasi
promql alert: PaymentConversionDrop expr:
(sum(rate(payments_success_total[15m])) / sum(rate(payments_attempt_total[15m])))
< (payment_conv_baseline - 0. 003)
for: 20m labels: { severity: "page", domain: "payments" }
annotations:
summary: "Payment conversion below baseline -0. 3%"
runbook: "https://runbooks/payments/conversion"
11) ChatOps va avtomatlashtirish
Stop canary, Rollback, Scale + N.
Buyruq qisqartmalari: ’/incident start’, ’/status update’, ’/call
Botlar kontekstni kuchaytiradi: oxirgi deploi, bog’liqlik grafigi, treys-misollar (exemplars), bog’langan chiptalar.
12) Hodisadan keyingi ish (RCA)
Faktlar: ko’rgan/sinab ko’rgan taymline, nima ishladi.
Root cause: texnik va tashkiliy sabablar.
Detections & Defenses: qanday signallar yordam berdi/tushirdi.
Action items: aniq vazifalar (SLO/alertlar/kodlar/limitlar/testlar/runabuk).
Due dates & owners: muddatlar va javobgarlik; 2-4 haftadan keyin follow-up-sessiya.
13) Joriy etish chek-varaqasi
1. Asosiy oqimlar uchun SLI/SLO (API/Payments/Games/TTW) ni aniqlang.
2. Recording rules va multi-burn alert + Alertmanager marshrutini moslash.
3. Rotatsiya, SLO reaksiyalar va eskalatsiyalar bilan on-call kiriting.
4. Alertlarni runbooks va ChatOps buyruqlariga bogʻlang.
5. Bostirish/jim oynalarni, relizlar/ishlar izohlarini moslash.
6. O’quv tashvishlari va game-day stsenariylarini (PSP pasayishi, p99 o’sishi, queue lag o’sishi) qiling.
7. O’lchang Alert Quality: MTTA/MTTR,% noisy/false, coverage by SLO.
8. Muntazam RCA va chegaralarni/jarayonlarni qayta ko’rib chiqish.
9. Biznes/sapport bilan status-kommunikatsiyalarni kiriting.
10. Hamma narsani kod sifatida hujjatlashtiring: qoidalar, yo’nalishlar, runabuk havolalari.
14) Anti-patternlar
«Har bir metrika» bo’yicha alerting → alert-fetig, ignor.
SLO yo’q → nima «norma» va nima «yonayotgani» aniq emas.
Bostirish/inhibitsiya yo’qligi → dublikat ko’chkisi.
Kechasi kichik voqealar uchun peyj (SEV Impact bilan taqqoslanmaydi).
Runbook/egasi boʻlmagan alertlar.
ChatOps/auditsiz «qoʻlda» harakat qilish.
RCA/Action items → hodisalar takrorlanmadi.
Yakunlar
Alerting va javob berish - bu qoidalar to’plami emas, balki jarayon. SLOni multi-burn-alertlar bilan bog’lang, aniq on-call-eskalatsiya tuzing, ChatOps va jonli Runabuk-i qo’shing, muntazam ravishda RCA va mashg’ulotlar o’tkazing. Shunda hodisalar kamroq, qisqaroq va arzon bo’ladi va relizlar hatto iGaming issiq soatlarida ham oldindan aytib bo’ladi.