Prometheus: metrik yigʻish
(Bo’lim: Texnologiyalar va infratuzilma)
Qisqacha xulosa
Prometheus - vaqt boʻyicha metriklar uchun sanoat standarti: u HTTP boʻyicha targetlarni skreypit qiladi, TSDBda seriyalarni saqlaydi, PromQLdagi agregatlarni va alertlarni Alertmanager orqali triggerit qiladi. iGaming uchun bu SLO yondashuvi (RED/USE, toʻlovlarning biznes metrikasi), p95/p99 tezkor diagnostikasi va avtomatik yechimlar (freeze/rollback) asosidir.
1) Ma’lumotlar modeli va kardinalligi
Metrika:’name {label1 =» v1», label2 =» v2»} value @ timestamp’.
Kardinallik = leybllarning barcha noyob to’plamlari quvvatlarini ko’paytirish; qiymatning asosiy omili.
- базовые: `service`, `env`, `region`, `instance`, `pod`, `container`, `version`;
- domenlar:’route’,’psp’,’tenant’(ehtiyot boʻling!),’game _ provider’.
- ’user _ id’,’session _ id’, tasodifiy/yuqori kardinal qiymatlarni qoʻyib boʻlmaydi.
2) Metrika turlari
Counter - faqat oʻsadi (masalan,’http _ requests _ total’).
Gauge - lahzali qiymatlar (masalan,’queue _ depth’).
Histogram/Summary - yashirin taqsimot. Proda - Histogram (’histogram _ quantile ()’va exemplars yordamida).
Native Histograms - o’zgaruvchan baketalar, aniqlikni oshiradi va o’lchamni tejaydi (mavjud joyga qo’shing).
go var httpLatency = prometheus. NewHistogramVec(
prometheus. HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP latency",
Buckets: prometheus. DefBuckets ,//or custom
},
[]string{"route","method"},
)
3) Eksportchilar va nima o’lchash kerak
Services: kod (Go/Java/Node/Python uchun SDK), RED-metrika API, biznes-metrika (to’lovlarni konvertatsiya qilish).
Tizimli: node_exporter, cAdvisor/kubelet.
Uchinchi tomon: DB/keshlar (mysqld_exporter, postgres_exporter, redis_exporter), NGINX/HAProxy, Kafka/RabbitMQ.
OTel-metriklar: OpenTelemetry Collector → Prometheus Remote Write yoki Prometheus-receiver → orqali umumiy stekga.
4) Scrape va relabel: targetlarni qanday ulash kerak
Asosiy’prometheus. yml`
yaml global:
scrape_interval: 15s evaluation_interval: 15s external_labels:
env: "prod"
region: "eu-west"
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['10. 0. 1. 10:9100','10. 0. 1. 11:9100']
- job_name: 'payments-api'
metrics_path: /metrics scheme: https tls_config:
ca_file: /etc/ssl/ca. crt cert_file: /etc/ssl/tls. crt key_file: /etc/ssl/tls. key relabel_configs:
- source_labels: [__address__]
regex: '(.):\d+'
target_label: instance replacement: '$1'
Kubernetes через Prometheus Operator
Qoʻlda ishlatiladigan «scrape _ configs» oʻrniga ServiceMonitor/PodMonitor dan foydalaning.
yaml apiVersion: monitoring. coreos. com/v1 kind: ServiceMonitor metadata: { name: payments-api }
spec:
selector: { matchLabels: { app: payments-api } }
namespaceSelector: { matchNames: [ "prod" ] }
endpoints:
- port: metrics interval: 15s scheme: http relabelings:
- action: replace targetLabel: service replacement: "payments-api"
K8s izohlar (Operator’siz, soddalashtirilgan)
yaml metadata:
annotations:
prometheus. io/scrape: "true"
prometheus. io/port: "9102"
prometheus. io/path: "/metrics"
5) Saqlash: TSDB, WAL va retenshn
WAL (Write-Ahead Log) → qayta boshlanganidan keyin tezda tiklanadi.
Compaction: bloklarni siqish, disk/CPUni tejash.
Retenshn: issiq ma’lumotlarni 7-30 kun saqlang; uzoq muddatli - chiqaring («Masshtablash» ga qarang).
- `--storage. tsdb. retention. time=15d`
- `--storage. tsdb. max-block-chunk-segment-size`
- Disk: tezkor SSD/NVMe; zaruratsiz tarmoq jildlaridan qochish.
6) PromQL: asoslar va tez-tez uchraydigan patternlar
Rate/irate
promql rate(http_requests_total{route="/deposit"}[5m])
Xatolar va muvaffaqiyatli
promql sum(rate(http_requests_total{status=~"2.. 3.."}[5m]))
/ sum(rate(http_requests_total[5m]))
p95 latentlik
promql histogram_quantile(0. 95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)
Navbatlar/toʻldirish
promql max(queue_depth{queue="withdrawals"}) by (region)
7) Recording rules va unumdorlik
Ogʻir iboralarni oldindan sanab, ketma - ket saqlang.
yaml groups:
- name: api. rules interval: 30s rules:
- record: job:http:request_duration_seconds:p95 expr:
histogram_quantile(0. 95,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
- record: job:http:success_ratio expr:
sum(rate(http_requests_total{status=~"2.. 3.."}[5m]))
/ sum(rate(http_requests_total[5m]))
Plyus: tezkor dashbordlar, Prometheus CPUga kamroq yuk.
8) Alerting и SLO (burn rate)
Burn-rate alerta (multi-window, multi-burn)
yaml groups:
- name: slo. payments rules:
- alert: PaymentsSLOFastBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 14 for: 5m labels: { severity: "page" }
annotations:
summary: "SLO fast burn"
runbook: "https://runbooks/payments/slo"
- alert: PaymentsSLOSlowBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 6 for: 1h labels: { severity: "ticket" }
Alertmanager: servislar/hududlar bo’yicha marshrutlash, dublikatlarni bostirish, ChatOps.
9) Trassalar va loglar bilan korrelyatsiya
exemplars:’trace _ id’tugmasini gistogramm baketalariga kiriting.
release compare uchun’service’,’version’,’region’yorliqlarini belgilang.
Dashbordlarda - relizlarning izohlari (Git SHA/versiya).
10) Ko’paytirish va uzoq muddatga saqlash
Federatsiya: yuqori Prometheus pastki (job/label-filtrlar bo’yicha) yig’adi.
Remote Write: uzoq muddat saqlanadigan backends/klasterlarga (Thanos/Cortex/Mimir) qator joʻnatish.
Afzalliklari: cheksiz retenshn, gorizontal kattalashtirish, global view.
Kamchiliklari: ekspluatatsiya qilish qiyinroq, narxi.
Funksiyalari bo’yicha sharding: tizimli metriklar, biznes, xavfsizlik uchun alohida instantsiyalar.
11) Xavfsizlik
TLS/mTLS/Alertmanager/remote _ write.
/ targets va API uchun bazaviy/token autentifikatsiyasi (proksatsiya shlyuzidan oldin).
RBAC: rollar bo’yicha UI/seriyalarga kirishni cheklang; shaxsiy yorliqlaringizni yashiring.
PII-gigiyena: PII ni metrikaga yozmang; xesh/taxalluslardan foydalaning.
12) Kubernetes-amaliyot
Prometheus Operator: CRD (ServiceMonitor, PodMonitor, Alertmanager, Prometheus).
kube-state-metrics + cAdvisor → klasterning to’liq manzarasi.
Teyninglar va resurslar: monitoring uchun ajratilgan nodlar; CPU/RAM limitlari.
Shovqinni kamaytirish: «ishlab chiqarish» neyspeyslari uchun leybl-selektorlar, padding scrape_interval mumkin bo’lgan joylarda.
13) Biznes-metrika va mahsulot
Платежи: `payments_success_total{psp, currency}`, `payment_conversion_ratio`, `ttw_seconds_histogram`.
O’yin faolligi: stavkalar/min, gauge sifatida sessiyalarni ushlab turish, xato bo’lganda drop-off.
Tavakkalchilik/frod: tezlik/geo anomaliyalari bo’yicha triggerlar; alohida logirovka qilish, metrika - agregatlar.
14) Qiymati va unumdorligi (FinOps)
Kardinallikni nazorat qiling (yangi yorliqni qo’shishdan oldin tag-revyu).
Sampling gistogramm/nodir eksport qiluvchilar →’scrape _ interval’↑ tanqidiy bo’lmagan targetlar uchun.
Downsampling uzoq muddatga saqlangan.
Dashbordlarni keshlash va recording rulesga keng tayanish.
15) «Tez boshlash» misollari
Dasturdagi RED eksport qiluvchi (Python)
python from prometheus_client import Counter, Histogram, start_http_server reqs = Counter('http_requests_total','', ['route','method','status'])
lat = Histogram('http_request_duration_seconds','', ['route','method'])
start_http_server(8000)
def handle(req):
with lat. labels(req. route, req. method). time():
status = app(req)
reqs. labels(req. route, req. method, str(status)). inc()
return status
Chegara alertlari p95
promql alert: HighLatencyP95 expr: histogram_quantile(0. 95,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) > 0. 25 for: 10m labels: { severity: "page", service: "api" }
16) Joriy etish chek-varaqasi
1. Asosiy metrlar (RED/USE) va domen koʻrsatkichlari toʻplamini aniqlang.
2. Kardinallik bo’yicha yorliqlar va gidlarni kelishib oling.
3. scrape/ServiceMonitor, TLS/mTLS, relabel.
4. Asosiy yoʻllar va exemplars uchun gistogrammalarni kiriting.
5. p95, success ratio, biznes agregatlari uchun recording rules yarating.
6. SLO (burn rate) va Alertmanager rutingini kiriting.
7. Dashbordlarni ko’taring: service map, release compare, to’lovlar.
8. Federatsiya/remote _ write va retenshn haqida qaror qabul qiling.
9. Foydalanishni cheklang (RBAC), PII mavjud emasligini tekshiring.
10. Runbooks va game-day sinovlarini yoqing.
17) Anti-patternlar
Yuqori kardinallikka ega leybllar (user/session/request_id).
Asosiy SLO uchun Histogram oʻrniga Summary → yoʻq’histogram _ quantile’.
Filtrlash/rotatsiyasiz Scrape «hammasi ketma-ket» → xarajatlar va shovqin o’sishi.
SLOsiz xom metriklar bo’yicha alertlar → alert-fetig.
Yo’qligi recording rules → «og’ir» dashbordlar.
TLS/mTLS → almashtirish/oqish xavfi bo’lmagan metriklarga ishonch.
Yakunlar
Prometheus iGaming platformasiga aniq gistogrammalar, barqaror agregatlar, aniq SLO alertlari va ko’p mintaqaviy xaritaga ko’paytirish kabi maqsadlarga bog’langan kuzatuv imkoniyatini beradi. Yorliqlar intizomi, to’g’ri recording rules, trassalar/loglar bilan aloqalar va puxta o’ylangan saqlash arxitekturasi tezkor relizlar va hatto eng yuqori daqiqalarda ham p99 ni ta’minlaydi.