Metriklarni yigʻish: Prometheus, Grafana
Metriklarni yigʻish: Prometheus, Grafana
1) Maqsad va ramka
Metrik konturning vazifasi - vaqt qatorlarini ishonchli yig’ish va saqlash, RCA uchun tezkor PromQL, SLO bo’yicha alertlar va tushunarli dashbordlarni berish. Asosiy juftlik: Prometheus (scrape → store → query) va Grafana (vizualizatsiya, alertlar, relizlar izohlari). Uzoq saqlash va global talab uchun - Thanos/Cortex/Mimir.
2) Ma’lumotlar modeli va semantika
Seriya = metrika nomi + label’lar toʻplami (kalit = qiymat).
Turlari: counter, gauge, histogram, summary (proda - ko’pincha histogram).
- RED (API):’rate’,’errors’,’duration’(gistogrammalar).
- USE (ресурсы): Utilization, Saturation, Errors (CPU/RAM/Disk/Net).
- Nomlanishi:’namespace _ subsystem _ metric _ unit’(masalan,’http _ server _ requests _ total’,’db _ connections _ current’).
Anti-kardinallik: label’larning turli qiymatlarini minimallashtiring (label’ga user_id request_id yo’q).
3) Ko’rgazmaga qo’yish va servis-diskaveri
Eksportchilar: node_exporter, kube-state-metrics, cAdvisor, DB/Navbatlar (postgres_exporter, redis_exporter, kafka_exporter).
Oʻz xizmatlari: mijozlar kutubxonalari (Go/Java/Node/Python) → ’/metrics’.
Service Discovery: Kubernetes, EC2/ASG, Consul, static files.
yaml global:
scrape_interval: 15s evaluation_interval: 15s scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs: [{ role: node }]
relabel_configs:
- action: labelmap regex: __meta_kubernetes_node_label_(.+)
- job_name: 'apps'
kubernetes_sd_configs: [{ role: pod }]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: (.+)
replacement: $1
Pod’lar izohlari:
yaml prometheus. io/scrape: "true"
prometheus. io/path: /metrics prometheus. io/port: "8080"
4) Gistogrammalar va latency
SLO uchun eksplitsit baketalardan foydalaning:- Veb/API:’[10ms, 25,50,100,200,400,800,1600] ’
- To’lovlar/to’lovlar: dumini 5-10s gacha qo’shing.
promql histogram_quantile(0. 95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
Exemplars bilan (agar kiritilgan boʻlsa):
promql histogram_quantile(0. 95,
sum by (le, route) (rate(traces_spanmetrics_duration_bucket{route="/withdraw"}[5m]))
)
5) Yozib olish qoidalari (recording rules)
Og’ir so’rovlarni kamaytiradi, SLI standartlashtiradi.
yaml groups:
- name: api_sli interval: 30s rules:
- record: job:http:success_ratio:rate5m expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- record: job:http:duration_p95:5m expr: histogram_quantile(0. 95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
6) SLO va alertlar (multi-window burn)
SLO 99. 9% muvaffaqiyatli soʻrovlar/30d.
yaml groups:
- name: slo_burn rules:
- alert: ErrorBudgetBurnHighShort expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 14 for: 5m labels: { severity: critical }
annotations: { summary: "Fast burn >14x for 5m" }
- alert: ErrorBudgetBurnHighLong expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 6 for: 1h labels: { severity: critical }
annotations: { summary: "Long burn >6x for 1h" }
Alertmanager (soddalashtirilgan):
yaml route:
receiver: pager group_by: ["service"]
receivers:
- name: pager slack_configs:
- channel: "#oncall"
send_resolved: true
7) Label-gigiyena va tejash
Label’larning nomlari barqaror va standartlashtirilgan:’service’,’env’,’region’,’route’,’code’,’version’.
Kardinallikni cheklang:’route’metriklari’http’namunasidan foydalanishi kerak. route’(toʻliq URL emas).
Sampling logika - treyslarda; metriklarda - hech qanday user_id.
Reliz xossalari (’service. version’) eski/yangi versiyani solishtirish uchun foydalidir.
8) Masshtablash va HA
Prometheus - vertikal va scrape-target bo’yicha chardlash:- Ikkita Prometheus (A/B) bir xil maqsadlarni yashiradi (HA → alertlar takrorlanadi).
- Global so’rovlar va uzoq muddatli saqlash uchun har bir Prometheus, Store + Query uchun Thanos: Sidecar (S3/GCS).
- Muqobil: Cortex/Mimir (remote-write, koʻp tenantlik, gorizontal masshtablash).
yaml remote_write:
- url: https://mimir. example. com/api/v1/push basic_auth: { username: tenantA, password: $MIMIR_TOKEN }
Lokal TSDB retenshn:
yaml
--storage. tsdb. retention. time=15d
--storage. tsdb. max-block-duration=2h
9) Grafana: dashbordlar, alertlar, annotatsiyalar
Standart dashbordlar:1. Platform Overview (SLO/RED, error-budget).
2. API by Route (RPS/5xx/p95, qiyoslash’version’).
3. K8s Cluster/Nodes (control-plane, saturation).
4. DB/Cache/Queues (lag/locks/hit ratio/backlog).
5. Per-Release (oldin/keyin, CI relizlari izohlari).
Grafana Alerting: PromQL triggerlari, on-call rotatsiyalari, mute-times «relizlar oynalari».
Annotations: CI’commit’,’image’bilan reliz tadbirini qoʻshadi. tag’, payplaynga havola.
10) Kubernetes: nimani o’lchash kerak
Control-plane: `apiserver_request_total`, etcd leader/fsync, scheduler latency.
Workloads: restartlar,’container _ cpu _ cfs _ throttled _ seconds _ total’, OOM, Pending/Evicted, PDB buzilishlari.
Tarmoq: droplar, conntrack,’kube-proxy’xatolari.
Kvotalar/limitlar: Requests vs Limits, HPA/VPA, saturation uzellari.
11) DB/kesh/navbatlar: asosiy signallar
PostgreSQL/MySQL: `connections`, `locks`, `deadlocks_total`, `xact_commit/rollback`, replication lag.
Redis: hit ratio, `evictions`, latency `instantaneous_ops_per_sec`.
Kafka/RabbitMQ: consumer lag, unacked, ISR, disk usage.
promql
Queue backlog sum by (topic) (kafka_consumergroup_lag)> 1000
Postgres replication lag max(pg_replication_lag_seconds) > 2
12) Xavfsizlik va ko’p tenantlik
RBAC dan Prometheus/Grafana, datasource-permishenlar.
ingress/orasida TLS/mTLS zanjiri.
Ijarachilarni izolyatsiya qilish: Cortex/Mimir da alohida Prometheus yoki tenant-label; seriya va so’rov limitlari.
Alert/notifikatsiyalardagi sirlar taqiqlanadi (PII emas, ID biletidan foydalaning).
13) Relizlar va avto-qaytishlar bilan integratsiya qilish
SLO qoidalari → AnalysisTemplate (Argo Rollouts) yoki CI-gate.
Burn-alertlar ishga tushganda - pause/rollback canary; logda/izohda - relizga havola.
Label’version’orqali barqaror va kanareykali versiyani taqqoslash.
14) Tipik xatolar (anti-patternlar)
Labellarning nazoratsiz kardinalligi (user_id, url. full, dinamik kalitlar).
Prod va stage bitta klasterda’env’labelsiz aralashtiriladi.
Faqat RED/USE boʻlmagan gauge; gistogrammalarsiz p95/p99.
SLO bilan bog’lanmagan «temir» alertlari → shovqin.
Prod-hodisalarda recording rules → «og’ir» so’rovlarning yo’qligi.
Reliz izohlari yoʻq → Oʻzgarishlar va tanazzullarni solishtirish qiyin.
15) Joriy etish chek-varaqasi (0-45 kun)
0-10 kun
Node/kube-state/cAdvisor eksport qiluvchilar; ’/metrics’xizmatlarida.
Bazaviy RED/USE dashbordlari; gistogrammalarning standart baketalari.
CI dan izoh izohlarini yoqish.
11-25 kun
Recording rules for SLI; multi-window burn alert.
HA Prometheus - GitOps konfiguratsiyasining zaxira nusxasi.
Alertmanager: yo’nalishlar/jim rejim/on-call rotatsiyalari.
26-45 kun
Remote-write to Thanos/Cortex/Mimir, uzoq saqlash.
Kardinallikni optimallashtirish, seriyalarga limitlar, so’rovlar.
SLO-geyting relizlari va auto-rollback integratsiyasi.
16) Etuklik metrikasi
Asosiy xizmatlar bo’yicha RED/USE qamrovi ≥ 95%.
«Og’ir» PromQL <2 c (p95) ni recording rules hisobiga bajarishning o’rtacha vaqti.
Foydali/shovqinli alertlarning nisbati> 3:1.
Nazorat ostidagi kardinallik: klaster uchun <10M faol seriyalar, portlashlar yo’qligi.
100% relizlar Grafana izohiga ega va metriklarni oldindan/keyin taqqoslaydi.
17) Foydali snippetlar
Stable vs canary versiyasini taqqoslash
promql histogram_quantile(0. 95,
sum by (le, version) (rate(http_request_duration_seconds_bucket{version=~"stable canary"}[5m]))
)
Yo’nalishlar bo’yicha 5xx xatolari
promql topk(5,
sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))
)
CPU saturation konteyneri
promql rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0. 1
Metriklarning treyslar bilan aloqasi (Exemplars kiritilgan)
promql sum (rate (http_request_duration_seconds_bucket[5m])) by (le) # clickable to the track
18) Xulosa
Prometheus + Grafana - metriklar uchun de-fakto standarti. Semantika va intizom: RED/USE, ehtiyotkorlik bilan label’lar, SLO gistogrammalari, recording rules va SLO-alertlar g’oliblikni qo’lga kiritmoqda. HA va uzoq muddatli saqlash, relizlarning izohlari va avto-qaytish bilan integratsiyalashuvni qo’shing - va siz tezkor, ko’paytiriladigan va tejamkor metrik konturga ega bo’lasiz, bu esa mahsulotda qaror qabul qilishga yordam beradi.