GH GambleHub

Metriklarni yigʻish: Prometheus, Grafana

Metriklarni yigʻish: Prometheus, Grafana

1) Maqsad va ramka

Metrik konturning vazifasi - vaqt qatorlarini ishonchli yig’ish va saqlash, RCA uchun tezkor PromQL, SLO bo’yicha alertlar va tushunarli dashbordlarni berish. Asosiy juftlik: Prometheus (scrape → store → query) va Grafana (vizualizatsiya, alertlar, relizlar izohlari). Uzoq saqlash va global talab uchun - Thanos/Cortex/Mimir.

2) Ma’lumotlar modeli va semantika

Seriya = metrika nomi + label’lar toʻplami (kalit = qiymat).
Turlari: counter, gauge, histogram, summary (proda - ko’pincha histogram).

Semantika:
  • RED (API):’rate’,’errors’,’duration’(gistogrammalar).
  • USE (ресурсы): Utilization, Saturation, Errors (CPU/RAM/Disk/Net).
  • Nomlanishi:’namespace _ subsystem _ metric _ unit’(masalan,’http _ server _ requests _ total’,’db _ connections _ current’).

Anti-kardinallik: label’larning turli qiymatlarini minimallashtiring (label’ga user_id request_id yo’q).

3) Ko’rgazmaga qo’yish va servis-diskaveri

Eksportchilar: node_exporter, kube-state-metrics, cAdvisor, DB/Navbatlar (postgres_exporter, redis_exporter, kafka_exporter).
Oʻz xizmatlari: mijozlar kutubxonalari (Go/Java/Node/Python) → ’/metrics’.
Service Discovery: Kubernetes, EC2/ASG, Consul, static files.

Asosiy’prometheus. yml’(parcha):
yaml global:
scrape_interval: 15s evaluation_interval: 15s scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs: [{ role: node }]
relabel_configs:
- action: labelmap regex: __meta_kubernetes_node_label_(.+)
- job_name: 'apps'
kubernetes_sd_configs: [{ role: pod }]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: (.+)
replacement: $1
Pod’lar izohlari:
yaml prometheus. io/scrape: "true"
prometheus. io/path: /metrics prometheus. io/port: "8080"

4) Gistogrammalar va latency

SLO uchun eksplitsit baketalardan foydalaning:
  • Veb/API:’[10ms, 25,50,100,200,400,800,1600] ’
  • To’lovlar/to’lovlar: dumini 5-10s gacha qo’shing.
PromQL p95:
promql histogram_quantile(0. 95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
Exemplars bilan (agar kiritilgan boʻlsa):
promql histogram_quantile(0. 95,
sum by (le, route) (rate(traces_spanmetrics_duration_bucket{route="/withdraw"}[5m]))
)

5) Yozib olish qoidalari (recording rules)

Og’ir so’rovlarni kamaytiradi, SLI standartlashtiradi.

yaml groups:
- name: api_sli interval: 30s rules:
- record: job:http:success_ratio:rate5m expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- record: job:http:duration_p95:5m expr: histogram_quantile(0. 95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

6) SLO va alertlar (multi-window burn)

SLO 99. 9% muvaffaqiyatli soʻrovlar/30d.

yaml groups:
- name: slo_burn rules:
- alert: ErrorBudgetBurnHighShort expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 14 for: 5m labels: { severity: critical }
annotations: { summary: "Fast burn >14x for 5m" }

- alert: ErrorBudgetBurnHighLong expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 6 for: 1h labels: { severity: critical }
annotations: { summary: "Long burn >6x for 1h" }
Alertmanager (soddalashtirilgan):
yaml route:
receiver: pager group_by: ["service"]
receivers:
- name: pager slack_configs:
- channel: "#oncall"
send_resolved: true

7) Label-gigiyena va tejash

Label’larning nomlari barqaror va standartlashtirilgan:’service’,’env’,’region’,’route’,’code’,’version’.
Kardinallikni cheklang:’route’metriklari’http’namunasidan foydalanishi kerak. route’(toʻliq URL emas).
Sampling logika - treyslarda; metriklarda - hech qanday user_id.
Reliz xossalari (’service. version’) eski/yangi versiyani solishtirish uchun foydalidir.

8) Masshtablash va HA

Prometheus - vertikal va scrape-target bo’yicha chardlash:
  • Ikkita Prometheus (A/B) bir xil maqsadlarni yashiradi (HA → alertlar takrorlanadi).
  • Global so’rovlar va uzoq muddatli saqlash uchun har bir Prometheus, Store + Query uchun Thanos: Sidecar (S3/GCS).
  • Muqobil: Cortex/Mimir (remote-write, koʻp tenantlik, gorizontal masshtablash).
Remote write (misol):
yaml remote_write:
- url: https://mimir. example. com/api/v1/push basic_auth: { username: tenantA, password: $MIMIR_TOKEN }
Lokal TSDB retenshn:
yaml
--storage. tsdb. retention. time=15d
--storage. tsdb. max-block-duration=2h

9) Grafana: dashbordlar, alertlar, annotatsiyalar

Standart dashbordlar:

1. Platform Overview (SLO/RED, error-budget).

2. API by Route (RPS/5xx/p95, qiyoslash’version’).

3. K8s Cluster/Nodes (control-plane, saturation).

4. DB/Cache/Queues (lag/locks/hit ratio/backlog).

5. Per-Release (oldin/keyin, CI relizlari izohlari).

Grafana Alerting: PromQL triggerlari, on-call rotatsiyalari, mute-times «relizlar oynalari».

Annotations: CI’commit’,’image’bilan reliz tadbirini qoʻshadi. tag’, payplaynga havola.

10) Kubernetes: nimani o’lchash kerak

Control-plane: `apiserver_request_total`, etcd leader/fsync, scheduler latency.
Workloads: restartlar,’container _ cpu _ cfs _ throttled _ seconds _ total’, OOM, Pending/Evicted, PDB buzilishlari.
Tarmoq: droplar, conntrack,’kube-proxy’xatolari.
Kvotalar/limitlar: Requests vs Limits, HPA/VPA, saturation uzellari.

11) DB/kesh/navbatlar: asosiy signallar

PostgreSQL/MySQL: `connections`, `locks`, `deadlocks_total`, `xact_commit/rollback`, replication lag.
Redis: hit ratio, `evictions`, latency `instantaneous_ops_per_sec`.
Kafka/RabbitMQ: consumer lag, unacked, ISR, disk usage.

PromQL namunalari:
promql
Queue backlog sum by (topic) (kafka_consumergroup_lag)> 1000

Postgres replication lag max(pg_replication_lag_seconds) > 2

12) Xavfsizlik va ko’p tenantlik

RBAC dan Prometheus/Grafana, datasource-permishenlar.
ingress/orasida TLS/mTLS zanjiri.
Ijarachilarni izolyatsiya qilish: Cortex/Mimir da alohida Prometheus yoki tenant-label; seriya va so’rov limitlari.
Alert/notifikatsiyalardagi sirlar taqiqlanadi (PII emas, ID biletidan foydalaning).

13) Relizlar va avto-qaytishlar bilan integratsiya qilish

SLO qoidalari → AnalysisTemplate (Argo Rollouts) yoki CI-gate.
Burn-alertlar ishga tushganda - pause/rollback canary; logda/izohda - relizga havola.
Label’version’orqali barqaror va kanareykali versiyani taqqoslash.

14) Tipik xatolar (anti-patternlar)

Labellarning nazoratsiz kardinalligi (user_id, url. full, dinamik kalitlar).
Prod va stage bitta klasterda’env’labelsiz aralashtiriladi.
Faqat RED/USE boʻlmagan gauge; gistogrammalarsiz p95/p99.
SLO bilan bog’lanmagan «temir» alertlari → shovqin.
Prod-hodisalarda recording rules → «og’ir» so’rovlarning yo’qligi.
Reliz izohlari yoʻq → Oʻzgarishlar va tanazzullarni solishtirish qiyin.

15) Joriy etish chek-varaqasi (0-45 kun)

0-10 kun

Node/kube-state/cAdvisor eksport qiluvchilar; ’/metrics’xizmatlarida.
Bazaviy RED/USE dashbordlari; gistogrammalarning standart baketalari.
CI dan izoh izohlarini yoqish.

11-25 kun

Recording rules for SLI; multi-window burn alert.
HA Prometheus - GitOps konfiguratsiyasining zaxira nusxasi.
Alertmanager: yo’nalishlar/jim rejim/on-call rotatsiyalari.

26-45 kun

Remote-write to Thanos/Cortex/Mimir, uzoq saqlash.
Kardinallikni optimallashtirish, seriyalarga limitlar, so’rovlar.
SLO-geyting relizlari va auto-rollback integratsiyasi.

16) Etuklik metrikasi

Asosiy xizmatlar bo’yicha RED/USE qamrovi ≥ 95%.
«Og’ir» PromQL <2 c (p95) ni recording rules hisobiga bajarishning o’rtacha vaqti.
Foydali/shovqinli alertlarning nisbati> 3:1.
Nazorat ostidagi kardinallik: klaster uchun <10M faol seriyalar, portlashlar yo’qligi.
100% relizlar Grafana izohiga ega va metriklarni oldindan/keyin taqqoslaydi.

17) Foydali snippetlar

Stable vs canary versiyasini taqqoslash

promql histogram_quantile(0. 95,
sum by (le, version) (rate(http_request_duration_seconds_bucket{version=~"stable    canary"}[5m]))
)

Yo’nalishlar bo’yicha 5xx xatolari

promql topk(5,
sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))
)

CPU saturation konteyneri

promql rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0. 1

Metriklarning treyslar bilan aloqasi (Exemplars kiritilgan)

promql sum (rate (http_request_duration_seconds_bucket[5m])) by (le) # clickable to the track

18) Xulosa

Prometheus + Grafana - metriklar uchun de-fakto standarti. Semantika va intizom: RED/USE, ehtiyotkorlik bilan label’lar, SLO gistogrammalari, recording rules va SLO-alertlar g’oliblikni qo’lga kiritmoqda. HA va uzoq muddatli saqlash, relizlarning izohlari va avto-qaytish bilan integratsiyalashuvni qo’shing - va siz tezkor, ko’paytiriladigan va tejamkor metrik konturga ega bo’lasiz, bu esa mahsulotda qaror qabul qilishga yordam beradi.

Contact

Biz bilan bog‘laning

Har qanday savol yoki yordam bo‘yicha bizga murojaat qiling.Doimo yordam berishga tayyormiz.

Telegram
@Gamble_GC
Integratsiyani boshlash

Email — majburiy. Telegram yoki WhatsApp — ixtiyoriy.

Ismingiz ixtiyoriy
Email ixtiyoriy
Mavzu ixtiyoriy
Xabar ixtiyoriy
Telegram ixtiyoriy
@
Agar Telegram qoldirilgan bo‘lsa — javob Email bilan birga o‘sha yerga ham yuboriladi.
WhatsApp ixtiyoriy
Format: mamlakat kodi va raqam (masalan, +998XXXXXXXX).

Yuborish orqali ma'lumotlaringiz qayta ishlanishiga rozilik bildirasiz.