GH GambleHub

Metrleri ýygnamak: Prometheus, Grafana

Metrleri ýygnamak: Prometheus, Grafana

1) Maksady we çarçuwasy

Metrleriň konturynyň wezipesi wagt hatarlaryny ygtybarly ýygnamak we saklamak, RCA üçin çalt PromQL bermek, SLO boýunça aladalar we düşnükli daşbordlar. Esasy jübüt: Prometheus (scrape → store → query) we Grafana (wizualizasiýa, alert, relizleriň düşündirişleri). Uzak wagtlap saklamak we global talap etmek üçin - Thanos/Cortex/Mimir.

2) Maglumatlaryň nusgasy we semantika

Seriýa = metrikanyň ady + label's toplumy (açar = many).
Görnüşleri: counter, gauge, histogram, summary (proda - köplenç histogram).

Semantika:
  • RED (API): 'rate', 'errors', 'duration' (gistogrammalar).
  • USE (ресурсы): Utilization, Saturation, Errors (CPU/RAM/Disk/Net).
  • Ady: 'namespace _ subsystem _ metric _ unit' (mysal üçin 'http _ server _ requests _ total', 'db _ connections _ current').

Anti-kardinalizm: Dürli bahalary minimallaşdyryň (user_id request_id ýok).

3) Sergi we hyzmat-diskaweri

Eksportçylar: node_exporter, kube-state-metrics, cAdvisor, DB/Nobatlar (postgres_exporter, redis_exporter, kafka_exporter).
Öz hyzmatlary: müşderi kitaphanalary (Go/Java/Node/Python) → '/metrics '.
Service Discovery: Kubernetes, EC2/ASG, Consul, static files.

Esasy 'prometheus. yml '(bölek):
yaml global:
scrape_interval: 15s evaluation_interval: 15s scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs: [{ role: node }]
relabel_configs:
- action: labelmap regex: __meta_kubernetes_node_label_(.+)
- job_name: 'apps'
kubernetes_sd_configs: [{ role: pod }]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: (.+)
replacement: $1
Pod düşündirişleri:
yaml prometheus. io/scrape: "true"
prometheus. io/path: /metrics prometheus. io/port: "8080"

4) Gistogrammalar we latency

SLO-laryňyzyň aşagyndaky aýratyn baketleri ulanyň:
  • Web/API: '[10ms, 25,50,100,200,400,800,1600]'
  • Tölegler/tölegler: 5-10s çenli guýruk goşuň.
PromQL p95:
promql histogram_quantile(0. 95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
Exemplars bilen (goşulsa):
promql histogram_quantile(0. 95,
sum by (le, route) (rate(traces_spanmetrics_duration_bucket{route="/withdraw"}[5m]))
)

5) Ýazgy düzgünleri (recording rules)

Agyr haýyşlary azaldýarlar, SLI standartlaşdyrýarlar.

yaml groups:
- name: api_sli interval: 30s rules:
- record: job:http:success_ratio:rate5m expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- record: job:http:duration_p95:5m expr: histogram_quantile(0. 95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

6) SLO we alertler (multi-window burn)

SLO 99. 9 %/30d.

yaml groups:
- name: slo_burn rules:
- alert: ErrorBudgetBurnHighShort expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 14 for: 5m labels: { severity: critical }
annotations: { summary: "Fast burn >14x for 5m" }

- alert: ErrorBudgetBurnHighLong expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 6 for: 1h labels: { severity: critical }
annotations: { summary: "Long burn >6x for 1h" }
Alertmanager (ýönekeý):
yaml route:
receiver: pager group_by: ["service"]
receivers:
- name: pager slack_configs:
- channel: "#oncall"
send_resolved: true

7) Label-arassaçylyk we tygşytlamak

Label's atlary durnukly we standartlaşdyrylan: 'service', 'env', 'region', 'route', 'code', 'version'.
Kardinallygy çäklendiriň: 'route' metrleri 'http' şablonyny ulanmaly. route '(doly URL däl).
Logikanyň samplingi - trekslerde; metriklerde - user_id ýok.
Reliz aýratynlyklary ('service. version ') köne/täze wersiýasyny deňeşdirmek üçin peýdalydyr.

8) Masştab we HA

Prometheus - dik we scrape-target boýunça şardlamak:
  • Iki Prometheus (A/B) şol bir maksatlary gizleýär (HA → alertler gaýtalanýar).
  • Thanos: Global haýyşlar we uzak möhletli saklamak üçin her Prometheus, Store + Query (S3/GCS) üçin Sidecar.
  • Alternatiwa: Cortex/Mimir (remote-write, köptenantlyk, gorizontal masştab).
Remote write (mysal):
yaml remote_write:
- url: https://mimir. example. com/api/v1/push basic_auth: { username: tenantA, password: $MIMIR_TOKEN }
Lokal TSDB satuwy:
yaml
--storage. tsdb. retention. time=15d
--storage. tsdb. max-block-duration=2h

9) Grafana: daşbordlar, alertler, düşündirişler

Standart daşbordlar:

1. Platform Overview (SLO/RED, error-budget).

2. API by Route (RPS/5xx/p95, deňeşdirme 'version').

3. K8s Cluster/Nodes (control-plane, saturation).

4. DB/Cache/Queues (lag/locks/hit ratio/backlog).

5. Per-Release (öň/soň, CI-den neşirleriň düşündirişleri).

Grafana Alerting: PromQL, on-call, mute-times "goýberiş penjireleri" boýunça triggerler.

Annotations: CI 'commit', 'image' bilen çykyş çäresini goşýar. tag ', paypline salgylanma.

10) Kubernetes: nämäni ölçemeli

Control-plane: `apiserver_request_total`, etcd leader/fsync, scheduler latency.
Workloads: restarts, 'container _ cpu _ cfs _ throttled _ seconds _ total', OOM, Pending/Evicted, PDB düzgün bozmalar.
Tor: damjalar, conntrack, 'kube-proxy' hatalary.
Kwotalar/çäkler: Requests vs Limits, HPA/VPA, düwünleriň saturasiýasy.

11) BD/keş/nobatlar: esasy signallar

PostgreSQL/MySQL: `connections`, `locks`, `deadlocks_total`, `xact_commit/rollback`, replication lag.
Redis: hit ratio, `evictions`, latency `instantaneous_ops_per_sec`.
Kafka/RabbitMQ: consumer lag, unacked, ISR, disk usage.

PromQL mysallary:
promql
Queue backlog sum by (topic) (kafka_consumergroup_lag)> 1000

Postgres replication lag max(pg_replication_lag_seconds) > 2

12) Howpsuzlyk we köp tenantlyk

RBAC k Prometheus/Grafana, datasource-permishen.
Ingress/komponentleriň arasynda TLS/mTLS zynjyry.
Kärendeçileriň izolýasiýasy: Cortex/Mimir-de aýratyn Prometheus ýa-da tenant-label; seriýa we haýyş üçin çäklendirmeler.
Alertlerde/notifikasiýalarda syrlar gadagan edilýär (PII däl-de, ID biletini ulanyň).

13) Relizler we awto-yza gaýdyşlar bilen integrasiýa

SLO düzgünleri → AnalysisTemplate (Argo Rollouts) ýa-da CI-gate.
Burn-alertler açylanda - pause/rollback canary; logda/düşündirişde - goýberilmegine salgylanma.
Label 'version' arkaly durnukly we kanar görnüşini deňeşdirmek.

14) Adaty ýalňyşlyklar (anti-pattern)

Labelleriň gözegçiliksiz kardinallygy (user_id, url. full, dinamiki açarlar).
Prod bilen stage-i bir toparda 'env' label-siz garyşdyryň.
RED/USE bolmasa diňe gauge; histogramsyz p95/p99.
SLO → sessiz "demir" alertleri.
Proto-hadysalarda "agyr" soraglaryň ýoklugy.
Çykyşlaryň düşündirişleri ýok → üýtgeşmeleri we zaýalanmalary deňeşdirmek kyn.

15) Giriş çek-sanawy (0-45 gün)

0-10 gün

Node/kube-state/cAdvisor eksportçylary; '/metrics 'hyzmatlarda.
Esasy RED/USE daşbordlary; gistogrammalaryň standart baketleri.
CI neşirleriniň düşündirişlerini açyň.

11-25 gün

SLI üçin recording rules; multi-window burn alerty.
HA Prometheus (goşa scrape), GitOps konfigurasiýalarynyň ätiýaçlyk nusgasy.
Alertmanager: marşrutlar/asuda re modeim/on-call aýlawlary.

26-45 gün

Remote-write in Thanos/Cortex/Mimir, uzak wagtlap saklamak.
Kardinallygy optimizirlemek, seriýalara çäkler, haýyşlar.
SLO-geýting relizleri we awto-rollback integrasiýasy.

16) Kämillik ölçegleri

Esasy hyzmatlar boýunça RED/USE ýapmak ≥ 95%.
recording rules hasabyna "agyr" PromQL <2 c (p95) ýerine ýetirmegiň ortaça wagty.
Peýdaly/şowhunly alertleriň gatnaşygy> 3:1.
Gözegçilik astynda kardinallyk: <10M klaster üçin işjeň seriýalar, partlamalaryň ýoklugy.
Relizleriň 100% -inde Grafana düşündirişi we metrikleri öň/soň deňeşdirmek bar.

17) Peýdaly snippetler

Stable vs canary

promql histogram_quantile(0. 95,
sum by (le, version) (rate(http_request_duration_seconds_bucket{version=~"stable    canary"}[5m]))
)

Marşrutlar boýunça 5xx ýalňyşlyklary

promql topk(5,
sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))
)

CPU konteýner saturasiýasy

promql rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0. 1

Metrleriň treýsler bilen baglanyşygy (Exemplars goşuldy)

promql sum (rate (http_request_duration_seconds_bucket[5m])) by (le) # clickable to the track

18) Netijenama

Prometheus + Grafana - metrikler üçin de-fakto standart. Semantika we düzgün-nyzam ýeňýär: RED/USE, arassa label's, SLO-nyň aşagyndaky gistogrammalar, recording rules we SLO-alertler. HA we uzak möhletli saklamak, neşirleriň düşündirişleri we awto-yza gaýdyp gelmek bilen integrasiýa goşuň - we önümde karar bermäge kömek edýän çalt, ulaldylýan we tygşytly metrik kontury alarsyňyz.

Contact

Biziň bilen habarlaşyň

Islendik sorag ýa-da goldaw boýunça bize ýazyp bilersiňiz.Biz hemişe kömek etmäge taýýar.

Telegram
@Gamble_GC
Integrasiýany başlamak

Email — hökmany. Telegram ýa-da WhatsApp — islege görä.

Adyňyz obýýektiw däl / islege görä
Email obýýektiw däl / islege görä
Tema obýýektiw däl / islege görä
Habar obýýektiw däl / islege görä
Telegram obýýektiw däl / islege görä
@
Eger Telegram görkezen bolsaňyz — Email-den daşary şol ýerden hem jogap bereris.
WhatsApp obýýektiw däl / islege görä
Format: ýurduň kody we belgi (meselem, +993XXXXXXXX).

Düwmäni basmak bilen siz maglumatlaryňyzyň işlenmegine razylyk berýärsiňiz.