Metrics collection: Prometheus, Grafana
Metrics collection: Prometheus, Grafana
1) Purpose and frame
The task of the metrics loop is to reliably collect and store time series, give fast PromQL for RCA, SLO alerts and understandable dashboards. Basic pair: Prometheus (scrape → store → query) and Grafana (visualization, alerts, release annotations). For long storage and global query - Thanos/Cortex/Mimir.
2) Data model and semantics
Series = metric name + set of labels (key = value).
Types: counter, gauge, histogram, summary (in prod - more often histogram).
- RED (API): 'rate', 'errors', 'duration' (histograms).
- USE (ресурсы): Utilization, Saturation, Errors (CPU/RAM/Disk/Net).
- Naming: 'namespace _ subsystem _ metric _ unit' (for example, 'http _ server _ requests _ total', 'db _ connections _ current').
Anti-cardinality: minimize different label values (no user_id request_id in label).
3) Exposure and service discovery
Exporters: node_exporter, kube-state-metrics, cAdvisor, DB/Queues (postgres_exporter, redis_exporter, kafka_exporter).
Native services: client libraries (Go/Java/Node/Python) → '/metrics'.
Service Discovery: Kubernetes, EC2/ASG, Consul, static files.
yaml global:
scrape_interval: 15s evaluation_interval: 15s scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs: [{ role: node }]
relabel_configs:
- action: labelmap regex: __meta_kubernetes_node_label_(.+)
- job_name: 'apps'
kubernetes_sd_configs: [{ role: pod }]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: (.+)
replacement: $1
Annotations of pods:
yaml prometheus. io/scrape: "true"
prometheus. io/path: /metrics prometheus. io/port: "8080"
4) Histograms and latency
Use explicit buckets for your SLOs:- Web/API: '[10ms, 25,50,100,200,400,800,1600]'
- Payments/payouts: Add tail to 5-10s.
promql histogram_quantile(0. 95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
With Exemplars (if enabled):
promql histogram_quantile(0. 95,
sum by (le, route) (rate(traces_spanmetrics_duration_bucket{route="/withdraw"}[5m]))
)
5) Recording rules
Reduce heavy requests, standardize SLI.
yaml groups:
- name: api_sli interval: 30s rules:
- record: job:http:success_ratio:rate5m expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- record: job:http:duration_p95:5m expr: histogram_quantile(0. 95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
6) SLO and alerts (multi-window burn)
SLO 99. 9% Successful Requests/30d.
yaml groups:
- name: slo_burn rules:
- alert: ErrorBudgetBurnHighShort expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 14 for: 5m labels: { severity: critical }
annotations: { summary: "Fast burn >14x for 5m" }
- alert: ErrorBudgetBurnHighLong expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 6 for: 1h labels: { severity: critical }
annotations: { summary: "Long burn >6x for 1h" }
Alertmanager (simplified):
yaml route:
receiver: pager group_by: ["service"]
receivers:
- name: pager slack_configs:
- channel: "#oncall"
send_resolved: true
7) Label-hygiene and economy
Label names are stable and standardized: 'service', 'env', 'region', 'route', 'code', 'version'.
Limit cardinality: Metrics with 'route' must use 'http' pattern. route '(not the full URL).
Logic sampling - in trace; in metrics - no user_id.
Release properties ('service. version ') are useful for comparing old/new versions.
8) Scaling and HA
Prometheus - vertically and by scrape-target:- Two Prometheus (A/B) scrape the same targets (HA → alerts are duplicated).
- Thanos: Sidecar to each Prometheus, Store + Query for global queries and long-term storage (S3/GCS).
- Alternative: Cortex/Mimir (remote-write, multi-tenancy, horizontal scaling).
yaml remote_write:
- url: https://mimir. example. com/api/v1/push basic_auth: { username: tenantA, password: $MIMIR_TOKEN }
Local TSDB Retention:
yaml
--storage. tsdb. retention. time=15d
--storage. tsdb. max-block-duration=2h
9) Grafana: dashboards, alerts, annotations
Standard dashboards:1. Platform Overview (SLO/RED, error-budget).
2. API by Route (RPS/5xx/p95, comparison 'version').
3. K8s Cluster/Nodes (control-plane, saturation).
4. DB/Cache/Queues (lag/locks/hit ratio/backlog).
5. Per-Release (before/after, release annotations from CI).
Grafana Alerting: triggers on PromQL, on-call rotations, mute-times "release windows."
Annotations: CI adds a release event with 'commit', 'image. tag ', a reference to the pipeline.
10) Kubernetes: what to measure
Control-plane: `apiserver_request_total`, etcd leader/fsync, scheduler latency.
Workloads: restarts, 'container _ cpu _ cfs _ throttled _ seconds _ total', OOM, Pending/Evicted, PDB violations.
Network: drops, conntrack, 'kube-proxy' errors.
Quotas/limits: Requests vs Limits, HPA/VPA, node saturation.
11) DB/caches/queues: key signals
PostgreSQL/MySQL: `connections`, `locks`, `deadlocks_total`, `xact_commit/rollback`, replication lag.
Redis: hit ratio, `evictions`, latency `instantaneous_ops_per_sec`.
Kafka/RabbitMQ: consumer lag, unacked, ISR, disk usage.
promql
Queue backlog sum by (topic) (kafka_consumergroup_lag)> 1000
Postgres replication lag max(pg_replication_lag_seconds) > 2
12) Safety and multi-tenancy
RBAC to Prometheus/Grafana, datasource-permishens.
TLS/mTLS chain on ingress/between components.
Tenant isolation: separate Prometheus or tenant-label in Cortex/Mimir; series and request limits.
Secrets in alerts/notifications - forbidden (use ticket ID, not PII).
13) Integration with releases and auto-rollbacks
SLO rules → AnalysisTemplate (Argo Rollouts) or CI-gate.
When burn alerts are triggered - pause/rollback canary; in the log/annotation - a link to the release.
Comparison of stable and canary version via label 'version'.
14) Typical errors (anti-patterns)
Uncontrolled cardinality of labels (user_id, url. full, dynamic keys).
Mix prod and stage in the same cluster without the'env'label.
Only gauge without RED/USE; without p95/p99 histograms.
Alerts on hardware without binding to SLO → noise.
Lack of recording rules → "heavy" requests in production incidents.
There are no release annotations → it is difficult to compare changes and degradation.
15) Implementation checklist (0-45 days)
0-10 days
Node/kube-state/cAdvisor exporters; '/metrics' in services.
Basic RED/USE dashboards; standard histogram buckets.
Include release annotations from CI.
11-25 days
Recording rules for SLI; multi-window burn alerts.
HA Prometheus (double scrape), backup of GitOps configs.
Alertmanager: routes/quiet mode/on-call rotations.
26-45 days
Remote-write in Thanos/Cortex/Mimir, long-term storage.
Cardinality optimization, series limits, requests.
SLO-gating releases and auto-rollback integration.
16) Maturity metrics
RED/USE coverage for key services ≥ 95%.
Average time to perform "heavy" PromQL <2 s (p95) due to recording rules.
The ratio of useful/noisy alerts is> 3:1.
Cardinality under control: <10M active batches per cluster, no spikes.
100% of releases are annotated in Grafana and correlated metrics before/after.
17) Useful snippets
Stable vs canary comparison by version
promql histogram_quantile(0. 95,
sum by (le, version) (rate(http_request_duration_seconds_bucket{version=~"stable canary"}[5m]))
)
5xx Errors by Route
promql topk(5,
sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))
)
container CPU saturation
promql rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0. 1
Metrics relationship to tracks (Exemplars enabled)
promql sum (rate (http_request_duration_seconds_bucket[5m])) by (le) # clickable to the track
18) Conclusion
Prometheus + Grafana is the de facto standard for metrics. Semantics and discipline win: RED/USE, neat labels, histograms for SLO, recording rules and SLO alerts. Add HA and long-term storage, release annotations and integration with auto-rollbacks - and you have a fast, scalable and economical metric loop that helps you make decisions in sales.