Metrics collection: Prometheus, Grafana

Metrics collection: Prometheus, Grafana

1) Purpose and frame

The task of the metrics loop is to reliably collect and store time series, give fast PromQL for RCA, SLO alerts and understandable dashboards. Basic pair: Prometheus (scrape → store → query) and Grafana (visualization, alerts, release annotations). For long storage and global query - Thanos/Cortex/Mimir.

2) Data model and semantics

Series = metric name + set of labels (key = value).
Types: counter, gauge, histogram, summary (in prod - more often histogram).

Semantics:

RED (API): 'rate', 'errors', 'duration' (histograms).
USE (ресурсы): Utilization, Saturation, Errors (CPU/RAM/Disk/Net).
Naming: 'namespace _ subsystem _ metric _ unit' (for example, 'http _ server _ requests _ total', 'db _ connections _ current').

Anti-cardinality: minimize different label values (no user_id request_id in label).

3) Exposure and service discovery

Exporters: node_exporter, kube-state-metrics, cAdvisor, DB/Queues (postgres_exporter, redis_exporter, kafka_exporter).
Native services: client libraries (Go/Java/Node/Python) → '/metrics'.
Service Discovery: Kubernetes, EC2/ASG, Consul, static files.

Basic 'prometheus. yml '(snippet):

yaml global:
scrape_interval: 15s evaluation_interval: 15s scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs: [{ role: node }]
relabel_configs:
- action: labelmap regex: __meta_kubernetes_node_label_(.+)
- job_name: 'apps'
kubernetes_sd_configs: [{ role: pod }]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: (.+)
replacement: $1

Annotations of pods:

yaml prometheus. io/scrape: "true"
prometheus. io/path: /metrics prometheus. io/port: "8080"

4) Histograms and latency

Use explicit buckets for your SLOs:

Web/API: '[10ms, 25,50,100,200,400,800,1600]'
Payments/payouts: Add tail to 5-10s.

PromQL p95:

promql histogram_quantile(0. 95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

With Exemplars (if enabled):

promql histogram_quantile(0. 95,
sum by (le, route) (rate(traces_spanmetrics_duration_bucket{route="/withdraw"}[5m]))
)

5) Recording rules

Reduce heavy requests, standardize SLI.

yaml groups:
- name: api_sli interval: 30s rules:
- record: job:http:success_ratio:rate5m expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- record: job:http:duration_p95:5m expr: histogram_quantile(0. 95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

6) SLO and alerts (multi-window burn)

SLO 99. 9% Successful Requests/30d.

yaml groups:
- name: slo_burn rules:
- alert: ErrorBudgetBurnHighShort expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 14 for: 5m labels: { severity: critical }
annotations: { summary: "Fast burn >14x for 5m" }

- alert: ErrorBudgetBurnHighLong expr: (1 - job:http:success_ratio:rate5m) > (1 - 0. 999) 6 for: 1h labels: { severity: critical }
annotations: { summary: "Long burn >6x for 1h" }

Alertmanager (simplified):

yaml route:
receiver: pager group_by: ["service"]
receivers:
- name: pager slack_configs:
- channel: "#oncall"
send_resolved: true

7) Label-hygiene and economy

Label names are stable and standardized: 'service', 'env', 'region', 'route', 'code', 'version'.
Limit cardinality: Metrics with 'route' must use 'http' pattern. route '(not the full URL).
Logic sampling - in trace; in metrics - no user_id.
Release properties ('service. version ') are useful for comparing old/new versions.

8) Scaling and HA

Prometheus - vertically and by scrape-target:

Two Prometheus (A/B) scrape the same targets (HA → alerts are duplicated).
Thanos: Sidecar to each Prometheus, Store + Query for global queries and long-term storage (S3/GCS).
Alternative: Cortex/Mimir (remote-write, multi-tenancy, horizontal scaling).

Remote write (example):

yaml remote_write:
- url: https://mimir. example. com/api/v1/push basic_auth: { username: tenantA, password: $MIMIR_TOKEN }

Local TSDB Retention:

yaml
--storage. tsdb. retention. time=15d
--storage. tsdb. max-block-duration=2h

9) Grafana: dashboards, alerts, annotations

Standard dashboards:

1. Platform Overview (SLO/RED, error-budget).

2. API by Route (RPS/5xx/p95, comparison 'version').

3. K8s Cluster/Nodes (control-plane, saturation).

4. DB/Cache/Queues (lag/locks/hit ratio/backlog).

5. Per-Release (before/after, release annotations from CI).

Grafana Alerting: triggers on PromQL, on-call rotations, mute-times "release windows."

Annotations: CI adds a release event with 'commit', 'image. tag ', a reference to the pipeline.

10) Kubernetes: what to measure

Control-plane: `apiserver_request_total`, etcd leader/fsync, scheduler latency.
Workloads: restarts, 'container _ cpu _ cfs _ throttled _ seconds _ total', OOM, Pending/Evicted, PDB violations.
Network: drops, conntrack, 'kube-proxy' errors.
Quotas/limits: Requests vs Limits, HPA/VPA, node saturation.

11) DB/caches/queues: key signals

PostgreSQL/MySQL: `connections`, `locks`, `deadlocks_total`, `xact_commit/rollback`, replication lag.
Redis: hit ratio, `evictions`, latency `instantaneous_ops_per_sec`.
Kafka/RabbitMQ: consumer lag, unacked, ISR, disk usage.

Examples of PromQL:

promql
Queue backlog sum by (topic) (kafka_consumergroup_lag)> 1000

Postgres replication lag max(pg_replication_lag_seconds) > 2

12) Safety and multi-tenancy

RBAC to Prometheus/Grafana, datasource-permishens.
TLS/mTLS chain on ingress/between components.
Tenant isolation: separate Prometheus or tenant-label in Cortex/Mimir; series and request limits.
Secrets in alerts/notifications - forbidden (use ticket ID, not PII).

13) Integration with releases and auto-rollbacks

SLO rules → AnalysisTemplate (Argo Rollouts) or CI-gate.
When burn alerts are triggered - pause/rollback canary; in the log/annotation - a link to the release.
Comparison of stable and canary version via label 'version'.

14) Typical errors (anti-patterns)

Uncontrolled cardinality of labels (user_id, url. full, dynamic keys).
Mix prod and stage in the same cluster without the'env'label.
Only gauge without RED/USE; without p95/p99 histograms.
Alerts on hardware without binding to SLO → noise.
Lack of recording rules → "heavy" requests in production incidents.
There are no release annotations → it is difficult to compare changes and degradation.

15) Implementation checklist (0-45 days)

0-10 days

Node/kube-state/cAdvisor exporters; '/metrics' in services.
Basic RED/USE dashboards; standard histogram buckets.
Include release annotations from CI.

11-25 days

Recording rules for SLI; multi-window burn alerts.
HA Prometheus (double scrape), backup of GitOps configs.
Alertmanager: routes/quiet mode/on-call rotations.

26-45 days

Remote-write in Thanos/Cortex/Mimir, long-term storage.
Cardinality optimization, series limits, requests.
SLO-gating releases and auto-rollback integration.

16) Maturity metrics

RED/USE coverage for key services ≥ 95%.
Average time to perform "heavy" PromQL <2 s (p95) due to recording rules.
The ratio of useful/noisy alerts is> 3:1.
Cardinality under control: <10M active batches per cluster, no spikes.
100% of releases are annotated in Grafana and correlated metrics before/after.

17) Useful snippets

Stable vs canary comparison by version

promql histogram_quantile(0. 95,
sum by (le, version) (rate(http_request_duration_seconds_bucket{version=~"stable    canary"}[5m]))
)

5xx Errors by Route

promql topk(5,
sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))
)

container CPU saturation

promql rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0. 1

Metrics relationship to tracks (Exemplars enabled)

promql sum (rate (http_request_duration_seconds_bucket[5m])) by (le) # clickable to the track

18) Conclusion

Prometheus + Grafana is the de facto standard for metrics. Semantics and discipline win: RED/USE, neat labels, histograms for SLO, recording rules and SLO alerts. Add HA and long-term storage, release annotations and integration with auto-rollbacks - and you have a fast, scalable and economical metric loop that helps you make decisions in sales.

Metrics collection: Prometheus, Grafana

11-25 days

26-45 days

5xx Errors by Route

container CPU saturation

Metrics relationship to tracks (Exemplars enabled)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects