Prometheus: collecting metrics
(Section: Technology and Infrastructure)
Brief Summary
Prometheus is an industrial standard for time metrics: it scrapes targets over HTTP, stores series in TSDB, counts aggregates in PromQL and triggers alerts through Alertmanager. For iGaming, this is the basis of the SLO approach (RED/USE, business payment metrics), fast p95/p99 diagnostics and automatic solutions (freeze/rollback).
1) Data model and cardinality
Metric:' name {label1 =" v1," label2 =" v2"} value @ timestamp'.
Cardinality = the power product of all unique label sets; the main cost factor.
- базовые: `service`, `env`, `region`, `instance`, `pod`, `container`, `version`;
- domain: 'route', 'psp', 'tenant' (caution!), 'game _ provider'.
- You cannot put 'user _ id', 'session _ id', random/high cardinal values.
2) Types of metrics
Counter - only grows (for example, 'http _ requests _ total').
Gauge - instantaneous values (for example, 'queue _ depth').
Histogram/Summary - latency distributions. In prod - Histogram (with support for 'histogram _ quantile ()' and exemplars).
Native Histograms are variable buckets that improve accuracy and save size (include where available).
go var httpLatency = prometheus. NewHistogramVec(
prometheus. HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP latency",
Buckets: prometheus. DefBuckets ,//or custom
},
[]string{"route","method"},
)
3) Exporters and what to measure
Service: your code (SDK for Go/Java/Node/Python), RED API metrics, business metrics (payment conversion).
System: node_exporter, cAdvisor/kubelet.
Third-party: DB/caches (mysqld_exporter, postgres_exporter, redis_exporter), NGINX/HAProxy, Kafka/RabbitMQ.
OTel metrics: through OpenTelemetry Collector → Prometheus Remote Write or Prometheus-receiver → into a common stack.
4) Scrape and relabel: how to connect targets
Basic 'prometheus. yml`
yaml global:
scrape_interval: 15s evaluation_interval: 15s external_labels:
env: "prod"
region: "eu-west"
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['10. 0. 1. 10:9100','10. 0. 1. 11:9100']
- job_name: 'payments-api'
metrics_path: /metrics scheme: https tls_config:
ca_file: /etc/ssl/ca. crt cert_file: /etc/ssl/tls. crt key_file: /etc/ssl/tls. key relabel_configs:
- source_labels: [__address__]
regex: '(.):\d+'
target_label: instance replacement: '$1'
Kubernetes через Prometheus Operator
Use ServiceMonitor/PodMonitor instead of manual'scrape _ configs'.
yaml apiVersion: monitoring. coreos. com/v1 kind: ServiceMonitor metadata: { name: payments-api }
spec:
selector: { matchLabels: { app: payments-api } }
namespaceSelector: { matchNames: [ "prod" ] }
endpoints:
- port: metrics interval: 15s scheme: http relabelings:
- action: replace targetLabel: service replacement: "payments-api"
Annotation K8s (without Operator, simplified)
yaml metadata:
annotations:
prometheus. io/scrape: "true"
prometheus. io/port: "9102"
prometheus. io/path: "/metrics"
5) Storage: TSDB, WAL and retention
WAL (Write-Ahead Log) → rapid recovery from restart.
Compaction: block compression, disk/CPU savings.
Retention: Keep hot data for 7-30 days; long-term - move (see Scaling).
- `--storage. tsdb. retention. time=15d`
- `--storage. tsdb. max-block-chunk-segment-size`
- Drive: fast SSD/NVMe; avoid network volumes unnecessarily.
6) PromQL: fundamentals and frequent patterns
Rate/irate
promql rate(http_requests_total{route="/deposit"}[5m])
Errors and success rates
promql sum(rate(http_requests_total{status=~"2.. 3.."}[5m]))
/ sum(rate(http_requests_total[5m]))
p95 latency
promql histogram_quantile(0. 95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)
Queues/Saturation
promql max(queue_depth{queue="withdrawals"}) by (region)
7) Recording rules and performance
Consider heavy expressions in advance and store as series.
yaml groups:
- name: api. rules interval: 30s rules:
- record: job:http:request_duration_seconds:p95 expr:
histogram_quantile(0. 95,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
- record: job:http:success_ratio expr:
sum(rate(http_requests_total{status=~"2.. 3.."}[5m]))
/ sum(rate(http_requests_total[5m]))
Plus: fast dashboards, less load on the Prometheus CPU.
8) Alerting и SLO (burn rate)
Burn-rate alerts (multi-window, multi-burn)
yaml groups:
- name: slo. payments rules:
- alert: PaymentsSLOFastBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 14 for: 5m labels: { severity: "page" }
annotations:
summary: "SLO fast burn"
runbook: "https://runbooks/payments/slo"
- alert: PaymentsSLOSlowBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 6 for: 1h labels: { severity: "ticket" }
Alertmanager: service/region routing, duplicate suppression, ChatOps.
9) Correlation with traces and logs
Enable exemplars: clickable 'trace _ id' in histogram buckets.
Put the labels' service ',' version ',' region'for "release compare" in the metrics.
On dashboards - release annotations (Git SHA/version).
10) Scaling and long-term storage
Federation: the upper Prometheus aggregates from the lower (by job/label filters).
Remote Write: sending rows to long-term storage backends/clusters (Thanos/Cortex/Mimir).
Pros: infinite retention, horizontal scaling, global view.
Cons: more difficult to operate, cost.
Sharding by function: separate instances for system metrics, business, security.
11) Safety
TLS/mTLS between Prometheus ↔/Alertmanager/remote _ write targets.
Basic/token authentication for/targets and API (in front of the proxy gateway).
RBAC: restrict access to UI/series by role; hide private labels.
PII hygiene: do not write PII in metrics; use hashes/aliases.
12) Kubernetes practices
Prometheus Operator: CRD (ServiceMonitor, PodMonitor, Alertmanager, Prometheus).
kube-state-metrics + cAdvisor → a complete picture of the cluster.
Tainings and resources: dedicated nodes for monitoring; CPU/RAM limits.
Noise reduction: label selectors for "production" namespaces, padding scrape_interval where possible.
13) Business Metrics and Product
Платежи: `payments_success_total{psp, currency}`, `payment_conversion_ratio`, `ttw_seconds_histogram`.
Game activity: bets/min, holding sessions as gauge, drop-off when errors.
Risk/fraud: triggers for speed anomalies/geo; logging separately, metrics - aggregates.
14) Cost and performance (FinOps)
Control the cardinality (tag review before adding a new label).
Sampling histograms/rare exporters → 'scrape_interval'↑ for non-critical targets.
Downsampling in long-term storage backends.
Dashboard caching and broad reliance on recording rules.
15) Examples of "fast start"
RED Exporter in Application (Python)
python from prometheus_client import Counter, Histogram, start_http_server reqs = Counter('http_requests_total','', ['route','method','status'])
lat = Histogram('http_request_duration_seconds','', ['route','method'])
start_http_server(8000)
def handle(req):
with lat. labels(req. route, req. method). time():
status = app(req)
reqs. labels(req. route, req. method, str(status)). inc()
return status
Threshold alerts p95
promql alert: HighLatencyP95 expr: histogram_quantile(0. 95,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) > 0. 25 for: 10m labels: { severity: "page", service: "api" }
16) Implementation checklist
1. Define a set of basic metrics (RED/USE) and domain metrics.
2. Coordinate labels and guide by cardinality.
3. Configure scrape/ServiceMonitor, TLS/mTLS, relabel.
4. Include histograms for key paths and exemplars.
5. Create recording rules for p95, success ratio, business aggregates.
6. Enter SLO alerts (burn rate) and rooting Alertmanager.
7. Raise dashboards: service map, release compare, payments.
8. Decide about the federation/remote _ write and retention.
9. Restrict access (RBAC), check the absence of PII.
10. Enable runbooks and game-day checks.
17) Anti-patterns
Labels with high cardinality (user/session/request_id).
Summary instead of Histogram for key SLOs → there is no'histogram _ quantile '.
Scratch "all in a row" without filtering/rotation → an increase in costs and noise.
Alerts on raw metrics without SLO → alert-phatig.
Lack of recording rules → "heavy" dashboards.
Trust in metrics without TLS/mTLS → risk of spoofing/leakage.
Summary
Prometheus gives the iGaming platform goal-bound observability: accurate histograms, stable aggregates, clear SLO alerts, and scaling to a multi-region map. Label discipline, correct recording rules, trace/log links, and a thoughtful storage architecture provide quick releases and predictable p99 even at peak times.