Prometheus: collecting metrics

(Section: Technology and Infrastructure)

Brief Summary

Prometheus is an industrial standard for time metrics: it scrapes targets over HTTP, stores series in TSDB, counts aggregates in PromQL and triggers alerts through Alertmanager. For iGaming, this is the basis of the SLO approach (RED/USE, business payment metrics), fast p95/p99 diagnostics and automatic solutions (freeze/rollback).

1) Data model and cardinality

Metric:' name {label1 =" v1," label2 =" v2"} value @ timestamp'.
Cardinality = the power product of all unique label sets; the main cost factor.

Label practices:

базовые: `service`, `env`, `region`, `instance`, `pod`, `container`, `version`;
domain: 'route', 'psp', 'tenant' (caution!), 'game _ provider'.
You cannot put 'user _ id', 'session _ id', random/high cardinal values.

2) Types of metrics

Counter - only grows (for example, 'http _ requests _ total').
Gauge - instantaneous values (for example, 'queue _ depth').
Histogram/Summary - latency distributions. In prod - Histogram (with support for 'histogram _ quantile ()' and exemplars).
Native Histograms are variable buckets that improve accuracy and save size (include where available).

Example (Go):

go var httpLatency = prometheus. NewHistogramVec(
prometheus. HistogramOpts{
Name:  "http_request_duration_seconds",
Help:  "HTTP latency",
Buckets: prometheus. DefBuckets ,//or custom
},
[]string{"route","method"},
)

3) Exporters and what to measure

Service: your code (SDK for Go/Java/Node/Python), RED API metrics, business metrics (payment conversion).
System: node_exporter, cAdvisor/kubelet.
Third-party: DB/caches (mysqld_exporter, postgres_exporter, redis_exporter), NGINX/HAProxy, Kafka/RabbitMQ.
OTel metrics: through OpenTelemetry Collector → Prometheus Remote Write or Prometheus-receiver → into a common stack.

4) Scrape and relabel: how to connect targets

Basic 'prometheus. yml`

yaml global:
scrape_interval: 15s evaluation_interval: 15s external_labels:
env: "prod"
region: "eu-west"

scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['10. 0. 1. 10:9100','10. 0. 1. 11:9100']

- job_name: 'payments-api'
metrics_path: /metrics scheme: https tls_config:
ca_file: /etc/ssl/ca. crt cert_file: /etc/ssl/tls. crt key_file: /etc/ssl/tls. key relabel_configs:
- source_labels: [__address__]
regex: '(.):\d+'
target_label: instance replacement: '$1'

Kubernetes через Prometheus Operator

Use ServiceMonitor/PodMonitor instead of manual'scrape _ configs'.

yaml apiVersion: monitoring. coreos. com/v1 kind: ServiceMonitor metadata: { name: payments-api }
spec:
selector: { matchLabels: { app: payments-api } }
namespaceSelector: { matchNames: [ "prod" ] }
endpoints:
- port: metrics interval: 15s scheme: http relabelings:
- action: replace targetLabel: service replacement: "payments-api"

Annotation K8s (without Operator, simplified)

yaml metadata:
annotations:
prometheus. io/scrape: "true"
prometheus. io/port: "9102"
prometheus. io/path: "/metrics"

5) Storage: TSDB, WAL and retention

WAL (Write-Ahead Log) → rapid recovery from restart.
Compaction: block compression, disk/CPU savings.
Retention: Keep hot data for 7-30 days; long-term - move (see Scaling).

Tuning:

`--storage. tsdb. retention. time=15d`
`--storage. tsdb. max-block-chunk-segment-size`
Drive: fast SSD/NVMe; avoid network volumes unnecessarily.

6) PromQL: fundamentals and frequent patterns

Rate/irate

promql rate(http_requests_total{route="/deposit"}[5m])

Errors and success rates

promql sum(rate(http_requests_total{status=~"2..    3.."}[5m]))
/ sum(rate(http_requests_total[5m]))

p95 latency

promql histogram_quantile(0. 95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)

Queues/Saturation

promql max(queue_depth{queue="withdrawals"}) by (region)

7) Recording rules and performance

Consider heavy expressions in advance and store as series.

yaml groups:
- name: api. rules interval: 30s rules:
- record: job:http:request_duration_seconds:p95 expr:
histogram_quantile(0. 95,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
- record: job:http:success_ratio expr:
sum(rate(http_requests_total{status=~"2..    3.."}[5m]))
/ sum(rate(http_requests_total[5m]))

Plus: fast dashboards, less load on the Prometheus CPU.

8) Alerting и SLO (burn rate)

Burn-rate alerts (multi-window, multi-burn)

yaml groups:
- name: slo. payments rules:
- alert: PaymentsSLOFastBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 14 for: 5m labels: { severity: "page" }
annotations:
summary: "SLO fast burn"
runbook: "https://runbooks/payments/slo"
- alert: PaymentsSLOSlowBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 6 for: 1h labels: { severity: "ticket" }

Alertmanager: service/region routing, duplicate suppression, ChatOps.

9) Correlation with traces and logs

Enable exemplars: clickable 'trace _ id' in histogram buckets.
Put the labels' service ',' version ',' region'for "release compare" in the metrics.
On dashboards - release annotations (Git SHA/version).

10) Scaling and long-term storage

Federation: the upper Prometheus aggregates from the lower (by job/label filters).
Remote Write: sending rows to long-term storage backends/clusters (Thanos/Cortex/Mimir).

Pros: infinite retention, horizontal scaling, global view.
Cons: more difficult to operate, cost.
Sharding by function: separate instances for system metrics, business, security.

11) Safety

TLS/mTLS between Prometheus ↔/Alertmanager/remote _ write targets.
Basic/token authentication for/targets and API (in front of the proxy gateway).
RBAC: restrict access to UI/series by role; hide private labels.
PII hygiene: do not write PII in metrics; use hashes/aliases.

12) Kubernetes practices

Prometheus Operator: CRD (ServiceMonitor, PodMonitor, Alertmanager, Prometheus).
kube-state-metrics + cAdvisor → a complete picture of the cluster.
Tainings and resources: dedicated nodes for monitoring; CPU/RAM limits.
Noise reduction: label selectors for "production" namespaces, padding scrape_interval where possible.

13) Business Metrics and Product

Платежи: `payments_success_total{psp, currency}`, `payment_conversion_ratio`, `ttw_seconds_histogram`.
Game activity: bets/min, holding sessions as gauge, drop-off when errors.
Risk/fraud: triggers for speed anomalies/geo; logging separately, metrics - aggregates.

14) Cost and performance (FinOps)

Control the cardinality (tag review before adding a new label).
Sampling histograms/rare exporters → 'scrape_interval'↑ for non-critical targets.
Downsampling in long-term storage backends.
Dashboard caching and broad reliance on recording rules.

15) Examples of "fast start"

RED Exporter in Application (Python)

python from prometheus_client import Counter, Histogram, start_http_server reqs = Counter('http_requests_total','', ['route','method','status'])
lat = Histogram('http_request_duration_seconds','', ['route','method'])
start_http_server(8000)

def handle(req):
with lat. labels(req. route, req. method). time():
status = app(req)
reqs. labels(req. route, req. method, str(status)). inc()
return status

Threshold alerts p95

promql alert: HighLatencyP95 expr: histogram_quantile(0. 95,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) > 0. 25 for: 10m labels: { severity: "page", service: "api" }

16) Implementation checklist

1. Define a set of basic metrics (RED/USE) and domain metrics.
2. Coordinate labels and guide by cardinality.
3. Configure scrape/ServiceMonitor, TLS/mTLS, relabel.
4. Include histograms for key paths and exemplars.
5. Create recording rules for p95, success ratio, business aggregates.
6. Enter SLO alerts (burn rate) and rooting Alertmanager.
7. Raise dashboards: service map, release compare, payments.
8. Decide about the federation/remote _ write and retention.
9. Restrict access (RBAC), check the absence of PII.
10. Enable runbooks and game-day checks.

17) Anti-patterns

Labels with high cardinality (user/session/request_id).
Summary instead of Histogram for key SLOs → there is no'histogram _ quantile '.
Scratch "all in a row" without filtering/rotation → an increase in costs and noise.
Alerts on raw metrics without SLO → alert-phatig.
Lack of recording rules → "heavy" dashboards.
Trust in metrics without TLS/mTLS → risk of spoofing/leakage.

Summary

Prometheus gives the iGaming platform goal-bound observability: accurate histograms, stable aggregates, clear SLO alerts, and scaling to a multi-region map. Label discipline, correct recording rules, trace/log links, and a thoughtful storage architecture provide quick releases and predictable p99 even at peak times.

Prometheus: collecting metrics

Brief Summary

Kubernetes через Prometheus Operator

Annotation K8s (without Operator, simplified)

Errors and success rates

p95 latency

Queues/Saturation

Threshold alerts p95

Summary

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects