GH GambleHub

Prometheus: collecting metrics

(Section: Technology and Infrastructure)

Brief Summary

Prometheus is an industrial standard for time metrics: it scrapes targets over HTTP, stores series in TSDB, counts aggregates in PromQL and triggers alerts through Alertmanager. For iGaming, this is the basis of the SLO approach (RED/USE, business payment metrics), fast p95/p99 diagnostics and automatic solutions (freeze/rollback).

1) Data model and cardinality

Metric:' name {label1 =" v1," label2 =" v2"} value @ timestamp'.
Cardinality = the power product of all unique label sets; the main cost factor.

Label practices:
  • базовые: `service`, `env`, `region`, `instance`, `pod`, `container`, `version`;
  • domain: 'route', 'psp', 'tenant' (caution!), 'game _ provider'.
  • You cannot put 'user _ id', 'session _ id', random/high cardinal values.

2) Types of metrics

Counter - only grows (for example, 'http _ requests _ total').
Gauge - instantaneous values ​ ​ (for example, 'queue _ depth').
Histogram/Summary - latency distributions. In prod - Histogram (with support for 'histogram _ quantile ()' and exemplars).
Native Histograms are variable buckets that improve accuracy and save size (include where available).

Example (Go):
go var httpLatency = prometheus. NewHistogramVec(
prometheus. HistogramOpts{
Name:  "http_request_duration_seconds",
Help:  "HTTP latency",
Buckets: prometheus. DefBuckets ,//or custom
},
[]string{"route","method"},
)

3) Exporters and what to measure

Service: your code (SDK for Go/Java/Node/Python), RED API metrics, business metrics (payment conversion).
System: node_exporter, cAdvisor/kubelet.
Third-party: DB/caches (mysqld_exporter, postgres_exporter, redis_exporter), NGINX/HAProxy, Kafka/RabbitMQ.
OTel metrics: through OpenTelemetry Collector → Prometheus Remote Write or Prometheus-receiver → into a common stack.

4) Scrape and relabel: how to connect targets

Basic 'prometheus. yml`

yaml global:
scrape_interval: 15s evaluation_interval: 15s external_labels:
env: "prod"
region: "eu-west"

scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['10. 0. 1. 10:9100','10. 0. 1. 11:9100']

- job_name: 'payments-api'
metrics_path: /metrics scheme: https tls_config:
ca_file: /etc/ssl/ca. crt cert_file: /etc/ssl/tls. crt key_file: /etc/ssl/tls. key relabel_configs:
- source_labels: [__address__]
regex: '(.):\d+'
target_label: instance replacement: '$1'

Kubernetes через Prometheus Operator

Use ServiceMonitor/PodMonitor instead of manual'scrape _ configs'.

yaml apiVersion: monitoring. coreos. com/v1 kind: ServiceMonitor metadata: { name: payments-api }
spec:
selector: { matchLabels: { app: payments-api } }
namespaceSelector: { matchNames: [ "prod" ] }
endpoints:
- port: metrics interval: 15s scheme: http relabelings:
- action: replace targetLabel: service replacement: "payments-api"

Annotation K8s (without Operator, simplified)

yaml metadata:
annotations:
prometheus. io/scrape: "true"
prometheus. io/port: "9102"
prometheus. io/path: "/metrics"

5) Storage: TSDB, WAL and retention

WAL (Write-Ahead Log) → rapid recovery from restart.
Compaction: block compression, disk/CPU savings.
Retention: Keep hot data for 7-30 days; long-term - move (see Scaling).

Tuning:
  • `--storage. tsdb. retention. time=15d`
  • `--storage. tsdb. max-block-chunk-segment-size`
  • Drive: fast SSD/NVMe; avoid network volumes unnecessarily.

6) PromQL: fundamentals and frequent patterns

Rate/irate

promql rate(http_requests_total{route="/deposit"}[5m])

Errors and success rates

promql sum(rate(http_requests_total{status=~"2..    3.."}[5m]))
/ sum(rate(http_requests_total[5m]))

p95 latency

promql histogram_quantile(0. 95,
sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)

Queues/Saturation

promql max(queue_depth{queue="withdrawals"}) by (region)

7) Recording rules and performance

Consider heavy expressions in advance and store as series.

yaml groups:
- name: api. rules interval: 30s rules:
- record: job:http:request_duration_seconds:p95 expr:
histogram_quantile(0. 95,
sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
- record: job:http:success_ratio expr:
sum(rate(http_requests_total{status=~"2..    3.."}[5m]))
/ sum(rate(http_requests_total[5m]))

Plus: fast dashboards, less load on the Prometheus CPU.

8) Alerting и SLO (burn rate)

Burn-rate alerts (multi-window, multi-burn)

yaml groups:
- name: slo. payments rules:
- alert: PaymentsSLOFastBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 14 for: 5m labels: { severity: "page" }
annotations:
summary: "SLO fast burn"
runbook: "https://runbooks/payments/slo"
- alert: PaymentsSLOSlowBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 6 for: 1h labels: { severity: "ticket" }

Alertmanager: service/region routing, duplicate suppression, ChatOps.

9) Correlation with traces and logs

Enable exemplars: clickable 'trace _ id' in histogram buckets.
Put the labels' service ',' version ',' region'for "release compare" in the metrics.
On dashboards - release annotations (Git SHA/version).

10) Scaling and long-term storage

Federation: the upper Prometheus aggregates from the lower (by job/label filters).
Remote Write: sending rows to long-term storage backends/clusters (Thanos/Cortex/Mimir).

Pros: infinite retention, horizontal scaling, global view.
Cons: more difficult to operate, cost.
Sharding by function: separate instances for system metrics, business, security.

11) Safety

TLS/mTLS between Prometheus ↔/Alertmanager/remote _ write targets.
Basic/token authentication for/targets and API (in front of the proxy gateway).
RBAC: restrict access to UI/series by role; hide private labels.
PII hygiene: do not write PII in metrics; use hashes/aliases.

12) Kubernetes practices

Prometheus Operator: CRD (ServiceMonitor, PodMonitor, Alertmanager, Prometheus).
kube-state-metrics + cAdvisor → a complete picture of the cluster.
Tainings and resources: dedicated nodes for monitoring; CPU/RAM limits.
Noise reduction: label selectors for "production" namespaces, padding scrape_interval where possible.

13) Business Metrics and Product

Платежи: `payments_success_total{psp, currency}`, `payment_conversion_ratio`, `ttw_seconds_histogram`.
Game activity: bets/min, holding sessions as gauge, drop-off when errors.
Risk/fraud: triggers for speed anomalies/geo; logging separately, metrics - aggregates.

14) Cost and performance (FinOps)

Control the cardinality (tag review before adding a new label).
Sampling histograms/rare exporters → 'scrape_interval'↑ for non-critical targets.
Downsampling in long-term storage backends.
Dashboard caching and broad reliance on recording rules.

15) Examples of "fast start"

RED Exporter in Application (Python)

python from prometheus_client import Counter, Histogram, start_http_server reqs = Counter('http_requests_total','', ['route','method','status'])
lat = Histogram('http_request_duration_seconds','', ['route','method'])
start_http_server(8000)

def handle(req):
with lat. labels(req. route, req. method). time():
status = app(req)
reqs. labels(req. route, req. method, str(status)). inc()
return status

Threshold alerts p95

promql alert: HighLatencyP95 expr: histogram_quantile(0. 95,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) > 0. 25 for: 10m labels: { severity: "page", service: "api" }

16) Implementation checklist

1. Define a set of basic metrics (RED/USE) and domain metrics.
2. Coordinate labels and guide by cardinality.
3. Configure scrape/ServiceMonitor, TLS/mTLS, relabel.
4. Include histograms for key paths and exemplars.
5. Create recording rules for p95, success ratio, business aggregates.
6. Enter SLO alerts (burn rate) and rooting Alertmanager.
7. Raise dashboards: service map, release compare, payments.
8. Decide about the federation/remote _ write and retention.
9. Restrict access (RBAC), check the absence of PII.
10. Enable runbooks and game-day checks.

17) Anti-patterns

Labels with high cardinality (user/session/request_id).
Summary instead of Histogram for key SLOs → there is no'histogram _ quantile '.
Scratch "all in a row" without filtering/rotation → an increase in costs and noise.
Alerts on raw metrics without SLO → alert-phatig.
Lack of recording rules → "heavy" dashboards.
Trust in metrics without TLS/mTLS → risk of spoofing/leakage.

Summary

Prometheus gives the iGaming platform goal-bound observability: accurate histograms, stable aggregates, clear SLO alerts, and scaling to a multi-region map. Label discipline, correct recording rules, trace/log links, and a thoughtful storage architecture provide quick releases and predictable p99 even at peak times.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.