Grafana and data visualization

(Section: Technology and Infrastructure)

Brief Summary

Grafana is a showcase for all observability: metrics, logs, traces, business indicators and alerts in one place. For iGaming, these are: p95/p99 monitoring, payment conversion, Time-to-Wallet, game provider availability, geo-distribution of incidents, and stable vs canary. Success: templates (variables), understandable panels, release annotations, SLO dashboards and access rights discipline.

1) Connection architecture

Datasources: Prometheus (metrics), Loki/ELK (logs), Tempo/Jaeger (trails), ClickHouse/BigQuery/PG (business data), OTLP via Gateway.
Key links: from the → exemplar → trace metric → related logs by 'trace _ id'.
Folders and RBAC: separate folders' SRE ',' Payments', 'Risk', 'Games', 'BizOps'; роли `Viewer/Editor/Admin` и granular permissions.

2) Dashboard design: principles

1. The answer to the question in 1-2 clicks: from SLO card to details.
2. RED/USE for each service + domain cards (TTW, deposit conversion).
3. Stable grid: 24-column, large KPI on top, details on the bottom.
4. Colors and thresholds: minimum, SLA/SLO only.
5. Release annotations: Git SHA, version, release type (canary/blue-green).

3) Variables and templates (templating)

Variables turn one dashboard into many.

Example (Prometheus query-variable):

Name: `service`
Query: `label_values(up, service)`
Multi-select + include all - convenient for aggregates.

Cascade variables:

`region` → `env` → `service` → `instance`.
Use 'regex '/' sort' for UX and 'refresh: On dashboard load'.

4) Panels and rendering types

Time series: p50/p95/p99, error-rate, throughput.
Stat/Gauge: target KPI (availability, TTW p95).
Bar gauge/Table: top N routes/PSP/game providers.
Geomap: thermal incident/latency maps by country/ROR.
Canvas: schematic streams (Player → API → PSP → Bank).
Node graph: service dependencies, coloring by errors.

Transformations:

Labels to fields, Outer join, Reduce (min/max/avg), Add field from calculation (conversion).

5) Examples of queries and panels

5. 1 p95 latency (PromQL)

promql histogram_quantile(0. 95,
sum by (le, route) (rate(http_request_duration_seconds_bucket{service="$service",region="$region"}[5m]))
)

5. 2 Success of requests (SLO proxy)

promql sum(rate(http_requests_total{service="$service",status=~"2..    3.."}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))

5. 3 Payment conversion (PromQL aggregate)

promql sum(rate(payments_success_total{psp=~"$psp",currency=~"$currency"}[15m]))
/
sum(rate(payments_attempt_total{psp=~"$psp",currency=~"$currency"}[15m]))

5. 4 Quick jump into the track (exemplars)

In the 'Time series' panel, turn on Examplars → clicking on the point → Tempo opens with 'trace _ id'.

5. 5 Loki trace_id

logql
{service="$service"}     = "$traceID"

6) Annotations and events

Release annotations: auto-addition of an event during depletion (version, author, canary weight).
Incident/Freeze: Incident start/end marks and release freeze windows.
Business events: large campaigns/tournaments - mark on the charts.

7) Alerts at Grafana

Alert rules centrally (based on Prometheus/Loki/Cloud).
Contact points: PagerDuty/Slack/Email; Notification policies (rooting by folder/tags).
Multi-window burn-rate: fast and slow budget roasting.
Silences: in scheduled windows and with duplicates.

Example expression for p95:

promql histogram_quantile(0. 95,
sum by (le, service) (rate(http_request_duration_seconds_bucket{service="$service"}[5m]))
) > 0. 25

8) Provisioning as code (IaC)

Store sources/dashboards/alerts in Git.

datasource. yaml

yaml apiVersion: 1 datasources:
- name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true
- name: Loki type: loki url: http://loki:3100
- name: Tempo type: tempo url: http://tempo:3100

dashboard. yaml

yaml apiVersion: 1 providers:
- name: sres folder: SRE type: file disableDeletion: false options:
path: /var/lib/grafana/dashboards/sre

grafana. ini (fragment)

ini
[auth]
disable_login_form = false
[users]
viewers_can_edit = false
[alerting]
enabled = true
[unified_alerting. screenshots]
capture = true

9) Security and access

SSO (OIDC/SAML), groups → roles → folders.
Rights to datasources: only the necessary folders, read-only for the Viewer.
PII hygiene: do not pull fields with PII in the panel; for logs - filter/mask.
Secrets: only through Vault/secure JSON fields, without "plain text" in dashboards.

10) Performance and cost

Recording rules in Prometheus for heavy expressions.
Downsampling/Retention in long-term storage backends.
Dashboard cache and fair intervals (not "1s" everywhere).
Cardinality restriction of variables (do not substitute 'user _ id '/' session _ id').
Redistribution: separate instances/folders for noisy teams.

11) Specialized dashboards for iGaming

Payments: attempts/success/TTW p95, PSP/route error, geo-deviation map.
Games/Providers: latency and error-rate by studio/game, launch conversion.
Risk/Fraud: speed of action, device bursts/IP, correlations (table + bar-gage).
RG/Compliance: sessions> thresholds, steak growth, anomaly alerts.
Release Compare: stable vs canary by p95/error/business metrics.
Infra/USE: Utilization/Saturation/Errors by Cluster and Queue.

12) Example of a JSON dashboard (fragment)

json
{
"title": "Payments SLO",
"tags": ["slo","payments"],
"time": {"from":"now-6h","to":"now"},
"panels": [
{
"type":"stat",
"title":"Availability",
"targets":[{"expr":"sum(rate(http_requests_total{service=\"payments-api\",status=~\"2..    3..\"}[5m]))/sum(rate(http_requests_total{service=\"payments-api\"}[5m]))"}],
"thresholds":{"mode":"absolute","steps":[{"color":"red","value":0},{"color":"green","value":0. 999}]}
},
{
"type":"timeseries",
"title":"p95 latency",
"exemplars": {"color":"rgba(31,120,193,0. 6)"},
"targets":[{"expr":"histogram_quantile(0. 95,sum by (le) (rate(http_request_duration_seconds_bucket{service=\"payments-api\"}[5m])))"}]
}
]
}

13) Runbooks and UX improvements

Each alert has a Runbook URL (action instruction).
Links to related dashboards (Payments ↔ Infra ↔ PSP).
Drilldown: clicks on labels → filters (region/psp/route).
Variables defaults: 'env = prod', 'region = eu' - speeds up the start.

14) Implementation checklist

1. Configure datasources: Prometheus/Loki/Tempo/SQL.
2. Enter folders and RBAC; rights audit.
3. Create template variables (region/env/service).
4. Build SLO dashboards (availability, p95, error-rate, error budget).
5. Add release annotations and stable/canary comparisons.
6. Enable exemplars and go to traces/logs by clicking.
7. Configure alerts (multi-window burn-rate) and rooting.
8. Provision everything as code, store in Git, do a review.
9. Optimize performance: recording rules, intervals, cache.
10. Enter business dashboards (TTW, payment conversion, GGR cards).

15) Antipatterns

"Zoo" inconsistent dashboards without variables and standards.
Panels with heavy PromQL without recording rules → slow UI.
Overabundance of colors/legends/Y-axis with different scales.
PII connection in panels opened for Viewer.
Lack of release annotations - it is not clear where the jumps come from.
One "monovew" dashboard instead of a folder structure.

Summary

Grafana is the interface where the technique meets the product: metrics, logs and tracks connect to business pictures. Templates, correct panels, annotations, and alerts turn data into solutions: rapid diagnosis, predictable releases, and manageable observability cost.

Grafana and data visualization

Brief Summary

datasource. yaml

dashboard. yaml

grafana. ini (fragment)

Summary

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects