GH GambleHub

Grafana and data visualization

(Section: Technology and Infrastructure)

Brief Summary

Grafana is a showcase for all observability: metrics, logs, traces, business indicators and alerts in one place. For iGaming, these are: p95/p99 monitoring, payment conversion, Time-to-Wallet, game provider availability, geo-distribution of incidents, and stable vs canary. Success: templates (variables), understandable panels, release annotations, SLO dashboards and access rights discipline.

1) Connection architecture

Datasources: Prometheus (metrics), Loki/ELK (logs), Tempo/Jaeger (trails), ClickHouse/BigQuery/PG (business data), OTLP via Gateway.
Key links: from the → exemplar → trace metric → related logs by 'trace _ id'.
Folders and RBAC: separate folders' SRE ',' Payments', 'Risk', 'Games', 'BizOps'; роли `Viewer/Editor/Admin` и granular permissions.

2) Dashboard design: principles

1. The answer to the question in 1-2 clicks: from SLO card to details.
2. RED/USE for each service + domain cards (TTW, deposit conversion).
3. Stable grid: 24-column, large KPI on top, details on the bottom.
4. Colors and thresholds: minimum, SLA/SLO only.
5. Release annotations: Git SHA, version, release type (canary/blue-green).

3) Variables and templates (templating)

Variables turn one dashboard into many.

Example (Prometheus query-variable):
  • Name: `service`
  • Query: `label_values(up, service)`
  • Multi-select + include all - convenient for aggregates.
Cascade variables:
  • `region` → `env` → `service` → `instance`.
  • Use 'regex '/' sort' for UX and 'refresh: On dashboard load'.

4) Panels and rendering types

Time series: p50/p95/p99, error-rate, throughput.
Stat/Gauge: target KPI (availability, TTW p95).
Bar gauge/Table: top N routes/PSP/game providers.
Geomap: thermal incident/latency maps by country/ROR.
Canvas: schematic streams (Player → API → PSP → Bank).
Node graph: service dependencies, coloring by errors.

Transformations:
  • Labels to fields, Outer join, Reduce (min/max/avg), Add field from calculation (conversion).

5) Examples of queries and panels

5. 1 p95 latency (PromQL)

promql histogram_quantile(0. 95,
sum by (le, route) (rate(http_request_duration_seconds_bucket{service="$service",region="$region"}[5m]))
)

5. 2 Success of requests (SLO proxy)

promql sum(rate(http_requests_total{service="$service",status=~"2..    3.."}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))

5. 3 Payment conversion (PromQL aggregate)

promql sum(rate(payments_success_total{psp=~"$psp",currency=~"$currency"}[15m]))
/
sum(rate(payments_attempt_total{psp=~"$psp",currency=~"$currency"}[15m]))

5. 4 Quick jump into the track (exemplars)

In the 'Time series' panel, turn on Examplars → clicking on the point → Tempo opens with 'trace _ id'.

5. 5 Loki trace_id

logql
{service="$service"}     = "$traceID"

6) Annotations and events

Release annotations: auto-addition of an event during depletion (version, author, canary weight).
Incident/Freeze: Incident start/end marks and release freeze windows.
Business events: large campaigns/tournaments - mark on the charts.

7) Alerts at Grafana

Alert rules centrally (based on Prometheus/Loki/Cloud).
Contact points: PagerDuty/Slack/Email; Notification policies (rooting by folder/tags).
Multi-window burn-rate: fast and slow budget roasting.
Silences: in scheduled windows and with duplicates.

Example expression for p95:
promql histogram_quantile(0. 95,
sum by (le, service) (rate(http_request_duration_seconds_bucket{service="$service"}[5m]))
) > 0. 25

8) Provisioning as code (IaC)

Store sources/dashboards/alerts in Git.

datasource. yaml

yaml apiVersion: 1 datasources:
- name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true
- name: Loki type: loki url: http://loki:3100
- name: Tempo type: tempo url: http://tempo:3100

dashboard. yaml

yaml apiVersion: 1 providers:
- name: sres folder: SRE type: file disableDeletion: false options:
path: /var/lib/grafana/dashboards/sre

grafana. ini (fragment)

ini
[auth]
disable_login_form = false
[users]
viewers_can_edit = false
[alerting]
enabled = true
[unified_alerting. screenshots]
capture = true

9) Security and access

SSO (OIDC/SAML), groups → roles → folders.
Rights to datasources: only the necessary folders, read-only for the Viewer.
PII hygiene: do not pull fields with PII in the panel; for logs - filter/mask.
Secrets: only through Vault/secure JSON fields, without "plain text" in dashboards.

10) Performance and cost

Recording rules in Prometheus for heavy expressions.
Downsampling/Retention in long-term storage backends.
Dashboard cache and fair intervals (not "1s" everywhere).
Cardinality restriction of variables (do not substitute 'user _ id '/' session _ id').
Redistribution: separate instances/folders for noisy teams.

11) Specialized dashboards for iGaming

Payments: attempts/success/TTW p95, PSP/route error, geo-deviation map.
Games/Providers: latency and error-rate by studio/game, launch conversion.
Risk/Fraud: speed of action, device bursts/IP, correlations (table + bar-gage).
RG/Compliance: sessions> thresholds, steak growth, anomaly alerts.
Release Compare: stable vs canary by p95/error/business metrics.
Infra/USE: Utilization/Saturation/Errors by Cluster and Queue.

12) Example of a JSON dashboard (fragment)

json
{
"title": "Payments SLO",
"tags": ["slo","payments"],
"time": {"from":"now-6h","to":"now"},
"panels": [
{
"type":"stat",
"title":"Availability",
"targets":[{"expr":"sum(rate(http_requests_total{service=\"payments-api\",status=~\"2..    3..\"}[5m]))/sum(rate(http_requests_total{service=\"payments-api\"}[5m]))"}],
"thresholds":{"mode":"absolute","steps":[{"color":"red","value":0},{"color":"green","value":0. 999}]}
},
{
"type":"timeseries",
"title":"p95 latency",
"exemplars": {"color":"rgba(31,120,193,0. 6)"},
"targets":[{"expr":"histogram_quantile(0. 95,sum by (le) (rate(http_request_duration_seconds_bucket{service=\"payments-api\"}[5m])))"}]
}
]
}

13) Runbooks and UX improvements

Each alert has a Runbook URL (action instruction).
Links to related dashboards (Payments ↔ Infra ↔ PSP).
Drilldown: clicks on labels → filters (region/psp/route).
Variables defaults: 'env = prod', 'region = eu' - speeds up the start.

14) Implementation checklist

1. Configure datasources: Prometheus/Loki/Tempo/SQL.
2. Enter folders and RBAC; rights audit.
3. Create template variables (region/env/service).
4. Build SLO dashboards (availability, p95, error-rate, error budget).
5. Add release annotations and stable/canary comparisons.
6. Enable exemplars and go to traces/logs by clicking.
7. Configure alerts (multi-window burn-rate) and rooting.
8. Provision everything as code, store in Git, do a review.
9. Optimize performance: recording rules, intervals, cache.
10. Enter business dashboards (TTW, payment conversion, GGR cards).

15) Antipatterns

"Zoo" inconsistent dashboards without variables and standards.
Panels with heavy PromQL without recording rules → slow UI.
Overabundance of colors/legends/Y-axis with different scales.
PII connection in panels opened for Viewer.
Lack of release annotations - it is not clear where the jumps come from.
One "monovew" dashboard instead of a folder structure.

Summary

Grafana is the interface where the technique meets the product: metrics, logs and tracks connect to business pictures. Templates, correct panels, annotations, and alerts turn data into solutions: rapid diagnosis, predictable releases, and manageable observability cost.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.