Monitoring and logging

1) Why it matters in iGaming

Money in real time: accepting deposits, instant payments, calculating bets and winnings, tournaments - everything is sensitive to delays and failures.
Regulatory and audit: full traceability of actions is required (KYC/AML, payments, limits of responsible play).
Complex distributed architecture: API gateways, payment orchestration, EDA/Kafka, provider services, mobile clients, fronts, BI bus.
The goal: to reduce MTTD/MTTR, keep SLO on gold signals and provide incident rate.

2) Basic concepts of observability

Logs: detailed events (structured JSON) suitable for investigations and audits.
Metrics: aggregates in time (TSDB), suitable for SLO/alerts.
Traces: cause and effect chains of requests (trace/span) through services/brokers/databases.
Events: domain events (BetPlaced, DepositApproved) - a bridge between business metrics and technology.

3) "Golden Signals" and SLI/SLO for iGaming

Latency: P95/P99 on critical flows (authorization, deposit, rate, session start, spin).
Traffic: RPS by API, TPS by payment, EPS by event.
Errors: 5xx/4xx share, decline-rate, failed-within, provider errors.
Saturation: CPU, memory, IO, Kafka lag, DB connections, thread-pools.

Example SLO (payment gateway):

SLI: `1 - (failed_payments / total_payments)`
SLO: 99. 7% of successful card authorizations in 30 days (error budget 0. 3%).

4) Collection and processing architecture

1. Injection: agents (OTel Collector/Fluent Bit), SDK in the application, RUM/synthetics.
2. Routing: broker/telemetry bus (OTLP/HTTP/GRPC), filters and PII masking.

3. Vaults:

Metrics: TSDB (aggregation, downsampling).
Logs: hot (indexed )/warm (less indexed )/cold (object storage, WORM).
Trails: time-indexed storage with retentions and tail-sampling.
4. Analytics/alerts: rules (PromQL/LogQL/SQL), correlation with tracks and releases.
5. Dashboards: technical + business types (payments, RNG/providers, tournament engine).

5) Log standard (JSON) and event taxonomy

Strict JSON logging, single keys and levels are recommended.

Уровни: `DEBUG < INFO < NOTICE < WARN < ERROR < FATAL < AUDIT`

Таксономия: `auth.`, `payment.`, `gameplay.`, `risk.`, `psp.`, `kyc.`, `rg.` (responsible gaming), `ops.`.

Example of a JSON event (AUDIT/PII-safe):

json
{
"ts": "2025-11-04T19:45:31. 842Z",
"lvl": "AUDIT",
"event_type": "payment. deposit_approved",
"correlation_id": "c-7d2c1f0b",
"trace_id": "2d6a9c0e4c0b1f72",
"span_id": "9f3a81d2a1c3b764",
"request_id": "r-8f12de9e",
"tenant": "brand_eu",
"psp": "acq_xyz",
"user_id_hash": "u:sha256:1e63…",
"device_id": "d-3c8f…",
"ip_trunc": "203. 0. 113. 0/24",
"amount_minor": 5000,
"currency": "EUR",
"result": "approved",
"latency_ms": 312,
"tags": ["pci_safe", "kyc_passed", "low_risk"],
"extra": {
"bin": "411111",
"method": "card",
"region": "EU",
"ab_test": "checkout_v2"
}
}

PII/PCI security rules:

We mask PAN/BIN (we store only fields valid by policy), email/phone - hash/token.
IP truncate to/24, GeoIP store separately.
We prohibit free text in 'extra' for user input without sanitization.

6) Correlation: trace_id, correlation_id, idempotency_key

Add 'trace _ id' (from OTel), 'span _ id', 'correlation _ id' (end-to-end for the business process), 'idempotency _ key' (for payment requests) to each log and metric.
Transfer baggage (tenant/brand, market, A/B option) to build slices.

7) Metrics: Technical and Business

Technical: RPS, p95 latency, error rate, saturation, GC, pool usage, Kafka consumer lag.
Business: CR registratsii→depozit, successful authorizations, cancellations of payments, NGR/GGR, ARPPU, RTP anomalies, drop-off at KYC step, share of responsible limits.

Example of PromQL (error-rate API):

promql sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

8) Tracing and OpenTelemetry

We instrument the gateway, payment orchestrator, game core, notifications, KYC/AML, integration with providers.
Head-sampling for total flow + tail-sampling (elevated) for errors/latent spans and payments.
Context propagation: 'traceparent '/' tracestate', Kafka headers, gRPC metadata.
Annotating spans with domain events: 'BetPlaced', 'WithinRequested'.

9) Alerting without noise

Multi-stage thresholds (warning/critical), flapping suppression, deduplication, time slots.
Correlation: we associate "5xx growth" + "Kafka lag" + "p95 latency PSP" → one incident.
SLO-based alerts: spend error-budget - escalate.
Alerts-as-Code (GitOps), review and rule tests.

Example rule (Prometheus):

yaml groups:
- name: payments rules:
- alert: PaymentErrorSpike expr: (sum(rate(payment_errors_total[5m])) / sum(rate(payment_attempts_total[5m]))) > 0. 02 for: 10m labels: { severity: "critical", team: "payments" }
annotations:
summary: "Payment errors> 2% per 10m"
runbook: "runbooks/payments/error-spike. md"

10) Log search (example LogQL)

logql
{app="psp-orchestrator", level=~"ERROR    FATAL"}
= "decline"
json amount_minor > 10000 region="EU"

The goal is to quickly weed out noise and highlight "expensive" failures in the target region.

11) Dashboards: what's mandatory

Payments Health: success/failures by PSP, latency by method, map of regions, SLA providers.
Game Core: RPS by providers, p95 spin, error ratio SDK, RTP anomalies by slots.
Player Journey: registratsiya→KUS→depozit→igra→vyvod.
Infra: Kafka lag, DB connections, cache hit ratio, Kubernetes cluster (grid of pods/nodes).

12) Storage, retention and cost (FinOps)

Cardinality under control: avoid metrics with highly changeable labels (user_id).
Retentions: hot metrics 30-90 days, downsampling up to 13 months; logs hot 7-14 days, warm 30-90 days, cold 1-3 years (taking into account the regulation).
WORM/immutability for audit logs, Object Lock.
Compression/partitioning and ILM policies; separate indexes for audit/PII-safe.
Sampling logs on INFO/DEBUG; ERROR/AUDIT - complete.

13) Safety and compliance

PII/PCI: tokenization, hashing, masking; minimizing data.
RBAC/ABAC: access to logs/tracks - by role, separation of awnings.
Secrets and keys: do not log credentials/tokens; secret detectors on the CI.
Audit trail: entries to the admin panel, changes in limits/payments, manual balance adjustments - only to the AUDIT index, invariably.
Legal-hold: a mechanism for freezing retentions in investigations.

14) Telemetry data quality

Schema Registry for logs/events (versioning, compatibility).
Single nomenclature of fields (snake_case, units of measure).
Validation at injection (drop of dirty events, marriage metrics).

Backpressure and protection against "log storms."

15) SRE processes, online calls and runbooks

Oncall matrix and escalations; Quiet Hours and rotations.
Runbooks are tied to alerts (diagnostic steps, SQL/LogQL recipes, phicheflags for degradation).
Postmortem without penalties, action items with owners and deadlines.
Team indicators: MTTD/MTTR, percentage of noisy alerts, Runbuk coverage.

16) RUM and synthetics

RUM: WebVitals (LCP, CLS, INP), front errors, device fingerprints, regions/providers.
Synthetics: scenarios "registratsiya→depozit→spin→vyvod" from different regions; private locations for internal paths (admin/back office).

17) Practices of releases, experiments and phicheflags

We link tracks with release versions (commit/artefact).

A/B tags in baggage → dashboard "effect of experiment on SLI."

Canary/Blue-Green: separate panels for canaries, error-budget burn rate.

18) Anomaly detection and anti-fraud signals

Statistical triggers (seasonality-aware) on decline-rate/chargeback-risk/surge of new cards.

Correlations: "growth of unsuccessful deposits + new release of PSP adapter."

Streaming rules (Kafka → Flink) for near-real-time reactions.

19) Implementation Roadmap (by phase)

Stage 0 - Basis: JSON logs, unified correlation fields, basic service metrics, common dashboards, first alerts.
Stage 1 - Tracing: OTel instrumentation, head + tail sampling, linking to logs.
Stage 2 - Business SLI/SLO: payments/outputs/game metrics, SLO alerts, error-budget processes.
Stage 3 - Maturity: Alerts-as-Code, ILM, separate retentions, anomaly-detection, per-service runbuki, SRE practices in CI/CD.

20) Review checklist

JSON only logs, single keys, PII masking.
In each event: 'trace _ id', 'span _ id', 'correlation _ id', 'tenant'.
Metrics cover gold signals and business flows.
SLOs are described, there is an error-budget and alerts on burn rate.
Tail-sampling is enabled for payment errors and high latencies.
ILM and WORM are configured for audit logs.
RBAC for telemetry, access audit.
Dashboards for Payments/Game Core/Player Journey/Infra.
Runbooks are tied to every critical alert.
Postmortems and action items - in the backlog with the owners.

Appendix A: OpenTelemetry attributes (recommendation)

`service. name`, `service. version`, `deployment. environment`

`cloud. region`, `k8s. pod. name`, `k8s. container. name`

`tenant`, `brand`, `market`, `ab_test`, `user_segment`

`payment. method`, `psp`, `game. provider`, `game. id`

Appendix B: Examples of Metrics for SLO

`payment_success_ratio`, `withdrawal_ttw_p95` (time-to-wallet), `psp_latency_p99`

`game_spin_latency_p95`, `provider_error_rate`, `kafka_consumer_lag`

`auth_success_ratio`, `kyc_step_dropout`, `cache_hit_ratio`

Appendix C: Quick Investigative Recipes

"Growing 'payment _ error _ rate'" → compare by PSP/region/method, check tail-trails, see adapter release.
"p99 spins ↑" trace →, front→geytvey→provayder check provider/channels, thread pool limits, network retrays.
"Kafka lag ↑" → health consumers, retro producers, backpressure, slow sinks/DB.

💡 By adhering to these practices, the platform gets a robust, verifiable and cost-effective observability system that doubles as an engineering tool, business radar and compliance guarantor.

Monitoring and logging

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects