Monitoring and logging
1) Why it matters in iGaming
Money in real time: accepting deposits, instant payments, calculating bets and winnings, tournaments - everything is sensitive to delays and failures.
Regulatory and audit: full traceability of actions is required (KYC/AML, payments, limits of responsible play).
Complex distributed architecture: API gateways, payment orchestration, EDA/Kafka, provider services, mobile clients, fronts, BI bus.
The goal: to reduce MTTD/MTTR, keep SLO on gold signals and provide incident rate.
2) Basic concepts of observability
Logs: detailed events (structured JSON) suitable for investigations and audits.
Metrics: aggregates in time (TSDB), suitable for SLO/alerts.
Traces: cause and effect chains of requests (trace/span) through services/brokers/databases.
Events: domain events (BetPlaced, DepositApproved) - a bridge between business metrics and technology.
3) "Golden Signals" and SLI/SLO for iGaming
Latency: P95/P99 on critical flows (authorization, deposit, rate, session start, spin).
Traffic: RPS by API, TPS by payment, EPS by event.
Errors: 5xx/4xx share, decline-rate, failed-within, provider errors.
Saturation: CPU, memory, IO, Kafka lag, DB connections, thread-pools.
- SLI: `1 - (failed_payments / total_payments)`
- SLO: 99. 7% of successful card authorizations in 30 days (error budget 0. 3%).
4) Collection and processing architecture
1. Injection: agents (OTel Collector/Fluent Bit), SDK in the application, RUM/synthetics.
2. Routing: broker/telemetry bus (OTLP/HTTP/GRPC), filters and PII masking.
- Metrics: TSDB (aggregation, downsampling).
- Logs: hot (indexed )/warm (less indexed )/cold (object storage, WORM).
- Trails: time-indexed storage with retentions and tail-sampling.
- 4. Analytics/alerts: rules (PromQL/LogQL/SQL), correlation with tracks and releases.
- 5. Dashboards: technical + business types (payments, RNG/providers, tournament engine).
5) Log standard (JSON) and event taxonomy
Strict JSON logging, single keys and levels are recommended.
Уровни: `DEBUG < INFO < NOTICE < WARN < ERROR < FATAL < AUDIT`
Таксономия: `auth.`, `payment.`, `gameplay.`, `risk.`, `psp.`, `kyc.`, `rg.` (responsible gaming), `ops.`.
Example of a JSON event (AUDIT/PII-safe):json
{
"ts": "2025-11-04T19:45:31. 842Z",
"lvl": "AUDIT",
"event_type": "payment. deposit_approved",
"correlation_id": "c-7d2c1f0b",
"trace_id": "2d6a9c0e4c0b1f72",
"span_id": "9f3a81d2a1c3b764",
"request_id": "r-8f12de9e",
"tenant": "brand_eu",
"psp": "acq_xyz",
"user_id_hash": "u:sha256:1e63…",
"device_id": "d-3c8f…",
"ip_trunc": "203. 0. 113. 0/24",
"amount_minor": 5000,
"currency": "EUR",
"result": "approved",
"latency_ms": 312,
"tags": ["pci_safe", "kyc_passed", "low_risk"],
"extra": {
"bin": "411111",
"method": "card",
"region": "EU",
"ab_test": "checkout_v2"
}
}
PII/PCI security rules:
- We mask PAN/BIN (we store only fields valid by policy), email/phone - hash/token.
- IP truncate to/24, GeoIP store separately.
- We prohibit free text in 'extra' for user input without sanitization.
6) Correlation: trace_id, correlation_id, idempotency_key
Add 'trace _ id' (from OTel), 'span _ id', 'correlation _ id' (end-to-end for the business process), 'idempotency _ key' (for payment requests) to each log and metric.
Transfer baggage (tenant/brand, market, A/B option) to build slices.
7) Metrics: Technical and Business
Technical: RPS, p95 latency, error rate, saturation, GC, pool usage, Kafka consumer lag.
Business: CR registratsii→depozit, successful authorizations, cancellations of payments, NGR/GGR, ARPPU, RTP anomalies, drop-off at KYC step, share of responsible limits.
promql sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
8) Tracing and OpenTelemetry
We instrument the gateway, payment orchestrator, game core, notifications, KYC/AML, integration with providers.
Head-sampling for total flow + tail-sampling (elevated) for errors/latent spans and payments.
Context propagation: 'traceparent '/' tracestate', Kafka headers, gRPC metadata.
Annotating spans with domain events: 'BetPlaced', 'WithinRequested'.
9) Alerting without noise
Multi-stage thresholds (warning/critical), flapping suppression, deduplication, time slots.
Correlation: we associate "5xx growth" + "Kafka lag" + "p95 latency PSP" → one incident.
SLO-based alerts: spend error-budget - escalate.
Alerts-as-Code (GitOps), review and rule tests.
yaml groups:
- name: payments rules:
- alert: PaymentErrorSpike expr: (sum(rate(payment_errors_total[5m])) / sum(rate(payment_attempts_total[5m]))) > 0. 02 for: 10m labels: { severity: "critical", team: "payments" }
annotations:
summary: "Payment errors> 2% per 10m"
runbook: "runbooks/payments/error-spike. md"
10) Log search (example LogQL)
logql
{app="psp-orchestrator", level=~"ERROR FATAL"}
= "decline"
json amount_minor > 10000 region="EU"
The goal is to quickly weed out noise and highlight "expensive" failures in the target region.
11) Dashboards: what's mandatory
Payments Health: success/failures by PSP, latency by method, map of regions, SLA providers.
Game Core: RPS by providers, p95 spin, error ratio SDK, RTP anomalies by slots.
Player Journey: registratsiya→KUS→depozit→igra→vyvod.
Infra: Kafka lag, DB connections, cache hit ratio, Kubernetes cluster (grid of pods/nodes).
12) Storage, retention and cost (FinOps)
Cardinality under control: avoid metrics with highly changeable labels (user_id).
Retentions: hot metrics 30-90 days, downsampling up to 13 months; logs hot 7-14 days, warm 30-90 days, cold 1-3 years (taking into account the regulation).
WORM/immutability for audit logs, Object Lock.
Compression/partitioning and ILM policies; separate indexes for audit/PII-safe.
Sampling logs on INFO/DEBUG; ERROR/AUDIT - complete.
13) Safety and compliance
PII/PCI: tokenization, hashing, masking; minimizing data.
RBAC/ABAC: access to logs/tracks - by role, separation of awnings.
Secrets and keys: do not log credentials/tokens; secret detectors on the CI.
Audit trail: entries to the admin panel, changes in limits/payments, manual balance adjustments - only to the AUDIT index, invariably.
Legal-hold: a mechanism for freezing retentions in investigations.
14) Telemetry data quality
Schema Registry for logs/events (versioning, compatibility).
Single nomenclature of fields (snake_case, units of measure).
Validation at injection (drop of dirty events, marriage metrics).
Backpressure and protection against "log storms."
15) SRE processes, online calls and runbooks
Oncall matrix and escalations; Quiet Hours and rotations.
Runbooks are tied to alerts (diagnostic steps, SQL/LogQL recipes, phicheflags for degradation).
Postmortem without penalties, action items with owners and deadlines.
Team indicators: MTTD/MTTR, percentage of noisy alerts, Runbuk coverage.
16) RUM and synthetics
RUM: WebVitals (LCP, CLS, INP), front errors, device fingerprints, regions/providers.
Synthetics: scenarios "registratsiya→depozit→spin→vyvod" from different regions; private locations for internal paths (admin/back office).
17) Practices of releases, experiments and phicheflags
We link tracks with release versions (commit/artefact).
A/B tags in baggage → dashboard "effect of experiment on SLI."
Canary/Blue-Green: separate panels for canaries, error-budget burn rate.
18) Anomaly detection and anti-fraud signals
Statistical triggers (seasonality-aware) on decline-rate/chargeback-risk/surge of new cards.
Correlations: "growth of unsuccessful deposits + new release of PSP adapter."
Streaming rules (Kafka → Flink) for near-real-time reactions.
19) Implementation Roadmap (by phase)
Stage 0 - Basis: JSON logs, unified correlation fields, basic service metrics, common dashboards, first alerts.
Stage 1 - Tracing: OTel instrumentation, head + tail sampling, linking to logs.
Stage 2 - Business SLI/SLO: payments/outputs/game metrics, SLO alerts, error-budget processes.
Stage 3 - Maturity: Alerts-as-Code, ILM, separate retentions, anomaly-detection, per-service runbuki, SRE practices in CI/CD.
20) Review checklist
- JSON only logs, single keys, PII masking.
- In each event: 'trace _ id', 'span _ id', 'correlation _ id', 'tenant'.
- Metrics cover gold signals and business flows.
- SLOs are described, there is an error-budget and alerts on burn rate.
- Tail-sampling is enabled for payment errors and high latencies.
- ILM and WORM are configured for audit logs.
- RBAC for telemetry, access audit.
- Dashboards for Payments/Game Core/Player Journey/Infra.
- Runbooks are tied to every critical alert.
- Postmortems and action items - in the backlog with the owners.
Appendix A: OpenTelemetry attributes (recommendation)
`service. name`, `service. version`, `deployment. environment`
`cloud. region`, `k8s. pod. name`, `k8s. container. name`
`tenant`, `brand`, `market`, `ab_test`, `user_segment`
`payment. method`, `psp`, `game. provider`, `game. id`
Appendix B: Examples of Metrics for SLO
`payment_success_ratio`, `withdrawal_ttw_p95` (time-to-wallet), `psp_latency_p99`
`game_spin_latency_p95`, `provider_error_rate`, `kafka_consumer_lag`
`auth_success_ratio`, `kyc_step_dropout`, `cache_hit_ratio`
Appendix C: Quick Investigative Recipes
"Growing 'payment _ error _ rate'" → compare by PSP/region/method, check tail-trails, see adapter release.
"p99 spins ↑" trace →, front→geytvey→provayder check provider/channels, thread pool limits, network retrays.
"Kafka lag ↑" → health consumers, retro producers, backpressure, slow sinks/DB.