Telemetry Threads

1) Purpose and context

Telemetry flows provide a continuous influx of observational data about the platform's performance: what is happening, why and how much it costs. In iGaming, this is key to early detection of deposit/bet degradation, visibility of external providers (PSP/KYC/game studios), and provable SLO/compliance compliance.

2) Telemetry source map

Metrics (TSDB): RED/USE, business SLI (success of authorizations,% of successful bets).
OTel: chains of requests through the front → API → brokers → database/PSP.
Logs (structured): events, audit operations, errors.
RUM: TTFB/LCP, JS errors, geo/device.
Synthetics: external trial transactions (login/deposit/sand rate) from different GEOs.
Low-level telemetry: eBPF/CPU profiling/IO/alloc, network p95/p99.
External statuses: webhooks/PSP/KYC/CDN/WAF pools.

3) Standards and schemes

OpenTelemetry as lingua franca: unification of attribute semantics (service. name, deployment. environment, enduser. id - masked, trace/SpanID, PSP codes).
Schema conventions: versioning, schema registry for logs/trails, "breaking-changes" only through the binary flag and grace period.
Correlation-ID: single 'correlation _ id' for payment/bet through all layers + exemplars in metric percentiles.

4) Injection conveyor (high-level)

1. Producers: SDK/agents/collectors (OTel Collector on nodes).
2. Edge buffering: local queues (memory/disk) with limits.
3. Transport: gRPC/HTTP OTLP → message broker (Kafka/Pulsar) with idempotency keys.
4. Processors: normalization, enrichment (GEO/tenant/channel), PII filters, thin sampling.
5. Fan-out: in TSDB (metrics), in the trace storage, in the log system, in lake/DWH, in alert/rules.
6. Consumers: dashboards, SLO alerts (burn-rate), investigations, status page, release auto-gates.

5) QoS and flow classes

Class A (real time, P1): SLI/SLO, synthetics, key providers (PSP/KYC). Delivery SLA: <5-10c, ≥99. 9%.
Class B (operating rooms): trails/logs for RCA, SLA: <1-2 min.
Class C (analytical): aggregates and batches in lake/DWH, SLA: hour/day.
Class routing → prioritization, different requests, individual queues/topics.

6) Sampling, aggregation, retention

Metrics: downsampling of historical series (1s→10s→1m), percentile aggregates, exemplars.
Trails: tail-based sampling (raise share for anomalies, PSP errors, p99- "bursts").
Logs: profile level, compression, noise rejection (health-pings, DEBUG on sale - prohibited).
Retention: "hot" (7-14 days detail), "cold" (units/archive). Policies per-class data and cost.

7) Privacy and compliance

PII hygiene: masking/tokenization of identifiers; prohibition of CCM documents/card tokens in telemetry.
Geo-localization: storage by jurisdiction; export - only through approved workflow (encryption, TTL, audit).
Access control: RBAC/ABAC to telemetry storages, SoD for uploads.

8) Flow reliability

Idempotence: event keys, dedup in processors.
Backpressure: injection limits per tenant/service; drop policies for low-priority overload fields.
Replays - store in broker ≥72 h for reprocessing.
Dead-letter: routing errors (scheme, size, PII violation) to secure DLQ with alerts.
Versioning: "dual flow" when changing circuits (v1 + v2) and consumer migration.

9) Multi-tenant and isolation

Tags' tenant _ id/brand/region'in each event; marginal quotas and budgets.
Isolation of A/B streams by topicals; showback/chargeback on injection and storage.
Masking/aggregation to tenant boundary during export.

10) Stream directory (example fields)

Identifier: 'telemetry. payments. auth. success. rate. eu`

Class: A (real time)

Схема: `{timestamp, tenant, region, psp, bank_bin_group, success_rate, window}`

Source: OTel Collector + PSP-router metrics

Consumers: SLO alerts, Exec dashboard, status page

Retention: hot for 30 days, aggregates for 12 months

Owner: Payments SRE, dpo-owner (privacy)

Flow SLO: delay <10 c p95, loss <0. 1 %/day

11) Integration with alert and releases

SLO alerts by burn-rate (fast/slow window) for deposits/rates.
Release-gates: SLI canary analysis; auto-stop/rollback during degradation.
Status page: update feed from incident card + SLI units.

12) A set of key dashboards

Exec: uptime, burn-rate, success of authorizations/rates (by GEO/PSP), provider status, $/RPS telemetry.
SRE/Platform: RED/USE by service, lag queues, outlier detection, eBPF profiles.
Payments/Risk: bank conversion/PSP, soft/hard declines, KYC SLA, early chargeback signals.
Cost-obs: injection volume by source, top labels of cardinality, cost by stream.

13) Observability Finance (FinOps)

KPI cost: $/GB ingest, $/trace, $/SLI-dashboard; report on "heavy" metrics and labels.
Optimizations: aggregation and downsampling, dynamic sampling, cleaning chatty logs, storage class of importance.
Politicians: quotas for high-cardinality, limits on the frequency of issue, review of schemes once a quarter.

14) Processes and roles

Data/Observability Owners на домены (Payments, Games, Core API, Infra).
Change-Control for circuits: PR-review, test benches, compatibility in consumers.
Tabletop/Chaos-days: disconnections of providers, broker overload, backpressure/idempotency check.
Post-mortem: include telemetry analysis (sufficiency of signals, false alarms, cost).

15) Implementation Roadmap (8-12 weeks)

Ned. 1-2: audit of current flows, source map, telemetry SLO goals, selection of standards (OTel, TSDB, trails, logs).
Ned. 3-4: OTel collectors, single correlation-ID, basic RED/USE + business SLI for deposit/bet, flow directory v0.
Ned. 5-6: tail-based sampling, GEO synthetics, DLQ/idempotency, privacy filters.
Ned. 7-8: FinOps panel (ingest/retention), downsampling, cardinality quotas, SLO alerts (burn-rate).
Ned. 9-10: eBPF/low-level signals, status page feed, release-gates.
Ned. 11-12: chaos tests, cost optimization, formal SLA flows, launch of quarterly review of schemes.

16) Artifact patterns

Telemetry Stream Spec: id, owner, scheme, QoS class, sources, consumers, retention, SLO/alerts, privacy policy.
Schema PR Template: change/migration, compatibility, tests, rollback plan.
Sampling Policy: rules for lifting sampling in case of anomalies; target budgets.
Cost Review Pack: Top Sources by $/Value, TTL/Aggregation Offers.
Incident Telemetry Checklist: a list of charts/trails/logs that are required to be for RCA.

17) KPI/KRI of telemetry streams

Delivery: p95 delays by class,% lost messages/day.
Coverage: proportion of critical paths with tracing> 90%, proportion of SLIs closed by metrics.
Signal quality:% of incidents caught on SLI prior to complaints, false/missed alerts.
Cost: $/RPS for telemetry, $/trace, share of "noise" in the injection.
Reliability: recovery time after broker degradation, replay volume.

18) Antipatterns

High-cardinality metrics (userId, sessionId) in TSDB.
A single "black box" of logs without structuring and schemes.
No DLQ/idempotency → duplicates and peak losses.
"Endless" retentions without FinOps → exponential bill growth.
Trails without business context (PSP/bank/GEO) → poor diagnostics.
Inconsistent schemes between commands → consumers break.

Total

Telemetry streams are a controlled, multi-layered system: OTel standards and schemes → reliable injection with QoS and backpressure → sampling/aggregation and retentions for cost → privacy and multi-tenant isolation → SLO alerts, dashboards and release gates. Such a circuit gives early signals, fast RCA, predictable costs and stability of the iGaming platform in peak modes.

Telemetry Threads

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects