Telemetry Threads
1) Purpose and context
Telemetry flows provide a continuous influx of observational data about the platform's performance: what is happening, why and how much it costs. In iGaming, this is key to early detection of deposit/bet degradation, visibility of external providers (PSP/KYC/game studios), and provable SLO/compliance compliance.
2) Telemetry source map
Metrics (TSDB): RED/USE, business SLI (success of authorizations,% of successful bets).
OTel: chains of requests through the front → API → brokers → database/PSP.
Logs (structured): events, audit operations, errors.
RUM: TTFB/LCP, JS errors, geo/device.
Synthetics: external trial transactions (login/deposit/sand rate) from different GEOs.
Low-level telemetry: eBPF/CPU profiling/IO/alloc, network p95/p99.
External statuses: webhooks/PSP/KYC/CDN/WAF pools.
3) Standards and schemes
OpenTelemetry as lingua franca: unification of attribute semantics (service. name, deployment. environment, enduser. id - masked, trace/SpanID, PSP codes).
Schema conventions: versioning, schema registry for logs/trails, "breaking-changes" only through the binary flag and grace period.
Correlation-ID: single 'correlation _ id' for payment/bet through all layers + exemplars in metric percentiles.
4) Injection conveyor (high-level)
1. Producers: SDK/agents/collectors (OTel Collector on nodes).
2. Edge buffering: local queues (memory/disk) with limits.
3. Transport: gRPC/HTTP OTLP → message broker (Kafka/Pulsar) with idempotency keys.
4. Processors: normalization, enrichment (GEO/tenant/channel), PII filters, thin sampling.
5. Fan-out: in TSDB (metrics), in the trace storage, in the log system, in lake/DWH, in alert/rules.
6. Consumers: dashboards, SLO alerts (burn-rate), investigations, status page, release auto-gates.
5) QoS and flow classes
Class A (real time, P1): SLI/SLO, synthetics, key providers (PSP/KYC). Delivery SLA: <5-10c, ≥99. 9%.
Class B (operating rooms): trails/logs for RCA, SLA: <1-2 min.
Class C (analytical): aggregates and batches in lake/DWH, SLA: hour/day.
Class routing → prioritization, different requests, individual queues/topics.
6) Sampling, aggregation, retention
Metrics: downsampling of historical series (1s→10s→1m), percentile aggregates, exemplars.
Trails: tail-based sampling (raise share for anomalies, PSP errors, p99- "bursts").
Logs: profile level, compression, noise rejection (health-pings, DEBUG on sale - prohibited).
Retention: "hot" (7-14 days detail), "cold" (units/archive). Policies per-class data and cost.
7) Privacy and compliance
PII hygiene: masking/tokenization of identifiers; prohibition of CCM documents/card tokens in telemetry.
Geo-localization: storage by jurisdiction; export - only through approved workflow (encryption, TTL, audit).
Access control: RBAC/ABAC to telemetry storages, SoD for uploads.
8) Flow reliability
Idempotence: event keys, dedup in processors.
Backpressure: injection limits per tenant/service; drop policies for low-priority overload fields.
Replays - store in broker ≥72 h for reprocessing.
Dead-letter: routing errors (scheme, size, PII violation) to secure DLQ with alerts.
Versioning: "dual flow" when changing circuits (v1 + v2) and consumer migration.
9) Multi-tenant and isolation
Tags' tenant _ id/brand/region'in each event; marginal quotas and budgets.
Isolation of A/B streams by topicals; showback/chargeback on injection and storage.
Masking/aggregation to tenant boundary during export.
10) Stream directory (example fields)
Identifier: 'telemetry. payments. auth. success. rate. eu`
Class: A (real time)
Схема: `{timestamp, tenant, region, psp, bank_bin_group, success_rate, window}`
Source: OTel Collector + PSP-router metrics
Consumers: SLO alerts, Exec dashboard, status page
Retention: hot for 30 days, aggregates for 12 months
Owner: Payments SRE, dpo-owner (privacy)
Flow SLO: delay <10 c p95, loss <0. 1 %/day
11) Integration with alert and releases
SLO alerts by burn-rate (fast/slow window) for deposits/rates.
Release-gates: SLI canary analysis; auto-stop/rollback during degradation.
Status page: update feed from incident card + SLI units.
12) A set of key dashboards
Exec: uptime, burn-rate, success of authorizations/rates (by GEO/PSP), provider status, $/RPS telemetry.
SRE/Platform: RED/USE by service, lag queues, outlier detection, eBPF profiles.
Payments/Risk: bank conversion/PSP, soft/hard declines, KYC SLA, early chargeback signals.
Cost-obs: injection volume by source, top labels of cardinality, cost by stream.
13) Observability Finance (FinOps)
KPI cost: $/GB ingest, $/trace, $/SLI-dashboard; report on "heavy" metrics and labels.
Optimizations: aggregation and downsampling, dynamic sampling, cleaning chatty logs, storage class of importance.
Politicians: quotas for high-cardinality, limits on the frequency of issue, review of schemes once a quarter.
14) Processes and roles
Data/Observability Owners на домены (Payments, Games, Core API, Infra).
Change-Control for circuits: PR-review, test benches, compatibility in consumers.
Tabletop/Chaos-days: disconnections of providers, broker overload, backpressure/idempotency check.
Post-mortem: include telemetry analysis (sufficiency of signals, false alarms, cost).
15) Implementation Roadmap (8-12 weeks)
Ned. 1-2: audit of current flows, source map, telemetry SLO goals, selection of standards (OTel, TSDB, trails, logs).
Ned. 3-4: OTel collectors, single correlation-ID, basic RED/USE + business SLI for deposit/bet, flow directory v0.
Ned. 5-6: tail-based sampling, GEO synthetics, DLQ/idempotency, privacy filters.
Ned. 7-8: FinOps panel (ingest/retention), downsampling, cardinality quotas, SLO alerts (burn-rate).
Ned. 9-10: eBPF/low-level signals, status page feed, release-gates.
Ned. 11-12: chaos tests, cost optimization, formal SLA flows, launch of quarterly review of schemes.
16) Artifact patterns
Telemetry Stream Spec: id, owner, scheme, QoS class, sources, consumers, retention, SLO/alerts, privacy policy.
Schema PR Template: change/migration, compatibility, tests, rollback plan.
Sampling Policy: rules for lifting sampling in case of anomalies; target budgets.
Cost Review Pack: Top Sources by $/Value, TTL/Aggregation Offers.
Incident Telemetry Checklist: a list of charts/trails/logs that are required to be for RCA.
17) KPI/KRI of telemetry streams
Delivery: p95 delays by class,% lost messages/day.
Coverage: proportion of critical paths with tracing> 90%, proportion of SLIs closed by metrics.
Signal quality:% of incidents caught on SLI prior to complaints, false/missed alerts.
Cost: $/RPS for telemetry, $/trace, share of "noise" in the injection.
Reliability: recovery time after broker degradation, replay volume.
18) Antipatterns
High-cardinality metrics (userId, sessionId) in TSDB.
A single "black box" of logs without structuring and schemes.
No DLQ/idempotency → duplicates and peak losses.
"Endless" retentions without FinOps → exponential bill growth.
Trails without business context (PSP/bank/GEO) → poor diagnostics.
Inconsistent schemes between commands → consumers break.
Total
Telemetry streams are a controlled, multi-layered system: OTel standards and schemes → reliable injection with QoS and backpressure → sampling/aggregation and retentions for cost → privacy and multi-tenant isolation → SLO alerts, dashboards and release gates. Such a circuit gives early signals, fast RCA, predictable costs and stability of the iGaming platform in peak modes.