Telemetry and Event Collection
1) Purpose and principles
Objectives:- Single and predictable event flow for analytics, anti-fraud, RG, compliance and ML.
- End-to-end tracing (user/session/request/trace) and reproducibility.
- PII minimization and privacy compliance.
Принципы: schema-first, privacy-by-design, idempotency-by-default, observability-by-default, cost-aware.
2) Taxonomy of events
Payment: 'payment. deposit`, `payment. withdrawal`, `payment. chargeback`.
Gaming: 'game. session_start/stop`, `game. bet`, `game. payout`, `bonus. applied`.
Custom: 'auth. login`, `profile. update`, `kyc. status_changed`, `rg. limit_set`.
Operating rooms: 'api. request`, `error. exception`, `release. deploy`, `feature. flag_changed`.
Compliance: 'aml. alert_opened`, `sanctions. screened`, `dsar. requested`.
Each type has a domain owner, a schema, and a freshness SLO.
3) Schemes and contracts
Required fields (minimum):- `event_time` (UTC), `event_type`, `schema_version`, `event_id` (UUID/ULID),
- `trace_id`/`span_id`, `request_id`, `user. pseudo_id`, `session_id`,
json
{
"event_id": "01HFY1S93R8X",
"event_time": "2025-11-01T18:45:12. 387Z",
"event_type": "game. bet",
"schema_version": "1. 4. 0",
"user": {"pseudo_id": "p-7a2e", "age_band": "25-34", "country": "EE"},
"session": {"id": "s-2233", "device_id": "d-9af0"},
"game": {"id": "G-BookOfX", "provider": "StudioA", "stake": {"value": 2. 00, "currency": "EUR"}},
"ctx": {"ip": "198. 51. 100. 10", "trace_id": "f4c2...", "request_id": "req-7f91"},
"labels": {"market": "EE", "affiliate": "A-77"}
}
Evolution of schemes: semantic versions; backward-compatible - add nullable fields; breaking - only in the new version ('/v2 ') with a double recording period.
4) Instrumentation: where and how
4. 1 Client (Web/Mobile/Desktop)
Local buffer telemetry SDK, batch submission, exponential retrays.
Auto-events: visits, clicks, visibility of blocks, web-vitals (TTFB, LCP, CLS), JS errors.
Identifiers: 'device _ id' (stable, but private), 'session _ id' (updated), 'user. pseudo_id`.
Protection against "noise": dedup by 'event _ id', throttling, client-side sampling.
4. 2 Server/backend
Logger/tracer wrappers (OpenTelemetry) → domain event emit.
Mandatory throwing 'trace _ id' from edge/gateway to all downstream services.
Outbox pattern for transactional publishing of domain events.
4. 3 Providers/Third Parties
Connectors (PSP/KYC/studios) with normalization to host circuits; version adapters.
Signature/payload integrity check, perimeter logging (ingest audit).
5) OpenTelemetry (OTel)
Traces: each request receives a'trace _ id'; we associate logs/events via 'trace _ id '/' span _ id'.
Logs: use OTel Logs/converters; environment labels' service. name`, `deployment. env`.
Metrics: RPS/latency/error-rate by service, business metrics (GGR, conversion).
Collector: single point of receipt/buffer/export to Kafka/HTTP/graphic. stack.
6) Identifiers and correlation
'event _ id '- uniqueness and idempotence.
`user. pseudo_id' - stable aliasing (mapping separately and limited).
'session _ id ', 'request _ id', 'trace _ id ', 'device _ id' are required for end-to-end analysis.
ID consistency at API gateway and SDK level.
7) Sampling and volume control
Rules: per-event-type, per-market, dynamic (adaptive) by load.
Accurately captured events: payment/compliance/incidents - not sampled.
Analytical events: 10-50% with corrective weights in display cases is allowed.
Server-side downsampling: Valid for high-frequency metrics.
8) Privacy and compliance
Minimize PII: Tokenize PAN/IBAN/email; IP → geo codes/ASN when ingest.
Regionalisation: Send to regional ingest endpoints (EEA/UK/BR).
DSAR/RTBF: support for selective projection hiding; legal transaction log.
Retention policies: timing by type (analytics shorter, regulatory longer); Legal Hold.
9) Transport and buffering
→ Edge client: HTTPS (HTTP/2/3), 'POST/telemetry/batch' (up to 100 events).
Edge → Tire: Kafka/Redpanda partitioned by'user. pseudo_id`/`tenant_id`.
Formats: JSON (ingest), Avro/Protobuf (in bus), Parquet (in lake).
Reliability: retrai with jitter, DLQ, poison-pill isolation.
json
{
"sdk": {"name":"igsdk-js","version":"2. 7. 1"},
"sent_at": "2025-11-01T18:45:12. 500Z",
"events": [ {... }, {... } ]
}
10) Reliability and idempotency
Client-generated 'event _ id' + server grandfather by '(event_id, source)'.
Outbox on services, Exactly-Once-semantics in threads (keyed state + dedupe).
Order within key: partitioned by 'user/session'.
Time control: NTP/PTP, allowed drift (for example, ≤ 200 ms), 'received _ at' on the server.
11) Telemetry Quality (TQ) and SLO
Completeness: ≥ 99. 5% of critical type events per T.
Freshness: p95 delivery delay to Silver ≤ 15 min.
Correctness: valid schemes ≥ 99. 9%, drop-rate < 0. 1%.
Trace coverage: The percentage of requests with 'trace _ id' ≥ 98%.
Cost/GB: target budget for ingest/storage by domain.
12) Observability and dashboards
Minimum widgets:- Lag ingest (p50/p95) by source and region.
- Completeness by event type and market.
- Validation errors of/oversized-payloads schemes.
- SDK version map and percentage of legacy clients.
- Correlation of web-vitals ↔ conversion/failures.
13) Client SDK Requirements
Light footprint, offline buffer, deferred initialization.
Settings: sampling, max batch size, max queue age, privacy fashion (no-PII).
Protection: package signature/anti-tamper, key obfuscation.
Update: feature-flags to disable noisy events.
14) Edge layer and protection
Rate limit, WAF, schema validation, compression (gzip/br).
Token bucket per client; anti-replay ('request _ id', TTL).
IP and UA removal → normalization/enrichment outside the "raw" payload.
15) Integration with the data pipeline
Bronze: irreversibly added raw payload (for forensics).
Silver: normalized tables with deduplication/enrichment.
Gold: display cases for BI/AML/RG/product.
Linage between events and reports; versions of transformations.
16) Customer Quality Analytics
Quiet customer ratio (no events in N hours).
Anomalies of the "storm" (mass duplicate/burst).
Share of "legacy SDKs" by version and platform.
17) Processes and RACI
R: Data Platform (ingest/bus/validators), App Teams (SDK instrumentation).
A: Head of Data/Architecture.
C: Compliance/DPO (PII/retention), SRE (SLO/incidents).
I: BI/Marketing/Risk/Product.
18) Implementation Roadmap
MVP (2-4 weeks):1. Event taxonomy v1 + JSON schemas for 6-8 types.
2. SDK (Web/Android/iOS) с batch и sampling; Edge `/telemetry/batch`.
3. Kafka + Bronze layer; basic validators and dedup.
4. Dashboard ingest lag/completeness, alerts to drop/validator.
Phase 2 (4-8 weeks):- OTel Collector, trace correlation; Silver normalization and DQ rules.
- Regional endpoints (EEA/UK), privacy-fashion, DSAR/RTBF procedures.
- SDK version map, auto-rollout updates by rings.
- Exactly-Once in streams, Feature Store connections, anti-fraud online feeds.
- Rule-as-Code for schemes and validators, impact analysis.
- Value optimization: adaptive sampling, Z-order/clustering in lake.
19) Quality checklist before release
- Required schema fields and correct types are filled in.
- 'trace _ id '/' request _ id '/' session _ id' are present.
- SDK supports batch, retry, sampling.
- Edge validates the scheme and limits the payload size.
- Privacy filters and tokenization of sensitive fields are enabled.
- Configured SLO/alerts and dashboards.
- Documentation for domains (example event, owner, SLA).
20) Frequent mistakes and how to avoid them
Raw events without schemes: enter registry and CI validation.
No idempotency: require 'event _ id' and store deduplication windows.
PII and analytics mix: separate mappings, mask fields.
No tracing: route 'trace _ id' through gateway → services → events.
Unmanaged volumes - Use sampling/trrottling and budget quotas.
Global endpoint without regions - use regionalization and data residency.
21) Glossary (brief)
OpenTelemetry (OTel) is an open standard for trails/metrics/logs.
Outbox - transactional publishing of domain events.
DLQ - queue of "broken" messages.
Sampling - selection of a part of events for volume reduction.
Data Residency - storing data in the desired jurisdiction.
22) Bottom line
Well-designed telemetry is about arrangements, not just "sending logs": strict schemes, agreed identifiers, default privacy, reliable transport, observability and cost-saving. By following this article, you get a steady stream of events ready for analytics, compliance and machine learning with predictable SLOs.