Distributed Tracing

(Section: Technology and Infrastructure)

Brief Summary

Distributed traces provide an answer to the question of where and why time is lost along the request path through the gateway, API, queues, databases, external providers (PSP/game studios). OpenTelemetry (OTel) is an open SDK/agent/protocol standard that combines trails, metrics and logs. In iGaming, it is a basic tool to keep p95/p99, quickly localize payment issues and identify bottlenecks before peak tournaments.

1) OTel concepts

Trace - the full path of the operation (deposit, rate, withdrawal).
Span - work area (HTTP handler, SQL request, queue/provider call).
Attributes - value key with details ('net. peer. name`, `db. system`, `psp. route`).
Events - instant events (retreat, timeout, cache miss).
Links - link to other traces (important for async/queue).
Resource - process metadata: 'service. name`, `service. version`, `deployment. environment`, `cloud. region`.

2) Context propagation

Use W3C Trace Context:


traceparent: 00-<trace_id>-<span_id>-01 tracestate:...

Additionally - baggage for secure keys (for example, 'tenant', 'route'), do not put PII there.

Where to puncture the context: API gateway → internal RPCs → producer to queue → consumer → external HTTP (PSP/providers).

3) Semantic conventions (mandatory minimum)

HTTP/RPC: `http. method`, `http. route`, `http. status_code`.
DB/cache: 'db. system` (`mysql`/`postgresql`/`redis`), `db. statement '(masked),' db. operation`.
Queues: 'messaging. system` (`kafka`/`rabbitmq`), `messaging. destination`, `messaging. operation` (`send`/`process`).
Payments: 'psp. route`, `psp. provider`, `payment. id '(pseudonymize),' amount ',' currency '.
iGaming domain: 'game. provider`, `game. session_id` (hash), `player. id_hash`.

A single taxonomy → the comparability of dashboards and a quick search for causes.

4) Sampling: How not to drown in data

Head-based

Simple, cheap; suitable for general flow.
Minus - you can lose "interesting" slow/erroneous tracks.

Tail-based (в Collector)

The decision is made after the end of spans: we save only errors/slow/important segments (VIP/payments).
Ideal for production load: greatly reduces cost with high information content.

Recommended hybrid:

Head: 5-10% for "background" coverage.
Tail: 100% error + p95 + slow + payment tracks/canary releases.

5) OpenTelemetry Collector topologies

Agent-sidecar (on each node/pod): local acceptance, minimum buffer, export to the aggregator.
Gateway (cluster): tail-sampling, routing, enrichment, export to Tempo/Jaeger/Zipkin/OTLP.

Example: tail-sampling (YAML fragment)

yaml processors:
tailsampling:
decision_wait: 5s policies:
- name: errors type: status_code status_code:
status_codes: [ ERROR ]
- name: slow_p95 type: latency latency:
threshold_ms: 250
- name: payments type: string_attribute string_attribute:
key: service. name values: [ "payments-api", "payments-worker" ]

6) Correlation with metrics and logs

Add'trace _ id '/' span _ id' to each log entry.
Store latency metrics as histograms and include exemplars - a reference to the representative 'trace _ id' for a "jump" from p95-boket to a specific trace.
Release annotations (Git SHA, chart version) - like events/labels.

7) Instrumentation (languages and auto-agents)

Go (manual + auto)

go tp:= sdktrace. NewTracerProvider(
sdktrace. WithBatcher(exporter),
sdktrace. WithResource(resource. NewWithAttributes(
semconv. SchemaURL,
semconv. ServiceName("payments-api"),
)),
)
otel. SetTracerProvider(tp)

ctx, span:= tracer. Start(ctx, "Deposit")
defer span. End()
span. SetAttributes(
attribute. String("psp. route","pspX"),
attribute. String("currency","EUR"),
)

Java

Auto-agent '-javaagent: opentelemetry-javaagent. jar ', config via env (' OTEL _ SERVICE _ NAME ',' OTEL _ EXPORTER _ OTLP _ ENDPOINT ').
Manual - annotations/tool of onion places (JDBC pools, cache).

Node. js / Python

Auto-tool with SDK + plugins (Express/FastAPI/celery).
For queues - producer/consumer wrappers to put 'messaging.' And links.

8) Queues and async: correct spans

Producer ('send'): span to send to the topic/queue.
Consumer ('process'): new span of message processing from link to producer span (save causal relationship without common 'trace _ id').
Attributes: 'messaging. kafka. partition`, `messaging. rabbitmq. routing_key`, `messaging. message_id`.
With retrays - event 'retry', attempt counter.

9) DB/Cache and N + 1

Enable database driver tracing, group queries of the same type into batches.
For Redis/cache, the attributes are'cache. hit`/`cache. miss`.
Take out "heavy" requests to separate spans - you can see where p99.

10) External providers: PSP/game studios

Wrap HTTP clients: 'psp. provider`, `psp. route`, `timeout_ms`, `attempt`.
Log error codes/types, but not PII (card number, tokens).
Compare studios/routes by 'duration', 'error-rate'.

11) Frontend and RUM

OTel Web SDK: `page_view`, `resource_load`, `xhr`.
Pierce 'traceparent' into the backend to stitch the user's path through the UI → API → database.
Segmentation by geo/network providers - optional labels.

12) Safety and PII

Mask the fields ('db. statement 'edited), hash' player _ id '.
Data zones: 'pii = true', 'region = EU/TR/LATAM'.
Access control to payment tracks (role-based).
WORM/Retention: retention periods for sensitive traces, deletion by policy.

13) Performance and cost

Tail-sampling by policy: "errors + slow + payments + canary releases."

Downsampling histograms of metrics, aggressive log deduplication.
Cardinality constraint: Do not write 'user _ id' as a metric label.
Buffers/batches in Collector, OTLP compression.

14) Dashboards and analysis

Service map: service dependencies, error/latency coloring.
Release compare: stable vs canary revision (p95, error-rate, payments conv).
Top slow traces: along the route '/deposit ', section along the PSP/region.
Queue lag: Deep consumption delay tracks.

15) Examples of Collector configurations

Pipelines (metrics/trails/logs, fragment)

yaml receivers:
otlp: { protocols: { http: {}, grpc: {} } }

processors:
batch:
memory_limiter:
limit_mib: 1024 spike_limit_mib: 256 attributes/payments:
actions:
- key: "psp. provider"
action: insert value: "pspX"

exporters:
otlp/traces: { endpoint: tempo:4317, tls: { insecure: true } }
otlp/metrics:{ endpoint: prometheus-otlp:4317, tls: { insecure: true } }
otlp/logs:  { endpoint: loki-otlp:4317, tls: { insecure: true } }

service:
pipelines:
traces:
receivers: [ otlp ]
processors: [ memory_limiter, batch, tailsampling ]
exporters: [ otlp/traces ]
metrics:
receivers: [ otlp ]
processors: [ batch ]
exporters: [ otlp/metrics ]
logs:
receivers: [ otlp ]
processors: [ batch ]
exporters: [ otlp/logs ]

16) Runbooks (typical scenarios)

A) p99 growth in 'payments-api'

1. Open "Top slow traces" → fall into database/PSP spans.
2. If the PSP problem is to translate the route, enable retrays/timeouts.
3. Check the queue'withdrawals' (lag), increase the consumers.

B) Post-release 5xx bugs

1. Filter by'service. version`.
2. Compare stable/canary; find spikes in 'psp. route`.
3. Freeze promotion, roll back (see Release Strategies/Rollbacks).

C) Suspected N + 1

1. Trails with a large number of short DB spans.
2. Enable aggregation/joyns, add cache layer.

17) Implementation checklist

1. Enable OTel SDK and uniform Resource attributes ('service. name`, `env`, `region`).
2. Propagating W3C Trace Context through all layers and queues.
3. Minimum set of semantic attributes (HTTP/DB/queue/PSP).
4. Tail-sampling: errors, p95 +, payments, canary.
5. Logs with 'trace _ id '/' span _ id', metrics with exemplars.
6. Dashboards: service map, release compare, payments flow.
7. PII policies: masking, zones, roles, retention.
8. Tests/load: check correlation and completeness of tracing before peaks.
9. Auto-generation of runbook links in alerts.
10. Telemetry Cost and Cardinality Report.

18) Antipatterns

Traces "only at the entrance" without databases/queues → no use.
Lack of propagation in async → cause and effect chains break.
Sampling random 1% without tail logic → do not catch slow/erroneous.
Logs without trace _ id → no end-to-end correlation.
Raw PIIs in attributes/logs → compliance risks.
Cardinality "to the ceiling" (user/session as metric labels) → an explosion of value.

Summary

OpenTelemetry turns observability from a collection of disparate tools into an end-to-end performance language. With the correct context propagation, neat semantics, tail-sampling and a combination of "metrics ↔ traces ↔ logs," the iGaming team keeps p95/p99 under control, quickly isolates bottlenecks (DB, queues, PSP) and confidently releases releases even in traffic peaks.

Distributed Tracing

Brief Summary

Tail-based (в Collector)

Java

Node. js / Python

Summary

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects