Distributed Traces
Distributed Traces
1) Why and what it is
Distributed tracing is a way to link operations along the entire request path: front → API gateway → microservices → databases/caches → brokers → jabs/pipelines.
The result is a trace from spans (span), where each span captures the operation of the component with attributes, events and status. This accelerates RCA, helps keep SLOs and reduces MTTR.
- Visibility of critical path and bottlenecks.
- Correlation of symptoms (metrics) with causes (spans) and details (logs).
- Analytics of retrays, queues, DLQ, fan-outs, "saws" of latency.
2) Trace data model
Trace - call graph with 'trace _ id'.
Span — операция: `name`, `kind` (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL), `start/end`, `status`, `attributes`, `events`, `links[]`.
Attributes - value key (route, db. system, messaging. system, cloud. region, etc.).
Events - instant tags inside the span (for example, 'retry', 'cache _ miss').
Span Links - connections outside the "parent-child" (batchi, retrai, fan-in/out).
Resource - process/service metadata ('service. name ', version, environment).
3) Context and tolerability
3. 1 W3C Trace Context
Titles:- 'traceparent ':' version-traceid-spanid-flags' (flags include sampling).
- 'tracestate ': vendor-specific state (minimum).
- Baggage - keys for business context (limited, no PII/secrets).
3. 2 Throwing context
HTTP: `traceparent`/`tracestate`; gRPC: metadata; WebSocket: when upgrading and in messages;
Messages: in headers (Kafka/NATS/RabbitMQ) - save the original context with PRODUCER and transfer with CONSUMER.
Bases: do not "carry" the context - we log attributes to span (query, rows, db. system), but not the values.
4) Sampling: How not to go broke
Head sampling: probabilistic/by rules (route, tenant, endpoint).
Tail sampling (on the collector): save "interesting" trails - errors, long p95/p99, rare paths.
Exemplars: Histogram metrics store references to specific 'trace _ id'.
Recommendation: combine - head 5-20% + tail rules 100% for 5xx/timeout/p99.
5) Attributes and taxonomy (minimum required)
General:- `service. name`, `service. version`, `deployment. environment`, `cloud. region`, `http. route`, `http. method`, `http. status_code`, `db. system`, `db. statement '(shortened/without data),' messaging. system`, `messaging. operation`, `peer. service`, `net. peer. name`, `tenant. id`, `request. id`.
Business labels: neat, PII-free. Example: 'order. segment`, `plan. tier`.
6) Asynchronous scripts, queues and batches
PRODUCER → CONSUMER: create span PRODUCER with context; in the message - headers (traceparent, baggage). CONSUMER starts SERVER/CONSUMER-span from link to PRODUCER (if there is no strict hierarchy).
Fan-out: one input - many outputs → child spans or links.
Batch: CONSUMER reads a burst of N messages → one span with'events' for each messageId or'links' for N separate contexts.
DLQ: separate span'messaging. dlq. publish` с reason и count.
Retrai: 'event: retry' + 'retry. count 'attribute; preferably a new CHILD span to try.
7) Integration with logs and metrics
We write logs JSON with 'trace _ id '/' span _ id' → from span we go to the logs by clicking.
RED/USE metrics contain exemplars → from p99 peaks we go to "bad" spans.
Traces generate technical signals (dependency errors) and business signals (conversion) through events.
8) Performance and cost
Event sampling and throttling.
Cardinality reduction of attributes (no'user _ id '/'session _ id'as label!).
Compression/butching by the exporter; export timeout boundaries.
Storage: hot 1-7 days, then - units/only "problem" trails.
Categories of spending: collectors, indexes, storage, egress.
9) Security and privacy
In Transit: TLS 1. 3/mTLS kollektor↔agenty; At Rest: encryption, private keys (see "Encryption In Transit/At Rest").
PII and secrets: do not write in attributes/events; tokenization/masking on the producer.
Multi-tenancy: 'tenant. id 'as a resource label and isolation of spaces, reading policies; auditing trace access (see Auditing and unmodified logs).
10) Implementation schemes (reference)
10. 1 OpenTelemetry SDK (pseudocode)
python from opentelemetry import trace from opentelemetry. sdk. trace import TracerProvider from opentelemetry. sdk. resources import Resource from opentelemetry. sdk. trace. export import BatchSpanProcessor from opentelemetry. exporter. otlp. proto. grpc. trace_exporter import OTLPSpanExporter
provider = TracerProvider(resource=Resource. create({
"service. name":"checkout","service. version":"1. 12. 0","deployment. environment":"prod"
}))
provider. add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True)))
trace. set_tracer_provider(provider)
tr = trace. get_tracer("checkout")
with tr. start_as_current_span("POST /pay", attributes={
"http. route":"/pay","http. method":"POST","tenant. id":"t-42"
}):
business logic, external API call and pass DB
10. 2 OTel Collector - tail sampling (fragment)
yaml processors:
tailsampling:
decision_wait: 2s policies:
- type: status_code status_codes: [ERROR]
- type: latency threshold_ms: 900
- type: probabilistic sampling_percentage: 10 exporters:
otlphttp: { endpoint: http://trace-backend:4318 }
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tailsampling]
exporters: [otlphttp]
10. 3 Kafka - Context Transfer (Concept)
PRODUCER: add headers' traceparent ',' baggage '.
CONSUMER: if the message initiates a new stream - a new SERVER/CONSUMER span from link to context from headers.
11) Data/ETL и ML
For batch pipelines: span on batch/partition with'dataset. urn`, `run. id`, `rows. in/out`, `freshness. lag`.
For ML: spans of training/inference, model version, latency, feature store.
Link to Lineage: 'run. id` и `dataset. urn'allow you to jump from the trace to the data origin graph.
12) Trace Platform SLO
Availability ingestion: ≥ 99. 9%
Delay to indexing: ≤ 60 s p95
Head-sample coverage: ≥ 5-10% of key routes
100% saving of trails with status ERROR and with latency> threshold according to the "critical paths" directory
Platform alerts: growth of drops, export timeouts, indexer lag, cardinality overheating.
13) Testing and verification
Trace contract in CI: presence of spans on key endpoints, mandatory attributes, correct 'traceparent' flies through the gateway/proxy.
Synthetic/rum samples: collect trails from the outside.
Chaos/incidents: disabling dependencies, checking that the tail sampler "picks up" errors.
Smoke in sales: after the release - "are there any spans" and exemplar → trace.
14) Checklists
Before selling
- W3C Trace Context is thrown everywhere; for messages - headers.
- Basic head sampling is enabled; tail rules for 5xx/p99 are configured.
- Mandatory attributes are route, method, status, service. version, tenant. id.
- JSON logs with 'trace _ id '/' span _ id', metrics with exemplars.
- PII sanitizers; Encryption on the go/at rest access policies.
- Dashboards: "critical path," "dependency errors," "retras/timeouts."
Operation
- Monthly review of cardinality of attributes; quotas.
- Tuning tail sampling by SLO (less noise, all "hot" - in the sample).
- Training RCAs with the transition metric → exemplar → trace → logs.
- Checking covers for queues, DLQs, ETL jobs.
15) Runbook’и
RCA: p99 rise on/pay
1. Open the RED dashboard; from bin p99 go on exemplar to trace.
2. Find a "narrow" CLIENT-span (for example, 'gateway. call '), check'retry. count ', timeouts.
3. Compare service versions/dependencies, region/zone.
4. Enable degradation (caching response/RPS limit), notify dependency owners.
5. After the fix - RCA and optimization tickets.
DLQ surge
1. Trace filter by'messaging. dlq. publish`.
2. Check the reasons (events), correlate with the release.
3. Start reprocess, temporarily increase the timeout for CONSUMER, notify downstream owners.
16) Frequent errors
There is no context prediction through gateways/brokers. Solution: middleware/interseptors, single libraries.
All trails 100%. Expensive and pointless - use tail sampling.
Logs without 'trace _ id'. The correlation → MTTR ↑ is lost.
PII in attributes. Mask/tokenize; keep only the technical context.
"Mute" background jabs. Add spans to batch/partition and'run. id`.
Naming discrepancy. Enter a dictionary of spans and attribute keys.
17) FAQ
Q: Is Head or tail sampling better?
A: Combination. Head gives the base layer, tail guarantees the preservation of anomalies/errors.
Q: How do I trace through Kafka without a rigid hierarchy?
A: Use span links between PRODUCER and CONSUMER; context - in headers.
Q: What to do with sensitive SQL?
A: 'db. statement 'shortened/normalized (no values), or' db. operation '+ dimensions/time.
Q: How do you relate to business metrics?
A: Add attributes of the domain without PII (plan/segment), use events of "business stages" within the span and go from conversion metrics to exemplar.
- "Observability: logs, metrics, traces"
- "Audit and immutable logs"
- "In Transit/At Rest Encryption"
- "Data Origin (Lineage)"
- «Privacy by Design (GDPR)»
- "Secret Management"