Distributed Tracing: OpenTelemetry
Distributed Tracing: OpenTelemetry
1) Why OTel and what it gives
OpenTelemetry (OTel) is an open standard and a set of SDKs/agents/collectors for telemetry (trails, metrics, logs) with a single OTLP protocol. Objectives:- End-to-end visibility of request paths (gateway → services → DB/cache/queues).
- Fast RCA/debug degradation and releases (canaries/blue-green).
- Link with SLO and auto-rollbacks (operational solutions based on data).
- Vendor-agnostic: export to any backend, without binding to one APM.
Reference principles: standardize, sample smart, secure by default, correlate everything.
2) Basics: context, spans, attributes
Trace - call tree/graph; Span - operation (RPC, SQL, queue call).
Span Kind: `SERVER`, `CLIENT`, `PRODUCER`, `CONSUMER`, `INTERNAL`.
W3C Trace Context: 'traceparent', 'tracestate' headers; context is carried interservice.
Attributes - key-value (low cardinality!), Events - time stamps, Status - error code/description.
Links - link spans outside a strict hierarchy (important for async/fan-out/fan-in).
- HTTP: 'HTTP {METHOD}' ('GET/withdraw' as attribute)
- DB: `DB SELECT` / `DB INSERT`
- Queue: `QUEUE publish topic=X` / `QUEUE consume topic=X`
3) Semantic conventions (semconv)
Use stable attribute schemas:- HTTP/GRPC: `http. method`, `http. route`, `http. status_code`, `url. full`.
- DB: `db. system=postgresql`, `db. statement '(only safe squeeze!),' db. name`.
- Messaging: `messaging. system=kafka`, `messaging. operation=receive`, `messaging. destination`.
- Cloud/K8s/Host: `cloud. region`, `k8s. pod. name`, `container. id`.
- Resource attributes (required): 'service. name`, `service. version`, `deployment. environment`.
Specify schema stability via'schemaUrl 'in SDK/Collector resources.
4) Sampling: head, tail, adaptive
Head-based (in SDK): decides in advance, cheap; good for high-QPS but may miss "interesting" tracks.
Tail-based (in Collector): decides after the completion of the track; allows rules by status, latency, attributes.
Adaptive/Dynamic: Raises the sample share for p95 errors/growth.
Recipe of the production level: Head 1-5% globally + Tail selection of "important": 'status = ERROR', 'latency> p95', "money routes," PSP/KYC errors.
5) Correlation: metrics, logs, trails
Exemplars: labels with 'trace _ id' in metric histograms (quick jump to the track).
Logs: Add 'trace _ id '/' span _ id' and switch from logs to trace.
SpanMetrics (processor) aggregates from RED metrics ('requests, errors, duration') traces for SLOs/alerts.
6) Deployment architecture
Agent (DaemonSet) on each node collects from applications (OTLP) and forwards.
Gateway (Cluster/Region) - Central Collector (many replicas) with routing/sampling/enrichment pipelines.
OTLP: gRPC `4317`, HTTP `4318`; Enable TLS/mTLS.
Pros of "agent + gateway": isolation, buffering, local backpressure, simplified network.
7) OpenTelemetry Collector - basic template (gateway)
yaml receivers:
otlp:
protocols:
grpc: { endpoint: 0. 0. 0. 0:4317 }
http: { endpoint: 0. 0. 0. 0:4318 }
processors:
memory_limiter: { check_interval: 5s, limit_percentage: 75 }
batch: { timeout: 2s, send_batch_size: 8192 }
attributes:
actions:
- key: deployment. environment action: upsert value: prod resource:
attributes:
- key: service. namespace action: upsert value: core tail_sampling:
decision_wait: 5s policies:
- name: errors type: status_code status_code: { status_codes: [ERROR] }
- name: slow_traces type: latency latency: { threshold_ms: 800 }
- name: important_routes type: string_attribute string_attribute:
key: http. route values: ["/withdraw", "/deposit"]
- name: baseline_prob type: probabilistic probabilistic: { sampling_percentage: 5 }
exporters:
otlp/apm:
endpoint: apm-backend:4317 tls: { insecure: true }
prometheus:
endpoint: 0. 0. 0. 0:9464
extensions:
health_check: {}
pprof: { endpoint: 0. 0. 0. 0:1777 }
zpages: { endpoint: 0. 0. 0. 0:55679 }
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces: { receivers: [otlp], processors: [memory_limiter,attributes,resource,batch,tail_sampling], exporters: [otlp/apm] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
logs: { receivers: [otlp], processors: [batch], exporters: [] }
8) SpanMetrics and RED for SLO
Add a processor:yaml processors:
spanmetrics:
metrics_exporter: prometheus histogram:
explicit:
buckets: [50ms,100ms,200ms,400ms,800ms,1600ms,3200ms]
service:
pipelines:
traces: { receivers: [otlp], processors: [spanmetrics,batch,tail_sampling], exporters: [otlp/apm] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
Now there are 'traces _ spanmetrics _ calls {service, route, code}' and 'duration _ bucket' for SLO/alerts.
9) K8s: deploying Collector (DaemonSet + Deployment)
Agent (DaemonSet) fragment:yaml apiVersion: apps/v1 kind: DaemonSet metadata: { name: otel-agent, namespace: observability }
spec:
template:
spec:
containers:
- name: otelcol image: otel/opentelemetry-collector:latest args: ["--config=/conf/agent. yaml"]
ports:
- { containerPort: 4317, name: otlp-grpc }
- { containerPort: 4318, name: otlp-http }
Gateway (Deployment) - several replicas, Service ClusterIP/Ingress, HPA by CPU/QPS.
10) Security and privacy
TLS/mTLS между SDK → Agent → Gateway → Backend.
Authentication (Basic/OAuth/Headers) at the Gateway input; limit the origins.
PII revision: filter/mask attributes ('user. email ',' card. ') in the Collector processor.
Limits: In the SDK, limit the event size/number of attributes (cardinality protection).
RBAC in the backend + individual project/tenant namespaces.
yaml processors:
attributes/redact:
actions:
- key: user. email action: delete
- key: payment. card action: delete
11) Instrumentation: Quick starts
Node. js
js import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes as R } from "@opentelemetry/semantic-conventions";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: "http://otel-agent. observability:4317" }),
resource: new Resource({
[R.SERVICE_NAME]: "payments-api",
[R.SERVICE_VERSION]: "1. 14. 2",
[R.DEPLOYMENT_ENVIRONMENT]: "prod"
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk. start();
Java (Spring)
java
// Gradle: io. opentelemetry. instrumentation:opentelemetry-spring-boot-starter
// application. yml otel:
service:
name: orders-api exporter:
otlp:
endpoint: http://otel-agent. observability:4317 traces:
sampler: parentbased_traceidratio sampler-arg: 0. 05
Python (FastAPI)
python from opentelemetry import trace from opentelemetry. sdk. resources import Resource from opentelemetry. exporter. otlp. proto. grpc. trace_exporter import OTLPSpanExporter from opentelemetry. sdk. trace import TracerProvider from opentelemetry. sdk. trace. export import BatchSpanProcessor
provider = TracerProvider(resource=Resource. create({"service. name":"fraud-scoring","deployment. environment":"prod"}))
provider. add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-agent. observability:4317", insecure=True)))
trace. set_tracer_provider(provider)
Go
go exp, _:= otlptracegrpc. New(ctx, otlptracegrpc. WithEndpoint("otel-agent. observability:4317"), otlptracegrpc. WithInsecure())
res:= resource. NewWithAttributes(semconv. SchemaURL, semconv. ServiceNameKey. String("gateway"), semconv. DeploymentEnvironmentKey. String("prod"))
tp:= sdktrace. NewTracerProvider(sdktrace. WithBatcher(exp), sdktrace. WithResource(res), sdktrace. WithSampler(sdktrace. ParentBased(sdktrace. TraceIDRatioBased(0. 05))))
otel. SetTracerProvider(tp)
12) Asynchronous: queues, buses, cron
PRODUCER/CONSUMER with communication through 'links' (messages have their own life cycle).
Propagate context to message headers ('traceparent '/' baggage').
When batch-consume, create a span per message or aggregate with the'messaging attribute. batch. size`.
For cron/jobs: new trace to run + links to primary events (if any).
13) Baggage and targeting
Store a minimum of stable keys ('tenant _ id', 'region', 'vip _ tier') in the baggage; prohibit PII.
Push through the gateway/gateway logger for subsequent aggregation of metrics by segment.
14) Integration with releases and SLO gating
Canary steps → check 'traces _ spanmetrics _' on routes/sub-segments.
During degradation (5xx/p95) - auto-stop and rollback (Argo Rollouts AnalysisTemplate + PromQL).
Instances from the metrics lead directly to the "bad" tracks of the release interval.
15) Limits and performance
Ограничивайте: `OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT`, `OTEL_SPAN_EVENT_COUNT_LIMIT`, `OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT`.
Sample/stacktrace exceptions by probability/frequency.
Batch processor in SDK and Collector; keep queues to avoid losing tracks when bursts occur.
16) Interoperability and migrations
Propagators: use W3C; Support dual-propagation B3/X-Ray reading.
Export: OTLP → APM (Jaeger/Tempo/Elastic/X-Ray, etc.).
Stable versions of semconv - fix 'schemaUrl' and plan upgrades.
17) Anti-patterns
High cardinality of attributes ('user _ id' in label, dynamic keys).
Logs without trace _ id → no correlation.
Export directly from applications to Internet APM (without gateway, without TLS/mTLS).
Collecting "only" 100% in sales is expensive and pointless.
Dumps of SQL queries with user data in 'db. statement`.
Inconsistent service/version name - metrics are "crumbling."
18) Implementation checklist (0-45 days)
0-10 days
Enable SDK/auto-instrumentation on 2-3 critical services.
Configure Agent (DaemonSet) + Gateway (Deployment), OTLP 4317/4318 with TLS.
Add'service. name`, `service. version`, `deployment. environment 'everywhere.
11-25 days
Tail-sampling by errors/latency/" money" routes.
SpanMetrics → Prometheus, include Exemplars and RED/SLO dashboards.
Propagate the W3C through the API gateway/NGINX/mesh; correlate logs.
26-45 days
Cover queues/DB/cache; links for async.
PII-edition policies in Collector; attribute limits in the SDK.
Integrate SLO gating of releases and auto-rollback.
19) Maturity metrics
Incoming request coverage with tracing ≥ 95% (including sampling head/tail).
The share of metrics with Exemplars ≥ 80%.
RCA time "from metric to trace" ≤ 2 min (p50).
0 PII leaks in attributes/events (scanner).
All services have'service. name/version/environment 'and consistent semantics.
20) Appendices: useful fragments
NGINX propaganda:nginx proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
proxy_set_header baggage $http_baggage;
Prometheus с Exemplars (Grafana):
histogram_quantile(0. 95, sum(rate(traces_spanmetrics_duration_bucket{route="/withdraw"}[5m])) by (le))
Policy: prohibition of PII attributes (pseudo-linter)
yaml forbid_attributes:
- user. email
- payment. card
- personal.
21) Conclusion
OpenTelemetry turns observability into a standardized, manageable circuit: unified semantics, safe propagation, smart sampling, and strong correlation with metrics and logs. Build an agent + gateway, add tail-sampling, spanmetrics and Examplars, monitor PII and cardinality - and tracing will become a tool not only for debugging, but also for automated SRE/Release solutions, reducing MTTR and risks with each release.