Observability и trace sampling
1) Why Observability
Observability (O11y) answers three questions: what is happening, why, how to fix it. It relies on 4 signals:- Metrics (aggregates, react quickly);
- Logs (details and forensics);
- Traces (cross-cutting cause-effect relationships);
- Profiles (CPU/heap/lock contention in prod mode).
Key: correlation between signals + telemetry economics (sampling, retention, compression).
2) Signal map and principles
2. 1 RED/USE
RED (for API): Rate (RPS), Errors (% 5xx/4xx important), Duration (p50/p95/p99).
USE (for resources): Utilization, Saturation, Errors (NIC, CPU, disk, queues).
2. 2 Product Invariants
Define an SLO (e.g. "p95 latency '/v1/payments' ≤ 300ms, erroneous budget 0. 5% in 30 days"). Alerts should "scream" only when SLO is violated or burned.
2. 3 Context
Implement W3C Trace Context ('traceparent', 'tracestate') and baggage to securely transfer those/business attributes (e.g. 'tenant', 'region', no PII).
3) Observability architecture
SDK/auto-instrumentation: OpenTelemetry (OTel) in services (HTTP/gRPC/DB/clients).
OTel Collector as a bus: reception → enrichment → sampling → export (Prometheus, Tempo/Jaeger, Loki/ELK, ClickHouse).
- Metrics: Prometheus/Mimir/VictoriaMetrics;
- Trails: Tempo/Jaeger/Zipkin;
- Logs: Loki/ELK/Vector→S3+deshevoye storage;
- Profiles: Pyroscope/Parca.
- Correlation: service graphs, exemplars, transition from p99 graph to a specific trace.
4) Tracing Sampling: Strategies
4. 1 Head-based sampling
Simple and cheap implementation (in SDK/ingress).
Cons: May miss rare errors/slow queries.
When: high RPS, strict budgets, a predictable share is required (for example, 1-5%).
4. 2 Tail-based sampling
The decision is made in Collector after the end of the span.
Anomalies can be guaranteed to be selected: errors, p99, specific routs/tenants.
Cons: buffering, harder and more expensive.
When: "meaningful" trails are needed at moderate cost.
4. 3 Combined model
Global head 1-5%, plus tail rules: "always save errors/slow spans," "sample 50% of canary traffic," "save all traces of payment paths in an incident."
5) Dynamic sampling and telemetry budget
Budget-aware: hold volume ≤ N trails/min; if exceeded, raise the thresholds (for example, select only p99. 5+, error-only).
Rules by route/tenant: important endpoints/tenants - with a greater share.
Adaptive windows: bursts → temporarily increase the error/slow rate.
Cardinality reduction: normalize user-agent, IP/ASN, squash stack traces, mask secrets.
6) Configs (references)
6. 1 OpenTelemetry Collector - tail-sampling (yaml-fragment)
yaml receivers:
otlp: { protocols: { http: {}, grpc: {} } }
processors:
batch: { send_batch_size: 8192, timeout: 2s }
tail_sampling:
decision_wait: 5s num_traces: 100000 expected_new_traces_per_sec: 5000 policies:
- name: always-error type: status_code status_code: { status_codes: [ERROR] }
- name: slow-endpoints type: latency latency: { threshold_ms: 300 } # p95 цель
- name: important-routes type: string_attribute string_attribute: { key: http. target, values: ["/v1/payments", "/v1/payouts"] }
- name: tenant-eu1 type: string_attribute string_attribute: { key: tenant, values: ["eu-1"] }
- name: probabilistic-default type: probabilistic probabilistic: { sampling_percentage: 5. 0 }
exporters:
otlphttp/tempo: { endpoint: http://tempo:4318 }
prometheus: { endpoint: "0. 0. 0. 0:9464" }
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlphttp/tempo]
6. 2 Prometheus - exemplars (fragment)
In the application, when recording histograms, add exemplars with 'trace _ id'. In Grafana, clicks on "needles" lead to a trace.
yaml scrape_configs:
- job_name: api scrape_interval: 10s honor_labels: true static_configs: [{ targets: ["api:9100"] }]
exemplar_limit: 10
6. 3 Loki - reducing the cost of logs
Labels are only stable ('service', 'env', 'region', 'route _ class').
High cardinality (request_id, user_id) - in payload, but with redaction.
Sample "successful" InfoLogs, save any errors/warnings.
6. 4 Jaeger/Tempo - retention and index
Store raw trails for 3-7 days, aggregates/symmetries for longer.
Enable parquet/blocks in cheap storage (S3-compatible), indexes are compact.
7) Tracing simulation
7. 1 Naming and Attributes
`service. name`, `service. version`, `deployment. environment`.
`http. method`, `http. route`, `http. target`, `http. status_code`, `net. peer. name`.
Business attributes without PII: 'tenant', 'region', 'payment _ provider', 'game _ id'.
7. 2 Events and connections
Span events: important points (start of DB transaction, retray, circuit open, cache miss).
Links: zapros→vebkhuk/sobytiye communication; useful for EDA and outbox/inbox.
7. 3 Instances
Add examples with 'trace _ id' to latency/size histograms: navigation from → trace to trace metric in one click.
8) Metrics: what and how
8. 1 Technical
RED by route/tenant/provider (PSP, KYC).
Пулы: `db_connections_in_use`, `http_client_in_flight`, `queue_depth`.
Stabilization: retries, timeouts, circuit open/half-open, rate-limit hits.
Go/Java/Python runtime: GC pauses, heap, safepoints, GIL delays.
8. 2 Business Metrics
Registration/logins/a deposits/conclusions, conversion, failures 3DS/KYC, chargeback-ratio.
Important features: time-to-wallet, success-rate payout.
8. 3 Cardinality and storage
Histograms with explicit buckets (e.g. '[50,100,200,300,500,1000,2000] ms').
Avoid marks with high cardinality (raw user_id, request_id) - take them to the logs/trails.
9) Logs: standards and correlation
Format: JSON + required keys ('timestamp', 'level', 'message', 'trace _ id', 'span _ id', 'service', 'env').
Editing: mask PAN, tokens, PII.
Sampling: 100% for 'error/warn', 5-20% for 'info' on 'noisy' paths
Binding to traces - via 'trace _ id'. Log strings → "pivot" to trace and vice versa.
10) Profiling in sales
Enable continuous profiling (Pyroscope/Parca) for CPU/heap/alloc/locks.
Correlate p99 peaks with hot stacks; keep for 7-14 days.
11) Alerting on SLO/flawed budget
SLO alerts: "erroneous budget is spent faster than X %/hour" (forecasting alerts).
Symptoms, not causes: alert to the client level (RUM/edge or per-route), not to the CPU.
Multi-window, multi-burn rate: 2% in 1 hour and 5% in 6 hours - two conditions.
Silence during planned degradation: threshold shift during feature flags/canary.
12) Cost and retention
Volume quotas: trails ≤ N TB/month, logs - hot 3-7 days, cold S3 30-90 days, metrics - downsampling (1 min → 5 min → 1 h).
Tail-rules reduce the volume of × 10- × 100, keeping erroneous/slow.
Lowest cost signals - metrics; with the highest value - "correct" trails and profiles.
13) Antipatterns
"100% trails always" → an explosion of cost, noise and brakes.
Logs in free format without keys/masking.
Infinite label metrics (user_id/ip/full UA).
No 'traceparent '/' baggage' - cannot be correlated.
Alerts on CPU/heap instead of SLO - chat "burns" without benefit.
Sampling "random 1%" without error priority/slow - lose valuable cases.
14) Examples of dashboards (skeletons)
API Overview: RPS, error-rate by class, latency p95/p99 (exemplars are clickable), top routs.
Release/Canary: comparison of old/new version metrics, outlier-rate, open-circuits, retries.
PSP/KYC: success-rate by providers, latency and failure, correlation with payout errors.
Infra: USE by resources, queue saturation, network drops.
15) Specifics of iGaming/Finance
Critical paths (deposits/conclusions): 100% tracing only for incidents or limited windows; in normal mode - tail "all with error/long latency."
Region/tenant: add 'tenant', 'jurisdiction', 'brand' to the baggage; build SLOs by jurisdiction.
Antifrod/bot filter: metrics and traces of Risk API solutions (allow/deny/challenge), challenge-pass-rate, velocity-hits.
Audit/compliance: keep the minimum necessary, without PII; fixed logs - in a separate circuit.
16) Prod Readiness Checklist
- End-to-end propagation ('traceparent', 'baggage'), log/metric/trace correlation.
- OTel Collector with tail-sampling (errors/slow/important routs) + probabilistic default.
- RED/USE metrics, explicit buckets, exemplars → transition to trace.
- SLO and erroneous budget alert (two timelines).
- Telemetry regulations and budget; downsampling metrics; cold storage for logs.
- Standardized JSON log, redaction PII/secrets.
- Profiling in sales included; dashboards of "hot" stacks for the incident.
- Canary dashboards and version comparison; release without "blind spots."
- Runbook: How to temporarily increase the sampling share of an incident.
- Attribute/label naming documentation and high-cardinality inhibition.
17) TL; DR
Build observability around correlation: RED/USE → exemplars metrics → trails → logs/profiles. Manage cost through combined sampling: small head% + tail rules (errors, slow, important routes/tenants). Alerts - on SLO and error budget. Keep retentions and cardinality under control, use OTel Collector as a "central nervous system." For payment/jurisdictional pathways - priority telemetry and strict data hygiene.