Observability и trace sampling

1) Why Observability

Observability (O11y) answers three questions: what is happening, why, how to fix it. It relies on 4 signals:

Metrics (aggregates, react quickly);
Logs (details and forensics);
Traces (cross-cutting cause-effect relationships);
Profiles (CPU/heap/lock contention in prod mode).

Key: correlation between signals + telemetry economics (sampling, retention, compression).

2) Signal map and principles

2. 1 RED/USE

RED (for API): Rate (RPS), Errors (% 5xx/4xx important), Duration (p50/p95/p99).
USE (for resources): Utilization, Saturation, Errors (NIC, CPU, disk, queues).

2. 2 Product Invariants

Define an SLO (e.g. "p95 latency '/v1/payments' ≤ 300ms, erroneous budget 0. 5% in 30 days"). Alerts should "scream" only when SLO is violated or burned.

2. 3 Context

Implement W3C Trace Context ('traceparent', 'tracestate') and baggage to securely transfer those/business attributes (e.g. 'tenant', 'region', no PII).

3) Observability architecture

SDK/auto-instrumentation: OpenTelemetry (OTel) in services (HTTP/gRPC/DB/clients).
OTel Collector as a bus: reception → enrichment → sampling → export (Prometheus, Tempo/Jaeger, Loki/ELK, ClickHouse).

Vaults:

Metrics: Prometheus/Mimir/VictoriaMetrics;
Trails: Tempo/Jaeger/Zipkin;
Logs: Loki/ELK/Vector→S3+deshevoye storage;
Profiles: Pyroscope/Parca.
Correlation: service graphs, exemplars, transition from p99 graph to a specific trace.

4) Tracing Sampling: Strategies

4. 1 Head-based sampling

Simple and cheap implementation (in SDK/ingress).
Cons: May miss rare errors/slow queries.

When: high RPS, strict budgets, a predictable share is required (for example, 1-5%).

4. 2 Tail-based sampling

The decision is made in Collector after the end of the span.
Anomalies can be guaranteed to be selected: errors, p99, specific routs/tenants.
Cons: buffering, harder and more expensive.

When: "meaningful" trails are needed at moderate cost.

4. 3 Combined model

Global head 1-5%, plus tail rules: "always save errors/slow spans," "sample 50% of canary traffic," "save all traces of payment paths in an incident."

5) Dynamic sampling and telemetry budget

Budget-aware: hold volume ≤ N trails/min; if exceeded, raise the thresholds (for example, select only p99. 5+, error-only).
Rules by route/tenant: important endpoints/tenants - with a greater share.
Adaptive windows: bursts → temporarily increase the error/slow rate.
Cardinality reduction: normalize user-agent, IP/ASN, squash stack traces, mask secrets.

6) Configs (references)

6. 1 OpenTelemetry Collector - tail-sampling (yaml-fragment)

yaml receivers:
otlp: { protocols: { http: {}, grpc: {} } }

processors:
batch: { send_batch_size: 8192, timeout: 2s }
tail_sampling:
decision_wait: 5s num_traces: 100000 expected_new_traces_per_sec: 5000 policies:
- name: always-error type: status_code status_code: { status_codes: [ERROR] }
- name: slow-endpoints type: latency latency: { threshold_ms: 300 }      # p95 цель
- name: important-routes type: string_attribute string_attribute: { key: http. target, values: ["/v1/payments", "/v1/payouts"] }
- name: tenant-eu1 type: string_attribute string_attribute: { key: tenant, values: ["eu-1"] }
- name: probabilistic-default type: probabilistic probabilistic: { sampling_percentage: 5. 0 }

exporters:
otlphttp/tempo: { endpoint: http://tempo:4318 }
prometheus: { endpoint: "0. 0. 0. 0:9464" }

service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlphttp/tempo]

6. 2 Prometheus - exemplars (fragment)

In the application, when recording histograms, add exemplars with 'trace _ id'. In Grafana, clicks on "needles" lead to a trace.

yaml scrape_configs:
- job_name: api scrape_interval: 10s honor_labels: true static_configs: [{ targets: ["api:9100"] }]
exemplar_limit: 10

6. 3 Loki - reducing the cost of logs

Labels are only stable ('service', 'env', 'region', 'route _ class').
High cardinality (request_id, user_id) - in payload, but with redaction.
Sample "successful" InfoLogs, save any errors/warnings.

6. 4 Jaeger/Tempo - retention and index

Store raw trails for 3-7 days, aggregates/symmetries for longer.
Enable parquet/blocks in cheap storage (S3-compatible), indexes are compact.

7) Tracing simulation

7. 1 Naming and Attributes

`service. name`, `service. version`, `deployment. environment`.
`http. method`, `http. route`, `http. target`, `http. status_code`, `net. peer. name`.
Business attributes without PII: 'tenant', 'region', 'payment _ provider', 'game _ id'.

7. 2 Events and connections

Span events: important points (start of DB transaction, retray, circuit open, cache miss).
Links: zapros→vebkhuk/sobytiye communication; useful for EDA and outbox/inbox.

7. 3 Instances

Add examples with 'trace _ id' to latency/size histograms: navigation from → trace to trace metric in one click.

8) Metrics: what and how

8. 1 Technical

RED by route/tenant/provider (PSP, KYC).
Пулы: `db_connections_in_use`, `http_client_in_flight`, `queue_depth`.
Stabilization: retries, timeouts, circuit open/half-open, rate-limit hits.
Go/Java/Python runtime: GC pauses, heap, safepoints, GIL delays.

8. 2 Business Metrics

Registration/logins/a deposits/conclusions, conversion, failures 3DS/KYC, chargeback-ratio.
Important features: time-to-wallet, success-rate payout.

8. 3 Cardinality and storage

Histograms with explicit buckets (e.g. '[50,100,200,300,500,1000,2000] ms').
Avoid marks with high cardinality (raw user_id, request_id) - take them to the logs/trails.

9) Logs: standards and correlation

Format: JSON + required keys ('timestamp', 'level', 'message', 'trace _ id', 'span _ id', 'service', 'env').
Editing: mask PAN, tokens, PII.

Sampling: 100% for 'error/warn', 5-20% for 'info' on 'noisy' paths

Binding to traces - via 'trace _ id'. Log strings → "pivot" to trace and vice versa.

10) Profiling in sales

Enable continuous profiling (Pyroscope/Parca) for CPU/heap/alloc/locks.
Correlate p99 peaks with hot stacks; keep for 7-14 days.

11) Alerting on SLO/flawed budget

SLO alerts: "erroneous budget is spent faster than X %/hour" (forecasting alerts).
Symptoms, not causes: alert to the client level (RUM/edge or per-route), not to the CPU.
Multi-window, multi-burn rate: 2% in 1 hour and 5% in 6 hours - two conditions.
Silence during planned degradation: threshold shift during feature flags/canary.

12) Cost and retention

Volume quotas: trails ≤ N TB/month, logs - hot 3-7 days, cold S3 30-90 days, metrics - downsampling (1 min → 5 min → 1 h).
Tail-rules reduce the volume of × 10- × 100, keeping erroneous/slow.
Lowest cost signals - metrics; with the highest value - "correct" trails and profiles.

13) Antipatterns

"100% trails always" → an explosion of cost, noise and brakes.
Logs in free format without keys/masking.
Infinite label metrics (user_id/ip/full UA).
No 'traceparent '/' baggage' - cannot be correlated.
Alerts on CPU/heap instead of SLO - chat "burns" without benefit.
Sampling "random 1%" without error priority/slow - lose valuable cases.

14) Examples of dashboards (skeletons)

API Overview: RPS, error-rate by class, latency p95/p99 (exemplars are clickable), top routs.
Release/Canary: comparison of old/new version metrics, outlier-rate, open-circuits, retries.
PSP/KYC: success-rate by providers, latency and failure, correlation with payout errors.
Infra: USE by resources, queue saturation, network drops.

15) Specifics of iGaming/Finance

Critical paths (deposits/conclusions): 100% tracing only for incidents or limited windows; in normal mode - tail "all with error/long latency."

Region/tenant: add 'tenant', 'jurisdiction', 'brand' to the baggage; build SLOs by jurisdiction.
Antifrod/bot filter: metrics and traces of Risk API solutions (allow/deny/challenge), challenge-pass-rate, velocity-hits.
Audit/compliance: keep the minimum necessary, without PII; fixed logs - in a separate circuit.

16) Prod Readiness Checklist

End-to-end propagation ('traceparent', 'baggage'), log/metric/trace correlation.
OTel Collector with tail-sampling (errors/slow/important routs) + probabilistic default.
RED/USE metrics, explicit buckets, exemplars → transition to trace.
SLO and erroneous budget alert (two timelines).
Telemetry regulations and budget; downsampling metrics; cold storage for logs.
Standardized JSON log, redaction PII/secrets.
Profiling in sales included; dashboards of "hot" stacks for the incident.
Canary dashboards and version comparison; release without "blind spots."
Runbook: How to temporarily increase the sampling share of an incident.
Attribute/label naming documentation and high-cardinality inhibition.

17) TL; DR

Build observability around correlation: RED/USE → exemplars metrics → trails → logs/profiles. Manage cost through combined sampling: small head% + tail rules (errors, slow, important routes/tenants). Alerts - on SLO and error budget. Keep retentions and cardinality under control, use OTel Collector as a "central nervous system." For payment/jurisdictional pathways - priority telemetry and strict data hygiene.

Observability и trace sampling

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects