Observability: logs, metrics, traces
Observability: logs, metrics, traces
1) Why do you need it
Observability - The ability of a system to answer unplanned questions about its condition. It relies on three main signals:- Metrics are compact aggregates for SLI/SLO and symptom alerting.
- Traces - end-to-end query chains.
- Logs - detailed events for investigations and audits.
Purpose: fast RCA, preventive alerts and managed reliability within an error budget.
2) Architecture principles
Single context: everywhere we throw 'trace _ id', 'span _ id', 'tenant _ id', 'request _ id', 'user _ agent', 'client _ ip _ hash'.
Standards: OpenTelemetry (OTel) for SDK/agents, JSON log format (canonical, with schema).
Symptoms> causes: alert by user symptoms (latency/errors), not by CPU.
Signal communication: from the → metric to exemplars → to specific logs by 'trace _ id'.
Security and privacy: PII masking in logs, encryption in transit/at rest, immutable audit logs.
Multi-tenancy: separation of namespaces/keys/policies.
3) Signal taxonomy and schemes
3. 1 Metrics
RED (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for infrastructure.
Типы: counter, gauge, histogram/summary. For latency, histogram with fixed buckets.
Exemplars: reference to 'trace _ id' in "hot" histogram bins.
name: http_server_duration_seconds labels: {service, route, method, code, tenant}
type: histogram buckets: [0. 01, 0. 025, 0. 05, 0. 1, 0. 25, 0. 5, 1, 2, 5]
exemplar: trace_id
3. 2 Traces
Span = operation with 'name', 'start/end', 'attributes', 'events', 'status'.
W3C Trace Context for tolerability.
Sampling: basic (head) + dynamic (tail) + rules of "importance" (errors, high p95).
3. 3 Logs
Only structured JSON; levels: DEBUG/INFO/WARN/ERROR.
Required fields are 'ts _ utc', 'level', 'message', 'trace _ id', 'span _ id', 'tenant _ id', 'env', 'service', 'region', 'host', 'labels {}'.
Forbidden: secrets, tokens, PAN, passwords. PII - tokenized/masked only.
json
{"ts":"2025-10-31T12:05:42. 123Z","level":"ERROR","service":"checkout","env":"prod",
"trace_id":"c03c...","span_id":"9ab1...","tenant_id":"t-42","route":"/pay",
"code":502,"msg":"payment gateway timeout","retry":true}
4) Collection and transport
Agents/exporters (daemonset/sidecar) → a buffer on the → bus/ingest node (TLS/mTLS) → the signal store.
Requirements: back-pressure, retrays, deduplication, cardinality restriction (labels!), Protection against "log storms."
Metrics: pull (Prometheus-compatible) or push via OTLP.
Traces: OTLP/HTTP (gRPC), tail samplers on the collector.
Logs: local collection (journal/docker/stdout) → parser → normalizer.
5) Storage and retention (tiered)
Metrics: hot TSDB 7-30 days (with downsample), aggregates for a longer period (90-365 days).
Traces: 1-7 days full, then aggregates/spans of "important" services; store indexes on 'service', 'status', 'error'.
Logs: hot index 7-14 days, warm 3-6 months, archive up to 1-7 years (compliance). Audit - WORM.
Cost optimization: downsampling, DEBUG filtering in sales, label quotas, sampling for tracks.
6) SLI/SLO, alert and duty
SLI: availability (% successful requests), latency (p95/p99), 5xx share, data freshness, share of successful jobs.
SLO: target on SLI (e.g. 99. 9% successful ≤ 400 ms).
Error budget: 0. 1% "margin for error" → the rules of fiction/experiments.
- `ALERT HighLatency` если `p99(http_server_duration_seconds{route="/pay"}) > 1s` 5мин.
- `ALERT ErrorRate` если `rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0. 02`.
- Silo alerts (CPU/Disk) - only as auxiliary, without paging.
7) Signal correlation
The metric is "red" → click on exemplar → a specific 'trace _ id' → look at the "slow" span → open the logs by the same 'trace _ id'.
Correlation with releases: attributes' version ',' image _ sha ',' feature _ flag '.
For data/ETL: 'dataset _ urn', 'run _ id', link to lineage (see the corresponding article).
8) Sampling and cardinality
Metrics: restrict labels (without 'user _ id', 'session _ id'); quotas/validation at registration.
Traces: combine head-sample (at the input) and tail-sample (at the collector) with the rules: "everything that is 5xx, p99, errors - 100%."
Logs: levels and throttling; for frequent recurring errors - aggregating events (dedupe key).
Example tail-sampling (conceptually, OTel Collector):yaml processors:
tailsampling:
decision_wait: 2s policies:
- type: status_code status_code: ERROR rate_allocation: 1. 0
- type: latency threshold_ms: 900 rate_allocation: 1. 0
- type: probabilistic hash_seed: 42 sampling_percentage: 10
9) Security and privacy
In Transit/At Rest: encryption (TLS 1. 3, AEAD, KMS/HSM).
PII/secrets: sanitizers before shipment, tokenization, masking.
Access: ABAC/RBAC to read; separation of the producers/readers/admins roles.
Audit: unchanging log of access to logs/traces; export - in encrypted form.
Multi-tenancy: namespaces/tenant-labels with policies; encryption key isolation.
10) Configuration profiles (fragments)
Prometheus (HTTP metrics + alerting):yaml global: { scrape_interval: 15s, evaluation_interval: 30s }
scrape_configs:
- job_name: 'app'
static_configs: [{ targets: ['app-1:8080','app-2:8080'] }]
rule_files: ['slo. rules. yaml']
slo. rules. yaml (example RED):
yaml groups:
- name: http_slo rules:
- record: job:http_request_duration_seconds:p99 expr: histogram_quantile(0. 99, sum(rate(http_server_duration_seconds_bucket[5m])) by (le,route))
- alert: HighLatencyP99 expr: job:http_request_duration_seconds:p99{route="/pay"} > 1 for: 5m
OpenTelemetry SDK (pseudocode):
python provider = TracerProvider(resource=Resource. create({"service. name":"checkout","service. version":"1. 8. 3"}))
provider. add_span_processor(BatchSpanProcessor(OTLPExporter(endpoint="otel-collector:4317")))
set_tracer_provider(provider)
with tracer. start_as_current_span("pay", attributes={"route":"/pay","tenant":"t-42"}):
business logic pass
Application logs (stdout JSON):
python log. info("gw_timeout", extra={"route":"/pay","code":502,"trace_id":get_trace_id()})
11) Data/ETL and streaming
SLI for data: freshness (max lag), completeness (rows vs expectation), "quality" (validators/duplicates).
Alerts: window skipping, consumer lag, DLQ growth.
Correlation: 'run _ id', 'dataset _ urn', lineage events; traces for pipelines (span per batch/partition).
Kafka/NATS: producer/consumer metrics, lag/failure; traces by headers (including 'traceparent').
12) Profiling and eBPF (additional signal)
Low-level hot paths CPU/alloc/IO; profiles per incident.
eBPF telemetry (network delays, DNS, system calls) tied to 'trace _ id '/PID.
13) Observability testing
Signal Contract - Checks the export of metrics/labels/histograms to the CI.
Synthetic probes: RUM scenarios/simulated clients for external SLIs.
Chaos/Fire drills: disabling dependencies, degradation - we look at how alerts and attendants react.
Smoke in Prod: Post-Deploy check that new endpoints have metrics and traces.
14) Cost and volume control
Budgets by signal/command; dashboard "cost per signal."
Cardinality under budget (SLO for cardinality), limits on new labels.
Downsampling, data class presentations, cold archives and WORM for auditing.
15) Operation and SLO of the observability platform
Platform SLO: 99. 9% of successful ingests; delay to the metric index ≤ 30 s, logs ≤ 2 min, traces ≤ 1 min.
Platform alerts: injection lag, drop growth, signature/encryption error, buffer overflow.
DR/HA: multi-area, replication, config/rule backups.
16) Checklists
Before selling:- Everywhere 'trace _ id '/' span _ id' is thrown; JSON logs with a diagram.
- RED/USE metrics with histograms; exemplar → alignment.
- Tail-sampling enabled; 5xx/p99 rules = 100%.
- Alerts by symptoms + Runybooks; quiet hours/anti-flap.
- PII sanitizers; Encryption at rest/in transit WORM for audit.
- Retentions and budgets for volumes/cardinality.
- Monthly alert review (noise/accuracy), tuning thresholds.
- Error budget report and actions taken (fichfreeze, hardening).
- Check of dashboard/log/trace coatings for critical paths.
- Training incidents and runbook updates.
17) Runbook’и
RCA: p99/pay rise
1. Open RED dashboard for 'checkout'.
2. Go to exemplar → a slow track → reveal a "narrow span" (e.g. 'gateway. call`).
3. Open logs by 'trace _ id' → view timeouts/retrays.
4. Enable rollback feature/RPS limit, notify dependency owners.
5. After stabilization - RCA, optimization tickets, playback test.
1. SLI "freshness" red → track job → failing pitch.
2. Check broker/DLQ log, connector errors.
3. Start reprocess, notify consumers (BI/product) via status channel.
18) Frequent errors
Logs without schema and without 'trace _ id'. Investigations are delayed at times.
Alerts on infrastructure instead of symptoms. Paging goes "into milk."
Boundless cardinality of metrics. Cost explosion and instability.
All tracks 100%. Expensive and unnecessary; enable smart sampling.
PII/secrets in the logs. Include sanitizers and red lists.
"Mute" features. New code without metrics/traces/logs.
19) FAQ
Q: Do I need to store the raw text of the logs?
A: Yes, but with retention and archives; for alerts and SLOs, aggregates are sufficient. Audit - in WORM.
Q: What to choose for tracks - head or tail sampling?
A: Combine: head-probabilistic for basecoat + tail-rules for errors and anomalies.
Q: How do I link user metrics and technical metrics?
A: Through general 'trace _ id' and business labels ('route', 'tenant', 'plan'), as well as through product events (conversions) with correlation to tracks.
Q: How not to drown in alerts?
A: Beat on symptoms, enter quiet hours, deduplication, grouping, SLO prioritization and owner-by-default per alert.
- "Audit and immutable logs"
- "In Transit/At Rest Encryption"
- "Secret Management"
- "Data Origin (Lineage)"
- «Privacy by Design (GDPR)»
- "Webhook Delivery Guarantees"