Observability: logs, metrics, traces

Observability: logs, metrics, traces

1) Why do you need it

Observability - The ability of a system to answer unplanned questions about its condition. It relies on three main signals:

Metrics are compact aggregates for SLI/SLO and symptom alerting.
Traces - end-to-end query chains.
Logs - detailed events for investigations and audits.

Purpose: fast RCA, preventive alerts and managed reliability within an error budget.

2) Architecture principles

Single context: everywhere we throw 'trace _ id', 'span _ id', 'tenant _ id', 'request _ id', 'user _ agent', 'client _ ip _ hash'.
Standards: OpenTelemetry (OTel) for SDK/agents, JSON log format (canonical, with schema).
Symptoms> causes: alert by user symptoms (latency/errors), not by CPU.
Signal communication: from the → metric to exemplars → to specific logs by 'trace _ id'.
Security and privacy: PII masking in logs, encryption in transit/at rest, immutable audit logs.
Multi-tenancy: separation of namespaces/keys/policies.

3) Signal taxonomy and schemes

3. 1 Metrics

RED (Rate, Errors, Duration) for services and USE (Utilization, Saturation, Errors) for infrastructure.
Типы: counter, gauge, histogram/summary. For latency, histogram with fixed buckets.
Exemplars: reference to 'trace _ id' in "hot" histogram bins.

Mini-scheme of metrics (logic. model):


name: http_server_duration_seconds labels: {service, route, method, code, tenant}
type: histogram buckets: [0. 01, 0. 025, 0. 05, 0. 1, 0. 25, 0. 5, 1, 2, 5]
exemplar: trace_id

3. 2 Traces

Span = operation with 'name', 'start/end', 'attributes', 'events', 'status'.
W3C Trace Context for tolerability.
Sampling: basic (head) + dynamic (tail) + rules of "importance" (errors, high p95).

3. 3 Logs

Only structured JSON; levels: DEBUG/INFO/WARN/ERROR.
Required fields are 'ts _ utc', 'level', 'message', 'trace _ id', 'span _ id', 'tenant _ id', 'env', 'service', 'region', 'host', 'labels {}'.
Forbidden: secrets, tokens, PAN, passwords. PII - tokenized/masked only.

Example of a log string (JSON):

json
{"ts":"2025-10-31T12:05:42. 123Z","level":"ERROR","service":"checkout","env":"prod",
"trace_id":"c03c...","span_id":"9ab1...","tenant_id":"t-42","route":"/pay",
"code":502,"msg":"payment gateway timeout","retry":true}

4) Collection and transport

Agents/exporters (daemonset/sidecar) → a buffer on the → bus/ingest node (TLS/mTLS) → the signal store.

Requirements: back-pressure, retrays, deduplication, cardinality restriction (labels!), Protection against "log storms."

Metrics: pull (Prometheus-compatible) or push via OTLP.
Traces: OTLP/HTTP (gRPC), tail samplers on the collector.
Logs: local collection (journal/docker/stdout) → parser → normalizer.

5) Storage and retention (tiered)

Metrics: hot TSDB 7-30 days (with downsample), aggregates for a longer period (90-365 days).
Traces: 1-7 days full, then aggregates/spans of "important" services; store indexes on 'service', 'status', 'error'.
Logs: hot index 7-14 days, warm 3-6 months, archive up to 1-7 years (compliance). Audit - WORM.

Cost optimization: downsampling, DEBUG filtering in sales, label quotas, sampling for tracks.

6) SLI/SLO, alert and duty

SLI: availability (% successful requests), latency (p95/p99), 5xx share, data freshness, share of successful jobs.
SLO: target on SLI (e.g. 99. 9% successful ≤ 400 ms).
Error budget: 0. 1% "margin for error" → the rules of fiction/experiments.

Alerting by symptoms (example):

`ALERT HighLatency` если `p99(http_server_duration_seconds{route="/pay"}) > 1s` 5мин.
`ALERT ErrorRate` если `rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0. 02`.
Silo alerts (CPU/Disk) - only as auxiliary, without paging.

7) Signal correlation

The metric is "red" → click on exemplar → a specific 'trace _ id' → look at the "slow" span → open the logs by the same 'trace _ id'.
Correlation with releases: attributes' version ',' image _ sha ',' feature _ flag '.
For data/ETL: 'dataset _ urn', 'run _ id', link to lineage (see the corresponding article).

8) Sampling and cardinality

Metrics: restrict labels (without 'user _ id', 'session _ id'); quotas/validation at registration.

Traces: combine head-sample (at the input) and tail-sample (at the collector) with the rules: "everything that is 5xx, p99, errors - 100%."

Logs: levels and throttling; for frequent recurring errors - aggregating events (dedupe key).

Example tail-sampling (conceptually, OTel Collector):

yaml processors:
tailsampling:
decision_wait: 2s policies:
- type: status_code status_code: ERROR rate_allocation: 1. 0
- type: latency threshold_ms: 900 rate_allocation: 1. 0
- type: probabilistic hash_seed: 42 sampling_percentage: 10

9) Security and privacy

In Transit/At Rest: encryption (TLS 1. 3, AEAD, KMS/HSM).
PII/secrets: sanitizers before shipment, tokenization, masking.
Access: ABAC/RBAC to read; separation of the producers/readers/admins roles.
Audit: unchanging log of access to logs/traces; export - in encrypted form.
Multi-tenancy: namespaces/tenant-labels with policies; encryption key isolation.

10) Configuration profiles (fragments)

Prometheus (HTTP metrics + alerting):

yaml global: { scrape_interval: 15s, evaluation_interval: 30s }
scrape_configs:
- job_name: 'app'
static_configs: [{ targets: ['app-1:8080','app-2:8080'] }]
rule_files: ['slo. rules. yaml']

slo. rules. yaml (example RED):

yaml groups:
- name: http_slo rules:
- record: job:http_request_duration_seconds:p99 expr: histogram_quantile(0. 99, sum(rate(http_server_duration_seconds_bucket[5m])) by (le,route))
- alert: HighLatencyP99 expr: job:http_request_duration_seconds:p99{route="/pay"} > 1 for: 5m

OpenTelemetry SDK (pseudocode):

python provider = TracerProvider(resource=Resource. create({"service. name":"checkout","service. version":"1. 8. 3"}))
provider. add_span_processor(BatchSpanProcessor(OTLPExporter(endpoint="otel-collector:4317")))
set_tracer_provider(provider)
with tracer. start_as_current_span("pay", attributes={"route":"/pay","tenant":"t-42"}):
business logic pass

Application logs (stdout JSON):

python log. info("gw_timeout", extra={"route":"/pay","code":502,"trace_id":get_trace_id()})

11) Data/ETL and streaming

SLI for data: freshness (max lag), completeness (rows vs expectation), "quality" (validators/duplicates).
Alerts: window skipping, consumer lag, DLQ growth.
Correlation: 'run _ id', 'dataset _ urn', lineage events; traces for pipelines (span per batch/partition).
Kafka/NATS: producer/consumer metrics, lag/failure; traces by headers (including 'traceparent').

12) Profiling and eBPF (additional signal)

Low-level hot paths CPU/alloc/IO; profiles per incident.
eBPF telemetry (network delays, DNS, system calls) tied to 'trace _ id '/PID.

13) Observability testing

Signal Contract - Checks the export of metrics/labels/histograms to the CI.
Synthetic probes: RUM scenarios/simulated clients for external SLIs.
Chaos/Fire drills: disabling dependencies, degradation - we look at how alerts and attendants react.
Smoke in Prod: Post-Deploy check that new endpoints have metrics and traces.

14) Cost and volume control

Budgets by signal/command; dashboard "cost per signal."

Cardinality under budget (SLO for cardinality), limits on new labels.
Downsampling, data class presentations, cold archives and WORM for auditing.

15) Operation and SLO of the observability platform

Platform SLO: 99. 9% of successful ingests; delay to the metric index ≤ 30 s, logs ≤ 2 min, traces ≤ 1 min.
Platform alerts: injection lag, drop growth, signature/encryption error, buffer overflow.
DR/HA: multi-area, replication, config/rule backups.

16) Checklists

Before selling:

Everywhere 'trace _ id '/' span _ id' is thrown; JSON logs with a diagram.
RED/USE metrics with histograms; exemplar → alignment.
Tail-sampling enabled; 5xx/p99 rules = 100%.
Alerts by symptoms + Runybooks; quiet hours/anti-flap.
PII sanitizers; Encryption at rest/in transit WORM for audit.
Retentions and budgets for volumes/cardinality.

Operation:

Monthly alert review (noise/accuracy), tuning thresholds.
Error budget report and actions taken (fichfreeze, hardening).
Check of dashboard/log/trace coatings for critical paths.
Training incidents and runbook updates.

17) Runbook’и

RCA: p99/pay rise

1. Open RED dashboard for 'checkout'.
2. Go to exemplar → a slow track → reveal a "narrow span" (e.g. 'gateway. call`).
3. Open logs by 'trace _ id' → view timeouts/retrays.
4. Enable rollback feature/RPS limit, notify dependency owners.
5. After stabilization - RCA, optimization tickets, playback test.

Anomaly in data (log DWH):

1. SLI "freshness" red → track job → failing pitch.

2. Check broker/DLQ log, connector errors.

3. Start reprocess, notify consumers (BI/product) via status channel.

18) Frequent errors

Logs without schema and without 'trace _ id'. Investigations are delayed at times.

Alerts on infrastructure instead of symptoms. Paging goes "into milk."

Boundless cardinality of metrics. Cost explosion and instability.
All tracks 100%. Expensive and unnecessary; enable smart sampling.
PII/secrets in the logs. Include sanitizers and red lists.
"Mute" features. New code without metrics/traces/logs.

19) FAQ

Q: Do I need to store the raw text of the logs?
A: Yes, but with retention and archives; for alerts and SLOs, aggregates are sufficient. Audit - in WORM.

Q: What to choose for tracks - head or tail sampling?
A: Combine: head-probabilistic for basecoat + tail-rules for errors and anomalies.

Q: How do I link user metrics and technical metrics?
A: Through general 'trace _ id' and business labels ('route', 'tenant', 'plan'), as well as through product events (conversions) with correlation to tracks.

Q: How not to drown in alerts?
A: Beat on symptoms, enter quiet hours, deduplication, grouping, SLO prioritization and owner-by-default per alert.

Related Materials:

"Audit and immutable logs"
"In Transit/At Rest Encryption"
"Secret Management"
"Data Origin (Lineage)"
«Privacy by Design (GDPR)»
"Webhook Delivery Guarantees"

Observability: logs, metrics, traces

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects