Logging and tracing events
Logging and tracing events
1) Purpose and frame
Logs and trails are the foundation of observability.
Logs respond to "what happened" and "with what context."
Traces respond to "where and why slowly/erroneously" in a distributed query path.
Key principles:- Structured by default (JSON); Trace-first: each log on the hot path is bound to 'trace _ id '/' span _ id'.
- Noise minimum, signal maximum: levels, sampling, anti-cardinality.
- Security and privacy: masking, editing, access control.
- Versioned diagrams of logs and events.
2) Taxonomy of events
Separate streams and indexes by destination:1. Technical logs (runtime, errors, network timeouts, retrays).
2. Business events (registration, deposit, rate, withdrawal, KYC stage) - suitable for product analytics and incidents along "money" paths.
3. Audit (who/when changed what: configs, accesses, flags, limits) is an unchangeable journal.
4. Security (authentication, privilege escalation, sanction/PEP flags).
5. Infrastructure (K8s events, autoscaling, HPA/VPA, nodes/disk/network).
For each stream - separate rules for retention, indexing and access.
3) Structural log (JSON standard)
json
{
"ts": "2025-11-03T14:28:15. 123Z",
"level": "ERROR",
"service": "payments-api",
"env": "prod",
"region": "eu-central-1",
"trace_id": "8a4f0c2e9b1f42d7",
"span_id": "c7d1f3a4b8b6e912",
"parent_span_id": "a1b2c3d4e5f60789",
"logger": "withdraw. handler",
"event": "psp_decline",
"msg": "PSP declined transaction",
"http": { "method": "POST", "route": "/withdraw", "status": 502, "latency_ms": 842 },
"user": { "tenant_id": "t_9f2", "user_key": "hash_0a7c", "vip_tier": 3 },
"payment": { "psp": "acme", "amount": 120. 50, "currency": "EUR", "idempotency_key": "u123:wd:7845" },
"safe": true ,//secret check passed
"version": "1. 14. 2, "//service version (SemVer)
"build": "sha-1f2a3b4",
"kubernetes": { "pod": "payments-7cbdf", "node": "ip-10-0-2-41" }
}
Requirements: flat scheme + attachments by domain, required fields ('ts, level, service, env, trace_id, msg'), numerical values - numbers, not strings.
4) Levels, cardinality and scope
Levels: 'DEBUG' (not in sales), 'INFO' (business facts), 'WARN' (anomalies), 'ERROR' (errors), 'FATAL' (crashes).
Cardinality: avoid arbitrary keys/dynamic labels. No "id-in-key."
Sampling logs: rate-limit repeating messages; enable'DEBUG 'only scoped and in time (feature flag).
Idempotency: Log 'idempotency _ key' to suppress duplicate events by consumers.
5) Privacy and security
Mask PII/secrets on agents (Fluent Bit/Vector): key masking cards ('email', 'card', 'token', 'authorization').
Hash 'user _ key', hold only the required context (country, KYC-level, VIP-tier).
Separate storages: warm (online search) and cold (archive without PII/with a stripped-down context).
Audit - append-only, WORM storage, access only on the principle of least privilege.
6) Tracing: standards and context
W3C Trace Context: 'traceparent '/' tracestate' headers, plus baggage for secure keys (e.g. 'tenant _ id', 'region').
Link metrics and traces: Exemplars - pass' trace _ id'to the sampling points of the histograms (accelerates RCA).
Sampling: basic sampling 1-5% + dynamic "on error/slow p95" up to 100% for problem queries.
Links: For asynchronous queues/sagas, link spans through 'links', not just 'parent'.
7) Collection and routing
Agent: Fluent Bit/Vector for logs; OTLP export to OpenTelemetry Collector.
Collector: central gateway (batch/transform/filter/routing).
App → (OTLP logs/traces/metrics) → OTel Collector
→ logs: redact → route(security audit tech biz) → hot index / cold archive
→ traces: tail_sampling(errors p95>threshold) → APM backend
→ metrics: Prometheus exporter (for SLO/alerts)
OTel Collector (fragment):
yaml processors:
batch: {}
attributes:
actions:
- key: env value: prod action: insert filter/logs:
logs:
include:
match_type: strict resource_attributes:
- key: service. name value: payments-api exporters:
otlp/traces: { endpoint: "apm:4317", tls: { insecure: true } }
loki: { endpoint: "http://loki:3100/loki/api/v1/push" }
prometheus: {}
service:
pipelines:
logs: { receivers: [otlp], processors: [attributes,batch], exporters: [loki] }
traces: { receivers: [otlp], processors: [batch], exporters: [otlp/traces] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
8) Instrumentation: SDK examples
8. 1 Node. js (Pino + OTel)
js import pino from "pino";
import { context, trace } from "@opentelemetry/api";
const logger = pino({ level: process. env. LOG_LEVEL "info" });
function log(info) {
const span = trace. getSpan(context. active());
const base = span? { trace_id: span. spanContext(). traceId, span_id: span. spanContext(). spanId }: {};
logger. info({...base,...info });
}
//example log ({event: "deposit. created", amount: 50, currency: "EUR", user: { user_key: "hash_0a7c" } });
8. 2 Java (SLF4J + OTel)
java
MDC. put("trace_id", Span. current(). getSpanContext(). getTraceId());
MDC. put("span_id", Span. current(). getSpanContext(). getSpanId());
log. info("psp_response status={} latency_ms={}", status, latency);
8. 3 Python (structlog + OTel)
python import structlog from opentelemetry import trace log = structlog. get_logger()
def log_json(event, kwargs):
span = trace. get_current_span()
ctx = {}
if span and span. get_span_context(). is_valid:
ctx = {"trace_id": span. get_span_context(). trace_id, "span_id": span. get_span_context(). span_id}
log. msg(event=event, ctx, kwargs)
8. 4 NGINX → header tracing
nginx proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
9) Logs as a signal for alerts and auto-actions
Erroneous patterns ('psp _ decline', 'fraud _ flag') aggregate and correlate with SLO.
Alerts on pattern-rate: "5xx by/withdraw> 0. 5% per 10m," "fraud_flag spike> + 200% of the base."
Auto-actions: if the log is'withdrawals _ manual _ mode = true ', enable kill-switch through the flag platform.
Example rule (pseudo-expression):
rate(count_over_time({service="payments-api", level="ERROR", event="psp_decline"}[5m])) > 5
10) Retention, indexing, storage
Hot: 7-14 days (operational investigation).
Warm: 30-90 days (trends, RCA).
Cold: 180-365 + (archive, audit) - compression, cheap classes, possibly without full-text search.
Indexing: fixed keys ('service, env, level, event, trace_id, user. tenant_id'), a ban on the "everything in a row" index.
Event size limits (for example, ≤ 32KB), trim/bottom: "extra in storage is the enemy of MTTR."
11) Audit and immutability
Write audit events in a separate stream with signatures/hashes, server time, 'who/what/when/why', a link to the ticket.
"Who included the bonus flag 100% in DE?" - answer must be in 1-2 requests.
json
{
"ts": "2025-11-03T14:00:00. 000Z",
"actor": "alice@company",
"action": "feature_flag. update",
"target": "bonus. enable_vip",
"old": {"rollout": 10},
"new": {"rollout": 100},
"reason": "campaign_2311",
"ticket": "OPS-3481",
"trace_id": "cf12ab.."
}
12) Business Events and Data Model
Business events are not "text in logs," but a contract:- `event_type`, `event_id`, `occurred_at`, `actor`, `subject`, `amount`, `currency`, `status`, `idempotency_key`.
- Use Outbox and "at-least-once" with idempotent consumers.
13) Kubernetes and pipeline logs
Sidecar/DaemonSet agents with buffer to disk (during network interruptions).
Annotations of pads for routing ('log. type`, `retention. tier`).
Collect the logs of K8s controllers separately (cluster index).
ini
[FILTER]
Name modify
Match
Remove authorization, password, card_number
14) Anti-patterns
String logs "as necessary," absence of 'trace _ id'.
PII/secrets in the logs, payload dumps in their entirety.
Millions of unique keys → "exploded" indexing.
DEBUG in sale 24/7.
Mixing audit, security and technologists into one index.
There is no retention policy and no backup recovery test.
15) Implementation checklist (0-45 days)
0-10 days
Enable W3C Trace Context on gateway/clients, forwarding headers.
Translate application logs to JSON, add 'trace _ id '/' span _ id'.
Deny PII/secrets (masking on the agent), approve the list of fields.
11-25 days
Separate streams: tech/biz/audit/security/infra, set retention and ACL.
Enable OTel Collector, tail-sampling errors/slow queries.
Dashboards "Log Rate/Error by route" + Jump-to-trace (Exemplars).
26-45 days
Event pattern alerts and correlation with SLO.
Archive/restore (DR test) for cold logs.
Linter of log diagrams in CI, contract for business events.
16) Maturity metrics
'Trace _ id'request coverage ≥ 95%.
The share of JSON logs ≥ 99%.
Incidents found via "jump-to-trace" resolved <15 min (p50).
0 PII cases in logs (leak scanner).
Retention is observed for all flows (we prove the audit automatically).
17) Apps: Mini Snippets
W3C traceparent generation (pseudo)
txt traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
PromQL - a bunch of logs and SLO (example)
high_error_logs = rate(log_events_total{service="payments-api",level="ERROR"}[5m])
5xx_rate = sum(rate(http_requests_total{service="payments-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payments-api"}[5m]))
alert if high_error_logs > 10 and 5xx_rate > 0. 005
OpenAPI - Correlation Headers
yaml components:
parameters:
Traceparent:
name: traceparent in: header required: false schema: { type: string }
18) Conclusion
A strong logging and tracing circuit is an agreement + discipline: structural JSON logs, a single 'trace _ id', secure PII processing, routing and retention over streams, as well as a close connection with SLO, alert and rollbacks. Make the transition from the "dump of texts" to event contracts and tracks, and the diagnosis of production incidents will become fast, predictable and verifiable.