Logging and tracing events

1) Purpose and frame

Logs and trails are the foundation of observability.

Logs respond to "what happened" and "with what context."

Traces respond to "where and why slowly/erroneously" in a distributed query path.

Key principles:

Structured by default (JSON); Trace-first: each log on the hot path is bound to 'trace _ id '/' span _ id'.
Noise minimum, signal maximum: levels, sampling, anti-cardinality.
Security and privacy: masking, editing, access control.
Versioned diagrams of logs and events.

2) Taxonomy of events

Separate streams and indexes by destination:

1. Technical logs (runtime, errors, network timeouts, retrays).

2. Business events (registration, deposit, rate, withdrawal, KYC stage) - suitable for product analytics and incidents along "money" paths.

3. Audit (who/when changed what: configs, accesses, flags, limits) is an unchangeable journal.

4. Security (authentication, privilege escalation, sanction/PEP flags).

5. Infrastructure (K8s events, autoscaling, HPA/VPA, nodes/disk/network).

For each stream - separate rules for retention, indexing and access.

3) Structural log (JSON standard)

json
{
"ts": "2025-11-03T14:28:15. 123Z",
"level": "ERROR",
"service": "payments-api",
"env": "prod",
"region": "eu-central-1",
"trace_id": "8a4f0c2e9b1f42d7",
"span_id": "c7d1f3a4b8b6e912",
"parent_span_id": "a1b2c3d4e5f60789",
"logger": "withdraw. handler",
"event": "psp_decline",
"msg": "PSP declined transaction",
"http": { "method": "POST", "route": "/withdraw", "status": 502, "latency_ms": 842 },
"user": { "tenant_id": "t_9f2", "user_key": "hash_0a7c", "vip_tier": 3 },
"payment": { "psp": "acme", "amount": 120. 50, "currency": "EUR", "idempotency_key": "u123:wd:7845" },
"safe": true ,//secret check passed
"version": "1. 14. 2, "//service version (SemVer)
"build": "sha-1f2a3b4",
"kubernetes": { "pod": "payments-7cbdf", "node": "ip-10-0-2-41" }
}

Requirements: flat scheme + attachments by domain, required fields ('ts, level, service, env, trace_id, msg'), numerical values - numbers, not strings.

4) Levels, cardinality and scope

Levels: 'DEBUG' (not in sales), 'INFO' (business facts), 'WARN' (anomalies), 'ERROR' (errors), 'FATAL' (crashes).

Cardinality: avoid arbitrary keys/dynamic labels. No "id-in-key."

Sampling logs: rate-limit repeating messages; enable'DEBUG 'only scoped and in time (feature flag).
Idempotency: Log 'idempotency _ key' to suppress duplicate events by consumers.

5) Privacy and security

Mask PII/secrets on agents (Fluent Bit/Vector): key masking cards ('email', 'card', 'token', 'authorization').
Hash 'user _ key', hold only the required context (country, KYC-level, VIP-tier).
Separate storages: warm (online search) and cold (archive without PII/with a stripped-down context).
Audit - append-only, WORM storage, access only on the principle of least privilege.

6) Tracing: standards and context

W3C Trace Context: 'traceparent '/' tracestate' headers, plus baggage for secure keys (e.g. 'tenant _ id', 'region').
Link metrics and traces: Exemplars - pass' trace _ id'to the sampling points of the histograms (accelerates RCA).
Sampling: basic sampling 1-5% + dynamic "on error/slow p95" up to 100% for problem queries.
Links: For asynchronous queues/sagas, link spans through 'links', not just 'parent'.

7) Collection and routing

Agent: Fluent Bit/Vector for logs; OTLP export to OpenTelemetry Collector.
Collector: central gateway (batch/transform/filter/routing).

Recommended conveyor:


App → (OTLP logs/traces/metrics) → OTel Collector
→ logs: redact → route(security    audit    tech    biz) → hot index / cold archive
→ traces: tail_sampling(errors    p95>threshold) → APM backend
→ metrics: Prometheus exporter (for SLO/alerts)

OTel Collector (fragment):

yaml processors:
batch: {}
attributes:
actions:
- key: env value: prod action: insert filter/logs:
logs:
include:
match_type: strict resource_attributes:
- key: service. name value: payments-api exporters:
otlp/traces: { endpoint: "apm:4317", tls: { insecure: true } }
loki: { endpoint: "http://loki:3100/loki/api/v1/push" }
prometheus: {}
service:
pipelines:
logs: { receivers: [otlp], processors: [attributes,batch], exporters: [loki] }
traces: { receivers: [otlp], processors: [batch], exporters: [otlp/traces] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }

8) Instrumentation: SDK examples

8. 1 Node. js (Pino + OTel)

js import pino from "pino";
import { context, trace } from "@opentelemetry/api";

const logger = pino({ level: process. env. LOG_LEVEL          "info" });

function log(info) {
const span = trace. getSpan(context. active());
const base = span? { trace_id: span. spanContext(). traceId, span_id: span. spanContext(). spanId }: {};
logger. info({...base,...info });
}

//example log ({event: "deposit. created", amount: 50, currency: "EUR", user: { user_key: "hash_0a7c" } });

8. 2 Java (SLF4J + OTel)

java
MDC. put("trace_id", Span. current(). getSpanContext(). getTraceId());
MDC. put("span_id", Span. current(). getSpanContext(). getSpanId());
log. info("psp_response status={} latency_ms={}", status, latency);

8. 3 Python (structlog + OTel)

python import structlog from opentelemetry import trace log = structlog. get_logger()

def log_json(event, kwargs):
span = trace. get_current_span()
ctx = {}
if span and span. get_span_context(). is_valid:
ctx = {"trace_id": span. get_span_context(). trace_id, "span_id": span. get_span_context(). span_id}
log. msg(event=event, ctx, kwargs)

8. 4 NGINX → header tracing

nginx proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;

9) Logs as a signal for alerts and auto-actions

Erroneous patterns ('psp _ decline', 'fraud _ flag') aggregate and correlate with SLO.

Alerts on pattern-rate: "5xx by/withdraw> 0. 5% per 10m," "fraud_flag spike> + 200% of the base."

Auto-actions: if the log is'withdrawals _ manual _ mode = true ', enable kill-switch through the flag platform.

Example rule (pseudo-expression):


rate(count_over_time({service="payments-api", level="ERROR", event="psp_decline"}[5m])) > 5

10) Retention, indexing, storage

Hot: 7-14 days (operational investigation).
Warm: 30-90 days (trends, RCA).
Cold: 180-365 + (archive, audit) - compression, cheap classes, possibly without full-text search.
Indexing: fixed keys ('service, env, level, event, trace_id, user. tenant_id'), a ban on the "everything in a row" index.

Event size limits (for example, ≤ 32KB), trim/bottom: "extra in storage is the enemy of MTTR."

11) Audit and immutability

Write audit events in a separate stream with signatures/hashes, server time, 'who/what/when/why', a link to the ticket.
"Who included the bonus flag 100% in DE?" - answer must be in 1-2 requests.

Audit example:

json
{
"ts": "2025-11-03T14:00:00. 000Z",
"actor": "alice@company",
"action": "feature_flag. update",
"target": "bonus. enable_vip",
"old": {"rollout": 10},
"new": {"rollout": 100},
"reason": "campaign_2311",
"ticket": "OPS-3481",
"trace_id": "cf12ab.."
}

12) Business Events and Data Model

Business events are not "text in logs," but a contract:

`event_type`, `event_id`, `occurred_at`, `actor`, `subject`, `amount`, `currency`, `status`, `idempotency_key`.
Use Outbox and "at-least-once" with idempotent consumers.

13) Kubernetes and pipeline logs

Sidecar/DaemonSet agents with buffer to disk (during network interruptions).
Annotations of pads for routing ('log. type`, `retention. tier`).
Collect the logs of K8s controllers separately (cluster index).

Fluent Bit:

ini
[FILTER]
Name     modify
Match
Remove    authorization, password, card_number

14) Anti-patterns

String logs "as necessary," absence of 'trace _ id'.
PII/secrets in the logs, payload dumps in their entirety.
Millions of unique keys → "exploded" indexing.
DEBUG in sale 24/7.
Mixing audit, security and technologists into one index.
There is no retention policy and no backup recovery test.

15) Implementation checklist (0-45 days)

0-10 days

Enable W3C Trace Context on gateway/clients, forwarding headers.
Translate application logs to JSON, add 'trace _ id '/' span _ id'.
Deny PII/secrets (masking on the agent), approve the list of fields.

11-25 days

Separate streams: tech/biz/audit/security/infra, set retention and ACL.
Enable OTel Collector, tail-sampling errors/slow queries.
Dashboards "Log Rate/Error by route" + Jump-to-trace (Exemplars).

26-45 days

Event pattern alerts and correlation with SLO.
Archive/restore (DR test) for cold logs.
Linter of log diagrams in CI, contract for business events.

16) Maturity metrics

'Trace _ id'request coverage ≥ 95%.
The share of JSON logs ≥ 99%.
Incidents found via "jump-to-trace" resolved <15 min (p50).
0 PII cases in logs (leak scanner).
Retention is observed for all flows (we prove the audit automatically).

17) Apps: Mini Snippets

W3C traceparent generation (pseudo)

txt traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

PromQL - a bunch of logs and SLO (example)


high_error_logs = rate(log_events_total{service="payments-api",level="ERROR"}[5m])
5xx_rate = sum(rate(http_requests_total{service="payments-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="payments-api"}[5m]))
alert if high_error_logs > 10 and 5xx_rate > 0. 005

OpenAPI - Correlation Headers

yaml components:
parameters:
Traceparent:
name: traceparent in: header required: false schema: { type: string }

18) Conclusion

A strong logging and tracing circuit is an agreement + discipline: structural JSON logs, a single 'trace _ id', secure PII processing, routing and retention over streams, as well as a close connection with SLO, alert and rollbacks. Make the transition from the "dump of texts" to event contracts and tracks, and the diagnosis of production incidents will become fast, predictable and verifiable.

Logging and tracing events