Centralization of logs
1) Why centralize logs
Centralized logs are the foundation of observability, audit and compliance. Ones:- speed up the search for incident roots (correlation by request-id/trace-id);
- allow you to build signal alerts on symptoms (errors, anomalies);
- give an audit trail (who/when/what did it);
- lower cost due to unification of retention and storage.
2) Basic principles
1. Only structured logs (JSON/RFC5424) - no "free-text" without keys.
2. Uniform scheme of keys: 'ts, level, service, env, region, tenant, trace_id, span_id, request_id, user_id (masked), msg, kv...'.
3. Default correlation: flip trace_id from gateway to backends and logs.
4. Noise minimization: correct levels, sampling, repetition deduplication.
5. Safety by design: PII masking, RBAC/ABAC, immutability.
6. Economy: hot/warm/cold, compression, aggregation, TTL and rehydration.
3) Typical architectures
EFK/ELK: (Fluent Bit/Fluentd/Filebeat) → (Kafka — опц.) → (Elasticsearch/OpenSearch) → (Kibana/OpenSearch Dashboards). Universal search and aggregation.
Loki-like (log indexing by labels): Promtail/Fluent Bit → Loki → Grafana. Cheaper for large volumes, powerful label filter + linear viewing.
Cloud: CloudWatch/Cloud Logging/Log Analytics + export to cold storage (S3/GCS/ADLS) and/or SIEM.
Data Lake approach: shippers → object storage (parquet/iceberg) → cheap analytical queries (Athena/BigQuery/Spark) + online layer (OpenSearch/Loki) for the last N days.
Recommendation: to keep the online layer (7-14 days hot) and archival (months/years) in lake with the ability to rehydrate.
4) Diagram and format of logs (recommendation)
Minimum JSON format:json
{
"ts":"2025-11-01T13:45:12.345Z",
"level":"ERROR",
"service":"payments-api",
"env":"prod",
"region":"eu-central",
"tenant":"tr",
"trace_id":"0af7651916cd43dd8448eb211c80319c",
"span_id":"b7ad6b7169203331",
"request_id":"r-7f2c",
"user_id":"", // masked
"route":"/v1/payments/charge",
"code":"PSP_TIMEOUT",
"latency_ms":1200,
"msg":"upstream PSP timeout",
"kv":{"provider":"psp-a","attempt":2,"timeout_ms":800}
}
Standards: RFC3339 for time, level from the set 'TRACE/DEBUG/INFO/WARN/ERROR/FATAL', keys in snake_case.
5) Logging levels and sampling
DEBUG - only in dev/stage; in prod by flag and with TTL.
INFO - life cycle of requests/events.
WARN - suspicious situations without affecting SLO.
ERROR/FATAL - Impact on request/user.
- rate-limit for repeated errors (for example, 1/sec/key).
- tail-sampling of traces (leave full logs/traces only for "bad" requests).
- dynamic: in case of a storm of errors, reduce the detail, save summary.
6) Delivery of logs (agents and shippers)
On the node: Fluent Bit/Filebeat/Promtail collect stdout files/juntrals, parsing, masking, buffering.
Network queues: Kafka/NATS for peak smoothing, retrays and ordering.
Reliability: backpressure, disk buffers, delivery confirmations (at-least-once), idempotent indices (key-hash).
Filtering at the edge: discarding "chatter" and secrets before hitting the net.
7) Indexing and storage
Time partitioning (daily/weekly) + by 'env/region/tenant' (via index templates or labels).
Storage layers:- Hot (SSD, 3-14 days): quick search and alerts.
- Warm (HDD/freezer, 30-90 days): sometimes we look.
- Cold/Archive (object, months/years): compliance and rare investigations.
- Compression and rotation: ILM/ISM (life cycle policies), gzip/zstd, downsampling (aggregation tables).
- Rehydration: temporary loading of archived batches into a "hot" cluster for investigation.
8) Search & Analytics: Sample Queries
Incident: Time filter × 'service =...' × 'level> = ERROR' × 'trace _ id '/' request _ id'.
Providers: 'code: PSP _' and 'kv. provider: psp-a 'grouped by region.
Anomalies: an increase in the frequency of messages or a change in field distributions (ML-detectors, rule-based).
Audit: 'category: audit' + 'actor '/' resource' + result.
9) Correlation with metrics and traces
Identical identifiers: 'trace _ id/span _ id' in all three signals (metrics, logs, traces).
Links from graphs: clickable transition from the p99 panel to the logs by 'trace _ id'.
Release annotations: versions/canaries in metrics and logs for quick binding.
10) Safety, PII and Compliance
Field classification: PII/secrets/finances - mask or delete at the entrance (Fluent Bit/Lua filters, Re2).
RBAC/ABAC: index/label access by role, row-/field-level-security.
Immutability (WORM/append-only) for audit and regulatory requirements.
Retention and "right to forget": TTL/deletion by keys, tokenization'user _ id'.
Signatures/hashes: integrity of critical journals (admin actions, payments).
11) SLO and pipeline log metrics
Delivery: 99. 9% of events in the hot layer ≤ 30-60 seconds.
Losses: <0. 01% at 24 hours (as per reference marks).
Search availability: ≥ 99. 9% in 28 days.
Latency of requests: p95 ≤ 2-5 seconds on typical filters.
Cost: $/1M events and $/storage/GB in layers.
12) Dashboards (minimum)
Pipeline health: entry/exit of shippers, retrays, filling buffers, Kafka lag.
Errors by services/codes: top N, trends, percentiles' latency _ ms'.
Audit activity: admin actions, provider errors, access.
Economics: volume/day, index-growth, value by layer, "expensive" queries.
13) Operations and playbooks
Log storm: enable aggressive sampling/rate-limit on the agent, raise buffers, temporarily transfer part of the stream to warm.
Schema drift: alert for the appearance of new keys/types, start schema-catalog negotiation.
Slow search: rebuilding indexes, increasing replicas, analyzing "heavy" queries, archiving old batches.
Security Incident: Immediate invariability enabled, artifacts offloaded, access restricted by role, RCA.
14) FinOps: how not to go broke on the logs
Remove verbosity: turn multi-line stacktrace into a'stack' field and sample replays.
Enable TTL: different for 'env '/' level '/' category'.
Use Loki/archive + on-demand rehydrate for rare access.
Parties and compression: Bigger parties are cheaper, but keep an eye out for search SLAs.
Materialize frequent evaluations (daily aggregates).
15) Instrumental examples
Fluent Bit (masking and sending to OpenSearch)
ini
[INPUT]
Name tail
Path /var/log/app/.log
Parser json
Mem_Buf_Limit 256MB
[FILTER]
Name modify
Match
Remove_key credit_card, password
[OUTPUT]
Name es
Host opensearch.svc
Port 9200
Index logs-${tag}-${date}
Logstash_Format On
Suppress_Type_Name On
Nginx access log в JSON с trace_id
nginx log_format json escape=json '{ "ts":"$time_iso8601","remote":"$remote_addr",'
'"method":"$request_method","path":"$uri","status":$status,'
'"bytes":$body_bytes_sent,"ua":"$http_user_agent","trace_id":"$http_trace_id"}';
access_log /var/log/nginx/access.json json;
OpenSearch ILM Policy (hot→warm→delete)
json
{
"policy": {
"phases": {
"hot": { "actions": { "rollover": { "max_age": "7d", "max_size": "50gb" } } },
"warm": { "min_age": "7d", "actions": { "forcemerge": { "max_num_segments": 1 } } },
"delete":{ "min_age": "90d", "actions": { "delete": {} } }
}
}
}
16) Implementation checklist
- Accepted field layout and log levels; trace/request-id correlation is enabled.
- Configured agents (Fluent Bit/Promtail) with masking and buffers.
- Online layer (OpenSearch/Loki/Cloud) and archive (S3/GCS + parquet) selected.
- ILM/ISM + hot/warm/cold retention policies, rehydrate process.
- RBAC/ABAC, audit immutability, access log.
- Pipeline dashboards, loss alerts/lag/disk buffers.
- Playbooks: log storm, schema drift, slow search, security incident.
- Financial limits: $/1M events, quotas for "expensive" requests.
17) Anti-patterns
Text logs without structure → the inability to filter and aggregate.
Giant stacktrace in INFO → volume explosion.
Lack of correlation → "fluttering" for all services.
Storing "everything forever" → the cloud bill like an airplane.
Secrets/PII in the logs → compliance risks.
Manual index edits in sales → drift and long search downtime.
18) The bottom line
Log centralization is a system, not just a stack. Standardized schema, correlation, secure shippers, layered storage, and strict access policies turn logs into a powerful tool for SRE, security, and product. Correct retentions and FinOps keep the budget, and pipeline SLOs and playbooks make investigations fast and reproducible.