Centralization of logs

1) Why centralize logs

Centralized logs are the foundation of observability, audit and compliance. Ones:

speed up the search for incident roots (correlation by request-id/trace-id);
allow you to build signal alerts on symptoms (errors, anomalies);
give an audit trail (who/when/what did it);
lower cost due to unification of retention and storage.

2) Basic principles

1. Only structured logs (JSON/RFC5424) - no "free-text" without keys.
2. Uniform scheme of keys: 'ts, level, service, env, region, tenant, trace_id, span_id, request_id, user_id (masked), msg, kv...'.
3. Default correlation: flip trace_id from gateway to backends and logs.
4. Noise minimization: correct levels, sampling, repetition deduplication.
5. Safety by design: PII masking, RBAC/ABAC, immutability.
6. Economy: hot/warm/cold, compression, aggregation, TTL and rehydration.

3) Typical architectures

EFK/ELK: (Fluent Bit/Fluentd/Filebeat) → (Kafka — опц.) → (Elasticsearch/OpenSearch) → (Kibana/OpenSearch Dashboards). Universal search and aggregation.
Loki-like (log indexing by labels): Promtail/Fluent Bit → Loki → Grafana. Cheaper for large volumes, powerful label filter + linear viewing.
Cloud: CloudWatch/Cloud Logging/Log Analytics + export to cold storage (S3/GCS/ADLS) and/or SIEM.
Data Lake approach: shippers → object storage (parquet/iceberg) → cheap analytical queries (Athena/BigQuery/Spark) + online layer (OpenSearch/Loki) for the last N days.

Recommendation: to keep the online layer (7-14 days hot) and archival (months/years) in lake with the ability to rehydrate.

4) Diagram and format of logs (recommendation)

Minimum JSON format:

json
{
"ts":"2025-11-01T13:45:12. 345Z",
"level":"ERROR",
"service":"payments-api",
"env":"prod",
"region":"eu-central",
"tenant":"tr",
"trace_id":"0af7651916cd43dd8448eb211c80319c",
"span_id":"b7ad6b7169203331",
"request_id":"r-7f2c",
"user_id":"",        // masked
"route":"/v1/payments/charge",
"code":"PSP_TIMEOUT",
"latency_ms":1200,
"msg":"upstream PSP timeout",
"kv":{"provider":"psp-a","attempt":2,"timeout_ms":800}
}

Standards: RFC3339 for time, level from the set 'TRACE/DEBUG/INFO/WARN/ERROR/FATAL', keys in snake_case.

5) Logging levels and sampling

DEBUG - only in dev/stage; in prod by flag and with TTL.
INFO - life cycle of requests/events.
WARN - suspicious situations without affecting SLO.
ERROR/FATAL - Impact on request/user.

Sampling:

rate-limit for repeated errors (for example, 1/sec/key).
tail-sampling of traces (leave full logs/traces only for "bad" requests).
dynamic: in case of a storm of errors, reduce the detail, save summary.

6) Delivery of logs (agents and shippers)

On the node: Fluent Bit/Filebeat/Promtail collect stdout files/juntrals, parsing, masking, buffering.
Network queues: Kafka/NATS for peak smoothing, retrays and ordering.
Reliability: backpressure, disk buffers, delivery confirmations (at-least-once), idempotent indices (key-hash).
Filtering at the edge: discarding "chatter" and secrets before hitting the net.

7) Indexing and storage

Time partitioning (daily/weekly) + by 'env/region/tenant' (via index templates or labels).

Storage layers:

Hot (SSD, 3-14 days): quick search and alerts.
Warm (HDD/freezer, 30-90 days): sometimes we look.
Cold/Archive (object, months/years): compliance and rare investigations.
Compression and rotation: ILM/ISM (life cycle policies), gzip/zstd, downsampling (aggregation tables).
Rehydration: temporary loading of archived batches into a "hot" cluster for investigation.

8) Search & Analytics: Sample Queries

Incident: Time filter × 'service =...' × 'level> = ERROR' × 'trace _ id '/' request _ id'.
Providers: 'code: PSP _' and 'kv. provider: psp-a 'grouped by region.
Anomalies: an increase in the frequency of messages or a change in field distributions (ML-detectors, rule-based).
Audit: 'category: audit' + 'actor '/' resource' + result.

9) Correlation with metrics and traces

Identical identifiers: 'trace _ id/span _ id' in all three signals (metrics, logs, traces).
Links from graphs: clickable transition from the p99 panel to the logs by 'trace _ id'.
Release annotations: versions/canaries in metrics and logs for quick binding.

10) Safety, PII and Compliance

Field classification: PII/secrets/finances - mask or delete at the entrance (Fluent Bit/Lua filters, Re2).
RBAC/ABAC: index/label access by role, row-/field-level-security.
Immutability (WORM/append-only) for audit and regulatory requirements.
Retention and "right to forget": TTL/deletion by keys, tokenization'user _ id'.
Signatures/hashes: integrity of critical journals (admin actions, payments).

11) SLO and pipeline log metrics

Delivery: 99. 9% of events in the hot layer ≤ 30-60 seconds.
Losses: <0. 01% at 24 hours (as per reference marks).
Search availability: ≥ 99. 9% in 28 days.
Latency of requests: p95 ≤ 2-5 seconds on typical filters.
Cost: $/1M events and $/storage/GB in layers.

12) Dashboards (minimum)

Pipeline health: entry/exit of shippers, retrays, filling buffers, Kafka lag.
Errors by services/codes: top N, trends, percentiles' latency _ ms'.
Audit activity: admin actions, provider errors, access.
Economics: volume/day, index-growth, value by layer, "expensive" queries.

13) Operations and playbooks

Log storm: enable aggressive sampling/rate-limit on the agent, raise buffers, temporarily transfer part of the stream to warm.
Schema drift: alert for the appearance of new keys/types, start schema-catalog negotiation.
Slow search: rebuilding indexes, increasing replicas, analyzing "heavy" queries, archiving old batches.
Security Incident: Immediate invariability enabled, artifacts offloaded, access restricted by role, RCA.

14) FinOps: how not to go broke on the logs

Remove verbosity: turn multi-line stacktrace into a'stack' field and sample replays.
Enable TTL: different for 'env '/' level '/' category'.
Use Loki/archive + on-demand rehydrate for rare access.
Parties and compression: Bigger parties are cheaper, but keep an eye out for search SLAs.
Materialize frequent evaluations (daily aggregates).

15) Instrumental examples

Fluent Bit (masking and sending to OpenSearch)

ini
[INPUT]
Name       tail
Path       /var/log/app/.log
Parser      json
Mem_Buf_Limit   256MB

[FILTER]
Name       modify
Match
Remove_key    credit_card, password

[OUTPUT]
Name       es
Host       opensearch. svc
Port       9200
Index       logs-${tag}-${date}
Logstash_Format  On
Suppress_Type_Name On

Nginx access log в JSON с trace_id

nginx log_format json escape=json '{ "ts":"$time_iso8601","remote":"$remote_addr",'
'"method":"$request_method","path":"$uri","status":$status,'
'"bytes":$body_bytes_sent,"ua":"$http_user_agent","trace_id":"$http_trace_id"}';
access_log /var/log/nginx/access. json json;

OpenSearch ILM Policy (hot→warm→delete)

json
{
"policy": {
"phases": {
"hot":  { "actions": { "rollover": { "max_age": "7d", "max_size": "50gb" } } },
"warm": { "min_age": "7d", "actions": { "forcemerge": { "max_num_segments": 1 } } },
"delete":{ "min_age": "90d", "actions": { "delete": {} } }
}
}
}

16) Implementation checklist

Accepted field layout and log levels; trace/request-id correlation is enabled.
Configured agents (Fluent Bit/Promtail) with masking and buffers.
Online layer (OpenSearch/Loki/Cloud) and archive (S3/GCS + parquet) selected.
ILM/ISM + hot/warm/cold retention policies, rehydrate process.
RBAC/ABAC, audit immutability, access log.
Pipeline dashboards, loss alerts/lag/disk buffers.
Playbooks: log storm, schema drift, slow search, security incident.
Financial limits: $/1M events, quotas for "expensive" requests.

17) Anti-patterns

Text logs without structure → the inability to filter and aggregate.
Giant stacktrace in INFO → volume explosion.
Lack of correlation → "fluttering" for all services.
Storing "everything forever" → the cloud bill like an airplane.
Secrets/PII in the logs → compliance risks.
Manual index edits in sales → drift and long search downtime.

18) The bottom line

Log centralization is a system, not just a stack. Standardized schema, correlation, secure shippers, layered storage, and strict access policies turn logs into a powerful tool for SRE, security, and product. Correct retentions and FinOps keep the budget, and pipeline SLOs and playbooks make investigations fast and reproducible.

Centralization of logs

Nginx access log в JSON с trace_id

OpenSearch ILM Policy (hot→warm→delete)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects