GH GambleHub

Centralization of logs

1) Why centralize logs

Centralized logs are the foundation of observability, audit and compliance. Ones:
  • speed up the search for incident roots (correlation by request-id/trace-id);
  • allow you to build signal alerts on symptoms (errors, anomalies);
  • give an audit trail (who/when/what did it);
  • lower cost due to unification of retention and storage.

2) Basic principles

1. Only structured logs (JSON/RFC5424) - no "free-text" without keys.
2. Uniform scheme of keys: 'ts, level, service, env, region, tenant, trace_id, span_id, request_id, user_id (masked), msg, kv...'.
3. Default correlation: flip trace_id from gateway to backends and logs.
4. Noise minimization: correct levels, sampling, repetition deduplication.
5. Safety by design: PII masking, RBAC/ABAC, immutability.
6. Economy: hot/warm/cold, compression, aggregation, TTL and rehydration.


3) Typical architectures

EFK/ELK: (Fluent Bit/Fluentd/Filebeat) → (Kafka — опц.) → (Elasticsearch/OpenSearch) → (Kibana/OpenSearch Dashboards). Universal search and aggregation.
Loki-like (log indexing by labels): Promtail/Fluent Bit → Loki → Grafana. Cheaper for large volumes, powerful label filter + linear viewing.
Cloud: CloudWatch/Cloud Logging/Log Analytics + export to cold storage (S3/GCS/ADLS) and/or SIEM.
Data Lake approach: shippers → object storage (parquet/iceberg) → cheap analytical queries (Athena/BigQuery/Spark) + online layer (OpenSearch/Loki) for the last N days.

Recommendation: to keep the online layer (7-14 days hot) and archival (months/years) in lake with the ability to rehydrate.


4) Diagram and format of logs (recommendation)

Minimum JSON format:
json
{
"ts":"2025-11-01T13:45:12.345Z",
"level":"ERROR",
"service":"payments-api",
"env":"prod",
"region":"eu-central",
"tenant":"tr",
"trace_id":"0af7651916cd43dd8448eb211c80319c",
"span_id":"b7ad6b7169203331",
"request_id":"r-7f2c",
"user_id":"",        // masked
"route":"/v1/payments/charge",
"code":"PSP_TIMEOUT",
"latency_ms":1200,
"msg":"upstream PSP timeout",
"kv":{"provider":"psp-a","attempt":2,"timeout_ms":800}
}

Standards: RFC3339 for time, level from the set 'TRACE/DEBUG/INFO/WARN/ERROR/FATAL', keys in snake_case.


5) Logging levels and sampling

DEBUG - only in dev/stage; in prod by flag and with TTL.
INFO - life cycle of requests/events.
WARN - suspicious situations without affecting SLO.
ERROR/FATAL - Impact on request/user.

Sampling:
  • rate-limit for repeated errors (for example, 1/sec/key).
  • tail-sampling of traces (leave full logs/traces only for "bad" requests).
  • dynamic: in case of a storm of errors, reduce the detail, save summary.

6) Delivery of logs (agents and shippers)

On the node: Fluent Bit/Filebeat/Promtail collect stdout files/juntrals, parsing, masking, buffering.
Network queues: Kafka/NATS for peak smoothing, retrays and ordering.
Reliability: backpressure, disk buffers, delivery confirmations (at-least-once), idempotent indices (key-hash).
Filtering at the edge: discarding "chatter" and secrets before hitting the net.


7) Indexing and storage

Time partitioning (daily/weekly) + by 'env/region/tenant' (via index templates or labels).

Storage layers:
  • Hot (SSD, 3-14 days): quick search and alerts.
  • Warm (HDD/freezer, 30-90 days): sometimes we look.
  • Cold/Archive (object, months/years): compliance and rare investigations.
  • Compression and rotation: ILM/ISM (life cycle policies), gzip/zstd, downsampling (aggregation tables).
  • Rehydration: temporary loading of archived batches into a "hot" cluster for investigation.

8) Search & Analytics: Sample Queries

Incident: Time filter × 'service =...' × 'level> = ERROR' × 'trace _ id '/' request _ id'.
Providers: 'code: PSP _' and 'kv. provider: psp-a 'grouped by region.
Anomalies: an increase in the frequency of messages or a change in field distributions (ML-detectors, rule-based).
Audit: 'category: audit' + 'actor '/' resource' + result.


9) Correlation with metrics and traces

Identical identifiers: 'trace _ id/span _ id' in all three signals (metrics, logs, traces).
Links from graphs: clickable transition from the p99 panel to the logs by 'trace _ id'.
Release annotations: versions/canaries in metrics and logs for quick binding.


10) Safety, PII and Compliance

Field classification: PII/secrets/finances - mask or delete at the entrance (Fluent Bit/Lua filters, Re2).
RBAC/ABAC: index/label access by role, row-/field-level-security.
Immutability (WORM/append-only) for audit and regulatory requirements.
Retention and "right to forget": TTL/deletion by keys, tokenization'user _ id'.
Signatures/hashes: integrity of critical journals (admin actions, payments).


11) SLO and pipeline log metrics

Delivery: 99. 9% of events in the hot layer ≤ 30-60 seconds.
Losses: <0. 01% at 24 hours (as per reference marks).
Search availability: ≥ 99. 9% in 28 days.
Latency of requests: p95 ≤ 2-5 seconds on typical filters.
Cost: $/1M events and $/storage/GB in layers.


12) Dashboards (minimum)

Pipeline health: entry/exit of shippers, retrays, filling buffers, Kafka lag.
Errors by services/codes: top N, trends, percentiles' latency _ ms'.
Audit activity: admin actions, provider errors, access.
Economics: volume/day, index-growth, value by layer, "expensive" queries.


13) Operations and playbooks

Log storm: enable aggressive sampling/rate-limit on the agent, raise buffers, temporarily transfer part of the stream to warm.
Schema drift: alert for the appearance of new keys/types, start schema-catalog negotiation.
Slow search: rebuilding indexes, increasing replicas, analyzing "heavy" queries, archiving old batches.
Security Incident: Immediate invariability enabled, artifacts offloaded, access restricted by role, RCA.


14) FinOps: how not to go broke on the logs

Remove verbosity: turn multi-line stacktrace into a'stack' field and sample replays.
Enable TTL: different for 'env '/' level '/' category'.
Use Loki/archive + on-demand rehydrate for rare access.
Parties and compression: Bigger parties are cheaper, but keep an eye out for search SLAs.
Materialize frequent evaluations (daily aggregates).


15) Instrumental examples

Fluent Bit (masking and sending to OpenSearch)

ini
[INPUT]
Name       tail
Path       /var/log/app/.log
Parser      json
Mem_Buf_Limit   256MB

[FILTER]
Name       modify
Match
Remove_key    credit_card, password

[OUTPUT]
Name       es
Host       opensearch.svc
Port       9200
Index       logs-${tag}-${date}
Logstash_Format  On
Suppress_Type_Name On

Nginx access log в JSON с trace_id

nginx log_format json escape=json '{ "ts":"$time_iso8601","remote":"$remote_addr",'
'"method":"$request_method","path":"$uri","status":$status,'
'"bytes":$body_bytes_sent,"ua":"$http_user_agent","trace_id":"$http_trace_id"}';
access_log /var/log/nginx/access.json json;

OpenSearch ILM Policy (hot→warm→delete)

json
{
"policy": {
"phases": {
"hot":  { "actions": { "rollover": { "max_age": "7d", "max_size": "50gb" } } },
"warm": { "min_age": "7d", "actions": { "forcemerge": { "max_num_segments": 1 } } },
"delete":{ "min_age": "90d", "actions": { "delete": {} } }
}
}
}

16) Implementation checklist

  • Accepted field layout and log levels; trace/request-id correlation is enabled.
  • Configured agents (Fluent Bit/Promtail) with masking and buffers.
  • Online layer (OpenSearch/Loki/Cloud) and archive (S3/GCS + parquet) selected.
  • ILM/ISM + hot/warm/cold retention policies, rehydrate process.
  • RBAC/ABAC, audit immutability, access log.
  • Pipeline dashboards, loss alerts/lag/disk buffers.
  • Playbooks: log storm, schema drift, slow search, security incident.
  • Financial limits: $/1M events, quotas for "expensive" requests.

17) Anti-patterns

Text logs without structure → the inability to filter and aggregate.
Giant stacktrace in INFO → volume explosion.
Lack of correlation → "fluttering" for all services.
Storing "everything forever" → the cloud bill like an airplane.
Secrets/PII in the logs → compliance risks.
Manual index edits in sales → drift and long search downtime.


18) The bottom line

Log centralization is a system, not just a stack. Standardized schema, correlation, secure shippers, layered storage, and strict access policies turn logs into a powerful tool for SRE, security, and product. Correct retentions and FinOps keep the budget, and pipeline SLOs and playbooks make investigations fast and reproducible.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.