Log pipelines: ELK and Loki

1) Why and when: logging goals

Observability and RCA: Debag acceleration, post-mortem, SLO/SLA control.
Security and audit: traces of access, anomalies, investigations.
Business metrics: conversion, payment flow, PSP errors, user behavior.
Compliance: storage, PII masking, retention policies, Legal Hold.

Types of logs: application, infrastructure (kubelet, kube-proxy, CNI, ingress), network, audit, payment, web events, Nginx/Envoy, database.

2) High-level architectures

Option A: ELK

Producers → Logshipper (Filebeat/Fluent Bit/Vector) → Logstash/Beats input → Elasticsearch → Kibana/Алертинг

Option B: Loki

Producers → Promtail/Fluent Bit → Loki distributor/ingester/querier → Grafana/Алертинг

Hybrid

ELK for full-text/facet search, Loki for low-cost scalable storage, and fast graphing-like queries; correlation with metrics/traces in Grafana.

3) Data flow and processing levels

1. Collection: byte tail files, journal, syslog, stdout containers, HTTP.
2. Enrichment: timestamp normalization, host/pod/namespace, env (prod/stage), release, commit SHA, trace/span id.
3. Parsing: JSON → flat fields; grok/regex; Nginx/Envoy formats; payment schemes (PSP error codes).
4. Filtering/editing: cut PII (PAN, CVV, e-mail, addresses), secrets, tokens.
5. Routing: by tenant/service/log level; hot/warm/cold; to S3/object storage.
6. Storage and retention: TTL policy by data class.
7. Access/Analytics/Alerts.

4) ELK: key solutions

4. 1 Logstash/Beats

Use Beats/Fluent Bit on nodes for easy picker, Logstash as central ETL (grok, dissect, mutate, geoip, translate).
Logstash pools: ingest-ETL, security-ETL, payments-ETL - to isolate loads.

4. 2 Elasticsearch

Sharding: focus on ~ 20-50 GB per shard; avoid a "shard explosion."

Index strategy: 'logs- - -YYYY. MM. DD 'or data streams; rollover by size/time.

ILM (hot/warm/cold/frozen):

hot: SSD, 1-7 days; warm: HDD, 7-30 days; cold: volumetric; frozen: minimum cost with slower access.
Mappings - Hard-type fields, restrict fielddata, and create dynamic fields.
Cache and queries: filters by keyword fields, aggregates - neatly; pin-to-hot for high-frequency search.

4. 3 Kibana

Spaces for multi-tenancy.
Saved searches, Lens/TSVB, threshold/alert metrics.
RBAC by index-patterns ('logs-tenant-').

5) Loki: key decisions

5. 1 Label model

Labels are Loki's "index." Use low cardinality: 'cluster', 'namespace', 'app', 'level', 'env', 'tenant'.
Fields with high cardinality (uid, request_id) - in a row; retrieve '| =', '| json', '| regexp' when queried through LogQL.

5. 2 Components

Promtail: сбор stdout, files, journald; parsers (JSON, regex, cri).
Distributor/Ingester/Querier/Query-frontend: scaling by role; request caching.
Object storage (S3/GCS/MinIO) for long-term storage of chunk logs.

5. 3 LogQL techniques

Fast grep: '{app = "payments," level =" error"} | = "declined"'

Парсинг JSON: `{app="api"} | json | code="5xx" | unwrap duration | avg()`

Correlation with metrics: 'rate ({app = "nginx"} | = "200" [5m])'

6) ELK vs Loki comparison (in brief)

Search/aggregation: ELK is stronger for complex full-text and faceted queries; Loki - grep-like, fast and cheap.
Cost: Loki is often cheaper on larger volumes (object storage + smaller index).
Operational complexity: ELK requires discipline in indices/ILM, Javu-hips; Loki - label disciplines.
Correlation with metrics/traces: Loki integrates naturally with the Grafana/OTel stack; ELK also knows how, but more often through integration.

7) Safety and compliance

PII edition on the edge (shipper): mask PAN, e-mail, phone, addresses, tokens.
TLS in-transit, mTLS between agents and buses.
RBAC: per-tenant indexes/labels; isolation of neimspaces/spaces.
Secrets hygiene: environment variables without secrets, individual secret managers.
Legal Hold: segment/index freezing mechanism; write-once for disputed periods.
Deletion/retention: TTL policies by data class (prod/stateful/payments/audit).
Log access audit trails.

8) Reliability and throughput

Buffering and backpressure: local files/disks for agents; retreats with exponential backoff.
Idempotency: 'ingest _ id '/' log _ id'fields to avoid duplicates during duplicates.
HA: minimum 3 nodes for Loki ES masters/ingesters; antiaffinity по AZ.
Quotas and rate-limits by tenant/service; protection against "storms" logging.
Log level scheme: 'ERROR' limited, 'DEBUG' only temporarily through dynamic flags.

9) Performance and tuning

ELK:

JVM heap 50% RAM (but ≤ ~ 30-32 GB per node), page cache is important.
Smart rollover (20-50 GB/shard), 'refresh _ interval' ↑ for ingest indexes.
In Logstash, avoid "heavy" grok; if possible, JSON logging at the source.

Loki:

The right label set is the key to speed.
Large chunks → cheaper storage, but more expensive memory at ingester; balance.
Query-frontend + cache (meme/Redis) for repeated requests.

10) FinOps for logs (cost)

Decreasing cardinality of fields/labels.

DEBUG sampling and dynamic "log switches."

Rotation: short hot, long cold to object.
Deduplication and consolidated messages (batch).
Archiving rarely used logs to cheap storage classes.
Value dashboard: volume/data streams/labels/indexes/tenants.

11) Observability 3-in-1

Trace-ID/Span-ID to each log (middleware on API gateways and services).
OpenTelemetry: single context; exporters to Tempo/Jaeger, metrics to Prometheus/Mimir, logs to Loki/ELK.

Quick scenarios: "alert by metric → jump into the corresponding logs → jump into the track."

12) Multi-tenancy and isolation

Namespace-based isolation (K8s labels), separate index patterns/labels' tenant '.
Separation of alerts/dashboards/retenschna by tenant.
Consumption billing: volume of ingest, storage, requests.

13) Monitoring and SLO for the conveyor itself

SLO ingest: «99. 9% logs delivered

Search SLO: "p95 queries

Technical metrics: queue depth, dropped logs, reprocess rate, parser error rate, ingester/ES node failure.

14) Typical deployment schemes

Managed: Elasticsearch Service/Opensearch, Grafana Cloud Loki.
Self-hosted K8s: StatefulSets for ES/Loki, anti-affinity for AZ, PersistentVolumes, object storage.
Edge agents (applications in the regions): local buffer + TLS channel to the central ingest.

15) Configuration examples

15. 1 Promtail (K8s, CRI JSON)

yaml scrape_configs:
- job_name: kubernetes-pods kubernetes_sd_configs:
- role: pod pipeline_stages:
- cri: {}
- json:
expressions:
level: level msg: message trace: trace_id
- labels:
level:
app:
namespace:
- match:
selector: '{namespace="prod"}'
stages:
- regex:
expression: '(?P<pan>\b[0-9]{12,19}\b)'
- replace:
expression: '(?P<pan>\b[0-9]{12,19}\b)'
replace: '[REDACTED_PAN]'
relabel_configs:
- action: replace source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- action: replace source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- action: replace source_labels: [__meta_kubernetes_pod_node_name]
target_label: node

15. 2 Logstash (ingest and masking)

ruby input {
beats { port => 5044 }
}
filter {
json { source => "message" skip_on_invalid_json => true }
mutate { add_field => { "env" => "%{[kubernetes][labels][env]}" } }
PII mutate {
gsub => [
"message", "\b[0-9]{12,19}\b", "[REDACTED_PAN]",
"message", "(?i)(authorization: Bearer)([A-Za-z0-9\.\-_]+)", "\1[REDACTED_TOKEN]"
]
}
}
output {
elasticsearch {
hosts => ["https://es-hot-1:9200","https://es-hot-2:9200"]
index => "logs-%{[fields][tenant]}-%{[app]}-%{+YYYY. MM. dd}"
ilm_enabled => true ssl => true cacert => "/etc/ssl/certs/ca. crt"
user => "${ES_USER}"
password => "${ES_PASS}"
}
}

16) Alerting and dashboards (templates)

Ошибки API: `rate({app="api",level="error"}[5m]) > threshold` → PagerDuty/Telegram.
5xx splash in Nginx/Envoy; drop ingest in agents; growth of latency search.

Dashboards:

Volume of logs by services/tenants.
Top error patterns (code/exception/endpoint).
Cost by retention/storage classes.

17) Quality checks (log-QA)

Logging contracts: JSON format, required fields ('ts', 'level', 'service', 'env', 'trace _ id', 'msg').
Linter logs in CI: banning new fields with high cardinality without agreement.
Canary services: generation of reference logs for early detection of regressions.

18) Frequent errors and anti-patterns

Loki labels with high cardinality ('user _ id', 'request _ id') → an explosion of memory.

Dynamic fields in ES without mapping → "index explosion."

DEBUG in the sale "forever." Turn on by flags and with TTL.
Absence of PII revision.
One common "monolithic" pipeline for everything - better segments by domain.

19) Implementation plan (iterations)

1. MVP: agents + one pipeline (applications), basic dashboards, PII edition.
2. Extension: network/infra-logs, SLO alerts, correlation with tracks.
3. FinOps: retention matrix, cost report, label/index optimization.
4. Multi-tenant: spaces, RBAC, consumption billing.
5. Reliability: HA, disaster-drills, Legal Hold.

20) Checklist of launch into production

JSON format and required fields in all services.
PII masking on/ingest agent.
Retention/ILM or bucket-lifecycle policies.
RBAC/spaces/tenants.
SLO ingest/search and alerts.
Canary logs and load test run.
Value dashboards and report on service owners.
Runbooks: "what to do if ingest dropped/search slow/shards red."

21) Mini-FAQ

What to choose - ELK or Loki?

Log pipelines: ELK and Loki

Producers → Logshipper (Filebeat/Fluent Bit/Vector) → Logstash/Beats input → Elasticsearch → Kibana/Алертинг

Producers → Promtail/Fluent Bit → Loki distributor/ingester/querier → Grafana/Алертинг

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects