Log pipelines: ELK and Loki
1) Why and when: logging goals
Observability and RCA: Debag acceleration, post-mortem, SLO/SLA control.
Security and audit: traces of access, anomalies, investigations.
Business metrics: conversion, payment flow, PSP errors, user behavior.
Compliance: storage, PII masking, retention policies, Legal Hold.
Types of logs: application, infrastructure (kubelet, kube-proxy, CNI, ingress), network, audit, payment, web events, Nginx/Envoy, database.
2) High-level architectures
Option A: ELK
Producers → Logshipper (Filebeat/Fluent Bit/Vector) → Logstash/Beats input → Elasticsearch → Kibana/Алертинг
Option B: Loki
Producers → Promtail/Fluent Bit → Loki distributor/ingester/querier → Grafana/Алертинг
Hybrid
ELK for full-text/facet search, Loki for low-cost scalable storage, and fast graphing-like queries; correlation with metrics/traces in Grafana.
3) Data flow and processing levels
1. Collection: byte tail files, journal, syslog, stdout containers, HTTP.
2. Enrichment: timestamp normalization, host/pod/namespace, env (prod/stage), release, commit SHA, trace/span id.
3. Parsing: JSON → flat fields; grok/regex; Nginx/Envoy formats; payment schemes (PSP error codes).
4. Filtering/editing: cut PII (PAN, CVV, e-mail, addresses), secrets, tokens.
5. Routing: by tenant/service/log level; hot/warm/cold; to S3/object storage.
6. Storage and retention: TTL policy by data class.
7. Access/Analytics/Alerts.
4) ELK: key solutions
4. 1 Logstash/Beats
Use Beats/Fluent Bit on nodes for easy picker, Logstash as central ETL (grok, dissect, mutate, geoip, translate).
Logstash pools: ingest-ETL, security-ETL, payments-ETL - to isolate loads.
4. 2 Elasticsearch
Sharding: focus on ~ 20-50 GB per shard; avoid a "shard explosion."
Index strategy: 'logs-
- hot: SSD, 1-7 days; warm: HDD, 7-30 days; cold: volumetric; frozen: minimum cost with slower access.
- Mappings - Hard-type fields, restrict fielddata, and create dynamic fields.
- Cache and queries: filters by keyword fields, aggregates - neatly; pin-to-hot for high-frequency search.
4. 3 Kibana
Spaces for multi-tenancy.
Saved searches, Lens/TSVB, threshold/alert metrics.
RBAC by index-patterns ('logs-tenant-').
5) Loki: key decisions
5. 1 Label model
Labels are Loki's "index." Use low cardinality: 'cluster', 'namespace', 'app', 'level', 'env', 'tenant'.
Fields with high cardinality (uid, request_id) - in a row; retrieve '| =', '| json', '| regexp' when queried through LogQL.
5. 2 Components
Promtail: сбор stdout, files, journald; parsers (JSON, regex, cri).
Distributor/Ingester/Querier/Query-frontend: scaling by role; request caching.
Object storage (S3/GCS/MinIO) for long-term storage of chunk logs.
5. 3 LogQL techniques
Fast grep: '{app = "payments," level =" error"} | = "declined"'
Парсинг JSON: `{app="api"} | json | code="5xx" | unwrap duration | avg()`
Correlation with metrics: 'rate ({app = "nginx"} | = "200" [5m])'
6) ELK vs Loki comparison (in brief)
Search/aggregation: ELK is stronger for complex full-text and faceted queries; Loki - grep-like, fast and cheap.
Cost: Loki is often cheaper on larger volumes (object storage + smaller index).
Operational complexity: ELK requires discipline in indices/ILM, Javu-hips; Loki - label disciplines.
Correlation with metrics/traces: Loki integrates naturally with the Grafana/OTel stack; ELK also knows how, but more often through integration.
7) Safety and compliance
PII edition on the edge (shipper): mask PAN, e-mail, phone, addresses, tokens.
TLS in-transit, mTLS between agents and buses.
RBAC: per-tenant indexes/labels; isolation of neimspaces/spaces.
Secrets hygiene: environment variables without secrets, individual secret managers.
Legal Hold: segment/index freezing mechanism; write-once for disputed periods.
Deletion/retention: TTL policies by data class (prod/stateful/payments/audit).
Log access audit trails.
8) Reliability and throughput
Buffering and backpressure: local files/disks for agents; retreats with exponential backoff.
Idempotency: 'ingest _ id '/' log _ id'fields to avoid duplicates during duplicates.
HA: minimum 3 nodes for Loki ES masters/ingesters; antiaffinity по AZ.
Quotas and rate-limits by tenant/service; protection against "storms" logging.
Log level scheme: 'ERROR' limited, 'DEBUG' only temporarily through dynamic flags.
9) Performance and tuning
ELK:- JVM heap 50% RAM (but ≤ ~ 30-32 GB per node), page cache is important.
- Smart rollover (20-50 GB/shard), 'refresh _ interval' ↑ for ingest indexes.
- In Logstash, avoid "heavy" grok; if possible, JSON logging at the source.
- The right label set is the key to speed.
- Large chunks → cheaper storage, but more expensive memory at ingester; balance.
- Query-frontend + cache (meme/Redis) for repeated requests.
10) FinOps for logs (cost)
Decreasing cardinality of fields/labels.
DEBUG sampling and dynamic "log switches."
Rotation: short hot, long cold to object.
Deduplication and consolidated messages (batch).
Archiving rarely used logs to cheap storage classes.
Value dashboard: volume/data streams/labels/indexes/tenants.
11) Observability 3-in-1
Trace-ID/Span-ID to each log (middleware on API gateways and services).
OpenTelemetry: single context; exporters to Tempo/Jaeger, metrics to Prometheus/Mimir, logs to Loki/ELK.
Quick scenarios: "alert by metric → jump into the corresponding logs → jump into the track."
12) Multi-tenancy and isolation
Namespace-based isolation (K8s labels), separate index patterns/labels' tenant '.
Separation of alerts/dashboards/retenschna by tenant.
Consumption billing: volume of ingest, storage, requests.
13) Monitoring and SLO for the conveyor itself
SLO ingest: «99. 9% logs delivered Search SLO: "p95 queries Technical metrics: queue depth, dropped logs, reprocess rate, parser error rate, ingester/ES node failure. 14) Typical deployment schemes Managed: Elasticsearch Service/Opensearch, Grafana Cloud Loki. 15) Configuration examples 15. 1 Promtail (K8s, CRI JSON) 15. 2 Logstash (ingest and masking) 16) Alerting and dashboards (templates) Ошибки API: `rate({app="api",level="error"}[5m]) > threshold` → PagerDuty/Telegram. 17) Quality checks (log-QA) Logging contracts: JSON format, required fields ('ts', 'level', 'service', 'env', 'trace _ id', 'msg'). 18) Frequent errors and anti-patterns Loki labels with high cardinality ('user _ id', 'request _ id') → an explosion of memory. Dynamic fields in ES without mapping → "index explosion." DEBUG in the sale "forever." Turn on by flags and with TTL. 19) Implementation plan (iterations) 1. MVP: agents + one pipeline (applications), basic dashboards, PII edition. 20) Checklist of launch into production 21) Mini-FAQ What to choose - ELK or Loki?
Self-hosted K8s: StatefulSets for ES/Loki, anti-affinity for AZ, PersistentVolumes, object storage.
Edge agents (applications in the regions): local buffer + TLS channel to the central ingest.yaml scrape_configs:
- job_name: kubernetes-pods kubernetes_sd_configs:
- role: pod pipeline_stages:
- cri: {}
- json:
expressions:
level: level msg: message trace: trace_id
- labels:
level:
app:
namespace:
- match:
selector: '{namespace="prod"}'
stages:
- regex:
expression: '(?P<pan>\b[0-9]{12,19}\b)'
- replace:
expression: '(?P<pan>\b[0-9]{12,19}\b)'
replace: '[REDACTED_PAN]'
relabel_configs:
- action: replace source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- action: replace source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- action: replace source_labels: [__meta_kubernetes_pod_node_name]
target_label: noderuby input {
beats { port => 5044 }
}
filter {
json { source => "message" skip_on_invalid_json => true }
mutate { add_field => { "env" => "%{[kubernetes][labels][env]}" } }
PII mutate {
gsub => [
"message", "\b[0-9]{12,19}\b", "[REDACTED_PAN]",
"message", "(?i)(authorization: Bearer)([A-Za-z0-9\.\-_]+)", "\1[REDACTED_TOKEN]"
]
}
}
output {
elasticsearch {
hosts => ["https://es-hot-1:9200","https://es-hot-2:9200"]
index => "logs-%{[fields][tenant]}-%{[app]}-%{+YYYY. MM. dd}"
ilm_enabled => true ssl => true cacert => "/etc/ssl/certs/ca. crt"
user => "${ES_USER}"
password => "${ES_PASS}"
}
}
5xx splash in Nginx/Envoy; drop ingest in agents; growth of latency search.
Linter logs in CI: banning new fields with high cardinality without agreement.
Canary services: generation of reference logs for early detection of regressions.
Absence of PII revision.
One common "monolithic" pipeline for everything - better segments by domain.
2. Extension: network/infra-logs, SLO alerts, correlation with tracks.
3. FinOps: retention matrix, cost report, label/index optimization.
4. Multi-tenant: spaces, RBAC, consumption billing.
5. Reliability: HA, disaster-drills, Legal Hold.