Observability stack

1) Why you need an observability stack

Rapid RCA and reduced MTTR: from symptom to cause in minutes.
SLO management: measurement of errors/latency, alert by erroneous budget.
Release control: canary calculations, auto-rollback by metrics.
Security and Audit: Access Routes, Anomalies, Legal Hold.
FinOps transparency: cost of storage/requests, cost-per-SLO.

Methodologies: Golden Signals (latency/traffic/errors/saturation), RED, USE.

2) Basic stack architecture

Components by Layer

Collection/agents: Exporters, Promtail/Fluent Bit, OTel SDK/Auto-Instr, Blackbox-probes.
Шина/ingest: Prometheus remote_write → Mimir/Thanos, Loki distributors/ingesters, Tempo/Jaeger ingesters.
Storages: object S3/GCS/MinIO (long cold), SSD (hot rows).
Queries/visualization: Grafana (panels, SLO widgets), Kibana (if ELK).
Management: Alertmanager/Graphana alerts, service catalog, RBAC, secret manager.

Deployment patterns

Managed (Grafana Cloud/cloud services) - fast and more expensive on volumes.
Self-hosted in K8s - full control, needs operation and FinOps.

3) Data standards: unified "observability scheme"

3. 1 Metrics (Prometheus/OpenMetrics)

Required labels: 'env', 'region', 'cluster', 'namespace', 'service', 'version', 'tenant' (if multi-tenant), 'endpoint'.
Naming: 'snake _ case', suffixes' _ total ',' _ seconds', '_ bytes'.
Bar charts: fixed 'buckets' (SLO-oriented).
Cardinality: do not include 'user _ id', 'request _ id' in labels.

3. 2 Logs

Format: JSON; required fields' ts', 'level', 'service', 'env', 'trace _ id', 'span _ id', 'msg'.
PII: masking on the agent (PAN, tokens, e-mail, etc.).
Loki labels: only low cardinality ('app', 'namespace', 'level', 'tenant').

3. 3 Tracks

OTel semantics: 'service. name`, `deployment. environment`, `db. system`, `http.`.
Sampling: p99 target paths are 'always _ on '/tail-sampling, the rest is' parent/ratio '.
Embedding ID: flick 'trace _ id/span _ id' into logs and metrics (labels/fields).

4) M-L-T correlation (Metrics/Logs/Traces)

From the alert graph (metric) → filtered logs by 'trace _ id' → a specific trace.
From the trace (slow span), a request for metrics of a specific backend on the span interval is →.
Drilldown buttons in the panels: "to logs" and "to traces" with variable substitution ('$ env', '$ service', '$ trace _ id').

5) OpenTelemetry Collector: Reference Pipeline

yaml receivers:
otlp:
protocols: { http: {}, grpc: {} }
prometheus:
config:
scrape_configs:
- job_name: kube-nodes static_configs: [{ targets: ['kubelet:9100'] }]

processors:
batch: {}
memory_limiter: { check_interval: 1s, limit_mib: 512 }
attributes:
actions:
- key: deployment. environment value: ${ENV}
action: insert tail_sampling:
decision_wait: 5s policies:
- name: errors type: status_code status_code: { status_codes: [ERROR] }
- name: important-routes type: string_attribute string_attribute: { key: http. target, values: ["/payments","/login"] }
- name: probabilistic type: probabilistic probabilistic: { sampling_percentage: 10 }

exporters:
otlphttp/mimir: { endpoint: "https://mimir/api/v1/push" }
otlphttp/tempo: { endpoint: "https://tempo/api/traces" }
loki:
endpoint: https://loki/loki/api/v1/push labels:
attributes:
env: "deployment. environment"
service: "service. name"

service:
pipelines:
metrics: { receivers: [prometheus, otlp], processors: [memory_limiter, batch], exporters: [otlphttp/mimir] }
logs:   { receivers: [otlp], processors: [batch], exporters: [loki] }
traces:  { receivers: [otlp], processors: [memory_limiter, attributes, tail_sampling, batch], exporters: [otlphttp/tempo] }

6) Alerting: SLO and multi-burn

The idea: alertim is not at the level of "CPU> 80%," but on the consumption of Error Budget.

PromQL templates:

promql
5-minute error rate err_ratio_5m =
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

Quick burn (1m window)
(err_ratio_1m / (1 - SLO)) > 14. 4

Slow burn (30m)
(err_ratio_30m / (1 - SLO)) > 2

Latency (histograms):

promql latency_p95 =
histogram_quantile(0. 95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

7) Dashboards: folder structure

00_Overview - platform: SLO, p95, 5xx%, capacity, active incidents.
10_Services - by services: RPS, p95/p99, errors, releases (annotations).
20_Infra - K8s/nodes/story/network, etcd, controllers.
30_DB/Queues — PostgreSQL/Redis/Kafka/RabbitMQ.
40_Edge/DNS/CDN/WAF - ingress, LB, WAF rules.
50_Synthetic - uptime and headless scripts.
60_Cost/FinOps - Storage, Inquiry, Hot/Cold, Forecast.

Each panel: description, units, owner, runbook link, drilldown.

8) Logs: LogQL workshop

logql
API errors
{app="api", level="error"}     = "Exception"

Nginx 5xx in 5 minutes
{app="nginx"}      json      status=~"5.."      count_over_time([5m])

Extract Fields
{app="payments"}      json      code!=""      unwrap duration      avg()

9) Tracks: TraceQL and tricks

Find the slowest spans:


{ service. name = "api" }      duration > 500ms

Slow SQL sandwich in a slow query:


{ name = "HTTP GET /order" }      child. span. name = "SELECT" & child. duration > 50ms

10) Synthetics and uptime

Blackbox-exporter: HTTP/TCP/TLS/DNS samples from ≥3 regions/ASN.
Headless: login/deposit scheduled scripts.
Quorum alerts: triggered if regional ≥2 see a failure.
Status page: automatic updates + manual comments.

11) Storage and retention

Metrics: hot 7-30 days (fast rows), downsampling/recording rules, cold - object storage (months).
Logs: hot for 3-7 days, then S3/GCS with an index (Loki chunk store/ELK ILM).
Traces: 3-7 days' always _ on '+ long-term storage for samples (tail-sampled/rejected).

Recommendations:

Rollover in size and time; budget for requests (quotas/limits).
Separate policies for prod/stage and security data.

12) Multi-tenancy and accesses

Separate by 'tenant '/' namespace '/Spaces, index patterns and resolutions.
Tag resources for billing: 'tenant', 'service', 'team'.
Import dashboards/alerts - in the spaces of specific teams.

13) Safety and compliance

TLS/mTLS from agents to backends, HMAC for private health.
RBAC to read/write, audit all requests and alerts.
PII edition at the edge; prohibition of secrets in the logs; DSAR/Legal Hold.
Isolation: separate clusters/nymspaces for sensitive domains.

14) FinOps: cost of observability

We reduce the cardinality of labels and logic in ingest (and not in requests).
Track sampling + target always-on for critical paths.
Downsampling/recording rules for heavy aggregations.
Archiving rare access to cold object.
Метрики: `storage_cost_gb_day`, `query_cost_hour`, `cost_per_rps`, `cost_per_9`.

15) CI/CD and observability tests

Linting metrics/logs in CI: prohibition of the "explosion" of cardinality, verification of histograms/units.
Observability contract tests: required metrics/log fields, 'trace _ id' in middleware.
Canary: annotations of releases on graphs, SLO-auto-rollback.

16) Examples: quick queries

Top endpoints by error:

promql topk(10, sum by (route) (rate(http_requests_total{status=~"5.."}[5m])))

CPU throttling:

promql sum by (namespace, pod) (rate(container_cpu_cfs_throttled_seconds_total[5m])) > 0

Kafka lag:

promql max by (topic, group) (kafka_consumergroup_lag)

From logs to tracks (Loki → Tempo): pass' trace _ id'as a link to Tempo UI/dashboard.

17) Stack quality: checklist

Agreed metric/log/trace schemes and units.
'trace _ id' in logs and metrics, drilldown from panels.
Multi-burn SLO alerts without flapping (quorum/multi-window).
Downsampling, request quotas, step/range limits.
Retention and storage classes are documented and applied.
RBAC/Audit/PII revision included.
Dashboards: owner, runbooks, ≤2 -3 screens, quick response.
FinOps-dashboard (volumes, cost, top talkers).

18) Implementation plan (3 iterations)

1. MVP (2 weeks): Prometheus→Mimir, Loki, Tempo; OTel Collector; basic dashboards and SLO alerts; blackbox samples.
2. Scale (3-4 weeks): tail-sampling, downsampling, multi-region ingest, RBAC/Spaces, FinOps-dashboards.
3. Pro (4 + weeks): auto-rollback on SLO, headless synthetics of key paths, Legal Hold, SLO portfolio and reporting.

19) Anti-patterns

"Beautiful graphics without SLO" - no action → no benefit.
High cardinality labels ('user _ id', 'request _ id') - an explosion of memory and cost.
Logs without JSON and without 'trace _ id' - no correlation.
Resource alerts instead of symptoms - noise and on-call burnout.
Lack of retention policies - uncontrolled cost increases.

20) Mini-FAQ

What to choose: Loki or ELK?
ELK for complex search/facets; Loki is cheaper and faster for grep-like scenarios. A hybrid is often used.

Do everyone need tracks?
Yes, at least on key paths (login, checkout, payments) with tail-sampling - this dramatically speeds up RCA.

How to start from scratch?
OTel Collector → Mimir/Loki/Tempo → basic SLO and blackbox samples → then dashboards and burn alerts.

Total

The observability stack is not a set of disparate tools, but a consistent system: uniform data standards → M-L-T correlation → SLO alert and synthetics → safety and FinOps. Capture schematics, label discipline and retention, connect OTel, add drilldown and auto-rollback - and you get manageable reliability at an understandable cost.

Observability stack

Deployment patterns

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects