Infrastructure monitoring
Infrastructure Monitoring
1) Objectives and frame
Infrastructure monitoring is a system of signals about the health, performance and availability of a platform. He must:- Warn before user crashes (early detection).
- Give diagnosis of the root cause (from symptom to cause).
- Support SLO gating of releases and auto-rollbacks.
- Feed the post-incident analysis (evidence as data).
Supporting principles: Observable by design, less noise - more signals, automation of reactions, a single panel of truth.
2) Observability triad
Timeseries - Rate/Demand/Error/Saturation (USE/RED)
Logs: event detail with context; contain no secrets/PII.
Traces: distributed cases with causal relationships.
- Profiling (CPU/heap/lock/io), eBPF for system level.
- Events/audit (K8s Events, changes in configs/secrets).
3) SLI/SLO/SLA - quality language
SLI: 'availability', 'error _ rate', 'p95 _ latency', 'queue _ lag'.
SLO (target): "successful requests ≥ 99. 9% in 30 days."
Error Budget: tolerance; used for auto-stop releases.
SLO Example (YAML):yaml service: "api-gateway"
slis:
- name: success_rate query_good: sum(rate(http_requests_total{status!~"5.."}[5m]))
query_total: sum(rate(http_requests_total[5m]))
slo: 99. 9 window: 30d
4) Monitoring layers map
1. Hosts/VM/nodes: CPU/Load/Steal, RAM/Swap, Disk IOPS/Latency, Filesystem.
2. Network/LB/DNS: RTT, packets/drops, backlog, SYN/Timeout, health-probes.
3. Kubernetes/Orchestrator: API server, etcd, controllers, scheduler; pods/nodes, pending/evicted, throttling, kube-events.
4. Services/containers: RED (Rate/Errors/Duration), readiness/liveness.
5. Databases/caches: QPS, lock wait, replication lag, buffer hit, slow queries.
6. Queues/buses: consumer lag, request/dead-letter, throughput.
7. Storage/cloud: S3/Blob errors and latency, 429/503 from providers.
8. Perimeter boundaries: WAF/Rate Limits, 4xx/5xx by route, CDN.
9. Synthetics: HTTP script checks (deposit/output), TLS/certificates.
10. Economy/capacity: cost per service, utilization, headroom.
5) Whitebox и Blackbox
Whitebox: exporters/SDKs within services (Prometheus, OpenTelemetry).
Blackbox: external samples from different regions (availability, latency, TLS expiration).
Combine: "sign outside" + "diagnosis inside."
Example of'blackbox _ exporter':yaml modules:
https_2xx:
prober: http http:
method: GET preferred_ip_protocol: "ip4"
6) Kubernetes: key signals
Кластер: `apiserver_request_total`, `etcd_server_has_leader`, etcd fsync.
Узлы: `container_cpu_cfs_throttled_seconds_total`, `node_pressure`.
Pads: Pending/CrashLoopBackOff, OOMKilled, restarts.
Plans/Limits: Requests vs Limits, PodDisruptionBudget, HPA/VPA.
Network: NetworkPolicy drops, conntrack exhaustion.
Дашборды: «Cluster health», «Workload saturation», «Top erroring services».
7) DB and queues
PostgreSQL/MySQL: replication lag, deadlocks, slow query %, checkpoint I/O.
Redis/Memcached: hit ratio, evictions, rejected connections.
Kafka/RabbitMQ: consumer lag, unacked, requeue, broker ISR, disk usage.
8) RED/USE metrics and business correlations
RED: `rate` (RPS), `errors` (4xx/5xx), `duration` (p95/p99).
USE (for resources): Utilization, Saturation, Errors.
Associate with the product: deposits/payments success, fraud flags, conversion - these are "guards" for canary release.
9) Alerting structure
Tier-1 (page): incidents affecting SLO (availability, 5xx, latency, cluster critical component failure).
Tier-2 (ticket): capacity degradation, error growth without affecting SLO.
Tier-3 (inform): trends, predictive capacity, expiring certificates.
Escalation rules: silence time/duplicate compression, on-call rotations, follow-the-sun.
Example of Alertmanager routes:yaml route:
group_by: ["service","severity"]
receiver: "pager"
routes:
- match: { severity: "critical" }
receiver: "pager"
- match: { severity: "warning" }
receiver: "tickets"
10) Prometheus Rule Examples
10. 1 5xx Errors with SLO threshold
yaml groups:
- name: api rules:
- alert: HighErrorRate expr:
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0. 005 for: 10m labels: { severity: "critical", service: "api-gateway" }
annotations:
summary: "5xx > 0. 5% 10m"
runbook: "https://runbooks/api-gateway/5xx"
10. 2 Burning error-budget (multi-window burn)
yaml
- alert: ErrorBudgetBurn expr:
(1 - (
sum(rate(http_requests_total{status!~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
)) > (1 - 0. 999) 14 for: 5m labels: { severity: "critical", slo: "99. 9" }
annotations: { summary: "Fast burn >14x for 5m" }
10. 3 System Saturation (CPU Throttling)
yaml
- alert: CPUThrottlingHigh expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0. 1 for: 10m labels: { severity: "warning" }
annotations: { summary: "CPU throttling >10%" }
11) Logs: collection, normalization, retention
Standardization: JSON logs: 'ts', 'level', 'service', 'trace _ id', 'user/tenant'.
Pipeline: agent (Fluent Bit/Vector) → buffer → index/storage.
Revision: PII/secrets masking at the edge.
Retention: fast storage class (7-14 days), cold archive (30-180 days).
Semantics: error budgets/deprecates - separate channels.
12) Trails and OpenTelemetry
Instrument input points (gateway), kliyent→servis calls, DB/caches/queues.
Bind metrics to trace attributes (Exemplars) for quick navigation.
OTel Collector as a central gateway: filtering, sampling, export to selected backends.
yaml receivers: { otlp: { protocols: { http: {}, grpc: {} } } }
processors: { batch: {}, tail_sampling: { policies: [ { name: errors, type: status_code, status_codes: [ERROR] } ] } }
exporters: { prometheus: {}, otlp: { endpoint: "traces. sink:4317" } }
service:
pipelines:
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
traces: { receivers: [otlp], processors: [tail_sampling,batch], exporters: [otlp] }
13) Synthetics and external checks
HTTP runs of business scenarios (login, deposit, withdrawal, purchase).
TLS/Domain: certificate term/CAA/DNS health.
Regionality: samples from key countries/providers (routing/block lists).
Synthetics should alert if not available to the user, even with green internal telemetry.
14) Profiling and eBPF
Continuous profiling: identification of hot functions, locks.
eBPF: system events (syscalls, TCP retransmitts), on the product with minimal overhead.
Profile alerts without straining (tickets), and for regressions after release - as a signal of rollback.
15) Dashboards and the "truth panel"
Minimum set:1. Platform Overview: SLI/SLO by key services, error-budget, alerts.
2. API RED: RPS/ERRORS/DURATION by route.
3. K8s Cluster: control-plane, узлы, capacity headroom.
4. DB/Cache: lag/locks/slow query %, hit ratio.
5. Queues: backlog/lag, fail/retry.
6. Per-release: comparison of before/after metrics (canary windows).
7. FinOps: cost per namespace/service, idle/oversized ресурсы.
16) Incidents, alert noise and escalation
Deduplication - Service/Cause Grouping, Cascade Suppression
Silence/maintenance: releases/migrations should not "paint" everything red.
Runbooks: each critical alert with diagnostic steps and a rollback "button."
Postmortem: timeline, what they learned, what signals added/cleaned.
17) Safety in monitoring
RBAC for reading/editing rules/datasources.
Secrets: Exporter/agent tokens - via Secret Manager.
Isolation: client/tenant metrics - into separate spaces/tabs.
Integrity: signature of agents/builds, configs via GitOps (merge review).
18) Finance & Capacity (FinOps)
Quotas and budgets; alerts to abnormal growth.
Right-sizing: analysis of requests/limits, CPU/RAM utilization, spot instances for non-critical tasks.
"Cost per request/tenant" as performance KPIs.
19) Anti-patterns
Infrastructure metrics only without custom SLIs.
100 + alerts "about everything" → blindness on-call.
Logs as the only source (without metrics and tracing).
Mutable dashboards without versioning/review.
Lack of synthetics: "everything is green," but the front is not available.
There is no connection with releases: it is impossible to answer "what has changed at the moment X."
20) Implementation checklist (0-60 days)
0-15 days
Define SLI/SLO for 3-5 key services.
Enable basic exporters/agents, standardize JSON logs.
Configure Tier-1 alerts (availability, 5xx, p95).
16-30 days
Add synthetics for critical scenarios.
Enable OTel on input/critical services.
Dashboards "Per-release" and error-budget burn-rules.
31-60 days
Cover DB/queues/cache with advanced signals.
Implement eBPF/profiling for high-CPU services.
GitOps for rules/dashboards/alerts, regular noise cleaning.
21) Maturity metrics
SLO coverage of key services ≥ 95%.
MTTA/MTTR (target: min/tens of minutes).
Proportion of Tier-1 alerts closed by auto-action or quick rollback.
The ratio of "useful "/" noisy "alerts is> 3:1.
Synthetic coverage of all "money" paths = 100%.
22) Applications: mini-templates
Prometheus - Availability by Status Class
yaml
- record: job:http:availability:ratio_rate5m expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Grafana - tip for canaries
expr: histogram_quantile(0. 95, sum(rate(http_request_duration_seconds_bucket{version=~"stable canary"}[5m])) by (le,version))
Alertmanager - duty and silence
yaml receivers:
- name: pager slack_configs:
- channel: "#oncall"
send_resolved: true inhibit_rules:
- source_match: { severity: "critical" }
target_match: { severity: "warning" }
equal: ["service"]
23) Conclusion
Monitoring is not a set of graphs, but the SRE operating system: SLI/SLO as a quality contract, metrics/trails/logs as sources of truth, alerts as a controlled signal, synthetics as a "user voice," GitOps as a discipline of change. Build a single loop from host to API, tie it to releases and rollbacks - and the platform is predictable, fast and economical.