Infrastructure monitoring

Infrastructure Monitoring

1) Objectives and frame

Infrastructure monitoring is a system of signals about the health, performance and availability of a platform. He must:

Warn before user crashes (early detection).
Give diagnosis of the root cause (from symptom to cause).
Support SLO gating of releases and auto-rollbacks.
Feed the post-incident analysis (evidence as data).

Supporting principles: Observable by design, less noise - more signals, automation of reactions, a single panel of truth.

2) Observability triad

Timeseries - Rate/Demand/Error/Saturation (USE/RED)

Logs: event detail with context; contain no secrets/PII.
Traces: distributed cases with causal relationships.

Plus:

Profiling (CPU/heap/lock/io), eBPF for system level.
Events/audit (K8s Events, changes in configs/secrets).

3) SLI/SLO/SLA - quality language

SLI: 'availability', 'error _ rate', 'p95 _ latency', 'queue _ lag'.

SLO (target): "successful requests ≥ 99. 9% in 30 days."

Error Budget: tolerance; used for auto-stop releases.

SLO Example (YAML):

yaml service: "api-gateway"
slis:
- name: success_rate query_good: sum(rate(http_requests_total{status!~"5.."}[5m]))
query_total: sum(rate(http_requests_total[5m]))
slo: 99. 9 window: 30d

4) Monitoring layers map

1. Hosts/VM/nodes: CPU/Load/Steal, RAM/Swap, Disk IOPS/Latency, Filesystem.
2. Network/LB/DNS: RTT, packets/drops, backlog, SYN/Timeout, health-probes.
3. Kubernetes/Orchestrator: API server, etcd, controllers, scheduler; pods/nodes, pending/evicted, throttling, kube-events.
4. Services/containers: RED (Rate/Errors/Duration), readiness/liveness.
5. Databases/caches: QPS, lock wait, replication lag, buffer hit, slow queries.
6. Queues/buses: consumer lag, request/dead-letter, throughput.
7. Storage/cloud: S3/Blob errors and latency, 429/503 from providers.
8. Perimeter boundaries: WAF/Rate Limits, 4xx/5xx by route, CDN.
9. Synthetics: HTTP script checks (deposit/output), TLS/certificates.
10. Economy/capacity: cost per service, utilization, headroom.

5) Whitebox и Blackbox

Whitebox: exporters/SDKs within services (Prometheus, OpenTelemetry).
Blackbox: external samples from different regions (availability, latency, TLS expiration).

Combine: "sign outside" + "diagnosis inside."

Example of'blackbox _ exporter':

yaml modules:
https_2xx:
prober: http http:
method: GET preferred_ip_protocol: "ip4"

6) Kubernetes: key signals

Кластер: `apiserver_request_total`, `etcd_server_has_leader`, etcd fsync.
Узлы: `container_cpu_cfs_throttled_seconds_total`, `node_pressure`.
Pads: Pending/CrashLoopBackOff, OOMKilled, restarts.
Plans/Limits: Requests vs Limits, PodDisruptionBudget, HPA/VPA.
Network: NetworkPolicy drops, conntrack exhaustion.

Дашборды: «Cluster health», «Workload saturation», «Top erroring services».

7) DB and queues

PostgreSQL/MySQL: replication lag, deadlocks, slow query %, checkpoint I/O.
Redis/Memcached: hit ratio, evictions, rejected connections.
Kafka/RabbitMQ: consumer lag, unacked, requeue, broker ISR, disk usage.

8) RED/USE metrics and business correlations

RED: `rate` (RPS), `errors` (4xx/5xx), `duration` (p95/p99).
USE (for resources): Utilization, Saturation, Errors.
Associate with the product: deposits/payments success, fraud flags, conversion - these are "guards" for canary release.

9) Alerting structure

Tier-1 (page): incidents affecting SLO (availability, 5xx, latency, cluster critical component failure).
Tier-2 (ticket): capacity degradation, error growth without affecting SLO.
Tier-3 (inform): trends, predictive capacity, expiring certificates.

Escalation rules: silence time/duplicate compression, on-call rotations, follow-the-sun.

Example of Alertmanager routes:

yaml route:
group_by: ["service","severity"]
receiver: "pager"
routes:
- match: { severity: "critical" }
receiver: "pager"
- match: { severity: "warning" }
receiver: "tickets"

10) Prometheus Rule Examples

10. 1 5xx Errors with SLO threshold

yaml groups:
- name: api rules:
- alert: HighErrorRate expr:
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0. 005 for: 10m labels: { severity: "critical", service: "api-gateway" }
annotations:
summary: "5xx > 0. 5% 10m"
runbook: "https://runbooks/api-gateway/5xx"

10. 2 Burning error-budget (multi-window burn)

yaml
- alert: ErrorBudgetBurn expr:
(1 - (
sum(rate(http_requests_total{status!~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
)) > (1 - 0. 999) 14 for: 5m labels: { severity: "critical", slo: "99. 9" }
annotations: { summary: "Fast burn >14x for 5m" }

10. 3 System Saturation (CPU Throttling)

yaml
- alert: CPUThrottlingHigh expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0. 1 for: 10m labels: { severity: "warning" }
annotations: { summary: "CPU throttling >10%" }

11) Logs: collection, normalization, retention

Standardization: JSON logs: 'ts', 'level', 'service', 'trace _ id', 'user/tenant'.
Pipeline: agent (Fluent Bit/Vector) → buffer → index/storage.
Revision: PII/secrets masking at the edge.
Retention: fast storage class (7-14 days), cold archive (30-180 days).
Semantics: error budgets/deprecates - separate channels.

12) Trails and OpenTelemetry

Instrument input points (gateway), kliyent→servis calls, DB/caches/queues.
Bind metrics to trace attributes (Exemplars) for quick navigation.
OTel Collector as a central gateway: filtering, sampling, export to selected backends.

OTel Collector example (fragment):

yaml receivers: { otlp: { protocols: { http: {}, grpc: {} } } }
processors: { batch: {}, tail_sampling: { policies: [ { name: errors, type: status_code, status_codes: [ERROR] } ] } }
exporters: { prometheus: {}, otlp: { endpoint: "traces. sink:4317" } }
service:
pipelines:
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
traces: { receivers: [otlp], processors: [tail_sampling,batch], exporters: [otlp] }

13) Synthetics and external checks

HTTP runs of business scenarios (login, deposit, withdrawal, purchase).
TLS/Domain: certificate term/CAA/DNS health.
Regionality: samples from key countries/providers (routing/block lists).
Synthetics should alert if not available to the user, even with green internal telemetry.

14) Profiling and eBPF

Continuous profiling: identification of hot functions, locks.
eBPF: system events (syscalls, TCP retransmitts), on the product with minimal overhead.
Profile alerts without straining (tickets), and for regressions after release - as a signal of rollback.

15) Dashboards and the "truth panel"

Minimum set:

1. Platform Overview: SLI/SLO by key services, error-budget, alerts.

2. API RED: RPS/ERRORS/DURATION by route.

3. K8s Cluster: control-plane, узлы, capacity headroom.

4. DB/Cache: lag/locks/slow query %, hit ratio.

5. Queues: backlog/lag, fail/retry.

6. Per-release: comparison of before/after metrics (canary windows).

7. FinOps: cost per namespace/service, idle/oversized ресурсы.

16) Incidents, alert noise and escalation

Deduplication - Service/Cause Grouping, Cascade Suppression

Silence/maintenance: releases/migrations should not "paint" everything red.

Runbooks: each critical alert with diagnostic steps and a rollback "button."

Postmortem: timeline, what they learned, what signals added/cleaned.

17) Safety in monitoring

RBAC for reading/editing rules/datasources.
Secrets: Exporter/agent tokens - via Secret Manager.
Isolation: client/tenant metrics - into separate spaces/tabs.
Integrity: signature of agents/builds, configs via GitOps (merge review).

18) Finance & Capacity (FinOps)

Quotas and budgets; alerts to abnormal growth.
Right-sizing: analysis of requests/limits, CPU/RAM utilization, spot instances for non-critical tasks.
"Cost per request/tenant" as performance KPIs.

19) Anti-patterns

Infrastructure metrics only without custom SLIs.
100 + alerts "about everything" → blindness on-call.
Logs as the only source (without metrics and tracing).
Mutable dashboards without versioning/review.
Lack of synthetics: "everything is green," but the front is not available.

There is no connection with releases: it is impossible to answer "what has changed at the moment X."

20) Implementation checklist (0-60 days)

0-15 days

Define SLI/SLO for 3-5 key services.
Enable basic exporters/agents, standardize JSON logs.
Configure Tier-1 alerts (availability, 5xx, p95).

16-30 days

Add synthetics for critical scenarios.
Enable OTel on input/critical services.
Dashboards "Per-release" and error-budget burn-rules.

31-60 days

Cover DB/queues/cache with advanced signals.
Implement eBPF/profiling for high-CPU services.
GitOps for rules/dashboards/alerts, regular noise cleaning.

21) Maturity metrics

SLO coverage of key services ≥ 95%.
MTTA/MTTR (target: min/tens of minutes).
Proportion of Tier-1 alerts closed by auto-action or quick rollback.
The ratio of "useful "/" noisy "alerts is> 3:1.
Synthetic coverage of all "money" paths = 100%.

22) Applications: mini-templates

Prometheus - Availability by Status Class

yaml
- record: job:http:availability:ratio_rate5m expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Grafana - tip for canaries


expr: histogram_quantile(0. 95, sum(rate(http_request_duration_seconds_bucket{version=~"stable    canary"}[5m])) by (le,version))

Alertmanager - duty and silence

yaml receivers:
- name: pager slack_configs:
- channel: "#oncall"
send_resolved: true inhibit_rules:
- source_match: { severity: "critical" }
target_match: { severity: "warning" }
equal: ["service"]

23) Conclusion

Monitoring is not a set of graphs, but the SRE operating system: SLI/SLO as a quality contract, metrics/trails/logs as sources of truth, alerts as a controlled signal, synthetics as a "user voice," GitOps as a discipline of change. Build a single loop from host to API, tie it to releases and rollbacks - and the platform is predictable, fast and economical.

Infrastructure monitoring