GH GambleHub

Infrastructure monitoring

Infrastructure Monitoring

1) Objectives and frame

Infrastructure monitoring is a system of signals about the health, performance and availability of a platform. He must:
  • Warn before user crashes (early detection).
  • Give diagnosis of the root cause (from symptom to cause).
  • Support SLO gating of releases and auto-rollbacks.
  • Feed the post-incident analysis (evidence as data).

Supporting principles: Observable by design, less noise - more signals, automation of reactions, a single panel of truth.

2) Observability triad

Timeseries - Rate/Demand/Error/Saturation (USE/RED)

Logs: event detail with context; contain no secrets/PII.
Traces: distributed cases with causal relationships.

Plus:
  • Profiling (CPU/heap/lock/io), eBPF for system level.
  • Events/audit (K8s Events, changes in configs/secrets).

3) SLI/SLO/SLA - quality language

SLI: 'availability', 'error _ rate', 'p95 _ latency', 'queue _ lag'.

SLO (target): "successful requests ≥ 99. 9% in 30 days."

Error Budget: tolerance; used for auto-stop releases.

SLO Example (YAML):
yaml service: "api-gateway"
slis:
- name: success_rate query_good: sum(rate(http_requests_total{status!~"5.."}[5m]))
query_total: sum(rate(http_requests_total[5m]))
slo: 99. 9 window: 30d

4) Monitoring layers map

1. Hosts/VM/nodes: CPU/Load/Steal, RAM/Swap, Disk IOPS/Latency, Filesystem.
2. Network/LB/DNS: RTT, packets/drops, backlog, SYN/Timeout, health-probes.
3. Kubernetes/Orchestrator: API server, etcd, controllers, scheduler; pods/nodes, pending/evicted, throttling, kube-events.
4. Services/containers: RED (Rate/Errors/Duration), readiness/liveness.
5. Databases/caches: QPS, lock wait, replication lag, buffer hit, slow queries.
6. Queues/buses: consumer lag, request/dead-letter, throughput.
7. Storage/cloud: S3/Blob errors and latency, 429/503 from providers.
8. Perimeter boundaries: WAF/Rate Limits, 4xx/5xx by route, CDN.
9. Synthetics: HTTP script checks (deposit/output), TLS/certificates.
10. Economy/capacity: cost per service, utilization, headroom.

5) Whitebox и Blackbox

Whitebox: exporters/SDKs within services (Prometheus, OpenTelemetry).
Blackbox: external samples from different regions (availability, latency, TLS expiration).

Combine: "sign outside" + "diagnosis inside."

Example of'blackbox _ exporter':
yaml modules:
https_2xx:
prober: http http:
method: GET preferred_ip_protocol: "ip4"

6) Kubernetes: key signals

Кластер: `apiserver_request_total`, `etcd_server_has_leader`, etcd fsync.
Узлы: `container_cpu_cfs_throttled_seconds_total`, `node_pressure`.
Pads: Pending/CrashLoopBackOff, OOMKilled, restarts.
Plans/Limits: Requests vs Limits, PodDisruptionBudget, HPA/VPA.
Network: NetworkPolicy drops, conntrack exhaustion.

Дашборды: «Cluster health», «Workload saturation», «Top erroring services».

7) DB and queues

PostgreSQL/MySQL: replication lag, deadlocks, slow query %, checkpoint I/O.
Redis/Memcached: hit ratio, evictions, rejected connections.
Kafka/RabbitMQ: consumer lag, unacked, requeue, broker ISR, disk usage.

8) RED/USE metrics and business correlations

RED: `rate` (RPS), `errors` (4xx/5xx), `duration` (p95/p99).
USE (for resources): Utilization, Saturation, Errors.
Associate with the product: deposits/payments success, fraud flags, conversion - these are "guards" for canary release.

9) Alerting structure

Tier-1 (page): incidents affecting SLO (availability, 5xx, latency, cluster critical component failure).
Tier-2 (ticket): capacity degradation, error growth without affecting SLO.
Tier-3 (inform): trends, predictive capacity, expiring certificates.

Escalation rules: silence time/duplicate compression, on-call rotations, follow-the-sun.

Example of Alertmanager routes:
yaml route:
group_by: ["service","severity"]
receiver: "pager"
routes:
- match: { severity: "critical" }
receiver: "pager"
- match: { severity: "warning" }
receiver: "tickets"

10) Prometheus Rule Examples

10. 1 5xx Errors with SLO threshold

yaml groups:
- name: api rules:
- alert: HighErrorRate expr:
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0. 005 for: 10m labels: { severity: "critical", service: "api-gateway" }
annotations:
summary: "5xx > 0. 5% 10m"
runbook: "https://runbooks/api-gateway/5xx"

10. 2 Burning error-budget (multi-window burn)

yaml
- alert: ErrorBudgetBurn expr:
(1 - (
sum(rate(http_requests_total{status!~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
)) > (1 - 0. 999) 14 for: 5m labels: { severity: "critical", slo: "99. 9" }
annotations: { summary: "Fast burn >14x for 5m" }

10. 3 System Saturation (CPU Throttling)

yaml
- alert: CPUThrottlingHigh expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0. 1 for: 10m labels: { severity: "warning" }
annotations: { summary: "CPU throttling >10%" }

11) Logs: collection, normalization, retention

Standardization: JSON logs: 'ts', 'level', 'service', 'trace _ id', 'user/tenant'.
Pipeline: agent (Fluent Bit/Vector) → buffer → index/storage.
Revision: PII/secrets masking at the edge.
Retention: fast storage class (7-14 days), cold archive (30-180 days).
Semantics: error budgets/deprecates - separate channels.

12) Trails and OpenTelemetry

Instrument input points (gateway), kliyent→servis calls, DB/caches/queues.
Bind metrics to trace attributes (Exemplars) for quick navigation.
OTel Collector as a central gateway: filtering, sampling, export to selected backends.

OTel Collector example (fragment):
yaml receivers: { otlp: { protocols: { http: {}, grpc: {} } } }
processors: { batch: {}, tail_sampling: { policies: [ { name: errors, type: status_code, status_codes: [ERROR] } ] } }
exporters: { prometheus: {}, otlp: { endpoint: "traces. sink:4317" } }
service:
pipelines:
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheus] }
traces: { receivers: [otlp], processors: [tail_sampling,batch], exporters: [otlp] }

13) Synthetics and external checks

HTTP runs of business scenarios (login, deposit, withdrawal, purchase).
TLS/Domain: certificate term/CAA/DNS health.
Regionality: samples from key countries/providers (routing/block lists).
Synthetics should alert if not available to the user, even with green internal telemetry.

14) Profiling and eBPF

Continuous profiling: identification of hot functions, locks.
eBPF: system events (syscalls, TCP retransmitts), on the product with minimal overhead.
Profile alerts without straining (tickets), and for regressions after release - as a signal of rollback.

15) Dashboards and the "truth panel"

Minimum set:

1. Platform Overview: SLI/SLO by key services, error-budget, alerts.

2. API RED: RPS/ERRORS/DURATION by route.

3. K8s Cluster: control-plane, узлы, capacity headroom.

4. DB/Cache: lag/locks/slow query %, hit ratio.

5. Queues: backlog/lag, fail/retry.

6. Per-release: comparison of before/after metrics (canary windows).

7. FinOps: cost per namespace/service, idle/oversized ресурсы.

16) Incidents, alert noise and escalation

Deduplication - Service/Cause Grouping, Cascade Suppression

Silence/maintenance: releases/migrations should not "paint" everything red.

Runbooks: each critical alert with diagnostic steps and a rollback "button."

Postmortem: timeline, what they learned, what signals added/cleaned.

17) Safety in monitoring

RBAC for reading/editing rules/datasources.
Secrets: Exporter/agent tokens - via Secret Manager.
Isolation: client/tenant metrics - into separate spaces/tabs.
Integrity: signature of agents/builds, configs via GitOps (merge review).

18) Finance & Capacity (FinOps)

Quotas and budgets; alerts to abnormal growth.
Right-sizing: analysis of requests/limits, CPU/RAM utilization, spot instances for non-critical tasks.
"Cost per request/tenant" as performance KPIs.

19) Anti-patterns

Infrastructure metrics only without custom SLIs.
100 + alerts "about everything" → blindness on-call.
Logs as the only source (without metrics and tracing).
Mutable dashboards without versioning/review.
Lack of synthetics: "everything is green," but the front is not available.

There is no connection with releases: it is impossible to answer "what has changed at the moment X."

20) Implementation checklist (0-60 days)

0-15 days

Define SLI/SLO for 3-5 key services.
Enable basic exporters/agents, standardize JSON logs.
Configure Tier-1 alerts (availability, 5xx, p95).

16-30 days

Add synthetics for critical scenarios.
Enable OTel on input/critical services.
Dashboards "Per-release" and error-budget burn-rules.

31-60 days

Cover DB/queues/cache with advanced signals.
Implement eBPF/profiling for high-CPU services.
GitOps for rules/dashboards/alerts, regular noise cleaning.

21) Maturity metrics

SLO coverage of key services ≥ 95%.
MTTA/MTTR (target: min/tens of minutes).
Proportion of Tier-1 alerts closed by auto-action or quick rollback.
The ratio of "useful "/" noisy "alerts is> 3:1.
Synthetic coverage of all "money" paths = 100%.

22) Applications: mini-templates

Prometheus - Availability by Status Class

yaml
- record: job:http:availability:ratio_rate5m expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Grafana - tip for canaries


expr: histogram_quantile(0. 95, sum(rate(http_request_duration_seconds_bucket{version=~"stable    canary"}[5m])) by (le,version))

Alertmanager - duty and silence

yaml receivers:
- name: pager slack_configs:
- channel: "#oncall"
send_resolved: true inhibit_rules:
- source_match: { severity: "critical" }
target_match: { severity: "warning" }
equal: ["service"]

23) Conclusion

Monitoring is not a set of graphs, but the SRE operating system: SLI/SLO as a quality contract, metrics/trails/logs as sources of truth, alerts as a controlled signal, synthetics as a "user voice," GitOps as a discipline of change. Build a single loop from host to API, tie it to releases and rollbacks - and the platform is predictable, fast and economical.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.