Operations and Alert → Management by System Capacity
System capacity alerts
1) Why do you need it
Capacitive alerts warn of approaching technical limits long before the incident: "we are 80% of the ceiling - it's time to scale." For grocery businesses, this is directly about money: missed bets/deposits, session drops, live game delays and provider failures = lost revenue, reputation, fines and kickbacks.
Objectives:- Predictably withstand peak loads (events, tournaments, streams, large campaigns).
- Turn on auto-scaling on time and plan capacity uplift.
- Reduce noise and wake up "on business" when SLO/money is at risk.
- Give engineers accurate recommendations through the runbook.
2) Basic concepts
Capacity: maximum stable throughput (RPS/TPS, connections, IOPS, throughput).
Headroom: margin between the current load and limits.
SLO/SLA: target levels of availability/response time; alerts must be "SLO-aware."
Burn-rate: the speed of "burning" the SLO budget of errors/latency.
High/Low Watermark: upper/lower levels for actuations and auto-recovery.
3) Signal architecture and data sources
Telemetry: metrics (Prometheus/OTel), logs (ELK/ClickHouse), traces (OTel/Jaeger).
Layer approach: alerts by layers (Edge → API → business services → queues/streams → databases/caches → file/object stores → external providers).
Context: feature flags, releases, marketing campaigns, tournaments, geo-alignment.
Incident tire: Alertmanager/PagerDuty/Opsgenie/Slack; binding to runbook and escalation matrix.
4) Key metrics by layer (what to monitor and why)
Edge / L7
RPS, 95-/99-percentile latency, error rate (5xx/4xx), open connections.
Rate-limits/quotas, drops на CDN/WAF/Firewall.
API-шлюз / Backend-for-Frontend
Saturation by worker/work pool, request queue, timeouts to downstreams.
Degradation fraction (fallbacks, circuit-breakers).
Queues/Streaming (Kafka/Rabbit/Pulsar)
Lag/consumer delay, backlog growth rate, throughput (msg/s, MB/s).
Partition skew, rebalancing churn, ISR (for Kafka), retray/grandfather-later.
Asynchronous workers
Task timeout, queue length, percentage of expired SLA tasks.
Saturation CPU/Memory/FD at pools.
Caches (Redis/Memcached)
Hit ratio, latency, evictions, used memory, connected clients/ops/s.
Clusters: slots/replicas, failover events.
БД (PostgreSQL/MySQL/ClickHouse)
Active connections vs max, lock waits, replication lag, buffer/cache hit.
IOPS, read/write latency, checkpoint/flush, bloat/fragmentation.
Object/File Storage
PUT/GET latency, 4xx/5xx, egress, requests/sec, provider limits.
External Providers (Payments/LCC/Game Providers)
TPS limits, QPS windows, error rate/timeout, retray queue, "cost per call."
Infrastructure
CPU/Memory/FD/IOPS/Network saturation on nodes/pods/ASG.
HPA/VPA events, pending pods, container OOM/Throttling.
5) Types of capacitive alerts
1. Static thresholds
Simple and straightforward: 'db _ connections> 80% max'. Good as a beacon signal.
2. Adaptive (dynamic) thresholds
Based on seasonality and trend (rolling windows, STL decomposition). Allow catching "unusually high for this hour/day of the week."
3. SLO-oriented (burn-rate)
They are triggered when the error-budget eating rate will jeopardize SLO in the X hour horizon.
4. Prognostic (forecast-alerts)
"After 20 minutes in the current trend, the queue will reach 90%." Linear/Robust/Prophet-like prediction on short windows is used.
5. Multi-signal
Trigger with the combination: 'queue _ lag ↑' + 'consumer _ cpu 85%' + 'autoscaling at max' → "manual intervention is needed."
6) Threshold policies and anti-noise
High/Low Watermark:- Up: warning 70-75%, Crete 85-90%. Down: hysteresis 5-10 pp In order not to "saw on the threshold."
- 'for: 5m'for criteria,' for: 10-15m'for warnings. Night-mode: route non-critical to chat without paging.
- Group by service/cluster/geo so as not to produce incident cards.
- If the KYC provider is out and API errors are due to paging the integration owner, not all consumers.
- During the stock period, raise noise thresholds for "expected growth," but leave SLO alerts intact.
7) Rule examples (pseudo-Prometheus)
DB connections:
ALERT PostgresConnectionsHigh
IF (pg_stat_activity_active / pg_max_connections) > 0. 85
FOR 5m
LABELS {severity="critical", team="core-db"}
ANNOTATIONS {summary="Postgres connections >85%"}
Kafka lag + auto-scaling at the limit:
ALERT StreamBacklogAtRisk
IF (kafka_consumer_lag > 5_000_000 AND rate(kafka_consumer_lag[5m]) > 50_000)
AND (hpa_desired_replicas == hpa_max_replicas)
FOR 10m
LABELS {severity="critical", team="streaming"}
Burn-rate SLO (API latency):
ALERT ApiLatencySLOBurn
IF slo_latency_budget_burnrate{le="300ms"} > 4
FOR 15m
LABELS {severity="page", team="api"}
ANNOTATIONS {runbook="wiki://runbooks/api-latency"}
Redis memory and evikshens:
ALERT RedisEvictions
IF rate(redis_evicted_keys_total[5m]) > 0
AND (redis_used_memory / redis_maxmemory) > 0. 8
FOR 5m
LABELS {severity="warning", team="caching"}
Payment Provider - Limits:
ALERT PSPThroughputLimitNear
IF increase(psp_calls_total[10m]) > 0. 9 psp_rate_limit_window
FOR 5m
LABELS {severity="warning", team="payments", provider="PSP-X"}
8) SLO approach and business priority
From signal to business impact: Capacity alerts should reference risk to SLO (specific games/geo/GGR metrics, deposit conversion).
Multilevel: warnings for on-call service; Crete - domain owner page; SLO-drop - major incident and team "summary" channel.
Degradation features: automatic load reduction (partial read-only, cutting down heavy features, reducing the frequency of jackpot broadcasts, turning off "heavy" animations in live games).
9) Auto-scaling and "correct" triggers
HPA/VPA: target not only by CPU/Memory, but also by business metrics (RPS, queue lag, p99 latency).
Warm-up timings: take into account the cold start and provider limits (ASG spin-up, container builders, warm-up caches).
Guardrails: stop conditions in avalanche-like growth of errors; protection against "scalim problem."
Capacity-playbooks: where and how to add a shard/party/replica, how to redistribute traffic by region.
10) Process: from design to operation
1. Limit mapping: collect "true" bottleneck limits for each layer (max conns, IOPS, TPS, quotas providers).
2. Selection of predictor metrics: which signals indicate "rest in N minutes" first.
3. Threshold design: high/low + SLO-burn + compound.
4. Runbook for each crete: diagnostic steps ("what to open," "what commands," "where to escalate"), three options for action: fast traversal, scaling, degradation.
5. Testing: load simulations (chaos/game days), dry starts of alerts, anti-noise checking.
6. Review and adoption: signal owner = service owner. No owner - no page.
7. Retrospectives and tuning: weekly analysis of false/missed; metric "MTTA (ack), MTTD, MTTR, Noise/Signal ratio."
11) Anti-patterns
CPU> 90% ⇒ panic: without correlation with latency/queues, this may be normal.
"One threshold for all": different regions/time zones - different traffic profiles.
Alert without runbook: page without clear action drains on-call.
Blindness to providers: external quotas/limits are often the first to "break" scripts (PSP, KYC, anti-fraud, game providers).
No hysteresis: "sawing" at the 80 %/79% boundary.
12) Features of iGaming/financial platforms
Schedule peaks: prime time, tournament finals, major matches; Promote target replicas and fill caches in advance.
Live streams and jackpots: bursts of broadcast events → limits on brokers/web sites.
Payments and KYC: provider windows, anti-fraud scoring; keep spare routes and "grace-mode" deposits.
Geo-balance: local provider failures - to divert traffic to a neighboring region where there is a headroom.
Responsibility: at risk of losing bets/jackpots - instant page to the domain team + business alert.
13) Dashboards (minimum set)
Capacity Overview: headroom by layer, top 3 risky areas, burn-rate SLO.
Stream & Queues: lag, backlog growth, consumer saturation, HPA state.
DB & Cache: connections, repl-lag, p95/p99 latency, hit ratio, evictions.
Providers: TPS/windows/quotas, timeouts/errors, call cost.
Release/Feature context: releases/phicheflags next to curves.
14) Implementation checklist
- List of "true" limits and owners.
- Predictor metrics map + inter-layer associations.
- Static thresholds + hysteresis.
- SLO-burn-alerts on critical paths (deposit, bet, live game launch).
- Predictive alerts on queue/streams/connections.
- Suppression/maintenance of the window; anti-noise politics.
- Runbook 'and with commands, graphs, degradation filters.
- Weekly analysis of false positives and tuning.
- Account for marketing campaigns and event calendar.
15) Example runbook pattern (abbreviated)
Signal: 'StreamBacklogAtRisk'
Objective: To prevent lag growth> 10 million and treatment delay> 5 min.
Diagnosis (3-5 min):1. Check 'hpa _ desired/max', throttle/oom in the pits.
2. View'rate (lag) ', partitioning (skew).
3. Check broker (ISR, under-replicated, network).
Actions:- Increase consumer-replicas by + N, raise max-in-flight.
- Enable priority pool on "critical topics."
- Temporarily reduce the frequency of secondary treatments/enrichment.
- If'ASG at max '- request a temporary uplift from the cloud; in parallel - enable degradation of heavy functions.
- Rollback: Return to normal traffic profile after 'lag <1 million' 15 minutes.
- Escalation: Kafka cluster owner, then SRE platform.
16) KPI and signal quality
Coverage:% of critical paths closed by capacitive alerts.
Noise/Signal: No more than 1 false page per on-call/week.
MTTD/MTTR: capacitive incidents are detected ≤5 min before SLO strikes.
Proactive saves: number of incidents prevented (by postmortem).
17) Fast start (conservative defaults)
DB: warning 75% of connections/IOPS/lat; crete 85%, hysteresis 8-10 pp
Caches: 'hit <0. 9 'And' evictions> 0 '> 5 min - warning;' used _ mem> 85% '- Crete.
Queues: 'lag' height> 3 σ of the average for 30d + 'hpa at max' - Crete.
API: `p99 > SLO1. 3 '10 min - warning;' burn-rate> 4 '15 min - Crete.
Providers: 'throughput> 90% quota' - warning; 'timeouts> 5%' - Crete.
18) FAQ
Q: Why not just "CPU> 80%"?
A: Without latency/queuing context, it's noise. The CPU itself is not equal to the risk.
Q: Do we need adaptive thresholds?
A: Yes, for daily/weekly seasonality - reduce false positives.
Q: How to consider marketing/events?
A: Campaign calendar → annotations on graphs + temporary anti-noise adjustment, but do not touch SLO alerts.