Operations and Alert → Management by System Capacity

System capacity alerts

1) Why do you need it

Capacitive alerts warn of approaching technical limits long before the incident: "we are 80% of the ceiling - it's time to scale." For grocery businesses, this is directly about money: missed bets/deposits, session drops, live game delays and provider failures = lost revenue, reputation, fines and kickbacks.

Objectives:

Predictably withstand peak loads (events, tournaments, streams, large campaigns).
Turn on auto-scaling on time and plan capacity uplift.
Reduce noise and wake up "on business" when SLO/money is at risk.
Give engineers accurate recommendations through the runbook.

2) Basic concepts

Capacity: maximum stable throughput (RPS/TPS, connections, IOPS, throughput).
Headroom: margin between the current load and limits.

SLO/SLA: target levels of availability/response time; alerts must be "SLO-aware."

Burn-rate: the speed of "burning" the SLO budget of errors/latency.
High/Low Watermark: upper/lower levels for actuations and auto-recovery.

3) Signal architecture and data sources

Telemetry: metrics (Prometheus/OTel), logs (ELK/ClickHouse), traces (OTel/Jaeger).
Layer approach: alerts by layers (Edge → API → business services → queues/streams → databases/caches → file/object stores → external providers).
Context: feature flags, releases, marketing campaigns, tournaments, geo-alignment.
Incident tire: Alertmanager/PagerDuty/Opsgenie/Slack; binding to runbook and escalation matrix.

4) Key metrics by layer (what to monitor and why)

Edge / L7

RPS, 95-/99-percentile latency, error rate (5xx/4xx), open connections.
Rate-limits/quotas, drops на CDN/WAF/Firewall.

API-шлюз / Backend-for-Frontend

Saturation by worker/work pool, request queue, timeouts to downstreams.
Degradation fraction (fallbacks, circuit-breakers).

Queues/Streaming (Kafka/Rabbit/Pulsar)

Lag/consumer delay, backlog growth rate, throughput (msg/s, MB/s).
Partition skew, rebalancing churn, ISR (for Kafka), retray/grandfather-later.

Asynchronous workers

Task timeout, queue length, percentage of expired SLA tasks.
Saturation CPU/Memory/FD at pools.

Caches (Redis/Memcached)

Hit ratio, latency, evictions, used memory, connected clients/ops/s.
Clusters: slots/replicas, failover events.

БД (PostgreSQL/MySQL/ClickHouse)

Active connections vs max, lock waits, replication lag, buffer/cache hit.
IOPS, read/write latency, checkpoint/flush, bloat/fragmentation.

Object/File Storage

PUT/GET latency, 4xx/5xx, egress, requests/sec, provider limits.

External Providers (Payments/LCC/Game Providers)

TPS limits, QPS windows, error rate/timeout, retray queue, "cost per call."

Infrastructure

CPU/Memory/FD/IOPS/Network saturation on nodes/pods/ASG.
HPA/VPA events, pending pods, container OOM/Throttling.

5) Types of capacitive alerts

1. Static thresholds

Simple and straightforward: 'db _ connections> 80% max'. Good as a beacon signal.

2. Adaptive (dynamic) thresholds

Based on seasonality and trend (rolling windows, STL decomposition). Allow catching "unusually high for this hour/day of the week."

3. SLO-oriented (burn-rate)

They are triggered when the error-budget eating rate will jeopardize SLO in the X hour horizon.

4. Prognostic (forecast-alerts)

"After 20 minutes in the current trend, the queue will reach 90%." Linear/Robust/Prophet-like prediction on short windows is used.

5. Multi-signal

Trigger with the combination: 'queue _ lag ↑' + 'consumer _ cpu 85%' + 'autoscaling at max' → "manual intervention is needed."

6) Threshold policies and anti-noise

High/Low Watermark:

Up: warning 70-75%, Crete 85-90%. Down: hysteresis 5-10 pp In order not to "saw on the threshold."

Time windows and suppressions:

'for: 5m'for criteria,' for: 10-15m'for warnings. Night-mode: route non-critical to chat without paging.

Event grouping:

Group by service/cluster/geo so as not to produce incident cards.

Dependency-aware suppression:

If the KYC provider is out and API errors are due to paging the integration owner, not all consumers.

Marketing time windows:

During the stock period, raise noise thresholds for "expected growth," but leave SLO alerts intact.

7) Rule examples (pseudo-Prometheus)

DB connections:


ALERT PostgresConnectionsHigh
IF (pg_stat_activity_active / pg_max_connections) > 0. 85
FOR 5m
LABELS {severity="critical", team="core-db"}
ANNOTATIONS {summary="Postgres connections >85%"}

Kafka lag + auto-scaling at the limit:


ALERT StreamBacklogAtRisk
IF (kafka_consumer_lag > 5_000_000 AND rate(kafka_consumer_lag[5m]) > 50_000)
AND (hpa_desired_replicas == hpa_max_replicas)
FOR 10m
LABELS {severity="critical", team="streaming"}

Burn-rate SLO (API latency):


ALERT ApiLatencySLOBurn
IF slo_latency_budget_burnrate{le="300ms"} > 4
FOR 15m
LABELS {severity="page", team="api"}
ANNOTATIONS {runbook="wiki://runbooks/api-latency"}

Redis memory and evikshens:


ALERT RedisEvictions
IF rate(redis_evicted_keys_total[5m]) > 0
AND (redis_used_memory / redis_maxmemory) > 0. 8
FOR 5m
LABELS {severity="warning", team="caching"}

Payment Provider - Limits:


ALERT PSPThroughputLimitNear
IF increase(psp_calls_total[10m]) > 0. 9 psp_rate_limit_window
FOR 5m
LABELS {severity="warning", team="payments", provider="PSP-X"}

8) SLO approach and business priority

From signal to business impact: Capacity alerts should reference risk to SLO (specific games/geo/GGR metrics, deposit conversion).
Multilevel: warnings for on-call service; Crete - domain owner page; SLO-drop - major incident and team "summary" channel.
Degradation features: automatic load reduction (partial read-only, cutting down heavy features, reducing the frequency of jackpot broadcasts, turning off "heavy" animations in live games).

9) Auto-scaling and "correct" triggers

HPA/VPA: target not only by CPU/Memory, but also by business metrics (RPS, queue lag, p99 latency).
Warm-up timings: take into account the cold start and provider limits (ASG spin-up, container builders, warm-up caches).

Guardrails: stop conditions in avalanche-like growth of errors; protection against "scalim problem."

Capacity-playbooks: where and how to add a shard/party/replica, how to redistribute traffic by region.

10) Process: from design to operation

1. Limit mapping: collect "true" bottleneck limits for each layer (max conns, IOPS, TPS, quotas providers).
2. Selection of predictor metrics: which signals indicate "rest in N minutes" first.
3. Threshold design: high/low + SLO-burn + compound.
4. Runbook for each crete: diagnostic steps ("what to open," "what commands," "where to escalate"), three options for action: fast traversal, scaling, degradation.
5. Testing: load simulations (chaos/game days), dry starts of alerts, anti-noise checking.
6. Review and adoption: signal owner = service owner. No owner - no page.

7. Retrospectives and tuning: weekly analysis of false/missed; metric "MTTA (ack), MTTD, MTTR, Noise/Signal ratio."

11) Anti-patterns

CPU> 90% ⇒ panic: without correlation with latency/queues, this may be normal.
"One threshold for all": different regions/time zones - different traffic profiles.
Alert without runbook: page without clear action drains on-call.
Blindness to providers: external quotas/limits are often the first to "break" scripts (PSP, KYC, anti-fraud, game providers).
No hysteresis: "sawing" at the 80 %/79% boundary.

12) Features of iGaming/financial platforms

Schedule peaks: prime time, tournament finals, major matches; Promote target replicas and fill caches in advance.
Live streams and jackpots: bursts of broadcast events → limits on brokers/web sites.
Payments and KYC: provider windows, anti-fraud scoring; keep spare routes and "grace-mode" deposits.
Geo-balance: local provider failures - to divert traffic to a neighboring region where there is a headroom.
Responsibility: at risk of losing bets/jackpots - instant page to the domain team + business alert.

13) Dashboards (minimum set)

Capacity Overview: headroom by layer, top 3 risky areas, burn-rate SLO.
Stream & Queues: lag, backlog growth, consumer saturation, HPA state.
DB & Cache: connections, repl-lag, p95/p99 latency, hit ratio, evictions.
Providers: TPS/windows/quotas, timeouts/errors, call cost.
Release/Feature context: releases/phicheflags next to curves.

14) Implementation checklist

List of "true" limits and owners.
Predictor metrics map + inter-layer associations.
Static thresholds + hysteresis.
SLO-burn-alerts on critical paths (deposit, bet, live game launch).
Predictive alerts on queue/streams/connections.
Suppression/maintenance of the window; anti-noise politics.
Runbook 'and with commands, graphs, degradation filters.
Weekly analysis of false positives and tuning.
Account for marketing campaigns and event calendar.

15) Example runbook pattern (abbreviated)

Signal: 'StreamBacklogAtRisk'

Objective: To prevent lag growth> 10 million and treatment delay> 5 min.

Diagnosis (3-5 min):

1. Check 'hpa _ desired/max', throttle/oom in the pits.

2. View'rate (lag) ', partitioning (skew).

3. Check broker (ISR, under-replicated, network).

Actions:

Increase consumer-replicas by + N, raise max-in-flight.
Enable priority pool on "critical topics."
Temporarily reduce the frequency of secondary treatments/enrichment.
If'ASG at max '- request a temporary uplift from the cloud; in parallel - enable degradation of heavy functions.
Rollback: Return to normal traffic profile after 'lag <1 million' 15 minutes.
Escalation: Kafka cluster owner, then SRE platform.

16) KPI and signal quality

Coverage:% of critical paths closed by capacitive alerts.
Noise/Signal: No more than 1 false page per on-call/week.
MTTD/MTTR: capacitive incidents are detected ≤5 min before SLO strikes.
Proactive saves: number of incidents prevented (by postmortem).

17) Fast start (conservative defaults)

DB: warning 75% of connections/IOPS/lat; crete 85%, hysteresis 8-10 pp

Caches: 'hit <0. 9 'And' evictions> 0 '> 5 min - warning;' used _ mem> 85% '- Crete.
Queues: 'lag' height> 3 σ of the average for 30d + 'hpa at max' - Crete.
API: `p99 > SLO1. 3 '10 min - warning;' burn-rate> 4 '15 min - Crete.
Providers: 'throughput> 90% quota' - warning; 'timeouts> 5%' - Crete.

18) FAQ

Q: Why not just "CPU> 80%"?
A: Without latency/queuing context, it's noise. The CPU itself is not equal to the risk.

Q: Do we need adaptive thresholds?
A: Yes, for daily/weekly seasonality - reduce false positives.

Q: How to consider marketing/events?
A: Campaign calendar → annotations on graphs + temporary anti-noise adjustment, but do not touch SLO alerts.