GH GambleHub

Operations and Alert → Management by System Capacity

System capacity alerts

1) Why do you need it

Capacitive alerts warn of approaching technical limits long before the incident: "we are 80% of the ceiling - it's time to scale." For grocery businesses, this is directly about money: missed bets/deposits, session drops, live game delays and provider failures = lost revenue, reputation, fines and kickbacks.

Objectives:
  • Predictably withstand peak loads (events, tournaments, streams, large campaigns).
  • Turn on auto-scaling on time and plan capacity uplift.
  • Reduce noise and wake up "on business" when SLO/money is at risk.
  • Give engineers accurate recommendations through the runbook.

2) Basic concepts

Capacity: maximum stable throughput (RPS/TPS, connections, IOPS, throughput).
Headroom: margin between the current load and limits.

SLO/SLA: target levels of availability/response time; alerts must be "SLO-aware."

Burn-rate: the speed of "burning" the SLO budget of errors/latency.
High/Low Watermark: upper/lower levels for actuations and auto-recovery.

3) Signal architecture and data sources

Telemetry: metrics (Prometheus/OTel), logs (ELK/ClickHouse), traces (OTel/Jaeger).
Layer approach: alerts by layers (Edge → API → business services → queues/streams → databases/caches → file/object stores → external providers).
Context: feature flags, releases, marketing campaigns, tournaments, geo-alignment.
Incident tire: Alertmanager/PagerDuty/Opsgenie/Slack; binding to runbook and escalation matrix.

4) Key metrics by layer (what to monitor and why)

Edge / L7

RPS, 95-/99-percentile latency, error rate (5xx/4xx), open connections.
Rate-limits/quotas, drops на CDN/WAF/Firewall.

API-шлюз / Backend-for-Frontend

Saturation by worker/work pool, request queue, timeouts to downstreams.
Degradation fraction (fallbacks, circuit-breakers).

Queues/Streaming (Kafka/Rabbit/Pulsar)

Lag/consumer delay, backlog growth rate, throughput (msg/s, MB/s).
Partition skew, rebalancing churn, ISR (for Kafka), retray/grandfather-later.

Asynchronous workers

Task timeout, queue length, percentage of expired SLA tasks.
Saturation CPU/Memory/FD at pools.

Caches (Redis/Memcached)

Hit ratio, latency, evictions, used memory, connected clients/ops/s.
Clusters: slots/replicas, failover events.

БД (PostgreSQL/MySQL/ClickHouse)

Active connections vs max, lock waits, replication lag, buffer/cache hit.
IOPS, read/write latency, checkpoint/flush, bloat/fragmentation.

Object/File Storage

PUT/GET latency, 4xx/5xx, egress, requests/sec, provider limits.

External Providers (Payments/LCC/Game Providers)

TPS limits, QPS windows, error rate/timeout, retray queue, "cost per call."

Infrastructure

CPU/Memory/FD/IOPS/Network saturation on nodes/pods/ASG.
HPA/VPA events, pending pods, container OOM/Throttling.

5) Types of capacitive alerts

1. Static thresholds

Simple and straightforward: 'db _ connections> 80% max'. Good as a beacon signal.

2. Adaptive (dynamic) thresholds

Based on seasonality and trend (rolling windows, STL decomposition). Allow catching "unusually high for this hour/day of the week."

3. SLO-oriented (burn-rate)

They are triggered when the error-budget eating rate will jeopardize SLO in the X hour horizon.

4. Prognostic (forecast-alerts)

"After 20 minutes in the current trend, the queue will reach 90%." Linear/Robust/Prophet-like prediction on short windows is used.

5. Multi-signal

Trigger with the combination: 'queue _ lag ↑' + 'consumer _ cpu 85%' + 'autoscaling at max' → "manual intervention is needed."

6) Threshold policies and anti-noise

High/Low Watermark:
  • Up: warning 70-75%, Crete 85-90%. Down: hysteresis 5-10 pp In order not to "saw on the threshold."
Time windows and suppressions:
  • 'for: 5m'for criteria,' for: 10-15m'for warnings. Night-mode: route non-critical to chat without paging.
Event grouping:
  • Group by service/cluster/geo so as not to produce incident cards.
Dependency-aware suppression:
  • If the KYC provider is out and API errors are due to paging the integration owner, not all consumers.
Marketing time windows:
  • During the stock period, raise noise thresholds for "expected growth," but leave SLO alerts intact.

7) Rule examples (pseudo-Prometheus)

DB connections:

ALERT PostgresConnectionsHigh
IF (pg_stat_activity_active / pg_max_connections) > 0. 85
FOR 5m
LABELS {severity="critical", team="core-db"}
ANNOTATIONS {summary="Postgres connections >85%"}
Kafka lag + auto-scaling at the limit:

ALERT StreamBacklogAtRisk
IF (kafka_consumer_lag > 5_000_000 AND rate(kafka_consumer_lag[5m]) > 50_000)
AND (hpa_desired_replicas == hpa_max_replicas)
FOR 10m
LABELS {severity="critical", team="streaming"}
Burn-rate SLO (API latency):

ALERT ApiLatencySLOBurn
IF slo_latency_budget_burnrate{le="300ms"} > 4
FOR 15m
LABELS {severity="page", team="api"}
ANNOTATIONS {runbook="wiki://runbooks/api-latency"}
Redis memory and evikshens:

ALERT RedisEvictions
IF rate(redis_evicted_keys_total[5m]) > 0
AND (redis_used_memory / redis_maxmemory) > 0. 8
FOR 5m
LABELS {severity="warning", team="caching"}
Payment Provider - Limits:

ALERT PSPThroughputLimitNear
IF increase(psp_calls_total[10m]) > 0. 9 psp_rate_limit_window
FOR 5m
LABELS {severity="warning", team="payments", provider="PSP-X"}

8) SLO approach and business priority

From signal to business impact: Capacity alerts should reference risk to SLO (specific games/geo/GGR metrics, deposit conversion).
Multilevel: warnings for on-call service; Crete - domain owner page; SLO-drop - major incident and team "summary" channel.
Degradation features: automatic load reduction (partial read-only, cutting down heavy features, reducing the frequency of jackpot broadcasts, turning off "heavy" animations in live games).

9) Auto-scaling and "correct" triggers

HPA/VPA: target not only by CPU/Memory, but also by business metrics (RPS, queue lag, p99 latency).
Warm-up timings: take into account the cold start and provider limits (ASG spin-up, container builders, warm-up caches).

Guardrails: stop conditions in avalanche-like growth of errors; protection against "scalim problem."

Capacity-playbooks: where and how to add a shard/party/replica, how to redistribute traffic by region.

10) Process: from design to operation

1. Limit mapping: collect "true" bottleneck limits for each layer (max conns, IOPS, TPS, quotas providers).
2. Selection of predictor metrics: which signals indicate "rest in N minutes" first.
3. Threshold design: high/low + SLO-burn + compound.
4. Runbook for each crete: diagnostic steps ("what to open," "what commands," "where to escalate"), three options for action: fast traversal, scaling, degradation.
5. Testing: load simulations (chaos/game days), dry starts of alerts, anti-noise checking.
6. Review and adoption: signal owner = service owner. No owner - no page.

7. Retrospectives and tuning: weekly analysis of false/missed; metric "MTTA (ack), MTTD, MTTR, Noise/Signal ratio."

11) Anti-patterns

CPU> 90% ⇒ panic: without correlation with latency/queues, this may be normal.
"One threshold for all": different regions/time zones - different traffic profiles.
Alert without runbook: page without clear action drains on-call.
Blindness to providers: external quotas/limits are often the first to "break" scripts (PSP, KYC, anti-fraud, game providers).
No hysteresis: "sawing" at the 80 %/79% boundary.

12) Features of iGaming/financial platforms

Schedule peaks: prime time, tournament finals, major matches; Promote target replicas and fill caches in advance.
Live streams and jackpots: bursts of broadcast events → limits on brokers/web sites.
Payments and KYC: provider windows, anti-fraud scoring; keep spare routes and "grace-mode" deposits.
Geo-balance: local provider failures - to divert traffic to a neighboring region where there is a headroom.
Responsibility: at risk of losing bets/jackpots - instant page to the domain team + business alert.

13) Dashboards (minimum set)

Capacity Overview: headroom by layer, top 3 risky areas, burn-rate SLO.
Stream & Queues: lag, backlog growth, consumer saturation, HPA state.
DB & Cache: connections, repl-lag, p95/p99 latency, hit ratio, evictions.
Providers: TPS/windows/quotas, timeouts/errors, call cost.
Release/Feature context: releases/phicheflags next to curves.

14) Implementation checklist

  • List of "true" limits and owners.
  • Predictor metrics map + inter-layer associations.
  • Static thresholds + hysteresis.
  • SLO-burn-alerts on critical paths (deposit, bet, live game launch).
  • Predictive alerts on queue/streams/connections.
  • Suppression/maintenance of the window; anti-noise politics.
  • Runbook 'and with commands, graphs, degradation filters.
  • Weekly analysis of false positives and tuning.
  • Account for marketing campaigns and event calendar.

15) Example runbook pattern (abbreviated)

Signal: 'StreamBacklogAtRisk'

Objective: To prevent lag growth> 10 million and treatment delay> 5 min.

Diagnosis (3-5 min):

1. Check 'hpa _ desired/max', throttle/oom in the pits.

2. View'rate (lag) ', partitioning (skew).

3. Check broker (ISR, under-replicated, network).

Actions:
  • Increase consumer-replicas by + N, raise max-in-flight.
  • Enable priority pool on "critical topics."
  • Temporarily reduce the frequency of secondary treatments/enrichment.
  • If'ASG at max '- request a temporary uplift from the cloud; in parallel - enable degradation of heavy functions.
  • Rollback: Return to normal traffic profile after 'lag <1 million' 15 minutes.
  • Escalation: Kafka cluster owner, then SRE platform.

16) KPI and signal quality

Coverage:% of critical paths closed by capacitive alerts.
Noise/Signal: No more than 1 false page per on-call/week.
MTTD/MTTR: capacitive incidents are detected ≤5 min before SLO strikes.
Proactive saves: number of incidents prevented (by postmortem).

17) Fast start (conservative defaults)

DB: warning 75% of connections/IOPS/lat; crete 85%, hysteresis 8-10 pp

Caches: 'hit <0. 9 'And' evictions> 0 '> 5 min - warning;' used _ mem> 85% '- Crete.
Queues: 'lag' height> 3 σ of the average for 30d + 'hpa at max' - Crete.
API: `p99 > SLO1. 3 '10 min - warning;' burn-rate> 4 '15 min - Crete.
Providers: 'throughput> 90% quota' - warning; 'timeouts> 5%' - Crete.

18) FAQ

Q: Why not just "CPU> 80%"?
A: Without latency/queuing context, it's noise. The CPU itself is not equal to the risk.

Q: Do we need adaptive thresholds?
A: Yes, for daily/weekly seasonality - reduce false positives.

Q: How to consider marketing/events?
A: Campaign calendar → annotations on graphs + temporary anti-noise adjustment, but do not touch SLO alerts.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.