Capacity planning and load growth

Brief Summary

Power is the ability to withstand the target SLO for expected load growth and failures. Basis:

1. Demand forecast (baseline trend + seasonality + events).

2. Load model (open-model for the Internet).

3. Headroom and erroneous budget.

4. Scaling (horizon/vertical/auto) + limiters (rate-limit/backpressure).

5. Finance: $/1000 RPS, $/ms p95, TCO by scenario.

Terms and Metrics

Throughput: RPS/QPS/CPS - actual throughput.
Latency p95/p99: target SLOs for user paths.
Saturation: CPU/memory/IO/FD/connections/queues loading.
Error rate: 5xx/timeout/429, erroneous budget for the period.
Headroom: share of free power at peak traffic (recommended ≥ 30%).
Burst: short-term spike (seconds/minutes), Spike: sharp rise × N.

Basic models and formulas

Little's Law (for queued systems)


L = λ W

L is the average number of requests in the system, λ is the average entry rate (RPS), W is the average time in the system. Useful for estimating queue depth.

Load factor (ρ)


ρ = λ / μ

μ - service speed (RPS at 100% CPU). When ρ→1, latency increases non-linearly - keep the work point ρ ≤ 0. 6–0. 75.

Safety factor/margin


Capacity_required = Peak_load (1 + Headroom) Degradation_factor

Where Degradation_factor accounts for N failure, cache degradation, loss of one PoP/region (e.g., 1. 2).

Demand forecast

1. History: day/week profiles, seasonality, correlation with events (matches/streams/payouts).
2. Events: scenario coefficients (regular day × 1, tournament × 2. 3, final × 3. 5).
3. Sources of fluctuations: marketing campaigns, releases, bot anomalies.
4. Forecast units: RPS by routes (login, lobby, catalog, payments), CPS TLS, QPS DB, IOPS disk, egress Gbps.
5. Trust: Keep two scenarios - conservative and aggressive.

Load simulation

Open-model (Poisson-like arrival): plausible for public APIs/web - use for sizing.
Closed-model (VU + think-time): suitable for internal sequences; combine.
Route mixtures: weight fractions per endpoints; include not only "hot," but also "expensive" (registration, deposit).
Do not forget: retras, queues, partner limits (PSP, third-party APIs).

Safety margin design

Headroom target: ≥ 30% to the peak (for the Internet); for the payment core and critical paths - 40-50%.
N + 1/N + 2: withstand the failure of 1-2 instances/zones without violating SLO.
Multi-region: each region pulls ≥ 60% of the total peak (to survive the loss of a neighbor).
Degrade mode: disable secondary functions, reduce payload, enable cache/stab responses.

Sizing by Layer

Network/Edge

CPS/RPS at the front, TLS-handshake p95, resumption≥70%, egress Gbps.
Anycast/Geo-routing, CDN/WAF limits (agree in advance).
Margin: link/aplink ≥ peak × 1. 3, SYN backlog with margin UDP/443 for H3.

Balancers/Proxies

RPS to instance, open connections, queues, CPU/IRQ.
Keepalive and connection pooling - reduce connections to backends.
Stock: ρ ≤ 0. 7, limiter по CPS/RPS per route.

Applications

Target performance per core (RPS/core) in plateau.
Pools (thread/DB/HTTP) - do not run into limits.
Stock: autoscale up to CPU 60-70% and latency-trigger (p95).

Caches

Hit-ratio, hotset volume, eviction, replica.
Reserve: memory ≥ 1. 2 × hotset, network headroom ≥ 30%.

Databases

QPS/TPM, p95 requests, locks, buffer cache, WAL/replication lag.
IOPS and latency drives are key to p95.
Margin: CPU operating point 50-65%, replica lag <target; charding plan and read-replicas.

Disks/Storage

IOPS (4k/64k), throughput, fsync cost.
Stock: IOPS ≥ peak × 1. 5, latency p95 in the target window; separate pools for log/data.

GPU/ML (if there is online inference)

Samples/s, latency, VRAM headroom, batching.
Margin: batch parameters under the "saw" load, warm-pool GPU.

Auto-scaling

HPA/KEDA: CPU metrics + custom (p95 latency, RPS, queue).
Warm pools: pre-heated instances before events.

Step-scaling: steps with cooldown so as not to "saw."

Reaction time: aim at the T_scale ≤ 1-2 minutes for the front layer; for DB - in advance.

Limiters and backpressure

Rate-limit по IP/ASN/device/route; partner quotas.
Queues with TTL, refusal "polite" (429/via gray-vol) before timeouts.
Idempotence: keys for payments; retrays with budget + jitter.
Request collapsing/SWR: Don't wake origin during a splash.

Example of quick calculation

Given: 35k RPS API peak forecast, p95 ≤ 250 ms, average service time 8 ms per instance at 60% CPU → μ≈125 RPS/core, 8 cores per instance → ~ 1000 RPS/instance.
Step 1 (no stock): 35 instances.
Step 2 (headroom 30%): 35 × 1. 3 = 46.
Step 3 (failure of one AZ, + 20%): 46 × 1. 2 ≈ 55.
Step 4 (rounding + hot reserve 10%): 61 instances.
Check: ρ ≈ 35k/( 61k) ≈ 0. 57 - in the green zone.

Financial Model (FinOps)

$/1000 RPS by layer (edge, proxy, app, DB).
$/ms p95 (tail reduction cost).
TCO scenarios: on-demand vs reserved vs spot (with the risk of interruptions).
Capacity plan: quarterly account/cluster limits, cloud quotas, PSP/CDN limits.

Ready for failures and DR

Multi-AZ/region: each arm ≈ 60% of the load.
Failover plan: withdraw Anycast, GSLB switching, TTL ≤ 60-120 s.
Critical dependencies: PSP/bank limits, secondary provider.
Periodic exercises: game day with PoP/BG/cache off.

Observability and early saturation signals

Growth of p95/p99 and queues with stable input.
Hit-ratio cache drop, origin egress growth.
Retransmitts/ECN CE increase, TLS resumption fall.
Growth 429/timeout and retry-rate.
For databases - conflict growth, checkpoint time, WAL fsync.

Operational Practices

Capacity review monthly: fact vs plan.
Change windows for events: freeze kernels and limits.
Prewarm (CDN/DNS/TLS/pools) 10-30 min before peak.
Limit versioning: fix rate-limit/pools configs in Git.

iGaming/fintech specific

Tournaments/matches: spike + plateau profiles, grey routes for bots, separate registration/deposit limits.
Payments/PSP: provider/method quotas, fallback routes, egress-IP pools, SLA Time-to-Wallet.
Content providers: distribution by studio, hot caches, shard pools.
Antifraud/AML: limit on rules/scoring, degradation to light rules at peak.

Implementation checklist

Peak forecast (base/season/events), two scenarios.
SLO/wrong budget and target headroom ≥ 30%.
Sizing by layer (edge/proxy/app/cache/DB/IO/network).
Rate-limit, queue, idempotency, retry-budget.
HPA/KEDA + warm pools; promotion plan before the event.
Multi-AZ/region, failover playbooks, TTL and GSLB.
Cloud/PSP/CDN quotas are consistent and documented.
Observability: capacity dashboards, early saturation signals.
DR exercises and regular capacity-review.

Common errors

Plan for average RPS without tailings/spikes.
ρ≈0. 9 "on paper" - latency explodes at the slightest noise.
Ignoring external service limits (PSP/CDN/DB cluster).
There are no degrade modes and backpressure are cascading fails.
Auto-scale without preheating - manages "after" the peak.
Single headroom for all layers - bottleneck migrates.

Mini playbooks

Before peak event (T-30 min)

1. Increase minReplicas/target HPA, enable warm pool.
2. Warm up CDN/DNS/TLS/connections, warm up caches.
3. Raise PSP pool limits and quotas as agreed.
4. Turn on gray routes/bot filters, narrow heavy endpoints.

Partial loss of region

1. GSLB → neighboring region, TTL 60-120 s.
2. Enable degrade mode (cache/simplified checkout).
3. Redistribute PSP/egress-IP limits.
4. Status communication, p95/error control.

Surge in retreats

1. Reduce retry-budget, enable backoff + jitter.
2. Enable request-collapsing/SWR on GET.
3. Temporarily tighten the rate-limit for "noisy" ASNs.

Result

Capacity planning is demand forecast + engineering model + safety margin + operational levers. Formalize SLO and headroom, consider external limits, automate scaling and degradation, measure "cost per millisecond" and conduct regular capacity-reviews. Then the increase in load will turn not into a risk, but into a manageable business metric.