Capacity Planning

1) What capacity planning is and why it is needed

Capacity planning is the systematic process of evaluating and securing the resources needed to achieve target SLOs at minimal cost. We are talking not only about CPU/memory, but also about network bandwidth, storage, databases/caches, queues/event bus, external providers (payments/CCM/anti-fraud), as well as human resources (on-call, support).

Objectives:

Perform SLO/SLAs even in peaks and degrades.
Minimize TCO and capital overprovisioning.
Reduce the risk of incidents from running out of resources (saturation → p99/error).
Ensure the predictability of releases and campaigns (marketing, tournaments, top matches).

2) Inputs and sources of truth

Observability: RPS/concatenation, p50/p95/p99, error-rate, saturation (CPU, mem, disk IOPS, network pps/mbps), queue lengths, rate limits.
Business events: campaign calendars, seasonality (evenings/weekends/mega-events), regions/jurisdictions.
Technical debt/features: roadmap of releases, architectural changes (for example, encryption, new logging).
Providers: quotas and throughput of payment/CUS/mail/anti-fraud services.
Incidents of the past: where is the bottleneck (database, cache, L7 balancer, bus, CDN, disk).

3) Basic concepts and formulas

Headroom - capacity margin: 'headroom = (max _ stable _ RPS − actual _ RPS )/max _ stable _ RPS'.
Target at 20-40% peak (for critical streams).
Saturation - the ratio of the occupied resource to the available one (CPU%, memory/GC, connections, file descriptors, IOPS, queue depth).
Throughput stable - the speed at which p99 and error-rate perform SLO for a long time (not a one-time burst).
Capacity Unit (CU) - normalized unit of power for the service (for example, X RPS per pod vCPU = 1, RAM = 2 GiB).
The system limit is max without degradation: 'N _ pods × CU'. It is important to take into account shared dependencies (DB/cache/bus).

4) Demand Model: Forecasting

Statistical series: weekly/daily seasonality, holidays, sports finals, regional peaks.
Cohorts: by country, payment providers, devices, VIP segments.
Event deltas: impact of campaigns/pooches/releases/SEO.
"What if" (scenario planning): + 50% to traffic at 19: 00-22: 00; drop of provider A → redistribution to B (+ 30% to latency).
Real-time adjustments: nowcasting by lead metrics (revitalization of sessions, queue for a match, baskets).

5) Supply model: where the chain "breaks"

Inquiry conveyor: Edge/CDN → L7 balancer → application → cache → DB → external API → turn/tire → handlers/ETL.

For each link we fix:

Capacity (CU/instance), scalability (horizon/vertex) , limits (connections, pps, IOPS), delays.
Failure policies (rate limit, circuit breaker, degradation).
SLOs are local and their contribution to e2e-SLO.

6) Error margin and budget

We bind the headroom to the error budget: less budget → more stock.
For critical flows (payment/verification) - headroom above, for secondary flows - below.
Cold/warm reserves: activated at peak/accident.

7) Scaling: Tactics

HPA (by load metrics): RPS, latency, queue length, user SLIs (better than CPU%).
VPA: correction of podam resources (careful with stateful and p99 GC).
KEDA/adapters: scaling by external sources (Kafka lag, Redis list length, CloudQueue depth).
Warm pools/warming up: pre-raised instances to avoid a cold start.
"Load-as-Code" approach: Autoscale/limit/timeout/retray policies are versioned and reviewed.

8) Queues, backpressure and tail control

The goal is to prevent avalanche-like growth of p99.
We limit concurrency and queue size, enter time windows and idempotence.
Hedging/Retry-budget: limit the total time budget of the user and the system.
Graceful degradation: disabling secondary features when overloading.

9) DB, caches and storage

DB: connection limit, logging/FSync, indexes, query plan, replica lag, hot-keys/tables, max TPS for transactions.
Keshi: hit-ratio by segment, "storm of misses" during release/disability, key distribution.
Storage: IOPS/throughput, delays, compression, TTL, cleaning old batches/snapshots.
Migration scheme: expand→migrate→contract without stop locks.

10) Event flows and ETLs

Kafka/bus: party throughput, lag, ISR, compaction, producer/consumer limits.

ETL/batches: start windows, runtime budgets, throttle I/O

Idempotence and exactly-once-like for critical flow (payments/balances).

11) Network and perimeter

L4/L7 balancers: connection limits, syn backlog, TLS offload, session reuse.
CDN/Edge: bandwidth, cache policy to reduce origin load.
Intra-network limits: pps/mbps in VPC/subnet, egress-cost (FinOps).

12) Multi-region, DR and jurisdictions

Strategies: active-active (GSLB/Anycast), active-passive (hot/warm/cold DR).
N + 1 by region: Sustain loss of AZ/region while maintaining SLO core streams.
Legal localization: division of traffic/data by country, different limits and SLOs into providers.
DR tests: regular game-days with real load transfer.

13) External providers: quotas and routes

Payments/KYC/anti-fraud/mail/SMS: TPS, burst quotas, daily limits.
Multi-provider: routing by latency/success, SLO per provider, auto-feiler.
SLA contracts: e2e-SLO compliance, escalation channels, status webhooks.

14) FinOps: Cost and Efficiency

TCO: compute + storage + network egress + licenses/providers + duty.
Unit Economics: cost of 1k requests/1 deposit transaction/1 KYC.
Optimization: right-sizing, spot/prefix discounts, cache hitrate, log/trace dedup, cold storage levels.
Load transfer in time: non-critical batches in "night" windows and cheap regions.

15) Dashboards and reporting (minimum set)

Capacity Overview:

Current load vs steady throughput across links.
Headroom by service and region; 24/72 hour forecast.
FinOps KPI: $/1k requests, $/deposit.

Risk & Hotspots:

Top bottlenecks (p99, saturation, lag), DR margin.

Providers:

Provider success/latency and limits; share of traffic on routes.

Backlog:

Upgrade/index/optimization plan, expected savings/capacity growth.

16) Processes and roles

RACI: Platform (infra/clusters/balancers), Database/Data (indexes, replications), Service commands (profiling/cache), SRE (SLO, alerts), Sec/Compliance (crypto/logs), Finance (budget).
Rhythm: weekly capacity-review (roadmap, forecast, risks), monthly FinOps-reports, quarterly DR-tests.
Change Management: Major campaigns/releases go capacity-gate (checklist below).

17) Capacity-gate

Peak load forecast and "+ x% emergency tail."
Available headroom for core streams (payments/ACC/login).
Quotas have been confirmed to providers; alternative routes are active.
HPA/KEDA thresholds and warm-pool are configured.
Queues/limits and degradation checked (playbooks ready).
Canary shares and auto-rollback are enabled.
Dashboards/alerts (burn-rate, saturation, p99) checked.
DR plan and escalation contacts are relevant.

18) Anti-patterns

"CPU <70% - everything is fine": ignoring dependency limits (DB connections, IOPS, queues).
Centralized "black box" without per-link metrics - it is impossible to understand where the limit is.
Lack of cache strategy - release misses kill origin.
The retray limit hardcode without budgets is a storm of requests.
"One payment provider" is a point of failure at its peak.
Ignoring warm reserves is a cold start as a cause of incidents.
No periodic DR tests - the plan doesn't work when needed.

19) Mini cost estimates (example)

Service X: stable 350 RPS per pod (vCPU = 1, RAM = 2 GiB). The goal is 5,000 RPS, headroom 25%.
Power needed = '5000/0. 75 = 6667 RPS`.
Podov = 'ceil (6667/350) = 20'. Plus warm-pool 15% → 3 more pods.
DB: 12k TPS limit, 9k TPS current credit, 10 peak forecast. 5k TPS → stock 1. 5k (14%). Requires indexes/sharding/replicas or caching to reduce to 8. 5k.
Provider A (KYC): quota 120 rps, peak 95 rps, campaign + 40% → 133 rps> quotas → routing 70% A/30% B.

20) Capacity planning implementation template

1. Describe the e2e path and bottlenecks.
2. Enter the CU and measure the sustained throughput of each layer.
3. Configure saturation and p99 metrics on all links.
4. Generate event/campaign/release calendar.
5. Construct cohort prediction and what-if scenarios.
6. Pin headroom per-thread and per-region (binding to error budget).
7. Set up HPA/VPA/KEDA + warm-pools, limits/retrays/queues.
8. Check provider quotas, enable multi-routes.
9. Collect dashboards and weekly rhythm capacity-review.
10. Quarterly - DR exercises and model revision.

21) The bottom line

Capacity planning is a manageable bundle of forecasts, architectural constraints and cost, not "add CPU." When each layer of the e2e path has a measured capacity, and the headroom and degradation strategies are associated with SLO and error budget, then peak loads, campaigns and accidents cease to be a surprise. This approach reduces the risk of incidents, stabilizes business metrics and optimizes costs.

Capacity Planning

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects