Autoscaling and SLA Balance
Autoscaling and SLA Balance
1) Goals and principles
The goal of autoscaling is to keep SLO (latency/availability) at a minimum cost.
SLA↔SLO↔SLA Cost: do not chase the "endless" scale - scale within the budget of errors and monetary limits.
Open load model: incoming requests form an intensity stream 'λ'; the system shall provide average parallelism 'N ≈ λ × W' (Little's law), where 'W' is the average service time.
2) What metrics are suitable for triggers
Technical:- CPU/RAM/IO (proxy for saturation).
- In-flight and pool wait.
- p95/p99 application latency (actually reflects SLO).
- RPS/arrival rate.
- Queues: depth, age of messages, processing speed.
- The share of successful transactions ≤ T seconds (deposits, check-out).
- Time to confirm transactions.
Recommendation: combine 2-3 signals: for example, latency + pools for services and queue depth + age for workers.
3) Reactive vs predictive scale
Feedback: HPA/ASG increase/decrease cues in fact. Simple, but there is a lag.
Predictive (feed-forward): calendar/past telemetry/market events. Enables pre-warm: raise N instances Δ t before peak.
In practice: hybrid - baseline (minimum), predictive boost before events, reactive brings.
4) Scale policies and stability parameters
Target tracking: keep the metric near the target (e.g. CPU 60%).
Step scaling: steps in excess (aggressively on adhesions).
Stabilization window/cooldown: smooth flapping (e.g. 60-180 sec).
Min/Max: lower and upper limits; max - within DB/provider limits.
5) Level coordination (architectural cascade)
1. Perimeter/API gateway - elastic, but with limits and backpressure.
2. Services - HPA by latency/RPS/pool wait.
3. Queues/workers - KEDA/ASG by message depth/age.
4. DB/cache - scaling carefully (replicas/shardings), in advance.
Rule: Do not grow the application faster than the "data" will withstand.
6) Queues and Little's Law (how to count the workers)
For queue with input 'λ' (msg/s) and average processing time 'W' (s):- Required concurrency is'N _ min ≈ λ × W '.
- Peak/tail margin: 'N ≈ λ × W × (1. 2–1. 5)`.