Resource Planner and Auto-Scaling

Brief Summary

Stable scaling is supported on four supports:

1. Correct requests/limits and QoS classes.

2. Correct stacking (topology, affinity, priorities, preemption).

3. Multi-level auto-scaling: HPA/VPA/KEDA + Cluster/Node autoscaler + warm pools.

4. SLO-oriented logic (latency/queue depth) with anti-flapping and budgets.

Basic Resource Model

Requests/Limits and QoS Classes

Requests = guarantees for the scheduler; Limits = ceilings for runtime.
QoS: Guaranteed (req = lim by CPU/Memory), Burstable (partially), BestEffort (nothing).
Production services with hard SLOs - Guaranteed/Burstable; background - Burstable/BestEffort.

CPU/Memory/IO/Network

CPU - elastic (time-sharing), memory - hard (OOM-kill if exceeded).

On the IO/network, set limits/priorities separately (cgroups/TC), otherwise "noisy neighbors."

GPUs/accelerators

Ask vector (GPU = 1, VRAM via profiles), use nodeSelector/tains and PodPriority for criticism.
For inference - batch size and model heating.

Scheduling Policies

Priorities, preemption and PDB

PriorityClass for critical paths (payments, login), preemption is allowed.
PodDisruptionBudget protects minimal cues during evacuation/updates.

Affinity/Topology

node/pod affinity for colocation/decolocation (for example, do not put replicas on one host).
topologySpreadConstraints align the hearths to the/AZ zones.
NUMA/topology: pin-CPU/hugepages where low latency is important.

Teinths and tolerances

Separate pools: 'prod', 'batch', 'gpu', 'system'. Criticism endures fewer neighbors.

Auto-scaling: levels and signals

1) HPA (Horizontal Pod Autoscaler)

Scale replicas of pods by metrics: CPU/Memory/custom (Prometheus Adapter).
Good signals: latency p95/p99, queue length/lag, RPS per pod, consumer lag.
Anti-flapping: stabilization (stabilizationWindow), minimum step, cooldown.

Example of HPA (latency-driven):

yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: { name: api-hpa }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
minReplicas: 6 maxReplicas: 120 behavior:
scaleUp:
stabilizationWindowSeconds: 60 policies: [{ type: Percent, value: 100, periodSeconds: 60 }]
scaleDown:
stabilizationWindowSeconds: 300 policies: [{ type: Percent, value: 20, periodSeconds: 60 }]
metrics:
- type: Pods pods:
metric:
name: http_server_request_duration_seconds_p95 target:
type: AverageValue averageValue: "0. 25" # 250ms

2) VPA (Vertical Pod Autoscaler)

Tunes requests/limits for real consumption (updates recommendations).
Modes: 'Off', 'Auto' (restart), 'Initial' (only at start).
Practice: turn on 'Off' → collect statistics → apply to releases.

3) KEDA/queue-based scaling

Reacts to external signals: Kafka lag, SQS depth, Redis length, Prometheus.
Ideal for Event/Queue Consumers (EDA).

KEDA ScaledObject (Kafka lag):

yaml apiVersion: keda. sh/v1alpha1 kind: ScaledObject metadata: { name: consumer-scale }
spec:
scaleTargetRef: { name: txn-consumer }
minReplicaCount: 2 maxReplicaCount: 200 cooldownPeriod: 120 pollingInterval: 5 triggers:
- type: kafka metadata:
bootstrapServers: broker:9092 consumerGroup: tx-cg topic: payments lagThreshold: "10000"

4) Cluster/Node Autoscaler (CA) + Warm Pools

CA adds/removes nodes at deficiency/excess.
Warm pools: preheated nodes/prepared images (speed up cold start).
For peaks - step-scaling and enlarged minNodes in advance.

Reaction rate and heating

SLO reaction delay: front layer ≤ 1-2 minutes, backends/DB - separately and in advance.
Warm-up: TLS/DNS/connections, loading models, cache warm-up and JIT.
Shadow load for "pumping" the cold path to the event.

Anti-Flapping and Stability

Hysteresis on metrics, smoothing (exp. medium).
Stabilization windows in HPA, large in 'scaleDown'.
Step-scaling instead of "saw"; rate-limit to modify replicas.
Budget-scaling: limit the% of traffic/replicas added per minute.

Observability and SLO

Key SLIs:

p95/99 latency, error rate, throughput, queue depth/lag, CPU/Memory saturation, pod pending time, node pressure.

Alerts:

Growth pending pods, unschedulable events, IP/subnet shortage, image pull long, evictions.
Trails: tail-based sampling on p99 tails → we see bottlenecks when scaling.

FinOps: cost of elasticity

Metrics: $/1000 RPS, $/ms p95, $/hour reserve.
Mix: on-demand + reserved + spot (for non-criticism).
The auto-scale threshold is related to the cost of the error: sometimes it is profitable to keep the warm stock.

iGaming/fintech specific

Match/tournament peaks: raise 'minReplicas' and minNodes in advance, turn on warm pools and warm up caches/models.
Payment consumers: KEDA by lag + idempotency, provider limits (PSP) as external triggers of degradation.
Antibot: a separate pool, a quick scale of rules, "gray" routes.
Regulatory: PDB for compliance services, priorities are higher than for batch.

Checklists

Design

Requests/limits specified by profiling data; QoS selected.
PriorityClass, PDB, tains/tolerations and topologySpread - configured.
HPA by SLO metrics, not just CPU.
VPA to'Off 'to collect recommendations (migration to'Auto' is planned).
KEDA/Event Load Queues.
CA + warm pools, images are cached (image pre-pull).

Operation

Stabilization windows and cooldowns are set (flapping excluded).
Alerts to pending/unschedulable, lag, p95, error-rate.
Runbooks: "no nodes," "image does not stretch," "OOM/evict," "retray storm."
Capacity-review monthly: fact of scale vs plan/cost.

Common errors

HPA only by CPU → lat-regression with IO/database limits.
The lack of PDB and priorities → criticism will be the first.

Mixing criticism and batch on the same pool without tains → "noisy neighbors."

Zero heating → cold starts at peak.
Aggressive 'scaleDown' → saw and thrash containers.
KEDA without idempotency → duplicate messages in a storm.

Mini playbooks

1) Before peak event (T-30 min)

1. Increase 'minReplicas '/minNodes, activate warm pools.
2. Warm up CDN/DNS/TLS/connections, load models.
3. Include gray routes/limits for bots.
4. Check dashboards: pending/lag/p95.

2) Node deficiency (unschedulable)

1. Check CA, cloud quotas, subnets/IP.
2. Temporarily lower batch limits, enable pre-emption for low priorities.
3. Raise a temporarily larger instance type or second pool.

3) Growth of lag in the queue

1. KEDA: scale up by trigger; 2) raise consumer limits;

2. Enable idempotency-keys and backpressure producers.

4) Saw replicas

1. Increase stabilization/cooldown; 2) switch to step-scaling;

2. smooth out the metric with an exponential mean.

Config crib

VPA (collection of recommendations):

yaml apiVersion: autoscaling. k8s. io/v1 kind: VerticalPodAutoscaler metadata: { name: api-vpa }
spec:
targetRef: { apiVersion: "apps/v1", kind: Deployment, name: api }
updatePolicy: {updateMode: "Off"} # collecting recommendations

Cluster Autoscaler (flag ideas, concept):


--balance-similar-node-groups
--expander=least-waste
--max-empty-bulk-delete=10
--scale-down-utilization-threshold=0. 5
--scale-down-delay-after-add=10m

Topology spread:

yaml topologySpreadConstraints:
- maxSkew: 1 topologyKey: topology. kubernetes. io/zone whenUnsatisfiable: DoNotSchedule labelSelector: { matchLabels: { app: api } }

Total

Efficient scheduler and auto-scaling are correct requests/limits + smart stacking + multi-level scaling (HPA/VPA/KEDA/CA) + warming up and anti-flapping, tied to SLO and millisecond cost. Fix policies in IaC, observe by "correct" metrics (latency/lag), keep warm stock under peaks - and the platform will be elastic, predictable and economical.