GH GambleHub

Resource Planner and Auto-Scaling

Brief summary

Stable scaling is supported on four supports:

1. Correct requests/limits and QoS classes.

2. Correct stacking (topology, affinity, priorities, preemption).

3. Multi-level auto-scaling: HPA/VPA/KEDA + Cluster/Node autoscaler + warm pools.

4. SLO-oriented logic (latency/queue depth) with anti-flapping and budgets.


Basic Resource Model

Requests/Limits and QoS Classes

Requests = guarantees for the scheduler; Limits = ceilings for runtime.
QoS: Guaranteed (req = lim by CPU/Memory), Burstable (partially), BestEffort (nothing).
Production services with hard SLOs - Guaranteed/Burstable; background - Burstable/BestEffort.

CPU/Memory/IO/Network

CPU - elastic (time-sharing), memory - hard (OOM-kill if exceeded).

On the IO/network, set limits/priorities separately (cgroups/TC), otherwise "noisy neighbors."

GPU/Accelerators

Ask vector (GPU = 1, VRAM via profiles), use nodeSelector/tains and PodPriority for criticism.
For inference - batch size and model heating.


Scheduling Policies

Priorities, preemption and PDB

PriorityClass for critical paths (payments, login), preemption is allowed.
PodDisruptionBudget protects minimal cues during evacuation/updates.

Affinity/Topology

node/pod affinity for colocation/decolocation (for example, do not put replicas on one host).
topologySpreadConstraints align the hearths to the/AZ zones.
NUMA/topology: pin-CPU/hugepages where low latency is important.

Teyinths and tolerances

Separate pools: 'prod', 'batch', 'gpu', 'system'. Criticism endures fewer neighbors.


Auto-scaling: levels and signals

1) HPA (Horizontal Pod Autoscaler)

Scale replicas of pods by metrics: CPU/Memory/custom (Prometheus Adapter).
Good signals: latency p95/p99, queue length/lag, RPS per pod, consumer lag.
Anti-flapping: stabilization (stabilizationWindow), minimum step, cooldown.

Example of HPA (latency-driven):
yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: { name: api-hpa }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
minReplicas: 6 maxReplicas: 120 behavior:
scaleUp:
stabilizationWindowSeconds: 60 policies: [{ type: Percent, value: 100, periodSeconds: 60 }]
scaleDown:
stabilizationWindowSeconds: 300 policies: [{ type: Percent, value: 20, periodSeconds: 60 }]
metrics:
- type: Pods pods:
metric:
name: http_server_request_duration_seconds_p95 target:
type: AverageValue averageValue: "0.25" # 250ms

2) VPA (Vertical Pod Autoscaler)

Tunes requests/limits for real consumption (updates recommendations).
Modes: 'Off', 'Auto' (restart), 'Initial' (only at start).
Practice: turn on 'Off' → collect statistics → apply to releases.

3) KEDA/queue-based scaling

Reacts to external signals: Kafka lag, SQS depth, Redis length, Prometheus.
Ideal for Event/Queue Consumers (EDA).

KEDA ScaledObject (Kafka lag):
yaml apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: { name: consumer-scale }
spec:
scaleTargetRef: { name: txn-consumer }
minReplicaCount: 2 maxReplicaCount: 200 cooldownPeriod: 120 pollingInterval: 5 triggers:
- type: kafka metadata:
bootstrapServers: broker:9092 consumerGroup: tx-cg topic: payments lagThreshold: "10000"

4) Cluster/Node Autoscaler (CA) + Warm Pools

CA adds/removes nodes at deficiency/excess.
Warm pools: preheated nodes/prepared images (speed up cold start).
For peaks - step-scaling and enlarged minNodes in advance.


Reaction rate and warming

SLO reaction delay: front layer ≤ 1-2 minutes, backends/DB - separately and in advance.
Warm-up: TLS/DNS/connections, loading models, cache warm-up and JIT.
Shadow load for "pumping" the cold path to the event.


Anti-flapping and stability

Hysteresis on metrics, smoothing (exp. medium).
Stabilization windows in HPA, large in 'scaleDown'.
Step-scaling instead of "saw"; rate-limit to modify replicas.
Budget-scaling: limit the% of traffic/replicas added per minute.


Observability and SLO

Key SLIs:
  • p95/99 latency, error rate, throughput, queue depth/lag, CPU/Memory saturation, pod pending time, node pressure.
Alerts:
  • Growth pending pods, unschedulable events, IP/subnet shortage, image pull long, evictions.
  • Trails: tail-based sampling on p99 tails → we see bottlenecks when scaling.

FinOps: cost of elasticity

Metrics: $/1000 RPS, $/ms p95, $/hour reserve.
Mix: on-demand + reserved + spot (for non-criticism).
The auto-scale threshold is related to the cost of the error: sometimes it is profitable to keep the warm stock.


Specificity for iGaming/fintech

Match/tournament peaks: raise 'minReplicas' and minNodes in advance, turn on warm pools and warm up caches/models.
Payment consumers: KEDA by lag + idempotency, provider limits (PSP) as external triggers of degradation.
Antibot: a separate pool, a quick scale of rules, "gray" routes.
Regulatory: PDB for compliance services, priorities are higher than for batch.


Check sheets

Design

  • Requests/limits specified by profiling data; QoS selected.
  • PriorityClass, PDB, tains/tolerations and topologySpread - configured.
  • HPA by SLO metrics, not just CPU.
  • VPA to'Off 'to collect recommendations (migration to'Auto' is planned).
  • KEDA/Event Load Queues.
  • CA + warm pools, images are cached (image pre-pull).

Operation

  • Stabilization windows and cooldowns are set (flapping excluded).
  • Alerts to pending/unschedulable, lag, p95, error-rate.
  • Runbooks: "no nodes," "image does not stretch," "OOM/evict," "retray storm."
  • Capacity-review monthly: fact of scale vs plan/cost.

Common mistakes

Mixing criticism and batch on the same pool without tains → "noisy neighbors."

HPA only by CPU → lat-regression with IO/database limits.
The lack of PDB and priorities → criticism will be the first.
Zero heating → cold starts at peak.
Aggressive 'scaleDown' → saw and thrash containers.
KEDA without idempotency → duplicate messages in a storm.


Mini playbooks

1) Before peak event (T-30 min)

1. Increase 'minReplicas '/minNodes, activate warm pools.
2. Warm up CDN/DNS/TLS/connections, load models.
3. Include gray routes/limits for bots.
4. Check dashboards: pending/lag/p95.

2) Node deficiency (unschedulable)

1. Check CA, cloud quotas, subnets/IP.
2. Temporarily lower batch limits, enable pre-emption for low priorities.
3. Raise a temporarily larger instance type or second pool.

3) Growth of lag in the queue

1. KEDA: scale up by trigger; 2) raise consumer limits;

2. Enable idempotency-keys and backpressure producers.

4) Saw replicas

1. Increase stabilization/cooldown; 2) switch to step-scaling;

2. smooth out the metric with an exponential mean.


Config crib

VPA (collection of recommendations):
yaml apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: { name: api-vpa }
spec:
targetRef: { apiVersion: "apps/v1", kind: Deployment, name: api }
updatePolicy: { updateMode: "Off" } # собираем рекомендации
Cluster Autoscaler (flag ideas, concept):

--balance-similar-node-groups
--expander=least-waste
--max-empty-bulk-delete=10
--scale-down-utilization-threshold=0.5
--scale-down-delay-after-add=10m
Topology spread:
yaml topologySpreadConstraints:
- maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: { matchLabels: { app: api } }

Result

Efficient scheduler and auto-scaling are correct requests/limits + smart stacking + multi-level scaling (HPA/VPA/KEDA/CA) + warming up and anti-flapping, tied to SLO and millisecond cost. Fix policies in IaC, observe by "correct" metrics (latency/lag), keep warm stock under peaks - and the platform will be elastic, predictable and economical.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.