Resource allocation
1) Task and principles
Resource allocation is a systematic way to match demand (load, projects, incidents) with supply (CPU/RAM/IO/network, licenses, people, budgets) for target SLOs and FinOps restrictions.
Basic principles:- SLO-first: the resource has a quality goal; selection is a tool to withstand it.
- Fairness + Priority: a fair share for everyone, but guarantees are a priority.
- Isolation: limit blast-radius "gluttonous" loads.
- Elasticity: automatic expansion/contraction for actual demand.
- Cost-aware: Each additional resource should have an understandable effect on SLO/revenue.
- Evidence-based: solutions confirmed by telemetry and experiments.
2) Resource taxonomy
Computing: CPU/Memory/GPU, container pools, serverless quotas.
Storage: IOPS/throughput, hot/warm/cold layers, cache.
Network: egress/ingress, CDN, private channels, IP pools.
Data: slots/window resources in DWH/streaming, backfill windows.
People: on-call slots, IC/Release, SRE/Dev time (hours/sprint).
Vendors: provider limits (PSP/KYC/CDN), rate-limits and connections.
3) Prioritization model (portfolio)
Tier-0: vital flow (login, payments). Guaranteed resources, individual pools.
Tier-1: business critical (core product, reports D-1). Preferred quotas.
Tier-2/3: auxiliary/research. Burstable, budget limits.
Projects: Impact × Urgency × Confidence × Cost rating → rank; matching in the SAV/portfolio.
4) Allocation policies (guarantees, quotas, limits)
Guaranteed (dedicated): fixed share/reserve; for Tier-0/1.
Burstable: base quota + right to borrow up to the limit.
Best-effort: no guarantees, can be superseded.
Quota/Limit-as-Code: all quotas and limits are described declaratively (policy repository).
Preemption/Pod Disruption Budget: Who can be ousted and at what speed.
Network quotas: egress/tenant, limits on connections to providers.
5) Multi-tenancy and isolation
Namespace/Account per tenant: individual limits, budget, audit.
Noisy neighbors: cgroups/requests/limits/IO-throttling; separate nodes for "heavy" tasks.
P95-isolation: SLO is calculated by percentiles, not averages; burst should not break p95 neighbors.
Data tenancy: separate storage layers and caches for VIP/regions.
6) Auto-scaling and elasticity
HPA/VPA/Cluster-autoscaler: scale by SLI/SLI proxy (latency p95, queue depth), not just CPU.
Scheduled scaling: in advance for peak windows/events.
Warm pools: warmed-up nodes/connections for fast scalapes.
Network/CDN: automatic rebalance by RUM/Anycast/POP load.
7) Queues, service classes and SLAs
Classes: 'gold/silver/bronze' with target wait times and error budgets.
Queues/buses: prioritization, individual batches for Tier-0, DLQ.
Backpressure: drop/shape/slow disciplines to protect the kernel.
Adaptive timeouts/retrays: for the class of service and the current state.
8) Human resources
Shifts and coverage: traffic match (follow-the-sun), P1 + P2 doubles at peak.
SRE/Dev focus: percentage of time per reagent/proactive (e.g. 50/50) with KPI.
Request resources: RFC templates for hours/sprint, transparent priority queue.
9) Financial Model (FinOps)
Unit economy: $/1k requests, $/successful payment, $/GiB logs.
Budgets and alerts: quotas for accounts/tenants, warnings about overspending.
Optimization: hot/warm/cold storage, log sampling, spot pools for non-critical.
Showback/Chargeback: Cost reports by team/tenant motivate performance.
10) Provider management
Limits and windows: contract TPS and queues at PSP/KYC/CDN; scheduled windows in the calendar.
Failover profiles: weights and routing between multiple providers.
Pulse metrics: response time, resiliency, cost/successful operation.
11) Distribution maturity metrics
SLO Adherence by grade:% compliance in gold/silver/bronze.
Resource Efficiency: CPU/RAM/IO utilization (median/p95), idle share.
Cost per SLO-point: change in the cost of holding the SLO target.
Throttling/Preemption rate: how often and whom we displace.
Hotspot MTTA: Pool/tenant overheating response time.
Fairness Index: Delay/quota spread between tenants (gini/variation).
12) Checklists
Before changing the distribution
- SLO targets and service class are defined.
- There is telemetry by load (p95/p99, growth, seasonality).
- Quotas/limits are described in Git and reviewed.
- Effects on neighbors (isolation tests) tested.
- Rollback plan and guardrails ready.
Weekly Operating Room
- Heatmap of pool disposal and hotspot report.
- FinOps report: $/unit, overruns, anomalies.
- Provider limits and SLAs are met.
- Queues: delay within classes, no fasting.
- CAPA by identified bottlenecks in the work.
13) Templates (ideas)
13. 1 Quota Policy (YAML)
yaml tenant: vip-eu class: gold compute:
cpu:
request: "8000m"
limit: "12000m"
memory:
request: "16Gi"
limit: "24Gi"
storage:
tier: hot iops_min: 8000 network:
egress_mbps_cap: 500 slo:
latency_p95_ms: 250 preemption:
protected: true burst:
allowed: true max_factor: 1.5
13. 2 Auto-zoom profile (fragment)
yaml autoscaling:
metric: "queue_depth" # или biz_sli.payment_latency_p95 target: 200 min_replicas: 6 max_replicas: 60 warm_pool: 4 cooldown_sec: 120
13. 3 Service class and queues
yaml class: gold sla:
wait_p95_ms: 150 queue:
partition: "gold-eu"
retry_policy:
attempts: 2 backoff_ms: 200 backpressure: "shape" # иначе drop/slow
13. 4 Resource Claim (People)
RFC: RES-OPS-2025-11
Цель: усилить on-call P2 на пике ноябрьских промо (EU)
Период: 2025-11-25..2025-12-05
Обоснование: прогноз трафика +30%, прошлогодний p95 MTTA ↑
Запрос: +1 P2 слот/сутки, +IC в prime-time
14) Procedures and automation
Planner bot: calculation of quotas from the history of traffic and SLO goals, PR to the policy repository.
Guardrails-bot: stop signal to deplors when the quota/oversubscription is insufficient.
Comms bot: notifications of teams about overspending/preemption/class change.
Annotations: maintenance releases/windows change weights/quotas for the duration of work (removal of suppression after).
15) Anti-patterns
One large pool for everyone without isolation "noisy neighbors."
Highlight "by sensation," without SLO and telemetry.
Uncontrolled burst without an upper limit → "strangle" neighbors.
Lack of backpressure/queues → a snowball of timeouts.
Ignore the cost of logs/egress - "quiet" budget leak.
Fixed quotas without seasonality/peaks → unavailability or overspending.
16) Implementation Roadmap (4-8 weeks)
1. Ned. 1-2: inventory of resources and services; Class assignment (gold/silver/bronze) primary quotas; basic SLOs.
2. Ned. 3-4: enable auto-scaling by SLI proxy; Configure queues and backpressure Isolate Tier-0 pools.
3. Ned. 5-6: FinOps reporting ($/unit, quotas, budget alerts); warm-pools and painted skales for peak days.
4. Ned. 7-8: Planner/Guardrails automation, tenant cabinet (quota/value visibility), quarterly review fairness & hotspots.
17) The bottom line
Resource allocation is not a one-time setup, but a live process built into SLO, telemetry, and FinOps. When priorities are formalized, quotas and limits - like code, isolation and elasticity - by default, and decisions are confirmed by metrics and cost, the system steadily survives peaks, protects critical flow and does not "burn through" the budget.