FinOps and infrastructure budgeting
1) FinOps objectives and area of responsibility
FinOps integrates engineering, finance and product to manage cost while maintaining SLO/delivery speed.
Results:- Cost transparency by service/team/tenant/region.
- Predictability (plan/actual, deviations, reforecast).
- Conscious trade-off: performance ↔ cost.
- Product/Owners - Revenue/Unit Economy Goals.
- Eng/Platform - architectural levers and SLOs.
- Finance - budgets, commits, reporting.
- FinOps Guild - process, tools, training.
2) Metrics and unit economics
Basic cost SLIs:- Cost/Req (cost of 1 request), Cost/ActiveUser/Month, Cost/Tenant/Brand/Region.
- COGS% (cost/revenue), Gross Margin.
- Waste% = paid − used.
- Coverage% (RI/CUD/SP) - the share of the load covered by commits.
- Egress/Req, Storage/Req, Observability/Req.
Cost/Req = (Compute + Storage + Network + Observability + 3rd-party) / #Requests
COGS% = COGS / Revenue
Waste% = (Idle + Over-provision + Unused) / Total
3) Tagging, ownership and policies
Required tags: 'env', 'team', 'service', 'tenant', 'product', 'cost _ center', 'slo _ tier', 'owner', 'ttl'.
Ownership: each resource has a responsible and review period.
Policies as code: disallow untagged resource creation, size limits, valid regions, test environment lifetime.
- Deny "public egress without proxy/PrivateLink."
- 'Description/owner/ttl 'requirement for SG/NSG/firewalls.
- Budget quotas per team (soft/hard thresholds).
4) Budget cycles and calendar
Annual budget (AOP): goals for COGS, margins, commits at the clouds.
Quarterly plans: adjustments by roadmap/seasonality.
Rolling-forecast (monthly, horizon 6-9 months): takes into account fact and trends, recalculates the deficit/surplus.
Incident pool: 3-5% reserve for unexpected egress/capacity.
1. Company → 2) Product/Brand → 3) Team/Service → 4) Environment → 5) Resource Class.
5) Load and cost forecasting
Drivers: MAU/DAU, RPS by route, data volumes, butch frequency/ML, seasonality, marketing campaigns.
Models: expon. anti-aliasing + events. adjustments (releases, regions, providers).
What-if: X% RPS growth, migrating to another region, enabling caching/edge, changing storage class.
- Separate fixed (commits, leases, AlwaysOn) and variables (on-demand/spot, egress).
- Have a scaling ladder (capex/commit steps) to the peaks.
6) Commits at cloud providers
Reserved Instances/CUD/Savings Plans: Close stable 50-70% of baseload.
Diversify by term (1/3/extendable), by region/instance types.
On-Demand buffer for peaks and troughs.
Spot/Preemptible: stateless/CI/background analytics, with a secure fallback.
- First rightsizing and autoscaling, then commits.
- Resale/markets (where available) for unused RIs.
- Control egress rates and discounts for direct channels.
7) Architectural cost reduction levers
Compute: horizontal autoscaling, Karpenter/Cluster Autoscaler, class-based QoS, disabling "night" dev clusters.
Storage: storage classes (hot/warm/cold), life cycles/TTL, partitioning, dedup, compression.
Network: CDN/edge + SWR, PrivateLink/PSC, API call aggregation, HTTP/3/QUIC.
DB/Cache: pgBouncer/RDS Proxy, read replicas, TTL/archive, two-stage cache.
Observability: tail-sampling traces (100% errors and p99, the rest 1-10%), retentions by class, downsampling metrics.
8) Chargeback / Showback
Internal billing model:- Showback (soft): monthly report without money transfer.
- Chargeback (hard): actually writing off the team's budget.
- Direct cost → by tag.
- General (egress, logging platform) → proportional to drivers (requests, GB logs, storage).
- "Advocacy" of controversial cases: FinOps-guild helps teams optimize.
9) Dashboards and alerts
Mandatory minimum:- Cost map: by services/teams/tenants/regions from drilim to resource.
- Plan/actual/deviations + forecast (rolling).
- Coverage RI/CUD/Spot and savings.
- Egress heatmap (directions, providers, PSP).
- Cost ↔ SLO: p95/p99 correlation with Cost/Req.
- Anomaly detection: a surge of> 30% of the trend in 24 hours.
- Budgets: 50/80/100% of period.
- Sudden growth of egress, "DEBUG-logs in prod," drop in coverage%.
- "Idle services" and unused volumes/IPs.
10) Processes and RACI
Weekly FinOps stand-up: top deviations, actions, owners.
Change review: assessment of the cost of features before production inclusion.
GameDays cost: artificial peaks/feature flags → checking budget stability.
Runbooks: how to increase/decrease commits, how to urgently cut egress/logs, how to park environments.
11) Documents and templates
11. 1 Budget template (fragment)
Revenue/MAU/Tenants
COGS: Compute/Storage/Network/Observability/3rd-party
RI/CUD/SP commits (coverage, term)
Reserve of incidents (3-5%)
Optimization plan (economy effect, owner, term)
11. 2 What-if template
ΔRPS = +20% → ΔCompute + ΔEgress
Enable CDN-SWR → − X% egress, − Y $
Transfer of logs from 30 to 14 days → − Z $
CUD + 20k $/year → payback 7.5 months
12) Risk Management and Compliance
Suppliers: SLAs/penalties, exit strategies, lock-in risks.
Legal: regions/retention periods, WORM for audit.
FX/currency: exchange rate sensitivity, multicurrency accounting.
Capitalization/amortization: interpretation of long-term commits and private connections.
13) Antipatterns
"Temporary" resources without TTL → forever.
Commits to rightsizing/autoscaling.
No tags → gray costs.
Single DEBUG log on sale/100% of traces.
Dev/stage at 24 × 7 without auto-pause.
Spot without on-demand buffer.
Public egress in each spoke without CDN/proxy.
14) Specifics of iGaming/Finance
PSP/commissions - part of COGS: smart-routing to cheaper/more reliable, status cache, repetition idempotency.
KYC/AML: request packetization, cache with TTL by policy, Cost/KYC metric.
"Ways of money" (deposit/withdrawal): separate budget/SLO, provisioned capacity only here, value-in-real-time dashboards.
Data residency: regional accounts/projects, local CDN/edge, private channels to PSP.
GGR/marginality: linking Cost/Req to game verticals/providers; reports per brand/jurisdiction.
15) Quick savings recipes
Enable tail-sampling of traces and reduce log retentions by class.
Raise SWR to CDN, warm up origin-shield.
Go to pgBouncer/RDS Proxy, remove the "storm" of connections.
Reduce requests/limits to p95 and enable Karpenter.
Transfer static/archive to cold-storage with life cycles.
Bring egress via PrivateLink/PSC, fix FQDN-allowlists.
16) FinOps prod checklist
- Tags/owners/TTL 100% resources; politicians block tagless ones.
- Budgets and alerts 50/80/100%; anomaly detection is enabled.
- Rightsizing completed; autoscaling/pause dev environments.
- Coverage RI/CUD/SP ≥ target (50-70% base); there is an on-demand buffer.
- CDN/edge + SWR; private channels to PaaS/PSP; egress-dashboard.
- Logs/trails: tail-sampling, retentions by class; PII filtering.
- Storage policies: classes, TTL, archive; partitioning large tables.
- Cost/Req, Cost/Tenant/Brand/Region dashboards; Heatmap egress; plan/actual/forecast.
- Processes: FinOps stand-up, change-review cost, GameDays.
- For iGaming: "money ways" budgets, PSP/KYC/AML accounting, WORM audit.
17) TL; DR
Do transparency (tags, dashboards, plan/fact), turn on rightsizing + autoscaling, close the base load with commits (RI/CUD/SP), reduce egress/storage through CDN/SWR, PrivateLink, classes and lifecycles, pay only for valuable telemetry. Manage your budget through rolling-forecast, alerts and chargeback, and for iGaming keep a separate contour and budget of "money paths" with tight SLOs and PSP/KYC/AML accounting.