Cloud cost optimization
1) Why FinOps and what goals
The goal is to reduce COGS while maintaining SLO/development speed. Key issues:- How much is 1 request, 1 active user, 1 tenant?
- What is the marginal effect of the new feature/traffic?
- Where are the "leaks" (egress, redundant logs, CPU/memory overhead, idle resources)?
Baseline metrics
Cost/Req, Cost/Minute Active, Cost/Tenant/Brand, Cost/GB-stored, Cost/GB-egress.
COGS%: share of cost of sales in revenue.
Waste%: (paid but unused resources )/( all resources).
2) Tidying up: tags, ownership, budgets
Tags/labels: 'env', 'team', 'service', 'tenant', 'product', 'cost _ center', 'slo _ tier'.
Ownership: Each resource has an owner and TTL.
Budgets/alerts: monthly/weekly budgets with thresholds of 50/80/100% + anomaly detection.
Policies as code: prohibition "without tags," size limits, default regions, allocated quotas.
hcl module "policy" {
source = "finops/policy/required-tags"
required_tags = ["env","team","service","cost_center","tenant"]
}
3) Architectural levers of economy
3. 1 Correct dimensions and auto-scaling
Rightsizing: select instances for the actual CPU/RAM p95.
Auto-scaling: horizontal> vertical; для K8s — Cluster Autoscaler/Karpenter, для serverless — min/max concurrency.
Cold ways - in line/batchi; long-term tasks - to workers on a schedule.
3. 2 Spot/purchased capacity
Spot/Preemptible for stateless/background and CI; hold the On-Demand buffer.
RI/CUD/Savings Plans: Book a stable 50-70% baseload, the rest is elastic.
3. 3 Data storage and classes
Separate: hot (SSD), warm (standard), cold/archive (Glacier/Archive).
Lifecycle policies: shift classes, delete after term.
Enable versioning where needed and object lock (WORM) for auditing only.
3. 4 Network and egress
CDN/edge + stale-while-revalidate reduces interregional egress.
Private channels (PrivateLink/PSC/Direct Connect/Interconnect) instead of the "raw" Internet.
Compression (Brotli/Zstd), HTTP/3/QUIC - less RTT/reconnections.
3. 5 Databases and caches
Choose a two-level scheme: cache (Redis/Memcached) + storage.
Read replicas for analytics, include auto-vacuum/compaction, use pgBouncer/RDS Proxy.
For large tables - partitioning/TTL/archive.
4) Kubernetes-economics
Requests/Limits by SLO class; 'limits: null'prohibition.
VPA (recommendations), Karpenter (selection of instances for hearths), Bin packing (tolerations/affinity).
Separate prod/stage/dev at the cluster/node level (different types and policies).
Network and storage classes: choose SC/IOPS by load profile, not "premium everywhere."
QoS classes and priorities: saving on background jobs.
Log profiles: sidecar agents with local buffer, sending by batches.
5) Serverless-economics
Min instances/provisioned concurrency - for hot handles only.
Small deploy-bundle, lazy-init, sharing connections.
Deadlines and queuing heavy tasks.
Function-aggregators (fan-in) instead of a dozen trips depending on.
6) Observability: pay for valuable telemetry
Logs: structural, without verboseness; presentation by class (prod errors longer, debug - short).
Trace sampling: tail-based - 100% errors/p99, the rest 1-10%.
Metrics: aggregation/downsampling, sparse-sending.
PII filtering before sending (fewer bytes and risks).
7) Supplier Network and Marketplace
Compare the prices of the regions, the marginality of managed services, marketplace bundles.
Negotiations: volume discounts (RI/CUD), commits, credit programs.
Avoid duplicating SaaS with overlapping functionality.
8) Unit economics and dashboards
Major Cost SLI/SLO
Cost/Req by routes (login, catalog, deposit).
Cost/Tenant/Brand/Region.
Egress/Req, Storage/Req, Compute/Req.
Waste % и Coverage RI/SP %.
Dashboards (minimum set)
"Cost map" for services/teams with descents to the resource.
egress "heat map" by direction.
"Service → cost → SLO": correlation of p99 and Cost/Req.
"RI/CUD/Spot" coverage and line savings.
9) FinOps processes
Weekly analysis of accounts with service owners.
Change review with assessment of the cost of features before production inclusion.
Guardrails: quota limits, automatic completion of idle resources, TTL for test environments.
GameDays of Value: Artificial Spades/Feature Flags, Checking Budget Sustainability.
10) Antipatterns
"Temporary" resources without TTL → forever.
`0. 0. 0. 0/0'egress + no CDN → egress accounts explode.
Without tags/labels, it is → impossible to allocate costs.
DEBUG logs in sales, 100% traces - meaningless terabytes.
Provisioned/serverful "just in case" without usage metrics.
All loads are only On-Demand, no RI/Spot/commits.
11) Specifics of iGaming/Finance
PSP/payment fees - part of COGS: optimize smart-routing to cheap/reliable providers; cache statuses, avoid repetition without idempotency.
KYC/AML vendors: package requests, use results cache (TTL by policy), measure Cost/KYC.
"Money ways" (deposit/withdrawal): separate SLOs and budget; reserves for peak events, warm specimens only there.
Content/CDN: Local edge and regional domains to reduce egress and comply with data residency.
Legal requirements: WORM storage for audit - limit scope (aggregation, TTL, compression).
12) Mini recipes
12. 1 Log retention policy
Prod errors: 30-90 days; Info: 7–14; Debug: 24-72 hours.
Archive only on request of compliance.
12. 2 Canary telemetry
For a new feature - 100% of traces for the first 24 hours → then tail-sampling.
12. 3 Object Lifecycles
json
[
{"prefix": "raw/", "days_to_warm": 30, "days_to_cold": 90, "days_to_delete": 365},
{"prefix": "audit/", "lock": "WORM-365d"}
]
12. 4 Budgets/alerts (idea)
Monthly budget per team; alerts 50/80/100%; anomaly detection> 30% of the trend over 24 hours
13) Prod Readiness Checklist
- 100% resource tags and owners; politicians block untagged ones.
- Budgets and alerts + anomaly detection; reports on tanants/brands/regions.
- RI/CUD/Spot cover baseload; there is an On-Demand buffer.
- K8s: requests/limits set; VPA/Karpenter; bin packing; separate Storage/IOPS classes.
- Serverless: provisioned/min for hot paths only; cold - through queues.
- CDN/edge enabled; private channels to PaaS; egress-dashboard.
- Logs/trails: tail-sampling, retentions by class; PII filtering.
- Storage lifecycles and archive; partitioning large tables.
- Financial dashboards Cost/Req, Cost/Tenant, Waste%, Coverage RI/SP%.
- For iGaming: PSP/KYC/AML expense accounting, SLO and money path budgets, WORM audit.
14) TL; DR
First, visibility (tags, budgets, dashboards), then structural levers: correct dimensions, auto-scaling, RI/Spot/commits, CDN/edge and private channels, storage classes and lifecycles. Pay for valuable telemetry (tail-sampling, short retentions) and automate guardrails. In iGaming, consider PSP/KYC/AML as part of COGS and highlight "money paths" with separate SLOs and budgets.