Cloud cost optimization

1) Why FinOps and what goals

The goal is to reduce COGS while maintaining SLO/development speed. Key issues:

How much is 1 request, 1 active user, 1 tenant?
What is the marginal effect of the new feature/traffic?
Where are the "leaks" (egress, redundant logs, CPU/memory overhead, idle resources)?

Basic metrics

Cost/Req, Cost/Minute Active, Cost/Tenant/Brand, Cost/GB-stored, Cost/GB-egress.
COGS%: share of cost of sales in revenue.
Waste%: (paid but unused resources )/( all resources).

2) Tidying up: tags, ownership, budgets

Tags/labels: 'env', 'team', 'service', 'tenant', 'product', 'cost _ center', 'slo _ tier'.
Ownership: Each resource has an owner and TTL.
Budgets/alerts: monthly/weekly budgets with thresholds of 50/80/100% + anomaly detection.
Policies as code: prohibition "without tags," size limits, default regions, allocated quotas.

Terraform example - mandatory tags (idea):

hcl module "policy" {
source = "finops/policy/required-tags"
required_tags = ["env","team","service","cost_center","tenant"]
}

3) Architectural levers of economy

3. 1 Correct dimensions and auto-scaling

Rightsizing: select instances for the actual CPU/RAM p95.
Auto-scaling: horizontal> vertical; для K8s — Cluster Autoscaler/Karpenter, для serverless — min/max concurrency.
Cold ways - in line/batchi; long-term tasks - to workers on a schedule.

3. 2 Spot/purchased capacity

Spot/Preemptible for stateless/background and CI; hold the On-Demand buffer.
RI/CUD/Savings Plans: Book a stable 50-70% baseload, the rest is elastic.

3. 3 Data storage and classes

Separate: hot (SSD), warm (standard), cold/archive (Glacier/Archive).
Lifecycle policies: shift classes, delete after term.
Enable versioning where needed and object lock (WORM) for auditing only.

3. 4 Network and egress

CDN/edge + stale-while-revalidate reduces interregional egress.
Private channels (PrivateLink/PSC/Direct Connect/Interconnect) instead of the "raw" Internet.
Compression (Brotli/Zstd), HTTP/3/QUIC - less RTT/reconnections.

3. 5 Databases and caches

Choose a two-level scheme: cache (Redis/Memcached) + storage.
Read replicas for analytics, include auto-vacuum/compaction, use pgBouncer/RDS Proxy.
For large tables - partitioning/TTL/archive.

4) Kubernetes-economics

Requests/Limits by SLO class; 'limits: null'prohibition.
VPA (recommendations), Karpenter (selection of instances for hearths), Bin packing (tolerations/affinity).
Separate prod/stage/dev at the cluster/node level (different types and policies).

Network and storage classes: choose SC/IOPS by load profile, not "premium everywhere."

QoS classes and priorities: saving on background jobs.
Log profiles: sidecar agents with local buffer, sending by batches.

5) Serverless-economics

Min instances/provisioned concurrency - for hot handles only.
Small deploy-bundle, lazy-init, sharing connections.
Deadlines and queuing heavy tasks.
Function-aggregators (fan-in) instead of a dozen trips depending on.

6) Observability: pay for valuable telemetry

Logs: structural, without verboseness; presentation by class (prod errors longer, debug - short).
Trace sampling: tail-based - 100% errors/p99, the rest 1-10%.
Metrics: aggregation/downsampling, sparse-sending.
PII filtering before sending (fewer bytes and risks).

7) Supplier Network and Marketplace

Compare the prices of the regions, the marginality of managed services, marketplace bundles.
Negotiations: volume discounts (RI/CUD), commits, credit programs.
Avoid duplicating SaaS with overlapping functionality.

8) Unit economics and dashboards

Major Cost SLI/SLO

Cost/Req by routes (login, catalog, deposit).
Cost/Tenant/Brand/Region.
Egress/Req, Storage/Req, Compute/Req.
Waste % и Coverage RI/SP %.

Dashboards (minimum set)

"Cost map" for services/teams with descents to the resource.
egress "heat map" by direction.
"Service → cost → SLO": correlation of p99 and Cost/Req.
"RI/CUD/Spot" coverage and line savings.

9) FinOps processes

Weekly analysis of accounts with service owners.
Change review with assessment of the cost of features before production inclusion.
Guardrails: quota limits, automatic completion of idle resources, TTL for test environments.
GameDays of Value: Artificial Spades/Feature Flags, Checking Budget Sustainability.

10) Antipatterns

"Temporary" resources without TTL → forever.
`0. 0. 0. 0/0'egress + no CDN → egress accounts explode.
Without tags/labels, it is → impossible to allocate costs.
DEBUG logs in sales, 100% traces - meaningless terabytes.
Provisioned/serverful "just in case" without usage metrics.
All loads are only On-Demand, no RI/Spot/commits.

11) Specifics of iGaming/Finance

PSP/payment fees - part of COGS: optimize smart-routing to cheap/reliable providers; cache statuses, avoid repetition without idempotency.
KYC/AML vendors: package requests, use results cache (TTL by policy), measure Cost/KYC.
"Money ways" (deposit/withdrawal): separate SLOs and budget; reserves for peak events, warm specimens only there.
Content/CDN: Local edge and regional domains to reduce egress and comply with data residency.
Legal requirements: WORM storage for audit - limit scope (aggregation, TTL, compression).

12) Mini recipes

12. 1 Log retention policy

Prod errors: 30-90 days; Info: 7–14; Debug: 24-72 hours.
Archive only on request of compliance.

12. 2 Canary telemetry

For a new feature - 100% of traces for the first 24 hours → then tail-sampling.

12. 3 Object Lifecycles

json
[
{"prefix": "raw/", "days_to_warm": 30, "days_to_cold": 90, "days_to_delete": 365},
{"prefix": "audit/", "lock": "WORM-365d"}
]

12. 4 Budgets/alerts (idea)

Monthly budget per team; alerts 50/80/100%; anomaly detection> 30% of the trend over 24 hours

13) Prod Readiness Checklist

100% resource tags and owners; politicians block untagged ones.
Budgets and alerts + anomaly detection; reports on tanants/brands/regions.
RI/CUD/Spot cover baseload; there is an On-Demand buffer.
K8s: requests/limits set; VPA/Karpenter; bin packing; separate Storage/IOPS classes.
Serverless: provisioned/min for hot paths only; cold - through queues.
CDN/edge enabled; private channels to PaaS; egress-dashboard.
Logs/trails: tail-sampling, retentions by class; PII filtering.
Storage lifecycles and archive; partitioning large tables.
Financial dashboards Cost/Req, Cost/Tenant, Waste%, Coverage RI/SP%.
For iGaming: PSP/KYC/AML expense accounting, SLO and money path budgets, WORM audit.

14) TL; DR

First, visibility (tags, budgets, dashboards), then structural levers: correct dimensions, auto-scaling, RI/Spot/commits, CDN/edge and private channels, storage classes and lifecycles. Pay for valuable telemetry (tail-sampling, short retentions) and automate guardrails. In iGaming, consider PSP/KYC/AML as part of COGS and highlight "money paths" with separate SLOs and budgets.

Cloud cost optimization

Basic metrics

Dashboards (minimum set)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects