Cost architecture
1) Principles and roles
Cost as a Feature. Price is part of UX/product and architectural solutions.
Shared responsibility. Engineers, platform/DevEx, finance, product - a single feedback loop.
A single source of truth. Tag/label catalog, cost dictionary, and data sources.
Watch → Optimize → Manage loop. Built-in dashboards, automatic gates and policies.
Roles: Value Architect, FinOps Analyst, Product Owner, Platform Team.
2) Value Data Model
Unit economics:- For API: '$/1000 requests', '$/millisecond CPU', '$/GB egress'.
- For data: '$/GB-month of storage', '$/request to the database', '$/million messages'.
- For the user: 'CAC', 'ARPU/ARPPU', 'Gross Margin', 'LTV: CAC'.
- For the stream: '$/transaction', '$/deposit', '$/test run'.
cost_record {
ts, provider, account, region, service, usage_qty, usage_unit,
list_price, net_price, discounts,
tags: { env, team, product, feature, tenant, cost_center, pii, tier },
resource_id, allocation_keys: {req_id?, tenant_id?, dataset?}
}
Gold tags (required): 'env', 'team', 'product', 'feature', 'cost _ center', 'owner', 'pii', 'tier (hot/warm/cold)', 'region'.
3) Attribution: showback/chargeback
Showback: transparent reports on teams/features without charging internal transfers.
Chargeback: distribution by rules: direct costs to → owner; shared resources - by keys: RPS, CPU seconds, GB hours, volume of events.
cluster_cost = sum(provider_cost where resource in "k8s-node:")
weights = { service: cpu_seconds(service)/total_cpu_seconds }
for service in services:
charge[service] = direct_cost(service) + cluster_cost weights[service]
4) Policy as Code
Budget rules: limits by 'env/team/feature'; auto-alert/deploy block at predicted excess.
Label requirements: resources without mandatory tags - deny in the admission controller.
Profile limits: prohibition of large machines in 'dev', TTL on ephemeral resources, minimum reservations.
yaml policy: require-tags-and-limits deny_if_missing_tags: [team, product, env, cost_center, owner]
constraints:
env==dev:
max_instance_type: "c6i. large"
ttl_hours: 72
5) Computing: Cost Reduction Patterns
Correct size (rightsizing): auto-matching vCPU/RAM based on p95/p99, seasonality and headroom.
Auto-scaling: target-based (CPU/RPS/lag), step functions; protection against thrash through hysteresis.
Price model selection: on-demand vs spot/preemptible, Reserved Instances/Savings Plans; mixture for critical and backgrounds.
Batch pipelines: windows of "cheap" load, batch compression, priority queues.
Caching and coalessing requests: reducing readings from expensive sources.
Edge/network optimization: HTTP/2/3, keep-alive, compression, CDN.
if rps > target1. 2 for 3m: replicas += ceil(rps/target); cool_down 5m if rps < target0. 6 for 10m: replicas = max(min_replicas, replicas-1)
6) Storage and data: hot/warm/cold
Tearing: hot data (instant access), warm (rare requests), cold/archive.
Formats: column (Parquet/ORC) for analytics, compression and partitioning by date/key.
TTL/ILM: set life policy: 'hot 7d → warm 90d → cold 365d → delete'.
Cache layer: Redis/Memcached with request coalescing, miss storm protection.
Quotas and request budgets: predictable limits on expensive joins/scans.
yaml dataset: events_main lifecycle:
- phase: hot; duration: 7d; storage: nvme
- phase: warm; duration: 90d; storage: ssd; compress: zstd
- phase: cold; duration: 365d; storage: object; glacier: true
- phase: purge; duration: 0d
7) Network and egress
Minimize interregional traffic: local copies and edge aggregation.
CDN and caches: origin-shield, reasonable TTL, validation/disability.
Protocols: binary (gRPC) for chatting, compression only where beneficial.
Dedup events and filtering on the producer: "do not carry garbage."
8) Observability and cost of SRE
Telemetry cost cards: '$/log-GB', '$/metric-series', '$/trace'.
Sampling and aggregation: tail-based sampling, downsampling metrics, retention in importance (SLO metrics - higher priority).
Deadup of logs and "log-sanitation": prohibition of PD, reduction of phantom fields, limits on the size of the event.
9) CI/CD and test environments
Ephemeral stands with auto-TTL, environment "by PR."
Perf-smoke in PR: short runs for early valuation of the "cost of inquiry."
Cache/artifacts: container reuse, compilations.
Gates: build/deploy is rejected if the "latency price "/RPS has deteriorated relative to the baseline> X%.
10) Forecasting, budgets and anomalies
Forecasts: seasonality/trend, events (campaigns, releases), feature → value correlation.
Budgets by level: team/product/feature/tenant; escalation at 80/90/100%.
Anomalies: sudden peaks by service/region/account; automatic "bisect" and flag rollback.
if forecast(month_end_cost) > budget0. 9 and variance ↑:
alert(team_owner)
suggest: rightsizing + RI/SP coverage + ILM tighten
11) Procurement and Commerce
RI/Savings Plans/Committed Use: Cover a stable base; monitor coverage and "unutilized" percentages.
Spot/Preemptible: background tasks and tolerant workflow; checkpointing and quick restart.
Licenses and SaaS: ROI matrix, alternative benchmarking, periodic "vendor fitness review."
12) Multi-tenancy and billing
Partitioning by tenant: logical/physical separation, limits and quotas.
Tenant-aware limiters/ratecaps: prevent a "noisy neighbor."
Usage models: billing by events, RPS, data volumes; transparent metrics for clients.
13) Safety and compliance as a cost factor
Crypto and storage: FPE/keys - KMS/HSM costs; Optimize frequency of operations.
Regulatory copies: separate "legal" retentions from operating ones; archive is cheaper than "eternal warm" storage.
Data minimization: less data - less bills and risks.
14) Engineering anti-patterns (expensive!)
Chat APIs without batches and caching.
Unlimited queues and unlimited parallelism - the growth of latency and counting.
Zero TTLs and hot keys without coalessing.
"All-seeing" dashboards with millions of series metrics.
Resources without tags → "gray" spending without an owner.
Lack of ILM/TTL → forever storage growth.
15) Tools and artifacts (vendor-neutral)
Tag directory (schema + linter in CI).
Cost extractor (aggregation usage/billing, normalization into a single format).
Dashboards unit economics (API-cost, dataset-cost, tenant-cost).
Auto-edits (rightsizer, RI/SP-recommendation, ILM-enforcer).
Cost policies (admission/OPA/Kyverno) and budget red lines.
16) Mini recipes
"Request price" formula (HTTP)
request_cost = (cpu_ms $/cpu_ms) +
(mem_mb_s $/mb_s) +
(egress_mb $/mb) +
(db_calls $/call) +
(cache_ops $/op miss_penalty)
Quick Service Audit
Top 3 expensive endpoints by $/1000 req.
Hit/miss cache and storm keys.
Untagged resource lists.
ILM and datacet retention.
RI/SP coverage (%).
Economical retry policy
retry = min(3, floor(budget_ms / (base_timeout_ms 1. 5^attempt)))
jitter = uniform(0. 5..1. 5)
17) Value Architect Checklist
1. Defined unit metrics ('$/req', '$/GB-month', '$/txn') and owners?
2. Tag policy enforced? Are untagged resources blocked?
3. Showback/chargeback and product/feature reports implemented?
4. Autoscale and rightsizing configured, headroom defined?
5. Data toned (hot/warm/cold), ILM/TTL applied?
6. Egress and interregional flows minimized? CDN/caches enabled?
7. Observability optimized (sampling, retention, downsampling)?
8. Are CI/CD regression gates and policy-checks active?
9. Are forecasts/budgets/anomaly analysis automated?
10. RI/SP/Spot mix covers base loads?
11. Are there quotas, limits and transparent usage metrics for multi-tenant?
12. FinOps runbook and monthly cost-review plan documented?
Conclusion
Value architecture is not "saving at all costs," but value management: how much each millisecond costs and how much revenue it generates. By embedding cost in architecture, processes and tools (tags, policies, gates, dashboards, ILM, autoscale), you get a platform where decisions are made based on metrics and economics, not intuition. This speeds up the product, reduces risk and makes the business predictably profitable.