Technology and Infrastructure → Cloud Architecture and SLAs
Cloud Architecture and SLAs
1) Why SLAs and how to manage them
SLA (Service Level Agreement) - an external promise to business/partners about the availability, speed and correctness of the service.
SLO (Service Level Objective) - internal target levels for commands.
SLI (Service Level Indicator) - measurable metrics on the basis of which SLO is evaluated.
iGaming/fintech is characterized by rigid peak windows (tournaments, live betting, reporting periods, "salary" days), strong dependence on PSP/KYC providers and geography. SLAs should take into account this behavior, and architecture should provide guarantees not only medium, but also percentile.
2) Basic terminology
Availability - The percentage of successful requests per interval.
Latency - P50/P95/P99 for key operations.
Error - determine exactly (5xx, timeout, business error?).
RTO (Recovery Time Objective) - how much time is allowed for recovery.
Recovery Point Objective (RPO) - how much data can be lost in a disaster.
Error Budget - 1 − SLO, "reserve" for changes and incidents.
3) Framework of cloud architecture for SLA
3. 1 Multi-area (Multi-AZ)
Replicate state (DB, cache, queues) to at least 2-3 AZ.
Cold/warm standbys, automatic failover.
Local balancers (L4/L7) with health checks per-AZ.
3. 2 Multiregion
Asset-to-asset: low RTO/RPO, more difficult consistency and cost.
Asset-liability (hot/warm): cheaper, RTO more, but easier data control.
Geographic routing (GeoDNS/Anycast), "blast radius" isolation.
3. 3 Storage and data
Transactional databases: synchronous replication within the region, asynchronous interregional.
Cache: cross-regional replicas, "local reads + async warmup" mode.
Object storage: versioning, life cycles, cross-region replication.
Queues/Streaming: Mirror Clusters/Multi-region Streams.
3. 4 Loop insulation
Separation of critical services (payments/wallet) and "heavy" analytical tasks.
Rate-limits/quotas between contours so that reports do not "eat up" the prod.
4) High availability patterns
Backpressure - control the incoming flow, do not allow queues "to the horizon."
Bulkhead & Pool Isolation - Isolate connection and resource pools.
Circuit Breaker + Timeouts - protection against freezes of external integrations.
Idempotency - repeat requests without double write-offs.
Graceful Degradation - when degraded, disable non-fundamental features (avatars, advanced filters).
Chaos/Failure Injection - planned "failures" to test reliability hypotheses.
5) DR (Disaster Recovery) Strategies
Choice: payments/wallet - minimum Hot Standby; content/directory - Warm; Reports - Backup & Restore with clear windows.
6) About SLI/SLO: how to measure correctly
6. 1 SLI by level
Client SLI: end-to-end (including gateway and external providers).
Service SLI: "pure" service latency/errors.
Business SLI: CR (registratsiya→depozit), T2W (time-to-wallet), PSP-decline rate.
6. 2 SLO examples
Core API availability: ≥ 99. 95% in 30 days.
Payout latency: P95 ≤ 350 ms, P99 ≤ 700 ms.
Delivery of webhooks PSP: ≥ 99. 9% for 60 sec (with retras).
Data Freshness reports: ≤ 10 min lag in 95% of the time.
6. 3 Error Budget Policy
50% of the budget - for changes (releases/experiments), 50% - for incidents.
Budget combustion → frieze feature, only stabilization.
7) Performance and scaling
HPA/VPA with SLO-oriented signals (not only CPU, but also queues/latency).
Predictive scaling based on schedules and historical peaks.
Warm pools/preheating connections to DB/PSP before tournaments.
Caching and edge - reduce RTT, especially for game catalogs and static assets.
8) Network layer and global traffic
Anycast/GeoDNS to minimize latency and localize crashes.
Failover policies: health tests of the region, thresholds, "stickiness" with TTL.
mTLS/WAF/Rate Limit at the edge, protection against bot traffic.
Egress control to PSP/KYC by allow-list and SLA-aware retreats.
9) Data and consistency
Select the level of consistency: strict (payments) vs eventual (catalog/ratings).
CQRS for offloading reading and verticals of critical commands.
Outbox/Inbox for "exactly once" event delivery.
Downtime-free migrations: expand-migrate-contract, double entry during MAJOR changes.
10) Observability under SLA
Traces through gateway: correlation of'trace _ id' with partner/region/API version.
SLO-dashboards with burn-rate, "weather" by region and provider.
Alerts by symptoms, not by proxy symptoms (not CPU, but P99/errors).
Synthetics: external checks from target countries (TR, BR, EU...).
Audit and reporting: exporting SLI/SLO to the partner portal.
11) Safety and compliance
Network segmentation and secret management (KMS/Vault).
In-flight/rest encryption, PAN/PII tokenization.
Role access policies for admins/operators.
Logs immutable (WORM) and retention for audit.
Regulatory: storage in the region, reports, provability of SLA execution.
12) FinOps: SLA as a cost driver
Put prices on SLO deviations: how much is + 0. 01% availability?
Profile peak windows, do not inflate constant power.
Right-sizing and "spot where you can" for background tasks.
Quotas and budgets for contours, do not allow "free" degradation.
13) Reliability testing
GameDay/Chaos sessions: turning off AZ/PSP, delays in queues, BGP breaks.
DR-drili: regular training of switching regions with goals for RTO.
Load & Soak: long runs with real betting/tournament profiles.
Replay incidents: a library of famous files and playback scripts.
14) SLA process side
SLO directory: owner, formula, metrics, sources, alerts.
Changes via RFC/ADR: evaluation of the impact on the error budget.
Post-mortems: improving architecture and ranbooks, adjusting SLO.
Communications with partners: mailings, status page, planned maintenance.
15) SLI/SLO/Report Examples
15. 1 Formulas
SLI_availability = (успешные_запросы / все_запросы) 100%
SLI_latency_P99 = перцентиль_99(латентность_запроса)
SLI_webhook_D+60 = доля вебхуков, доставленных ≤ 60 сек
15. 2 Core API SLO Set Example
Availability (30 days): 99. 95%
Endpoint P95 '/v2/payouts/create ': ≤ 350ms
5xx errors (rolling 1 hour): <0. 3%
Webhook delivery ≤ 60 сек (P99): ≥ 99. 9%
RPO for wallet: ≤ 60 sec, RTO ≤ 5 min
15. 3 SLA report (squeeze)
Completed: 99. 97% (SLO 99. 95%) +
Violations: 2 episodes per BR region due to PSP timeouts (cumulative 8 minutes).
Measures: added smart-routing by failure codes, increased warm pool of connections to PSP-B.
16) Implementation checklist
1. Critical user paths and corresponding SLIs are defined.
2. SLO for 30/90 days + error budget policy.
3. Multi-zoning and DR plan with RTO/RPO goals, regular drills.
4. Synthetics from geo-target, dashboards per-region/per-PSP.
5. Stability patterns: circuit breaker, backpressure, idempotency.
6. Degradation policy and feature flags for disabled features.
7. FinOps: contour budgets, peak forecast, warm pools.
8. Security: segmentation, encryption, auditing.
9. SLA documentation for partners, communication process.
10. Retrospectives and SLO revisions every 1-2 quarters.
17) Anti-patterns
Promise SLAs without measurable SLIs and transparent counting techniques.
Count availability "at the entrance of the service," ignoring the gateway/providers.
Rely only on medium latency, ignoring P99 tails.
DR "on paper," lack of real training.
"Eternal" resources without limits: one report brings down prod.
Mix food and heavy analytics in one cluster/database.
18) The bottom line
Cloud architecture for SLAs is a combination of technical patterns (multi-AZ/region, isolation, fault-tolerant data), processes (SLO, error budget, DR-drills) and economics (FinOps). Give yourself the right to predicted failures: test fault tolerance, measure by percentiles, limit the "explosive radius" and communicate openly. SLA's promises would then become not marketing but managed engineering practice.