Technology and Infrastructure → Cloud Architecture and SLAs

Cloud Architecture and SLAs

1) Why SLAs and how to manage them

SLA (Service Level Agreement) - an external promise to business/partners about the availability, speed and correctness of the service.
SLO (Service Level Objective) - internal target levels for commands.
SLI (Service Level Indicator) - measurable metrics on the basis of which SLO is evaluated.

iGaming/fintech is characterized by rigid peak windows (tournaments, live betting, reporting periods, "salary" days), strong dependence on PSP/KYC providers and geography. SLAs should take into account this behavior, and architecture should provide guarantees not only medium, but also percentile.

2) Basic terminology

Availability - The percentage of successful requests per interval.
Latency - P50/P95/P99 for key operations.
Error - determine exactly (5xx, timeout, business error?).
RTO (Recovery Time Objective) - how much time is allowed for recovery.
Recovery Point Objective (RPO) - how much data can be lost in a disaster.
Error Budget - 1 − SLO, "reserve" for changes and incidents.

3) Framework of cloud architecture for SLA

3. 1 Multi-area (Multi-AZ)

Replicate state (DB, cache, queues) to at least 2-3 AZ.
Cold/warm standbys, automatic failover.
Local balancers (L4/L7) with health checks per-AZ.

3. 2 Multiregion

Asset-to-asset: low RTO/RPO, more difficult consistency and cost.
Asset-liability (hot/warm): cheaper, RTO more, but easier data control.
Geographic routing (GeoDNS/Anycast), "blast radius" isolation.

3. 3 Storage and data

Transactional databases: synchronous replication within the region, asynchronous interregional.
Cache: cross-regional replicas, "local reads + async warmup" mode.
Object storage: versioning, life cycles, cross-region replication.
Queues/Streaming: Mirror Clusters/Multi-region Streams.

3. 4 Loop insulation

Separation of critical services (payments/wallet) and "heavy" analytical tasks.
Rate-limits/quotas between contours so that reports do not "eat up" the prod.

4) High availability patterns

Bulkhead & Pool Isolation - Isolate connection and resource pools.
Circuit Breaker + Timeouts - protection against freezes of external integrations.
Idempotency - repeat requests without double write-offs.
Graceful Degradation - when degraded, disable non-fundamental features (avatars, advanced filters).

Backpressure - control the incoming flow, do not allow queues "to the horizon."

Chaos/Failure Injection - planned "failures" to test reliability hypotheses.

5) DR (Disaster Recovery) Strategies

Strategy	RTO	RPO	Cost	Complexity	Comment
Backup & Restore	hours	minutes-hours	low	low	For non-displaceable systems, not allowed for payment core
Warm Standby (region)	minutes	minutes	average	average	Keep minimal remarks + periodic warming up
Hot Standby (region)	<5-10 min	<1-2 min	medium-to-high	average	Fast failover, cross-regional magazines
Active-Active	seconds-minutes	~ 0-1 min	high	high	Requires thoughtful consistency and conflict-resolution

Choice: payments/wallet - minimum Hot Standby; content/directory - Warm; Reports - Backup & Restore with clear windows.

6) About SLI/SLO: how to measure correctly

6. 1 SLI by level

Client SLI: end-to-end (including gateway and external providers).
Service SLI: "pure" service latency/errors.
Business SLI: CR (registratsiya→depozit), T2W (time-to-wallet), PSP-decline rate.

6. 2 SLO examples

Core API availability: ≥ 99. 95% in 30 days.
Payout latency: P95 ≤ 350 ms, P99 ≤ 700 ms.
Delivery of webhooks PSP: ≥ 99. 9% for 60 sec (with retras).
Data Freshness reports: ≤ 10 min lag in 95% of the time.

6. 3 Error Budget Policy

50% of the budget - for changes (releases/experiments), 50% - for incidents.
Budget combustion → frieze feature, only stabilization.

7) Performance and scaling

HPA/VPA with SLO-oriented signals (not only CPU, but also queues/latency).
Predictive scaling based on schedules and historical peaks.
Warm pools/preheating connections to DB/PSP before tournaments.
Caching and edge - reduce RTT, especially for game catalogs and static assets.

8) Network layer and global traffic

Anycast/GeoDNS to minimize latency and localize crashes.
Failover policies: health tests of the region, thresholds, "stickiness" with TTL.
mTLS/WAF/Rate Limit at the edge, protection against bot traffic.
Egress control to PSP/KYC by allow-list and SLA-aware retreats.

9) Data and consistency

Select the level of consistency: strict (payments) vs eventual (catalog/ratings).
CQRS for offloading reading and verticals of critical commands.
Outbox/Inbox for "exactly once" event delivery.
Downtime-free migrations: expand-migrate-contract, double entry during MAJOR changes.

10) Observability under SLA

Traces through gateway: correlation of'trace _ id' with partner/region/API version.
SLO-dashboards with burn-rate, "weather" by region and provider.
Alerts by symptoms, not by proxy symptoms (not CPU, but P99/errors).
Synthetics: external checks from target countries (TR, BR, EU...).
Audit and reporting: exporting SLI/SLO to the partner portal.

11) Safety and compliance

Network segmentation and secret management (KMS/Vault).
In-flight/rest encryption, PAN/PII tokenization.
Role access policies for admins/operators.
Logs immutable (WORM) and retention for audit.
Regulatory: storage in the region, reports, provability of SLA execution.

12) FinOps: SLA as a cost driver

Put prices on SLO deviations: how much is + 0. 01% availability?
Profile peak windows, do not inflate constant power.
Right-sizing and "spot where you can" for background tasks.
Quotas and budgets for contours, do not allow "free" degradation.

13) Reliability testing

GameDay/Chaos sessions: turning off AZ/PSP, delays in queues, BGP breaks.
DR-drili: regular training of switching regions with goals for RTO.
Load & Soak: long runs with real betting/tournament profiles.
Replay incidents: a library of famous files and playback scripts.

14) SLA process side

SLO directory: owner, formula, metrics, sources, alerts.
Changes via RFC/ADR: evaluation of the impact on the error budget.
Post-mortems: improving architecture and ranbooks, adjusting SLO.
Communications with partners: mailings, status page, planned maintenance.

15) SLI/SLO/Report Examples

15. 1 Formulas


SLI_availability = (successful _ requests/all _ requests) 100%
SLI_latency_P99 = percentile _ 99 (latency _ of _ request)
SLI_webhook_D+60 = percentage of webhooks delivered ≤ 60 seconds

15. 2 Core API SLO Set Example

Availability (30 days): 99. 95%

Endpoint P95 '/v2/payouts/create ': ≤ 350ms

5xx errors (rolling 1 hour): <0. 3%

Webhook delivery ≤ 60 сек (P99): ≥ 99. 9%

RPO for wallet: ≤ 60 sec, RTO ≤ 5 min

15. 3 SLA report (squeeze)

Completed: 99. 97% (SLO 99. 95%) +

Violations: 2 episodes per BR region due to PSP timeouts (cumulative 8 minutes).
Measures: added smart-routing by failure codes, increased warm pool of connections to PSP-B.

16) Implementation checklist

1. Critical user paths and corresponding SLIs are defined.
2. SLO for 30/90 days + error budget policy.
3. Multi-zoning and DR plan with RTO/RPO goals, regular drills.
4. Synthetics from geo-target, dashboards per-region/per-PSP.
5. Stability patterns: circuit breaker, backpressure, idempotency.
6. Degradation policy and feature flags for disabled features.
7. FinOps: contour budgets, peak forecast, warm pools.
8. Security: segmentation, encryption, auditing.
9. SLA documentation for partners, communication process.
10. Retrospectives and SLO revisions every 1-2 quarters.

17) Anti-patterns

Promise SLAs without measurable SLIs and transparent counting techniques.
Count availability "at the entrance of the service," ignoring the gateway/providers.
Rely only on medium latency, ignoring P99 tails.
DR "on paper," lack of real training.
"Eternal" resources without limits: one report brings down prod.
Mix food and heavy analytics in one cluster/database.

18) The bottom line

Cloud architecture for SLAs is a combination of technical patterns (multi-AZ/region, isolation, fault-tolerant data), processes (SLO, error budget, DR-drills) and economics (FinOps). Give yourself the right to predicted failures: test fault tolerance, measure by percentiles, limit the "explosive radius" and communicate openly. SLA's promises would then become not marketing but managed engineering practice.

Technology and Infrastructure → Cloud Architecture and SLAs

Cloud Architecture and SLAs

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects