GH GambleHub

Technology and Infrastructure → Cloud Architecture and SLAs

Cloud Architecture and SLAs

1) Why SLAs and how to manage them

SLA (Service Level Agreement) - an external promise to business/partners about the availability, speed and correctness of the service.
SLO (Service Level Objective) - internal target levels for commands.
SLI (Service Level Indicator) - measurable metrics on the basis of which SLO is evaluated.

iGaming/fintech is characterized by rigid peak windows (tournaments, live betting, reporting periods, "salary" days), strong dependence on PSP/KYC providers and geography. SLAs should take into account this behavior, and architecture should provide guarantees not only medium, but also percentile.


2) Basic terminology

Availability - The percentage of successful requests per interval.
Latency - P50/P95/P99 for key operations.
Error - determine exactly (5xx, timeout, business error?).
RTO (Recovery Time Objective) - how much time is allowed for recovery.
Recovery Point Objective (RPO) - how much data can be lost in a disaster.
Error Budget - 1 − SLO, "reserve" for changes and incidents.


3) Framework of cloud architecture for SLA

3. 1 Multi-area (Multi-AZ)

Replicate state (DB, cache, queues) to at least 2-3 AZ.
Cold/warm standbys, automatic failover.
Local balancers (L4/L7) with health checks per-AZ.

3. 2 Multiregion

Asset-to-asset: low RTO/RPO, more difficult consistency and cost.
Asset-liability (hot/warm): cheaper, RTO more, but easier data control.
Geographic routing (GeoDNS/Anycast), "blast radius" isolation.

3. 3 Storage and data

Transactional databases: synchronous replication within the region, asynchronous interregional.
Cache: cross-regional replicas, "local reads + async warmup" mode.
Object storage: versioning, life cycles, cross-region replication.
Queues/Streaming: Mirror Clusters/Multi-region Streams.

3. 4 Loop insulation

Separation of critical services (payments/wallet) and "heavy" analytical tasks.
Rate-limits/quotas between contours so that reports do not "eat up" the prod.


4) High availability patterns

Backpressure - control the incoming flow, do not allow queues "to the horizon."

Bulkhead & Pool Isolation - Isolate connection and resource pools.
Circuit Breaker + Timeouts - protection against freezes of external integrations.
Idempotency - repeat requests without double write-offs.
Graceful Degradation - when degraded, disable non-fundamental features (avatars, advanced filters).
Chaos/Failure Injection - planned "failures" to test reliability hypotheses.


5) DR (Disaster Recovery) Strategies

StrategyRTORPOCostComplexityComment
Backup & Restorehoursminutes-hourslowlowFor non-displaceable systems, not allowed for payment core
Warm Standby (region)minutesminutesaverageaverageKeep minimal remarks + periodic warming up
Hot Standby (region)<5-10 min<1-2 minmedium-to-highaverageFast failover, cross-regional magazines
Active-Activeseconds-minutes~ 0-1 minhighhighRequires thoughtful consistency and conflict-resolution

Choice: payments/wallet - minimum Hot Standby; content/directory - Warm; Reports - Backup & Restore with clear windows.


6) About SLI/SLO: how to measure correctly

6. 1 SLI by level

Client SLI: end-to-end (including gateway and external providers).
Service SLI: "pure" service latency/errors.
Business SLI: CR (registratsiya→depozit), T2W (time-to-wallet), PSP-decline rate.

6. 2 SLO examples

Core API availability: ≥ 99. 95% in 30 days.
Payout latency: P95 ≤ 350 ms, P99 ≤ 700 ms.
Delivery of webhooks PSP: ≥ 99. 9% for 60 sec (with retras).
Data Freshness reports: ≤ 10 min lag in 95% of the time.

6. 3 Error Budget Policy

50% of the budget - for changes (releases/experiments), 50% - for incidents.
Budget combustion → frieze feature, only stabilization.


7) Performance and scaling

HPA/VPA with SLO-oriented signals (not only CPU, but also queues/latency).
Predictive scaling based on schedules and historical peaks.
Warm pools/preheating connections to DB/PSP before tournaments.
Caching and edge - reduce RTT, especially for game catalogs and static assets.


8) Network layer and global traffic

Anycast/GeoDNS to minimize latency and localize crashes.
Failover policies: health tests of the region, thresholds, "stickiness" with TTL.
mTLS/WAF/Rate Limit at the edge, protection against bot traffic.
Egress control to PSP/KYC by allow-list and SLA-aware retreats.


9) Data and consistency

Select the level of consistency: strict (payments) vs eventual (catalog/ratings).
CQRS for offloading reading and verticals of critical commands.
Outbox/Inbox for "exactly once" event delivery.
Downtime-free migrations: expand-migrate-contract, double entry during MAJOR changes.


10) Observability under SLA

Traces through gateway: correlation of'trace _ id' with partner/region/API version.
SLO-dashboards with burn-rate, "weather" by region and provider.
Alerts by symptoms, not by proxy symptoms (not CPU, but P99/errors).
Synthetics: external checks from target countries (TR, BR, EU...).
Audit and reporting: exporting SLI/SLO to the partner portal.


11) Safety and compliance

Network segmentation and secret management (KMS/Vault).
In-flight/rest encryption, PAN/PII tokenization.
Role access policies for admins/operators.
Logs immutable (WORM) and retention for audit.
Regulatory: storage in the region, reports, provability of SLA execution.


12) FinOps: SLA as a cost driver

Put prices on SLO deviations: how much is + 0. 01% availability?
Profile peak windows, do not inflate constant power.
Right-sizing and "spot where you can" for background tasks.
Quotas and budgets for contours, do not allow "free" degradation.


13) Reliability testing

GameDay/Chaos sessions: turning off AZ/PSP, delays in queues, BGP breaks.
DR-drili: regular training of switching regions with goals for RTO.
Load & Soak: long runs with real betting/tournament profiles.
Replay incidents: a library of famous files and playback scripts.


14) SLA process side

SLO directory: owner, formula, metrics, sources, alerts.
Changes via RFC/ADR: evaluation of the impact on the error budget.
Post-mortems: improving architecture and ranbooks, adjusting SLO.
Communications with partners: mailings, status page, planned maintenance.


15) SLI/SLO/Report Examples

15. 1 Formulas


SLI_availability = (успешные_запросы / все_запросы) 100%
SLI_latency_P99 = перцентиль_99(латентность_запроса)
SLI_webhook_D+60 = доля вебхуков, доставленных ≤ 60 сек

15. 2 Core API SLO Set Example

Availability (30 days): 99. 95%

Endpoint P95 '/v2/payouts/create ': ≤ 350ms

5xx errors (rolling 1 hour): <0. 3%

Webhook delivery ≤ 60 сек (P99): ≥ 99. 9%

RPO for wallet: ≤ 60 sec, RTO ≤ 5 min

15. 3 SLA report (squeeze)

Completed: 99. 97% (SLO 99. 95%) +

Violations: 2 episodes per BR region due to PSP timeouts (cumulative 8 minutes).
Measures: added smart-routing by failure codes, increased warm pool of connections to PSP-B.


16) Implementation checklist

1. Critical user paths and corresponding SLIs are defined.
2. SLO for 30/90 days + error budget policy.
3. Multi-zoning and DR plan with RTO/RPO goals, regular drills.
4. Synthetics from geo-target, dashboards per-region/per-PSP.
5. Stability patterns: circuit breaker, backpressure, idempotency.
6. Degradation policy and feature flags for disabled features.
7. FinOps: contour budgets, peak forecast, warm pools.
8. Security: segmentation, encryption, auditing.
9. SLA documentation for partners, communication process.
10. Retrospectives and SLO revisions every 1-2 quarters.


17) Anti-patterns

Promise SLAs without measurable SLIs and transparent counting techniques.
Count availability "at the entrance of the service," ignoring the gateway/providers.
Rely only on medium latency, ignoring P99 tails.
DR "on paper," lack of real training.
"Eternal" resources without limits: one report brings down prod.
Mix food and heavy analytics in one cluster/database.


18) The bottom line

Cloud architecture for SLAs is a combination of technical patterns (multi-AZ/region, isolation, fault-tolerant data), processes (SLO, error budget, DR-drills) and economics (FinOps). Give yourself the right to predicted failures: test fault tolerance, measure by percentiles, limit the "explosive radius" and communicate openly. SLA's promises would then become not marketing but managed engineering practice.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.