SLO, SLA and reliability monitoring

(Section: Technology and Infrastructure)

Brief Summary

SLO is an internal quality goal, SLA is an external commitment to the customer, SLI is how we measure quality. In iGaming, key SLIs: API and payment availability, p95/p99 latency of critical routes, Time-to-Wallet (TTW), payment conversion, game launch, and queue metrics. Reliability management is built around a budget of errors, multi-burn alerts, clear release gates and visual dashboards with annotations.

1) Terms and differences

SLI (Service Level Indicator) - measured indicator (e.g. the proportion of successful requests per time window).
SLO (Service Level Objective) - target SLI value (e.g. "availability 99. 9% in 30 days").
SLA (Service Level Agreement) - contract/liability with compensation; is based on real SLOs, but includes legal clauses and planned maintenance windows.

Rule: first stabilize the SLI/SLO inside, and only then fix the SLA outside.

2) SLI framework for iGaming

TexSLO

Availability: successful 2xx/3xx/all requests.
Latency: p95/p99 by key routes ('/deposit ', '/bet', '/game/init ').
Errors: 5xx share/timeouts.
Saturation/Queues: delayed payout/transaction queues.

Business SLI

Payment conversion: `success/attempt`.
TTW p95: Time from withdrawal request to enrollment.
Game start success: game sessions, provider initialization.
KYC/AML flow success.

3) Error budget: how to count

Error Budget = 1 − SLO.
Example: Availability 99 SLO. 9 %/30d ⇒ error budget = 0. 1% of the time ≈ 43min 12s in a 30-day window.

For SLI share:


success_ratio = success_requests / all_requests error_ratio  = 1 - success_ratio

SLO counts on a sliding window (30/7/1 day) and is visible on dashboards.

Usage Policy:

Fast "combustion" of the budget → freeze releases, we stop canary, we are working on stability.
Budget stock → allow more frequent changes (controlled).

4) SLO examples for key flows

Payments API:

Availability ≥ 99. 9 %/30d
Latency p95 `/deposit` ≤ 250 ms / 30д
Payment conversion ≥ baseline − 0. 3 %/24h
TTW p95 (output) ≤ 3 min/24h

Games API/game providers:

Game init success ≥ 99. 5% / 7д p95 game init ≤ 600 ms / 7д

Backoffice/Reports:

Job success ≥ 99 %/7e, lag <5 min (peak windows separately).

5) Measurement: formulas and PromQL (ideas)

Success of requests:

promql sum(rate(http_requests_total{status=~"2..    3..",service="payments-api"}[5m]))
/
sum(rate(http_requests_total{service="payments-api"}[5m]))

p95 latency:

promql histogram_quantile(0. 95,
sum by (le) (rate(http_request_duration_seconds_bucket{service="payments-api",route="/deposit"}[5m])))

TTW p95 (event histogram):

promql histogram_quantile(0. 95,
sum by (le) (rate(ttw_seconds_bucket{flow="withdrawal"}[15m])))

Payment Conversion:

promql sum(rate(payments_success_total[15m])) / sum(rate(payments_attempt_total[15m]))

6) Burn-rate alerts (multi-window)

The idea: we compare the current rate of budget consumption with the permissible one.

Example for SLO 99. 9%:

Fast burn: 14 budget × in 1 hour → page in 5-15 minutes.
Slow burn: 6 budget × in 24 hours → ticket, reason analysis.

Pseudo-rules:

yaml recording rule: job:http:success_ratio — заранее alert: SLOFastBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 14 for: 10m labels: { severity: "page" }

alert: SLOSlowBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 6 for: 1h labels: { severity: "ticket" }

7) Dashboards "SLO-card" and operating system

Top level (map):

Service cards: Availability, p95, Error-rate, Burn-rate, Error Budget balance.
Filters: 'env', 'region', 'tenant', 'version'.
Release annotations: Git SHA, type (canary/blue-green), switch time.

Drill-down:

Stable vs canary comparison.
Section by PSP/game providers.
Go to exemplars (trace_id) and related logs.
Queue lag and saturation (USE metrics).

8) SLO processes: gates, freeze, escalations

Gates in CD: Canary promotion is allowed only when performing an SLO proxy (availability, p95, conv).
Freeze: with fast-burn or zero budget balance - stop releases until recovery.
Escalation: SEV-matrix (SEV1 payments/deposits, SEV2 games, SEV3 backhoe).
RCA: analysis without charges, updating tests/limits/phicheflags.

9) Data/ML-SLO (for recommenders/LLM)

Latency: p95 response model ≤ 300 ms (or tokens/s ≥ N).
Quality proxy: proportion of valid responses/low toxicity, share of helpful.
Freshness: age of features/data ≤ X hours.
Cost per 1k events: spending in the budget.
SLO gates are integrated into model releases (A/B/canary rollout).

10) SLA design based on SLO

Choose conservative SLOs as the basis for SLAs.
Define exceptions (planned activities, external dependent providers, incident procedures).
Enter offsets by violation levels (credit/discount), reporting and verification mechanisms.

11) Frequent errors and anti-patterns

There is no SLO, only "uptime 100%" is unrealistic, demotivates and hides risks.
Alerts for "every metric" instead of burn-rate - alert-fatig and ignore.
PII mixing in metrics/logs for SLO - compliance risks.
Cardinality explodes: 'user _ id/session _ id' as labels.
Lack of release annotations - it is difficult to associate degradation with change.
Opaque error budget - the team does not understand when "you can" take risks.

SLO is not tied to business - technical metrics are "green," revenue is "red."

12) Implementation checklist

1. Approve the basic SLIs (Availability, p95/p99, Error-rate, TTW, Conversion).
2. Set the SLO on the 30/7/1 day windows and count the Error Budget.
3. Add recording rules and burn-rate alerts (fast/slow).
4. Build an SLO map with release annotations and canary/stable comparisons.
5. Include gates in CD: without SLO-ok - without promotion.
6. Enter freeze procedures and an escalation SEV matrix.
7. Link SLOs to business metrics (conv, TTW) and payment routes.
8. For Data/ML, define latency/quality/freshness-SLO.
9. Regular RCAs and SLO/threshold revisions (quarterly).
10. Document SLAs only after SLO has stabilized.

13) Examples of "ready" goals (as a start)

API general: Availability 99. 9 %/30d; p95 ≤ 250 ms/30d; Error-rate ≤ 0. 3 %/30d

Payments: Conversion ≥ baseline − 0. 3 %/24h; TTW p95 ≤ 3 min/24h

Games init: Success ≥ 99. 5 %/7d; p95 ≤ 600 ms/7e

Backoffice jobs: Success ≥ 99%/7д; lag ≤ 5 min/7d

LLM/Reco: tokens/s ≥ N, toxicity viol. ≤ 0. 5 %/7d, freshness ≤ 6h.

Summary

The SLO/SLA approach turns reliability from "better than yesterday" into a measurable discipline: transparent SLIs, an understandable error budget, alerts for combustion speed, understandable dashboards and quality gates built into releases. This contour gives the iGaming platform a predictable p95/p99, stable payments and TTW, which means better revenue and fewer incidents during the hottest hours.

SLO, SLA and reliability monitoring

Brief Summary

Business SLI

Summary

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects