SLO, SLA and reliability monitoring
(Section: Technology and Infrastructure)
Brief Summary
SLO is an internal quality goal, SLA is an external commitment to the customer, SLI is how we measure quality. In iGaming, key SLIs: API and payment availability, p95/p99 latency of critical routes, Time-to-Wallet (TTW), payment conversion, game launch, and queue metrics. Reliability management is built around a budget of errors, multi-burn alerts, clear release gates and visual dashboards with annotations.
1) Terms and differences
SLI (Service Level Indicator) - measured indicator (e.g. the proportion of successful requests per time window).
SLO (Service Level Objective) - target SLI value (e.g. "availability 99. 9% in 30 days").
SLA (Service Level Agreement) - contract/liability with compensation; is based on real SLOs, but includes legal clauses and planned maintenance windows.
Rule: first stabilize the SLI/SLO inside, and only then fix the SLA outside.
2) SLI framework for iGaming
TexSLO
Availability: successful 2xx/3xx/all requests.
Latency: p95/p99 by key routes ('/deposit ', '/bet', '/game/init ').
Errors: 5xx share/timeouts.
Saturation/Queues: delayed payout/transaction queues.
Business SLI
Payment conversion: `success/attempt`.
TTW p95: Time from withdrawal request to enrollment.
Game start success: game sessions, provider initialization.
KYC/AML flow success.
3) Error budget: how to count
Error Budget = 1 − SLO.
Example: Availability 99 SLO. 9 %/30d ⇒ error budget = 0. 1% of the time ≈ 43min 12s in a 30-day window.
success_ratio = success_requests / all_requests error_ratio = 1 - success_ratio
SLO counts on a sliding window (30/7/1 day) and is visible on dashboards.
Usage Policy:- Fast "combustion" of the budget → freeze releases, we stop canary, we are working on stability.
- Budget stock → allow more frequent changes (controlled).
4) SLO examples for key flows
Payments API:- Availability ≥ 99. 9 %/30d
- Latency p95 `/deposit` ≤ 250 ms / 30д
- Payment conversion ≥ baseline − 0. 3 %/24h
- TTW p95 (output) ≤ 3 min/24h
- Game init success ≥ 99. 5% / 7д p95 game init ≤ 600 ms / 7д
- Job success ≥ 99 %/7e, lag <5 min (peak windows separately).
5) Measurement: formulas and PromQL (ideas)
Success of requests:promql sum(rate(http_requests_total{status=~"2.. 3..",service="payments-api"}[5m]))
/
sum(rate(http_requests_total{service="payments-api"}[5m]))
p95 latency:
promql histogram_quantile(0. 95,
sum by (le) (rate(http_request_duration_seconds_bucket{service="payments-api",route="/deposit"}[5m])))
TTW p95 (event histogram):
promql histogram_quantile(0. 95,
sum by (le) (rate(ttw_seconds_bucket{flow="withdrawal"}[15m])))
Payment Conversion:
promql sum(rate(payments_success_total[15m])) / sum(rate(payments_attempt_total[15m]))
6) Burn-rate alerts (multi-window)
The idea: we compare the current rate of budget consumption with the permissible one.
Example for SLO 99. 9%:- Fast burn: 14 budget × in 1 hour → page in 5-15 minutes.
- Slow burn: 6 budget × in 24 hours → ticket, reason analysis.
yaml recording rule: job:http:success_ratio — заранее alert: SLOFastBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 14 for: 10m labels: { severity: "page" }
alert: SLOSlowBurn expr: (1 - job:http:success_ratio{job="payments-api"}) > (1 - 0. 999) 6 for: 1h labels: { severity: "ticket" }
7) Dashboards "SLO-card" and operating system
Top level (map):- Service cards: Availability, p95, Error-rate, Burn-rate, Error Budget balance.
- Filters: 'env', 'region', 'tenant', 'version'.
- Release annotations: Git SHA, type (canary/blue-green), switch time.
- Stable vs canary comparison.
- Section by PSP/game providers.
- Go to exemplars (trace_id) and related logs.
- Queue lag and saturation (USE metrics).
8) SLO processes: gates, freeze, escalations
Gates in CD: Canary promotion is allowed only when performing an SLO proxy (availability, p95, conv).
Freeze: with fast-burn or zero budget balance - stop releases until recovery.
Escalation: SEV-matrix (SEV1 payments/deposits, SEV2 games, SEV3 backhoe).
RCA: analysis without charges, updating tests/limits/phicheflags.
9) Data/ML-SLO (for recommenders/LLM)
Latency: p95 response model ≤ 300 ms (or tokens/s ≥ N).
Quality proxy: proportion of valid responses/low toxicity, share of helpful.
Freshness: age of features/data ≤ X hours.
Cost per 1k events: spending in the budget.
SLO gates are integrated into model releases (A/B/canary rollout).
10) SLA design based on SLO
Choose conservative SLOs as the basis for SLAs.
Define exceptions (planned activities, external dependent providers, incident procedures).
Enter offsets by violation levels (credit/discount), reporting and verification mechanisms.
11) Frequent errors and anti-patterns
There is no SLO, only "uptime 100%" is unrealistic, demotivates and hides risks.
Alerts for "every metric" instead of burn-rate - alert-fatig and ignore.
PII mixing in metrics/logs for SLO - compliance risks.
Cardinality explodes: 'user _ id/session _ id' as labels.
Lack of release annotations - it is difficult to associate degradation with change.
Opaque error budget - the team does not understand when "you can" take risks.
SLO is not tied to business - technical metrics are "green," revenue is "red."
12) Implementation checklist
1. Approve the basic SLIs (Availability, p95/p99, Error-rate, TTW, Conversion).
2. Set the SLO on the 30/7/1 day windows and count the Error Budget.
3. Add recording rules and burn-rate alerts (fast/slow).
4. Build an SLO map with release annotations and canary/stable comparisons.
5. Include gates in CD: without SLO-ok - without promotion.
6. Enter freeze procedures and an escalation SEV matrix.
7. Link SLOs to business metrics (conv, TTW) and payment routes.
8. For Data/ML, define latency/quality/freshness-SLO.
9. Regular RCAs and SLO/threshold revisions (quarterly).
10. Document SLAs only after SLO has stabilized.
13) Examples of "ready" goals (as a start)
API general: Availability 99. 9 %/30d; p95 ≤ 250 ms/30d; Error-rate ≤ 0. 3 %/30d
Payments: Conversion ≥ baseline − 0. 3 %/24h; TTW p95 ≤ 3 min/24h
Games init: Success ≥ 99. 5 %/7d; p95 ≤ 600 ms/7e
Backoffice jobs: Success ≥ 99%/7д; lag ≤ 5 min/7d
LLM/Reco: tokens/s ≥ N, toxicity viol. ≤ 0. 5 %/7d, freshness ≤ 6h.
Summary
The SLO/SLA approach turns reliability from "better than yesterday" into a measurable discipline: transparent SLIs, an understandable error budget, alerts for combustion speed, understandable dashboards and quality gates built into releases. This contour gives the iGaming platform a predictable p95/p99, stable payments and TTW, which means better revenue and fewer incidents during the hottest hours.