GH GambleHub

Error Budgets and SLO Management

1) Why SLO and error budget

SLO (Service Level Objective) - target quality level perceived by the user; SLI - measured metric; Error Budget - tolerance for deviations per window (usually 30/90 days).
The error budget turns reliability from an "abstraction" into a managed resource: when the budget burns out quickly, we freeze features and chinim; when the budget is healthy - you can speed up releases.

2) SLI pick: What counts as "good"

Criterion: "successful from the user's point of view."

2. 1 Classic SLIs

Availability is the percentage of successful requests (excluding those canceled by the client).
'success = http_status ∈ {2xx, 3xx, 404} and no timeout '(404 can be considered a read API success if it is expected semantics).
Latency: the proportion of requests is faster than the threshold (for example, p95 ≤ 300 ms).
`good_latency = duration_ms ≤ 300`.
Freshness/Staleness: "data not older than X minutes" (cache, directories, coefficients).
Quality: correctness of the result (passing business validators/backend invariants).

2. 2 Boundaries and segments

SLI should be counted by important slices: 'route', 'tenant/brand', 'region/jurisdiction', 'payment _ provider'. So you will not "smear" the failure of one critical handle throughout the system.

3) Formulas and budget

3. 1 Request-based (preferred for API)


SLO_availability = good_requests / total_requests
Error_budget = 1 - SLO_target
Burn = 1 - SLO_actual

3. 2 Time-based (for background services/streaming)


SLO_uptime = good_minutes / total_minutes

3. 3 Example of goals

API general: 99. 9% availability in 30 days → budget = 0. 1%.
Critical payment pens: 99. 95%; catalogs/search: 99. 5%.
Latency: p95 ≤ 300 ms on '/v1/payments', p99 ≤ 800 ms.

4) SLI instrumentation

4. 1 Principles

Logs/trails → RED (Rate/Errors/Duration) metrics with explicit buckets.
Be sure to put 'tenant', 'region', 'route _ class' (without PII).

Count two metrics: "success" and "total," and for latency, "fast" and "total."

4. 2 Prometheus example (rate per 5m)

promql
Availability (successes/total)
sli:success:rate5m = sum by(region, route)(
rate(http_requests_total{code=~"2..    3.."}[5m])
)
sli:total:rate5m = sum by(region, route)(
rate(http_requests_total[5m])
)
sli:availability:ratio5m = sli:success:rate5m / sli:total:rate5m

Latency (fraction faster than 300 ms)
sli:fast:rate5m = sum by(region, route)(
rate(http_request_duration_seconds_bucket{le="0. 3"}[5m])
)
sli:latency_ok:ratio5m = sli:fast:rate5m / sli:total:rate5m

5) Alerts by burn rate (multi-window, multi-burn)

5. 1 Idea

We look at how fast the budget burns out relative to the goal. If the speed is high on a short and long window, we signal.

5. 2 Threshold profiles (for SLO 99. 9%)

Paging: burn rate ≥ 14. 4 × (10% of the budget for 1 hour and 5% for 6 hours).
Ticket: burn rate ≥ 6 × (2% in 6 hours and 1% in 24 hours).

5. 3 Example rules (Prometheus, pseudo)

promql
Let's calculate the error_ratio on two windows short = 1 - (sum (rate (http_requests_total{code=~"2..    3.."}[5m])) /
sum(rate(http_requests_total[5m])))
long = 1 - (sum(rate(http_requests_total{code=~"2..    3.."}[1h])) /
sum(rate(http_requests_total[1h])))

For SLO = 99. 9%, error_budget=0. 001. BurnRate = error_ratio / 0. 001 burn_short = short / 0. 001 burn_long = long / 0. 001

Paging: Both windows exceed 14. 4× alert: SLOErrorBudgetBurnRateHigh expr: burn_short > 14. 4 and burn_long > 14. 4 for: 5m labels: { severity="page" }
annotations:
summary: "SLO burn rate high (short & long windows)"
runbook: "slo/runbooks/payments. md"

Similarly for 6h/24h for ticket.

6) Office of Budget: Processes

6. 1 Release gates

If the balance of the budget is <25% and the trend is negative - "code-freeze" on features, the priority is SRE/stability.
Canary releases must have a separate SLO slice ('deployment. environment="canary"`).

6. 2 Prioritizing the backlog

Distribute command capacity in proportion to combustion rate and revenue impact.

Justify technical debt with metrics: "fix X will reduce burn rate by Y%."

6. 3 Post-incident

Each incident - RCA and "fix that cannot be rolled back" (actionable), control "has returned to SLO."

7) SLO as code

7. 1 SLO Manifest Example (YAML)

yaml service: payments-api owner: team-payments slis:
- name: availability type: request_based success_query: sum(rate(http_requests_total{svc="pay",code=~"2..    3.."}[5m]))
total_query:  sum(rate(http_requests_total{svc="pay"}[5m]))
objective: 99. 95 window: 30d segments: ["region", "tenant", "route"]
- name: latency_p95_300ms type: latency_threshold good_query: sum(rate(http_request_duration_seconds_bucket{svc="pay",le="0. 3"}[5m]))
total_query: sum(rate(http_request_duration_seconds_count{svc="pay"}[5m]))
objective: 99. 0 window: 30d alerts:
- name: burn_high_page windows: ["5m", "1h"]
threshold_burn: 14. 4 severity: page

7. 2 Rule generation

Use generators (slo-generator, pyrra, sloth) to automatically create rules, dashboards and reports.

8) SLO degradation and protection

Load shedder: give quick answers without "expensive" dependencies at peak.
Cache & stale: `stale-while-revalidate` для read.
Rate/Concurrency limits: protects backends; important routes - priority.
Circuit/Timeout: fast timeouts and "fallback" branches.
Feature flags: disabling heavy features with one button.

9) Observability for SLO

Dashboards: SLO actual vs target, budget balance, burn rate, contribution by routes/providers.
Correlation: from the "hole" of SLO → to exemplar → a specific trace → logs/profiles.
Reports: weekly - trends, budget consumption, top reasons for degradation.

10) Antipatterns

One "global" SLO for the whole → masks problems. Segment.
«99. 99% on everything" excluding cost and reality. Choose goals from the user experience.
CPU/heap alerts instead of burn rate/SLO.
Ignoring 4xx/long redirects that spoil UX.

Opaque window (rolling vs calendar) - comparisons of "apples with oranges."

11) Specifics of iGaming/Finance

Money paths (deposits/withdrawals): individual SLOs above the overall level (e.g. 99. 95% availability; p95 ≤ 250 ms).
PSP/KYC providers: SLO for each provider + alerts for their contribution to burn; automatic route switching (smart routing).
Jurisdictions/tenants: SLOs and budgets by 'region/jurisdiction/brand' so that a local problem does not "flood" the global metric.
Responsible game: SLO for the period of application of limits/self-exclusion (compliance-formulas).
Audit/regulatory: keep SLO and incident reports; transparency for internal audits.

12) Prod Readiness Checklist

  • SLIs (availability/latency/quality/freshness) and segments (route/tenant/region) are defined.
  • Goals (SLOs) are realistic, aligned with business; there are rolling windows 30/90 days.
  • Alerts by burn rate with multi-windows (paging/ticket).
  • SLO as code (rule/dashboard generator); budget reports.
  • Release gates are tied to the rest of the budget; canary sections.
  • Degradation mechanisms (shedder, cache stale, circuit, limits) are implemented and tested.
  • Metrics ↔ trails correlation, clear runbooks.
  • Financial/jurisdictional pathways - separate SLOs/alerts; PSP/KYC are disaggregated.
  • Regular incident retro and burn-based reliability investments.

13) TL; DR

Define SLIs by user value, set realistic SLOs, and manage via Error Budget and burn rate with multi-windows. Enable SLO-as-code, release gates and degradation as planned. Segment by route/tenant/region; for money paths keep tighter targets and separate alerts.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.