Error Budgets and SLO Management

1) Why SLO and error budget

SLO (Service Level Objective) - target quality level perceived by the user; SLI - measured metric; Error Budget - tolerance for deviations per window (usually 30/90 days).
The error budget turns reliability from an "abstraction" into a managed resource: when the budget burns out quickly, we freeze features and chinim; when the budget is healthy - you can speed up releases.

2) SLI pick: What counts as "good"

Criterion: "successful from the user's point of view."

2. 1 Classic SLIs

Availability is the percentage of successful requests (excluding those canceled by the client).
'success = http_status ∈ {2xx, 3xx, 404} and no timeout '(404 can be considered a read API success if it is expected semantics).
Latency: the proportion of requests is faster than the threshold (for example, p95 ≤ 300 ms).
`good_latency = duration_ms ≤ 300`.
Freshness/Staleness: "data not older than X minutes" (cache, directories, coefficients).
Quality: correctness of the result (passing business validators/backend invariants).

2. 2 Boundaries and segments

SLI should be counted by important slices: 'route', 'tenant/brand', 'region/jurisdiction', 'payment _ provider'. So you will not "smear" the failure of one critical handle throughout the system.

3) Formulas and budget

3. 1 Request-based (preferred for API)


SLO_availability = good_requests / total_requests
Error_budget = 1 - SLO_target
Burn = 1 - SLO_actual

3. 2 Time-based (for background services/streaming)


SLO_uptime = good_minutes / total_minutes

3. 3 Example of goals

API general: 99. 9% availability in 30 days → budget = 0. 1%.
Critical payment pens: 99. 95%; catalogs/search: 99. 5%.
Latency: p95 ≤ 300 ms on '/v1/payments', p99 ≤ 800 ms.

4) SLI instrumentation

4. 1 Principles

Logs/trails → RED (Rate/Errors/Duration) metrics with explicit buckets.
Be sure to put 'tenant', 'region', 'route _ class' (without PII).

Count two metrics: "success" and "total," and for latency, "fast" and "total."

4. 2 Prometheus example (rate per 5m)

promql
Availability (successes/total)
sli:success:rate5m = sum by(region, route)(
rate(http_requests_total{code=~"2..    3.."}[5m])
)
sli:total:rate5m = sum by(region, route)(
rate(http_requests_total[5m])
)
sli:availability:ratio5m = sli:success:rate5m / sli:total:rate5m

Latency (fraction faster than 300 ms)
sli:fast:rate5m = sum by(region, route)(
rate(http_request_duration_seconds_bucket{le="0. 3"}[5m])
)
sli:latency_ok:ratio5m = sli:fast:rate5m / sli:total:rate5m

5) Alerts by burn rate (multi-window, multi-burn)

5. 1 Idea

We look at how fast the budget burns out relative to the goal. If the speed is high on a short and long window, we signal.

5. 2 Threshold profiles (for SLO 99. 9%)

Paging: burn rate ≥ 14. 4 × (10% of the budget for 1 hour and 5% for 6 hours).
Ticket: burn rate ≥ 6 × (2% in 6 hours and 1% in 24 hours).

5. 3 Example rules (Prometheus, pseudo)

promql
Let's calculate the error_ratio on two windows short = 1 - (sum (rate (http_requests_total{code=~"2..    3.."}[5m])) /
sum(rate(http_requests_total[5m])))
long = 1 - (sum(rate(http_requests_total{code=~"2..    3.."}[1h])) /
sum(rate(http_requests_total[1h])))

For SLO = 99. 9%, error_budget=0. 001. BurnRate = error_ratio / 0. 001 burn_short = short / 0. 001 burn_long = long / 0. 001

Paging: Both windows exceed 14. 4× alert: SLOErrorBudgetBurnRateHigh expr: burn_short > 14. 4 and burn_long > 14. 4 for: 5m labels: { severity="page" }
annotations:
summary: "SLO burn rate high (short & long windows)"
runbook: "slo/runbooks/payments. md"

Similarly for 6h/24h for ticket.

6) Office of Budget: Processes

6. 1 Release gates

If the balance of the budget is <25% and the trend is negative - "code-freeze" on features, the priority is SRE/stability.
Canary releases must have a separate SLO slice ('deployment. environment="canary"`).

6. 2 Prioritizing the backlog

Distribute command capacity in proportion to combustion rate and revenue impact.

Justify technical debt with metrics: "fix X will reduce burn rate by Y%."

6. 3 Post-incident

Each incident - RCA and "fix that cannot be rolled back" (actionable), control "has returned to SLO."

7) SLO as code

7. 1 SLO Manifest Example (YAML)

yaml service: payments-api owner: team-payments slis:
- name: availability type: request_based success_query: sum(rate(http_requests_total{svc="pay",code=~"2..    3.."}[5m]))
total_query:  sum(rate(http_requests_total{svc="pay"}[5m]))
objective: 99. 95 window: 30d segments: ["region", "tenant", "route"]
- name: latency_p95_300ms type: latency_threshold good_query: sum(rate(http_request_duration_seconds_bucket{svc="pay",le="0. 3"}[5m]))
total_query: sum(rate(http_request_duration_seconds_count{svc="pay"}[5m]))
objective: 99. 0 window: 30d alerts:
- name: burn_high_page windows: ["5m", "1h"]
threshold_burn: 14. 4 severity: page

7. 2 Rule generation

Use generators (slo-generator, pyrra, sloth) to automatically create rules, dashboards and reports.

8) SLO degradation and protection

Load shedder: give quick answers without "expensive" dependencies at peak.
Cache & stale: `stale-while-revalidate` для read.
Rate/Concurrency limits: protects backends; important routes - priority.
Circuit/Timeout: fast timeouts and "fallback" branches.
Feature flags: disabling heavy features with one button.

9) Observability for SLO

Dashboards: SLO actual vs target, budget balance, burn rate, contribution by routes/providers.
Correlation: from the "hole" of SLO → to exemplar → a specific trace → logs/profiles.
Reports: weekly - trends, budget consumption, top reasons for degradation.

10) Antipatterns

One "global" SLO for the whole → masks problems. Segment.
«99. 99% on everything" excluding cost and reality. Choose goals from the user experience.
CPU/heap alerts instead of burn rate/SLO.
Ignoring 4xx/long redirects that spoil UX.

Opaque window (rolling vs calendar) - comparisons of "apples with oranges."

11) Specifics of iGaming/Finance

Money paths (deposits/withdrawals): individual SLOs above the overall level (e.g. 99. 95% availability; p95 ≤ 250 ms).
PSP/KYC providers: SLO for each provider + alerts for their contribution to burn; automatic route switching (smart routing).
Jurisdictions/tenants: SLOs and budgets by 'region/jurisdiction/brand' so that a local problem does not "flood" the global metric.
Responsible game: SLO for the period of application of limits/self-exclusion (compliance-formulas).
Audit/regulatory: keep SLO and incident reports; transparency for internal audits.

12) Prod Readiness Checklist

SLIs (availability/latency/quality/freshness) and segments (route/tenant/region) are defined.
Goals (SLOs) are realistic, aligned with business; there are rolling windows 30/90 days.
Alerts by burn rate with multi-windows (paging/ticket).
SLO as code (rule/dashboard generator); budget reports.
Release gates are tied to the rest of the budget; canary sections.
Degradation mechanisms (shedder, cache stale, circuit, limits) are implemented and tested.
Metrics ↔ trails correlation, clear runbooks.
Financial/jurisdictional pathways - separate SLOs/alerts; PSP/KYC are disaggregated.
Regular incident retro and burn-based reliability investments.

13) TL; DR

Define SLIs by user value, set realistic SLOs, and manage via Error Budget and burn rate with multi-windows. Enable SLO-as-code, release gates and degradation as planned. Segment by route/tenant/region; for money paths keep tighter targets and separate alerts.

Error Budgets and SLO Management

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects