SLA, SLO and Reliability KPI

1) Terms and differences

SLI (Service Level Indicator) - a measurable indicator of quality (for example, the proportion of successful requests, p95 latency).
SLO (Service Level Objective) - target SLI value per time window (for example, "success ≥ 99. 9% in 28 days").
Error Budget - The allowed SLO failure rate is' 1 − SLO '.
SLA (Service Level Agreement) - contractual obligations with fines/credits (external).
Reliability KPIs - operational process metrics (MTTD/MTTA/MTTR/MTBF,% automatic mitigate, alert coverage, etc.).

💡 Rule: SLA ≤ SLO; the external contract must not be stricter than the internal purpose of the service.

2) How to choose SLI (based on Golden Signals)

1. Latency - p95/p99 for key endpoints.
2. Traffic - RPS/RPM/message flow.
3. Errors - the share of 5xx/business errors (for example, exclude payment "decline due to PSP fault).
4. Saturation - resource saturation (CPU/RAM/IO/lag).

Good SLI criteria:

Correlates with user-perceived experience.
Technically available and stable in measurement.
We control (actions for improvement are possible).
Low collection cost.

3) Formulas and Examples

3. 1 Availability


Availability = Successful Requests/All Requests
Error Budget = 1 − SLO

Example: SLO 99. 9% in 30 days → error budget = 0. 1%, which is equivalent to 43 min 12 sec of unavailability.

3. 2 Latency

SLO by latency is formulated as the proportion of requests that fit into the threshold:


Latency SLI = percentage of requests with duration ≤ T
SLO example: 99% requests ≤ 300ms (rolling 28d)

3. 3 Payments (Business Level)


Payment Success SLI =/all attempts

💡 Exclude "decline by customer card" from service failures; include only platform guilt.

4) Flawed budget and burn-rate

Budget error - your "fuel tank" for innovation (releases, experiments).

Burn-rate - budget consumption speed:

fast channel (detection in ~ 1 h),
slow channel (trend over the ~ 6-12 h/24 h).

Threshold ideas:

If burn-rate> 14. 4 in 1 hour - SEV-1 (we will eat the daily budget in ~ 100 minutes).
If burn-rate> 6 in 6 hours - SEV-2 (rapid degradation).

5) Alerting by SLO (multi-window, multi-burn)

Error indicator: proportion of 5xx or latency violations.

Examples of PromQL (generalized):

promql
5 minute error rate sum (rate (http_requests_total{status=~"5"..}[5m]))
/
sum(rate(http_requests_total[5m]))

Quick burn (1m window)
(
sum(rate(http_requests_total{status=~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
) / (1 - SLO) > 14. 4

Slow burn (30m window)
(
sum(rate(http_requests_total{status=~"5.."}[30m])) /
sum(rate(http_requests_total[30m]))
) / (1 - SLO) > 2

For SLO by latency, use percentile histograms:

promql p95 latency histogram_quantile(0. 95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

6) SLI/SLO Examples by Domain

6. 1 API Gateway/Edge

SLI-Errors: 5xx response rate <0. 1% (28d).
SLI-Latency: p95 ≤ 250 ms (day).
SLO: Availability ≥ 99. 95% (quarter).

6. 2 Payments

SLI-Success: payment for successful (excluding client failures) ≥ 99. 8% (28d).
SLI-Latency: authorization ≤ 2 seconds for 99% (day).
SLO: Time-to-Wallet p95 ≤ 3 мин (24h).

6. 3 Databases (PostgreSQL)

SLI-Lag: replication lag p95 ≤ 1 sec (day).
SLI-Errors: Query error rate ≤ 0. 05% (28d).
SLO Cluster Availability ≥ 99. 95%.

6. 4 Queues/Streaming (Kafka)

SLI-Lag: consumer lag p95 ≤ N messages (hour).
SLI-Durability - Confirm ≥ 99 entry. 99% (28d).
SLO: availability of brokers ≥ 99. 9%.

7) Reliability process KPI

MTTD (Mean Time To Detect)

MTTA (… To Acknowledge)

MTTR (… To Restore)

MTBF (… Between Failures)

% of incidents with automatic mitigation

SLO/alert coverage of top traffic paths (target ≥ 95%)

Share of releases with canary stage

Consumption of erroneous budget by teams/features

8) How to put SLO realistic

1. Measure current baseline reliability (3-4 weeks).
2. Define "sensitive" user paths (login, deposit, game).
3. Consider the cost of each deviation (time, money, reputation).
4. Choose an ambitious but achievable goal (10-30% improvement on the baseline).
5. Review quarterly.

Anti-patterns:

Immediately "five nines" without justification.
SLO by metrics not visible to the user (for example, CPU without communication with UX).
Too much SLO → focus spray.

9) SLO and budget reporting

Standard report (weekly/monthly):

Completion per SLO: actual vs target, trends, confidence.
Summary of error consumption: how much budget is "burned" than by whom (release/incident).
Top five causes of degradation, CAPA plan and task status.
Business impact: conversion, ND, retention, LTV.

10) Communication with release policy

Error budget <50% → free releases.
50-80% → "cautious mode": only low-risk/canary calculations.

💡 80% → release freeze, focus on stabilization and debt.

11) SLA (contractual) - item templates

Availability obligation: for example, 99. 9 %/month.
Force Majeure: DDoS beyond reasonable control, third party providers.
Measurement window and area of responsibility: sources of metrics, calculation method.
Credits/penalties: a table of levels (for example, unavailability of 60-120 minutes → credit X%).
Escalation and notification procedures: deadlines, channels.
Data and privacy: masking, storage, Legal Hold.
Repetition Prevention Plan (CAPA) in case of violation.

💡 External SLA should refer to specific, verifiable SLIs and calculation methodology.

12) Measurement tools

Passive metrics: Prometheus/Mimir/Thanos, exporters.
Logs: Loki/ELK for counting successes/errors at the business level.
Synthetics: active samples (login/deposit/game) by cron.
Tracing: Tempo/Jaeger for p99 bottlenecks.
Payment/Finance: ground truth sources for payment SLI.

13) Query examples (templates)

Percentage of successful API requests (excluding 4xx as client):

promql
1 - (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
)

SLO card:

yaml slo:
name: "API Availability"
window: "28d"
target: 0. 999 sli: "1 - 5xx%"
owner: "Platform SRE"
alerting:
fast_burn: {window: "1h", factor: 14. 4}
slow_burn: {window: "6h", factor: 6}

Payment success (for business events in logs/stream):


success_rate = (count_over_time({app="payments"}     = "status=success"[5m]))
/ (count_over_time({app="payments"}     ~ "status=(success    fail)"[5m]))

key> Refine filters to exclude "decline by customer."

14) FinOps and reliability

Cost per 9: The cost of adding a nine is growing exponentially.

Benefit curve: optimum where the increase in revenue/decrease in losses ≥ the cost of additional "9."

SLO portfolio: different levels for different paths (critical payments are "more expensive," reporting is "cheaper").

15) SLO/Alert Quality - Checklist

SLI correlates with UX and business metrics.
Window and aggregation are consistent (rolling 28d/quarter).
Multi-window alerts, no flapping, role-based routing.
Documentation: owner, formula, sources, runbook.
SLO demo panel with erroneous budget and burn indicators.
Regularly review goals (quarterly).
Synthetics tests on key scenarios.

16) Implementation plan (4 iterations)

1. Week 1: inventory of user paths, SLI drafts, basic dashboards.
2. Week 2: SLO formalization, budgeting, alerts (fast/slow burn).
3. Week 3: integration with the incident/release process, freeze rules.

4. Week 4 +: Contractual SLAs, Quarterly Reviews, "cost per 9" Finops Model

17) Mini-FAQ

Do I need to have one SLO per service?
Better 2-3 key ones (success + latency) instead of dozens of secondary ones.

What if the budget is exhausted?
Freezing releases, focusing on stabilization and CAPA, removing experimental features.

How to avoid a conflict between release speed and reliability?
Plan releases "on budget," implement canary calculations and feature-flags.

Total

Reliability is not controlled by a set of disparate metrics, but by the system: SLI → SLO → budget error → burn alert → incident process → CAPA → SLA. Standardize definitions, data sources and reporting, link goals to user experience and economics, and regularly review nines based on real-world ROI.

SLA, SLO and Reliability KPI

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects