SLA, SLO and Reliability KPI
1) Terms and differences
SLI (Service Level Indicator) - a measurable indicator of quality (for example, the proportion of successful requests, p95 latency).
SLO (Service Level Objective) - target SLI value per time window (for example, "success ≥ 99. 9% in 28 days").
Error Budget - The allowed SLO failure rate is' 1 − SLO '.
SLA (Service Level Agreement) - contractual obligations with fines/credits (external).
Reliability KPIs - operational process metrics (MTTD/MTTA/MTTR/MTBF,% automatic mitigate, alert coverage, etc.).
2) How to choose SLI (based on Golden Signals)
1. Latency - p95/p99 for key endpoints.
2. Traffic - RPS/RPM/message flow.
3. Errors - the share of 5xx/business errors (for example, exclude payment "decline due to PSP fault).
4. Saturation - resource saturation (CPU/RAM/IO/lag).
- Correlates with user-perceived experience.
- Technically available and stable in measurement.
- We control (actions for improvement are possible).
- Low collection cost.
3) Formulas and Examples
3. 1 Availability
Availability = Успешные запросы / Все запросы
Error Budget (за период) = 1 − SLO
Example: SLO 99. 9% in 30 days → error budget = 0. 1%, which is equivalent to 43 min 12 sec of unavailability.
3. 2 Latency
SLO by latency is formulated as the proportion of requests that fit into the threshold:
Latency SLI = доля запросов с duration ≤ T
SLO пример: 99% запросов ≤ 300 мс (rolling 28d)
3. 3 Payments (Business Level)
Payment Success SLI = (успешные проводки — внешние отказы PSP) / все попытки
4) Flawed budget and burn-rate
Budget error - your "fuel tank" for innovation (releases, experiments).
Burn-rate - budget consumption speed:- fast channel (detection in ~ 1 h),
- slow channel (trend over the ~ 6-12 h/24 h).
- If burn-rate> 14. 4 in 1 hour - SEV-1 (we will eat the daily budget in ~ 100 minutes).
- If burn-rate> 6 in 6 hours - SEV-2 (rapid degradation).
5) Alerting by SLO (multi-window, multi-burn)
Error indicator: proportion of 5xx or latency violations.
Examples of PromQL (generalized):promql
Доля ошибок за 5 минут sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Быстрый burn (1m окно)
(
sum(rate(http_requests_total{status=~"5.."}[1m])) /
sum(rate(http_requests_total[1m]))
) / (1 - SLO) > 14.4
Медленный burn (30m окно)
(
sum(rate(http_requests_total{status=~"5.."}[30m])) /
sum(rate(http_requests_total[30m]))
) / (1 - SLO) > 2
For SLO by latency, use percentile histograms:
promql p95 latency histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
6) SLI/SLO Examples by Domain
6. 1 API Gateway/Edge
SLI-Errors: 5xx response rate <0. 1% (28d).
SLI-Latency: p95 ≤ 250 ms (day).
SLO: Availability ≥ 99. 95% (quarter).
6. 2 Payments
SLI-Success: payment for successful (excluding client failures) ≥ 99. 8% (28d).
SLI-Latency: authorization ≤ 2 seconds for 99% (day).
SLO: Time-to-Wallet p95 ≤ 3 мин (24h).
6. 3 Databases (PostgreSQL)
SLI-Lag: replication lag p95 ≤ 1 sec (day).
SLI-Errors: Query error rate ≤ 0. 05% (28d).
SLO Cluster Availability ≥ 99. 95%.
6. 4 Queues/Streaming (Kafka)
SLI-Lag: consumer lag p95 ≤ N messages (hour).
SLI-Durability - Confirm ≥ 99 entry. 99% (28d).
SLO: availability of brokers ≥ 99. 9%.
7) Reliability process KPI
MTTD (Mean Time To Detect)
MTTA (… To Acknowledge)
MTTR (… To Restore)
MTBF (… Between Failures)
% of incidents with automatic mitigation
SLO/alert coverage of top traffic paths (target ≥ 95%)
Share of releases with canary stage
Consumption of erroneous budget by teams/features
8) How to put SLO realistic
1. Measure current baseline reliability (3-4 weeks).
2. Define "sensitive" user paths (login, deposit, game).
3. Consider the cost of each deviation (time, money, reputation).
4. Choose an ambitious but achievable goal (10-30% improvement on the baseline).
5. Review quarterly.
- Immediately "five nines" without justification.
- SLO by metrics not visible to the user (for example, CPU without communication with UX).
- Too much SLO → focus spray.
9) SLO and budget reporting
Standard report (weekly/monthly):- Completion per SLO: actual vs target, trends, confidence.
- Summary of error consumption: how much budget is "burned" than by whom (release/incident).
- Top five causes of degradation, CAPA plan and task status.
- Business impact: conversion, ND, retention, LTV.
10) Communication with release policy
Error budget <50% → free releases.
50-80% → "cautious mode": only low-risk/canary calculations.
11) SLA (contractual) - item templates
Availability obligation: for example, 99. 9 %/month.
Force Majeure: DDoS beyond reasonable control, third party providers.
Measurement window and area of responsibility: sources of metrics, calculation method.
Credits/penalties: a table of levels (for example, unavailability of 60-120 minutes → credit X%).
Escalation and notification procedures: deadlines, channels.
Data and privacy: masking, storage, Legal Hold.
Repetition Prevention Plan (CAPA) in case of violation.
12) Measurement tools
Passive metrics: Prometheus/Mimir/Thanos, exporters.
Logs: Loki/ELK for counting successes/errors at the business level.
Synthetics: active samples (login/deposit/game) by cron.
Tracing: Tempo/Jaeger for p99 bottlenecks.
Payment/Finance: ground truth sources for payment SLI.
13) Query examples (templates)
Percentage of successful API requests (excluding 4xx as client):promql
1 - (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
)
SLO card:
yaml slo:
name: "API Availability"
window: "28d"
target: 0.999 sli: "1 - 5xx%"
owner: "Platform SRE"
alerting:
fast_burn: {window: "1h", factor: 14.4}
slow_burn: {window: "6h", factor: 6}
Payment success (for business events in logs/stream):
success_rate = (count_over_time({app="payments"} = "status=success"[5m]))
/ (count_over_time({app="payments"} ~ "status=(success fail)"[5m]))
key> Refine filters to exclude "decline by customer."
14) FinOps and reliability
Cost per 9: The cost of adding a nine is growing exponentially.
Benefit curve: optimum where the increase in revenue/decrease in losses ≥ the cost of additional "9."
SLO portfolio: different levels for different paths (critical payments are "more expensive," reporting is "cheaper").
15) SLO/Alert Quality - Checklist
- SLI correlates with UX and business metrics.
- Window and aggregation are consistent (rolling 28d/quarter).
- Multi-window alerts, no flapping, role-based routing.
- Documentation: owner, formula, sources, runbook.
- SLO demo panel with erroneous budget and burn indicators.
- Regularly review goals (quarterly).
- Synthetics tests on key scenarios.
16) Implementation plan (4 iterations)
1. Week 1: inventory of user paths, SLI drafts, basic dashboards.
2. Week 2: SLO formalization, budgeting, alerts (fast/slow burn).
3. Week 3: integration with the incident/release process, freeze rules.
4. Week 4 +: Contractual SLAs, Quarterly Reviews, "cost per 9" Finops Model
17) Mini-FAQ
Do I need to have one SLO per service?
Better 2-3 key ones (success + latency) instead of dozens of secondary ones.
What if the budget is exhausted?
Freezing releases, focusing on stabilization and CAPA, removing experimental features.
How to avoid a conflict between release speed and reliability?
Plan releases "on budget," implement canary calculations and feature-flags.
Result
Reliability is not controlled by a set of disparate metrics, but by the system: SLI → SLO → budget error → burn alert → incident process → CAPA → SLA. Standardize definitions, data sources and reporting, link goals to user experience and economics, and regularly review nines based on real-world ROI.