SLO/SLA and Metrics
SLO/SLA and metrics
1) Terms and hierarchy
SLI (Service Level Indicator) - a measurable indicator "as the user sees us": the share of successful requests, p95 latency, freshness of data, the share of successfully processed batches, etc.
SLO (Service Level Objective) - target SLI value at the observation interval (28/30/90 days). Example: "99. 9% of requests/pay end ≤ 400 ms."
Error budget — 1 − SLO. At SLO 99. 9% error budget = 0. 1% of time/requests.
SLA (Agreement) - legally significant service level: includes SLO, measurement, exceptions, compensations/fines.
2) Design principles
Symptoms> internal metrics. SLIs should reflect the actual user experience.
Small number of key SLIs. For service - 2-5 main: success, latency, throughput/freshness, correctness.
Coverage of critical pathways. Each business scenario (checkout, login, webhook, ETL download) has its own set of SLI/SLO.
Strict "success" semantics. Not "code 200," but "the user received a response on time and the result is valid."
Separation of external and internal SLOs. Internal - stricter; external SLA ≤ 1-2 nines lower.
3) SLI catalog (reference)
3. 1 API/Online Services
Success: 'SLI _ success = 1 − (5xx + timeout + business_error )/ all_requests'
Latency: p95/p99 'http _ server _ duration _ seconds' by route/method/tenant
Bandwidth: 'RPS '/limits/quotas
Correctness: proportion of valid responses (signatures, schemas, invariants)
3. 2 Webhooks/Asynchronous Deliveries
Delivery: proportion of events confirmed in T seconds and ≤ N retrays
Customers: percentage of subscribers without long delay (per tenant)
3. 3 Data/ETL/DWH
freshness: 'now − last_successful_ingest_ts'
Completeness: 'ingested _ rows/ expected_rows'
Correctness: the proportion of records that passed quality checks
Pipelines: share of jobs completed before deadline
3. 4 Mobile/Client SDKs
Client success: proportion of sessions without fatal errors
Round-trip latency: p95 from request to render
Cache hits: percentage served from cache (as a symptom of performance)
4) Formulas and examples of goals
Availability (on request):- `SLI_req_avail = 1 − (failed_requests / total_requests)`
- `SLO_req_avail = 99. 95% 'for 30 days → error budget = 0. 05% of requests.
- `uptime = (obs_window − downtime) / obs_window`
- 'SLO _ latency = p95 (route = "/pay") ≤ 400 ms' on 7-day slices, excluding cache warm-ups (1%)
- 'SLO _ freshness (dataset = "orders") ≤ 10 min'p99 in 24 hours.
5) Error budget and change management
Budget (B): 'B = 1 − SLO'.
Burn - The ratio of actual errors to allowable errors.
- Overspend (burn> 1) → feature freeze, focus on reliability.
- At burn rate> X in short window - incident and cap. measures.
- Planning: The sprint's share of reliability correlates with burn over the past period.
6) Alerting: burn rate and multi-window rules
The idea: we catch fast leaks and slow drift.
Example (SLO 99. 9%, budget 0. 1%):- Critical: "2% budget in 1 hour" (fast fire).
- Warning: "5% of budget in 6 hours" (creeping degradation).
- Short window (minute-hour) for quick incidents.
- Long window (6-24 hours) for trends.
- Latency: alert by p99> threshold ≥5 min, with suppression of flapping and communication with trace instances.
error_ratio_5m = errors[5m] / requests[5m]
error_ratio_1h = errors[1h] / requests[1h]
burn_5m = error_ratio_5m / error_budget_fraction burn_1h = error_ratio_1h / error_budget_fraction alert_critical if burn_1h > 14 and burn_5m > 14 alert_warning if burn_6h > 6 and burn_30m > 6
7) Multi-tenant and segmentation
SLI/SLO are counted according to tenants/plans/regions, otherwise the median will "cover up" the failures.
Minimum number of events for statistical significance (guard-rails).
SLA can vary in tariffs (for example, "Pro 99. 9%, Free 99. 5%»).
8) Association with observability and traces
SLI metrics - from histograms/counters with exemplars → transition to "bad" trails.
Logs are the source of the reasons: timeouts, business error codes, limits.
For data - a link with lineage: "which job delayed the freshness metric."
9) Contracts and SLAs
SLA content:- SLI/measurement method/window definitions.
- Exceptions (planned work, force majeure).
- Incident and communication procedure (status page, RFO/RCA).
- Service credits and request order.
- Jurisdiction, period of validity, terms of revision.
- Never publicly promise SLOs stricter than architecture and operational practices allow.
- Separate internal SLOs and external SLAs.
10) Cost and prioritization
The price of nines is not growing linearly. «99. 9% → 99. 99%" = different architecture class (N + 1, multi-zone, asset-to-asset).
Put SLOs on the most valuable user actions.
Control the cost of telemetry - downsampling, quotas, replica, and storage by class.
11) Procedures and Reporting
Weekly reports: SLO execution by service/tenant, budget expenditure, top reasons, improvement plans.
Post-incident RCA: we associate with pieces of the budget; we set tasks to eliminate the root causes.
Fichfriz: inclusion/withdrawal criteria.
12) Templates (for a quick start)
12. 1 SLO card (example)
Service: Checkout API
SLI:
success: 1 - (5xx+timeouts+biz_failures)/all latency_p95: p95(http_server_duration_seconds{route="/pay"})
SLO:
success: 99. 95% / 30d latency_p95: ≤ 400ms / 7d
Windows:
primary: 30d rolling secondary: 7d rolling
Burn Alerts:
critical: use 1h/5m > 14 warning: use 6h/30m > 6
Owner: Team Checkout
Tenancy: per-tenant (≥ 1k req/day threshold)
Dashboards: RED + trace exemplars
12. 2 SLO Maturity Table
13) Examples of rules (fragments)
PromQL - success/errors/latency:promql
Error rate (5xx + timeout) for the sum (rate (http_requests_total{route="/pay",code=~"5. route. 599"}[5m]))
/ sum(rate(http_requests_total{route="/pay"}[5m]))
p99 histogram_quantile latency (0. 99, sum(rate(http_server_duration_seconds_bucket{route="/pay"}[5m])) by (le))
Alert burn-rate (idea for rules):
promql error_budget_fraction = 0. 001 for 99. 9%
(err_rate_5m / 0. 001 > 14) and (err_rate_1h / 0. 001 > 14) # critical
(err_rate_30m / 0. 001 > 6) and (err_rate_6h / 0. 001 > 6) # warning
Data freshness:
promql
Data order lag (minutes)
(max(time()) - max(last_ingest_ts_seconds{dataset="orders"})) / 60
14) SLO for data and ML (features)
End-to-end data SLOs: p99 freshness, p99 completeness, post-crash "reprocessing" time.
Data contracts: expected schemes, volumes, deadlines; data violation → incident.
ML: SLO for inference latency, SLA for feature stor availability, drift monitoring (model quality is a separate topic, outside SLA).
15) Integration with security and privacy
SLI logs without PII/secrets; tokenization/masking.
Audit changes to SLO/SLAs and report publications in immutable logs.
For regulatory pathways (payments/PII) - separate, more stringent SLOs.
16) Checklists
Before starting the service/features
- Success/Latency/Throughput/Freshness SLIs defined.
- SLO and windows defined; an error budget is calculated.
- Set burn-rate alerts (short + long).
- Dashboards RED + exemplars → routes; incident runibooks.
- Multi-lease sections and significance thresholds.
- Fichfreeze and reporting procedure.
Operation
- SLO/burn weekly report, hardening plans.
- Reevaluate SLO when architecture/load changes.
- Periodic "drill incidents" and runibook updates.
- Monitor telemetry cost and SLI count.
17) Runbook’и
Runbook: Rapid growth p99/pay
1. Alert p99> threshold → open a dashboard → go via exemplar to trace.
2. Find a narrow CLIENT/SERVER span, compare regions/versions.
3. Enable degradation (cache/limit/fallback), notify dependency command.
4. After stabilization - RCA, optimization tasks, updating SLO measurements.
Runbook: budget expenditure> 50% for the week
1. Freeze features, raise reliability priority.
2. Clustering of errors: by routes/tenants/dependencies.
3. Roll out corrections → confirm trend recovery.
4. Retrospective and alert/threshold adjustment.
18) FAQ
Q: How many SLOs do you need?
A: Minimum on critical user scenarios: success + latency. Everything else is out of necessity.
Q: Which is better - availability by time or by request?
A: On demand - more user metric. Time is convenient for network components/infra.
Q: Why p95, not average?
A: The middle one hides the tail; the user feels p95/p99.
Q: How not to "tighten the screws" too much?
A: Start with realistic goals (historical data), then tighten as you mature.
- "Observability: logs, metrics, traces"
- "Distributed Traces"
- "Audit and immutable logs"
- "Webhook Delivery Guarantees"
- "In Transit/At Rest Encryption"
- "Data Origin (Lineage)"