SLA and SLO monitoring

1) Terms and roles

SLA (Service Level Agreement) - external contractual obligation to the client (penalty clauses, credits).
SLO (Service Level Objective) - target internal service level that supports SLA execution.
SLI (Service Level Indicator) - measured indicator, on the basis of which SLO/SLA are evaluated.
Error Budget - the allowed percentage of "unavailability/errors" for the period: 'Budget = 1 − SLO'.
Scope: measured by the user's eyes (end-to-end). In microservices, both at the component level and at the end-to-end path level.

2) SLI selection: what exactly to measure

The criterion is correlation with user experience and business value.

Typical SLIs:

Availability: percentage of successful requests. 'SLI = successful/all'.
Latency: the proportion of requests is faster than the threshold T. 'SLI = P (latency ≤ T)'.
Quality: proportion of correct answers (without 5xx/functs. errors).
Data up-to-date - Replication latency/ETL ≤ X minutes.
Business process performance: share of successful payments/registrations.

Anti-patterns: count only 200 as "success," ignoring business mistakes; measure in test network instead of user network.

3) Formulas and observation windows

Availability per window:

`Availability = (OK_requests / All_requests) × 100%`.

SLO by latency:

'P95 ≤ T '→ is better formulated as a share:' SLI =% of requests ≤ T '.
Example: "99% of search queries ≤ 300 ms in 28 days."
Sliding window: 28 or 30 days (balance of sensitivity and stability). For incidents - additional windows: 1 h, 6 h, 24 h.

4) Error Budget and change rate control

Calculation: at'SLO = 99. 9% 'budget =' 0. 1% 'errors/unavailability per period.

Policies:

Budget> 50%: releases and plan experiments.
Budget 10-50%: only low-risk releases, tightening canaries.
Budget <10%: release freeze, root cause, reliability improvements.
Connection with progressive releases: canary/feature-flags "eat" the budget in doses, with auto-rollback under degradation.

5) Alert politicians: from thresholds to burn rate

Why didn't "daupal SLO - raise alert": too late. Need proactivity.

Burn Rate (BR) - budget burn rate:

'BR = (observed error in a short window/allowed error in this window) '.
If'BR> 1 '- budget is consumed faster than normal.

Two-window alerts (SRE best practice):

Fast alert (noise is sensitive, catches disasters): window 5-10 minutes, threshold BR 14-20 ×.
Slow alert (catches creeping degradation): window 1-6 hours, threshold BR 2-4 ×.
Combine conditions: fast or slow worked - paging on-call.
Levels: pager for user SLOs, tickets/notifications for gray degradation of internal SLIs.

6) Observability and sources of truth

Logs - diagnosis of causes.
Metrics - numerical SLIs (success/error, latency percentiles, fractions, counters).
Trails - through paths, localization of "hot" segments.
Synthetics - active samples from the periphery (region-aware).
Real events - RUM/customer telemetry, business metrics (conversion, successful payments).

Requirements: a single picture in dashboards of releases and incidents, annotations "version/canary/flag."

7) SLO design: step-by-step template

1. Describe the critical path (for example, "deposit by card").
2. Define SLI: success/error, latency threshold, completeness.
3. Agree SLO: 28-day target + exceptions (scheduled windows).
4. Link to SLA: legal obligation ≦ actual SLO.
5. Assign a service owner, RACI and alert channel.
6. Define alert policies (two-window BR) and auto-rollbacks.
7. Implement reporting: weekly budget reviews, post-incident reviews.
8. Review SLOs quarterly (load/architecture change).

8) SLO examples (templates)

Payment API:

Availability: '≥ 99. 95% '(28d, excluding announced windows ≤ 30 min/month).
Latency: '≥ 99%' responses' ≤ 400 ms'.
Success of business operations: '≥ 98. 5% 'successful authorizations (fraud filters are taken into account).

Search for games/content:

Latency: '≥ 99%' requests' ≤ 300 ms'.
Cache relevance: '≤ 5 min' lag 99% of the time.

Streaming events (KYC/AML):

Delivery: '≥ 99. 9% 'for' ≤ 60 s' (end-to-end, with retras).
Loss: '≤ 0. 01% 'messages (idempotency/deduplication enabled).

9) Multi-region and multi-tenant

SLO "by cohort": country, payment provider, VIP segment, device.
Local SLOs at the edge: metrics from the points closest to the user (edge/PoP).
Aggregation: Total SLO should not hide failures across important cohorts.
Switching providers: automatic fallback routes at the SLO gate level.

10) Dashboards and reporting

Release dashboard: version, canary (% traffic), SLI (success/latency), BR, flag annotations.
Operating dashboard: burn-down budget by day, top incidents, MTTR, problem cohorts.
Weekly reports: budget balance, BR trends, technical debt (bottlenecks), improvement plan.

11) Processes: Incidents, RCAs and Improvements

Incident management: alert → BR assessment → scale of canaries/flags → rollback/fix.
RCA (root cause): facts/timelines/hypotheses/corrections/effect check by SLI.
Lessons learned: non-punitive post-mortems, mandatory action items with owners and deadlines.
Loop closure: changes in tests, feature flags, limits, caches, retrays, quotas.

12) Compliance and audit

SLO/SLI as control artifacts (policy-as-code, immutable logs).
Link to requirements (for example, availability of payment transactions).
Evidence: alert minutes, budget reports, release/rollback logs.

13) Frequent mistakes and how to avoid them

“99. 99% or death": unattainable goals → constant alert-noise. Choose realistic SLOs.
Global averages hide local dips → introduce cohorts.
Metrics not e2e: high SLOs during actual degradation on the client → add RUM/synthetics.
Alerts on one threshold → switch to two-window burn rate.
There is no link to changes → releases are not annotated, there is no auto-rollback.

14) Mini Implementation Checklist

Critical paths and their SLI/SLO are described.
Monitoring and exclusion window is set.
Two-window BR alerts (fast and slow) are configured.
Dashboards of releases and operations with annotations of versions/flags.
The error budget policy affects releases.
Regular budget reviews and post-incident RCAs.
Documentation and scorecard owners.

15) Calculation example (specifics)

API availability SLO: 99. 9% in 28 days → budget = 0. 1%.
For 7 days accumulated 0. 06% of errors → used 60% of the weekly budget.
On a short window of 15 minutes, 2% of errors are observed. Valid on this window is' 0. 1% × (15 min/40320 min) ≈ 0. 000037%`.
Burn Rate ≫ 1 (tens of ×) → a fast pager is triggered, the canary rolls back to 1%, the degrade-payments-UX feature flag turns on, RCA starts.

16) The bottom line

SLA/SLO monitoring is not only numbers in the report, but a mechanism for managing the risk of changes and the quality of service. Correct SLIs, realistic SLOs, error budget management, two-window burn-rate alerts and e2e-observability turn metrics into working solutions: release value faster and keep the user experience predictable.

SLA and SLO monitoring

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects