SLO/SLA and Metrics

SLO/SLA and metrics

1) Terms and hierarchy

SLI (Service Level Indicator) - a measurable indicator "as the user sees us": the share of successful requests, p95 latency, freshness of data, the share of successfully processed batches, etc.

SLO (Service Level Objective) - target SLI value at the observation interval (28/30/90 days). Example: "99. 9% of requests/pay end ≤ 400 ms."

Error budget — 1 − SLO. At SLO 99. 9% error budget = 0. 1% of time/requests.
SLA (Agreement) - legally significant service level: includes SLO, measurement, exceptions, compensations/fines.

2) Design principles

Symptoms> internal metrics. SLIs should reflect the actual user experience.
Small number of key SLIs. For service - 2-5 main: success, latency, throughput/freshness, correctness.
Coverage of critical pathways. Each business scenario (checkout, login, webhook, ETL download) has its own set of SLI/SLO.

Strict "success" semantics. Not "code 200," but "the user received a response on time and the result is valid."

Separation of external and internal SLOs. Internal - stricter; external SLA ≤ 1-2 nines lower.

3) SLI catalog (reference)

3. 1 API/Online Services

Success: 'SLI _ success = 1 − (5xx + timeout + business_error )/ all_requests'

Latency: p95/p99 'http _ server _ duration _ seconds' by route/method/tenant

Bandwidth: 'RPS '/limits/quotas

Correctness: proportion of valid responses (signatures, schemas, invariants)

3. 2 Webhooks/Asynchronous Deliveries

Delivery: proportion of events confirmed in T seconds and ≤ N retrays

Customers: percentage of subscribers without long delay (per tenant)

3. 3 Data/ETL/DWH

freshness: 'now − last_successful_ingest_ts'

Completeness: 'ingested _ rows/ expected_rows'

Correctness: the proportion of records that passed quality checks

Pipelines: share of jobs completed before deadline

3. 4 Mobile/Client SDKs

Client success: proportion of sessions without fatal errors

Round-trip latency: p95 from request to render

Cache hits: percentage served from cache (as a symptom of performance)

4) Formulas and examples of goals

Availability (on request):

`SLI_req_avail = 1 − (failed_requests / total_requests)`
`SLO_req_avail = 99. 95% 'for 30 days → error budget = 0. 05% of requests.

Availability (by time):

`uptime = (obs_window − downtime) / obs_window`

Latency:

'SLO _ latency = p95 (route = "/pay") ≤ 400 ms' on 7-day slices, excluding cache warm-ups (1%)

Data freshness:

'SLO _ freshness (dataset = "orders") ≤ 10 min'p99 in 24 hours.

5) Error budget and change management

Budget (B): 'B = 1 − SLO'.
Burn - The ratio of actual errors to allowable errors.

Politicians:

Overspend (burn> 1) → feature freeze, focus on reliability.
At burn rate> X in short window - incident and cap. measures.
Planning: The sprint's share of reliability correlates with burn over the past period.

6) Alerting: burn rate and multi-window rules

The idea: we catch fast leaks and slow drift.

Example (SLO 99. 9%, budget 0. 1%):

Critical: "2% budget in 1 hour" (fast fire).
Warning: "5% of budget in 6 hours" (creeping degradation).

Rules:

Short window (minute-hour) for quick incidents.
Long window (6-24 hours) for trends.
Latency: alert by p99> threshold ≥5 min, with suppression of flapping and communication with trace instances.

Example expressions (logic):


error_ratio_5m = errors[5m] / requests[5m]
error_ratio_1h = errors[1h] / requests[1h]
burn_5m     = error_ratio_5m / error_budget_fraction burn_1h     = error_ratio_1h / error_budget_fraction alert_critical if burn_1h > 14 and burn_5m > 14 alert_warning  if burn_6h > 6 and burn_30m > 6

7) Multi-tenant and segmentation

SLI/SLO are counted according to tenants/plans/regions, otherwise the median will "cover up" the failures.
Minimum number of events for statistical significance (guard-rails).
SLA can vary in tariffs (for example, "Pro 99. 9%, Free 99. 5%»).

8) Association with observability and traces

SLI metrics - from histograms/counters with exemplars → transition to "bad" trails.
Logs are the source of the reasons: timeouts, business error codes, limits.

For data - a link with lineage: "which job delayed the freshness metric."

9) Contracts and SLAs

SLA content:

SLI/measurement method/window definitions.
Exceptions (planned work, force majeure).
Incident and communication procedure (status page, RFO/RCA).
Service credits and request order.
Jurisdiction, period of validity, terms of revision.

Recommendations:

Never publicly promise SLOs stricter than architecture and operational practices allow.
Separate internal SLOs and external SLAs.

10) Cost and prioritization

The price of nines is not growing linearly. «99. 9% → 99. 99%" = different architecture class (N + 1, multi-zone, asset-to-asset).
Put SLOs on the most valuable user actions.
Control the cost of telemetry - downsampling, quotas, replica, and storage by class.

11) Procedures and Reporting

Weekly reports: SLO execution by service/tenant, budget expenditure, top reasons, improvement plans.
Post-incident RCA: we associate with pieces of the budget; we set tasks to eliminate the root causes.
Fichfriz: inclusion/withdrawal criteria.

12) Templates (for a quick start)

12. 1 SLO card (example)


Service: Checkout API
SLI:
success: 1 - (5xx+timeouts+biz_failures)/all latency_p95: p95(http_server_duration_seconds{route="/pay"})
SLO:
success: 99. 95% / 30d latency_p95: ≤ 400ms / 7d
Windows:
primary: 30d rolling secondary: 7d rolling
Burn Alerts:
critical: use 1h/5m > 14 warning: use 6h/30m > 6
Owner: Team Checkout
Tenancy: per-tenant (≥ 1k req/day threshold)
Dashboards: RED + trace exemplars

12. 2 SLO Maturity Table

Level	Characteristics
0	No SLI, CPU/Memory alerts
1	There are SLIs, simple thresholds
2	SLO with burn-rate alerts, reporting
3	Multi-lease SLOs, fichfreeze, capital investments according to plan
4	End-to-end SLOs (kliyent→bekend→dannyye), auto-remediation, canary SLOs

13) Examples of rules (fragments)

PromQL - success/errors/latency:

promql
Error rate (5xx + timeout) for the sum (rate (http_requests_total{route="/pay",code=~"5. route.    599"}[5m]))
/ sum(rate(http_requests_total{route="/pay"}[5m]))

p99 histogram_quantile latency (0. 99, sum(rate(http_server_duration_seconds_bucket{route="/pay"}[5m])) by (le))

Alert burn-rate (idea for rules):

promql error_budget_fraction = 0. 001 for 99. 9%
(err_rate_5m / 0. 001 > 14) and (err_rate_1h / 0. 001 > 14) # critical
(err_rate_30m / 0. 001 > 6) and (err_rate_6h / 0. 001 > 6)  # warning

Data freshness:

promql
Data order lag (minutes)
(max(time()) - max(last_ingest_ts_seconds{dataset="orders"})) / 60

14) SLO for data and ML (features)

End-to-end data SLOs: p99 freshness, p99 completeness, post-crash "reprocessing" time.
Data contracts: expected schemes, volumes, deadlines; data violation → incident.
ML: SLO for inference latency, SLA for feature stor availability, drift monitoring (model quality is a separate topic, outside SLA).

15) Integration with security and privacy

SLI logs without PII/secrets; tokenization/masking.
Audit changes to SLO/SLAs and report publications in immutable logs.
For regulatory pathways (payments/PII) - separate, more stringent SLOs.

16) Checklists

Before starting the service/features

Success/Latency/Throughput/Freshness SLIs defined.
SLO and windows defined; an error budget is calculated.
Set burn-rate alerts (short + long).
Dashboards RED + exemplars → routes; incident runibooks.
Multi-lease sections and significance thresholds.
Fichfreeze and reporting procedure.

Operation

SLO/burn weekly report, hardening plans.
Reevaluate SLO when architecture/load changes.
Periodic "drill incidents" and runibook updates.
Monitor telemetry cost and SLI count.

17) Runbook’и

Runbook: Rapid growth p99/pay

1. Alert p99> threshold → open a dashboard → go via exemplar to trace.
2. Find a narrow CLIENT/SERVER span, compare regions/versions.
3. Enable degradation (cache/limit/fallback), notify dependency command.
4. After stabilization - RCA, optimization tasks, updating SLO measurements.

Runbook: budget expenditure> 50% for the week

1. Freeze features, raise reliability priority.
2. Clustering of errors: by routes/tenants/dependencies.
3. Roll out corrections → confirm trend recovery.
4. Retrospective and alert/threshold adjustment.

18) FAQ

Q: How many SLOs do you need?
A: Minimum on critical user scenarios: success + latency. Everything else is out of necessity.

Q: Which is better - availability by time or by request?
A: On demand - more user metric. Time is convenient for network components/infra.

Q: Why p95, not average?
A: The middle one hides the tail; the user feels p95/p99.

Q: How not to "tighten the screws" too much?
A: Start with realistic goals (historical data), then tighten as you mature.

Related Materials:

"Observability: logs, metrics, traces"
"Distributed Traces"
"Audit and immutable logs"
"Webhook Delivery Guarantees"
"In Transit/At Rest Encryption"
"Data Origin (Lineage)"

SLO/SLA and Metrics

SLO/SLA and metrics

Operation

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects