GH GambleHub

SLO/SLA and Metrics

SLO/SLA and metrics

1) Terms and hierarchy

SLI (Service Level Indicator) - a measurable indicator "as the user sees us": the share of successful requests, p95 latency, freshness of data, the share of successfully processed batches, etc.

SLO (Service Level Objective) - target SLI value at the observation interval (28/30/90 days). Example: "99. 9% of requests/pay end ≤ 400 ms."

Error budget — 1 − SLO. At SLO 99. 9% error budget = 0. 1% of time/requests.
SLA (Agreement) - legally significant service level: includes SLO, measurement, exceptions, compensations/fines.

2) Design principles

Symptoms> internal metrics. SLIs should reflect the actual user experience.
Small number of key SLIs. For service - 2-5 main: success, latency, throughput/freshness, correctness.
Coverage of critical pathways. Each business scenario (checkout, login, webhook, ETL download) has its own set of SLI/SLO.

Strict "success" semantics. Not "code 200," but "the user received a response on time and the result is valid."

Separation of external and internal SLOs. Internal - stricter; external SLA ≤ 1-2 nines lower.

3) SLI catalog (reference)

3. 1 API/Online Services

Success: 'SLI _ success = 1 − (5xx + timeout + business_error )/ all_requests'

Latency: p95/p99 'http _ server _ duration _ seconds' by route/method/tenant

Bandwidth: 'RPS '/limits/quotas

Correctness: proportion of valid responses (signatures, schemas, invariants)

3. 2 Webhooks/Asynchronous Deliveries

Delivery: proportion of events confirmed in T seconds and ≤ N retrays

Customers: percentage of subscribers without long delay (per tenant)

3. 3 Data/ETL/DWH

freshness: 'now − last_successful_ingest_ts'

Completeness: 'ingested _ rows/ expected_rows'

Correctness: the proportion of records that passed quality checks

Pipelines: share of jobs completed before deadline

3. 4 Mobile/Client SDKs

Client success: proportion of sessions without fatal errors

Round-trip latency: p95 from request to render

Cache hits: percentage served from cache (as a symptom of performance)

4) Formulas and examples of goals

Availability (on request):
  • `SLI_req_avail = 1 − (failed_requests / total_requests)`
  • `SLO_req_avail = 99. 95% 'for 30 days → error budget = 0. 05% of requests.
Availability (by time):
  • `uptime = (obs_window − downtime) / obs_window`
Latency:
  • 'SLO _ latency = p95 (route = "/pay") ≤ 400 ms' on 7-day slices, excluding cache warm-ups (1%)
Data freshness:
  • 'SLO _ freshness (dataset = "orders") ≤ 10 min'p99 in 24 hours.

5) Error budget and change management

Budget (B): 'B = 1 − SLO'.
Burn - The ratio of actual errors to allowable errors.

Politicians:
  • Overspend (burn> 1) → feature freeze, focus on reliability.
  • At burn rate> X in short window - incident and cap. measures.
  • Planning: The sprint's share of reliability correlates with burn over the past period.

6) Alerting: burn rate and multi-window rules

The idea: we catch fast leaks and slow drift.

Example (SLO 99. 9%, budget 0. 1%):
  • Critical: "2% budget in 1 hour" (fast fire).
  • Warning: "5% of budget in 6 hours" (creeping degradation).
Rules:
  • Short window (minute-hour) for quick incidents.
  • Long window (6-24 hours) for trends.
  • Latency: alert by p99> threshold ≥5 min, with suppression of flapping and communication with trace instances.
Example expressions (logic):

error_ratio_5m = errors[5m] / requests[5m]
error_ratio_1h = errors[1h] / requests[1h]
burn_5m     = error_ratio_5m / error_budget_fraction burn_1h     = error_ratio_1h / error_budget_fraction alert_critical if burn_1h > 14 and burn_5m > 14 alert_warning  if burn_6h > 6 and burn_30m > 6

7) Multi-tenant and segmentation

SLI/SLO are counted according to tenants/plans/regions, otherwise the median will "cover up" the failures.
Minimum number of events for statistical significance (guard-rails).
SLA can vary in tariffs (for example, "Pro 99. 9%, Free 99. 5%»).

8) Association with observability and traces

SLI metrics - from histograms/counters with exemplars → transition to "bad" trails.
Logs are the source of the reasons: timeouts, business error codes, limits.

For data - a link with lineage: "which job delayed the freshness metric."

9) Contracts and SLAs

SLA content:
  • SLI/measurement method/window definitions.
  • Exceptions (planned work, force majeure).
  • Incident and communication procedure (status page, RFO/RCA).
  • Service credits and request order.
  • Jurisdiction, period of validity, terms of revision.
Recommendations:
  • Never publicly promise SLOs stricter than architecture and operational practices allow.
  • Separate internal SLOs and external SLAs.

10) Cost and prioritization

The price of nines is not growing linearly. «99. 9% → 99. 99%" = different architecture class (N + 1, multi-zone, asset-to-asset).
Put SLOs on the most valuable user actions.
Control the cost of telemetry - downsampling, quotas, replica, and storage by class.

11) Procedures and Reporting

Weekly reports: SLO execution by service/tenant, budget expenditure, top reasons, improvement plans.
Post-incident RCA: we associate with pieces of the budget; we set tasks to eliminate the root causes.
Fichfriz: inclusion/withdrawal criteria.

12) Templates (for a quick start)

12. 1 SLO card (example)


Service: Checkout API
SLI:
success: 1 - (5xx+timeouts+biz_failures)/all latency_p95: p95(http_server_duration_seconds{route="/pay"})
SLO:
success: 99. 95% / 30d latency_p95: ≤ 400ms / 7d
Windows:
primary: 30d rolling secondary: 7d rolling
Burn Alerts:
critical: use 1h/5m > 14 warning: use 6h/30m > 6
Owner: Team Checkout
Tenancy: per-tenant (≥ 1k req/day threshold)
Dashboards: RED + trace exemplars

12. 2 SLO Maturity Table

LevelCharacteristics
0No SLI, CPU/Memory alerts
1There are SLIs, simple thresholds
2SLO with burn-rate alerts, reporting
3Multi-lease SLOs, fichfreeze, capital investments according to plan
4End-to-end SLOs (kliyent→bekend→dannyye), auto-remediation, canary SLOs

13) Examples of rules (fragments)

PromQL - success/errors/latency:
promql
Error rate (5xx + timeout) for the sum (rate (http_requests_total{route="/pay",code=~"5. route.    599"}[5m]))
/ sum(rate(http_requests_total{route="/pay"}[5m]))

p99 histogram_quantile latency (0. 99, sum(rate(http_server_duration_seconds_bucket{route="/pay"}[5m])) by (le))
Alert burn-rate (idea for rules):
promql error_budget_fraction = 0. 001 for 99. 9%
(err_rate_5m / 0. 001 > 14) and (err_rate_1h / 0. 001 > 14) # critical
(err_rate_30m / 0. 001 > 6) and (err_rate_6h / 0. 001 > 6)  # warning
Data freshness:
promql
Data order lag (minutes)
(max(time()) - max(last_ingest_ts_seconds{dataset="orders"})) / 60

14) SLO for data and ML (features)

End-to-end data SLOs: p99 freshness, p99 completeness, post-crash "reprocessing" time.
Data contracts: expected schemes, volumes, deadlines; data violation → incident.
ML: SLO for inference latency, SLA for feature stor availability, drift monitoring (model quality is a separate topic, outside SLA).

15) Integration with security and privacy

SLI logs without PII/secrets; tokenization/masking.
Audit changes to SLO/SLAs and report publications in immutable logs.
For regulatory pathways (payments/PII) - separate, more stringent SLOs.

16) Checklists

Before starting the service/features

  • Success/Latency/Throughput/Freshness SLIs defined.
  • SLO and windows defined; an error budget is calculated.
  • Set burn-rate alerts (short + long).
  • Dashboards RED + exemplars → routes; incident runibooks.
  • Multi-lease sections and significance thresholds.
  • Fichfreeze and reporting procedure.

Operation

  • SLO/burn weekly report, hardening plans.
  • Reevaluate SLO when architecture/load changes.
  • Periodic "drill incidents" and runibook updates.
  • Monitor telemetry cost and SLI count.

17) Runbook’и

Runbook: Rapid growth p99/pay

1. Alert p99> threshold → open a dashboard → go via exemplar to trace.
2. Find a narrow CLIENT/SERVER span, compare regions/versions.
3. Enable degradation (cache/limit/fallback), notify dependency command.
4. After stabilization - RCA, optimization tasks, updating SLO measurements.

Runbook: budget expenditure> 50% for the week

1. Freeze features, raise reliability priority.
2. Clustering of errors: by routes/tenants/dependencies.
3. Roll out corrections → confirm trend recovery.
4. Retrospective and alert/threshold adjustment.

18) FAQ

Q: How many SLOs do you need?
A: Minimum on critical user scenarios: success + latency. Everything else is out of necessity.

Q: Which is better - availability by time or by request?
A: On demand - more user metric. Time is convenient for network components/infra.

Q: Why p95, not average?
A: The middle one hides the tail; the user feels p95/p99.

Q: How not to "tighten the screws" too much?
A: Start with realistic goals (historical data), then tighten as you mature.

Related Materials:
  • "Observability: logs, metrics, traces"
  • "Distributed Traces"
  • "Audit and immutable logs"
  • "Webhook Delivery Guarantees"
  • "In Transit/At Rest Encryption"
  • "Data Origin (Lineage)"
Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.