GH GambleHub

SLA and SLO monitoring

1) Terms and roles

SLA (Service Level Agreement) - external contractual obligation to the client (penalty clauses, credits).
SLO (Service Level Objective) - target internal service level that supports SLA execution.
SLI (Service Level Indicator) - measured indicator, on the basis of which SLO/SLA are evaluated.
Error Budget - the allowed percentage of "unavailability/errors" for the period: 'Budget = 1 − SLO'.
Scope: measured by the user's eyes (end-to-end). In microservices, both at the component level and at the end-to-end path level.

2) SLI selection: what exactly to measure

The criterion is correlation with user experience and business value.

Typical SLIs:
  • Availability: percentage of successful requests. 'SLI = successful/all'.
  • Latency: the proportion of requests is faster than the threshold T. 'SLI = P (latency ≤ T)'.
  • Quality: proportion of correct answers (without 5xx/functs. errors).
  • Data up-to-date - Replication latency/ETL ≤ X minutes.
  • Business process performance: share of successful payments/registrations.

Anti-patterns: count only 200 as "success," ignoring business mistakes; measure in test network instead of user network.

3) Formulas and observation windows

Availability per window:
  • `Availability = (OK_requests / All_requests) × 100%`.
SLO by latency:
  • 'P95 ≤ T '→ is better formulated as a share:' SLI =% of requests ≤ T '.
  • Example: "99% of search queries ≤ 300 ms in 28 days."
  • Sliding window: 28 or 30 days (balance of sensitivity and stability). For incidents - additional windows: 1 h, 6 h, 24 h.

4) Error Budget and change rate control

Calculation: at'SLO = 99. 9% 'budget =' 0. 1% 'errors/unavailability per period.

Policies:
  • Budget> 50%: releases and plan experiments.
  • Budget 10-50%: only low-risk releases, tightening canaries.
  • Budget <10%: release freeze, root cause, reliability improvements.
  • Connection with progressive releases: canary/feature-flags "eat" the budget in doses, with auto-rollback under degradation.

5) Alert politicians: from thresholds to burn rate

Why didn't "daupal SLO - raise alert": too late. Need proactivity.

Burn Rate (BR) - budget burn rate:
  • 'BR = (observed error in a short window/allowed error in this window) '.
  • If'BR> 1 '- budget is consumed faster than normal.
Two-window alerts (SRE best practice):
  • Fast alert (noise is sensitive, catches disasters): window 5-10 minutes, threshold BR 14-20 ×.
  • Slow alert (catches creeping degradation): window 1-6 hours, threshold BR 2-4 ×.
  • Combine conditions: fast or slow worked - paging on-call.
  • Levels: pager for user SLOs, tickets/notifications for gray degradation of internal SLIs.

6) Observability and sources of truth

Logs - diagnosis of causes.
Metrics - numerical SLIs (success/error, latency percentiles, fractions, counters).
Trails - through paths, localization of "hot" segments.
Synthetics - active samples from the periphery (region-aware).
Real events - RUM/customer telemetry, business metrics (conversion, successful payments).

Requirements: a single picture in dashboards of releases and incidents, annotations "version/canary/flag."

7) SLO design: step-by-step template

1. Describe the critical path (for example, "deposit by card").
2. Define SLI: success/error, latency threshold, completeness.
3. Agree SLO: 28-day target + exceptions (scheduled windows).
4. Link to SLA: legal obligation ≦ actual SLO.
5. Assign a service owner, RACI and alert channel.
6. Define alert policies (two-window BR) and auto-rollbacks.
7. Implement reporting: weekly budget reviews, post-incident reviews.
8. Review SLOs quarterly (load/architecture change).

8) SLO examples (templates)

Payment API:
  • Availability: '≥ 99. 95% '(28d, excluding announced windows ≤ 30 min/month).
  • Latency: '≥ 99%' responses' ≤ 400 ms'.
  • Success of business operations: '≥ 98. 5% 'successful authorizations (fraud filters are taken into account).
Search for games/content:
  • Latency: '≥ 99%' requests' ≤ 300 ms'.
  • Cache relevance: '≤ 5 min' lag 99% of the time.
Streaming events (KYC/AML):
  • Delivery: '≥ 99. 9% 'for' ≤ 60 s' (end-to-end, with retras).
  • Loss: '≤ 0. 01% 'messages (idempotency/deduplication enabled).

9) Multi-region and multi-tenant

SLO "by cohort": country, payment provider, VIP segment, device.
Local SLOs at the edge: metrics from the points closest to the user (edge/PoP).
Aggregation: Total SLO should not hide failures across important cohorts.
Switching providers: automatic fallback routes at the SLO gate level.

10) Dashboards and reporting

Release dashboard: version, canary (% traffic), SLI (success/latency), BR, flag annotations.
Operating dashboard: burn-down budget by day, top incidents, MTTR, problem cohorts.
Weekly reports: budget balance, BR trends, technical debt (bottlenecks), improvement plan.

11) Processes: Incidents, RCAs and Improvements

Incident management: alert → BR assessment → scale of canaries/flags → rollback/fix.
RCA (root cause): facts/timelines/hypotheses/corrections/effect check by SLI.
Lessons learned: non-punitive post-mortems, mandatory action items with owners and deadlines.
Loop closure: changes in tests, feature flags, limits, caches, retrays, quotas.

12) Compliance and audit

SLO/SLI as control artifacts (policy-as-code, immutable logs).
Link to requirements (for example, availability of payment transactions).
Evidence: alert minutes, budget reports, release/rollback logs.

13) Frequent mistakes and how to avoid them

“99. 99% or death": unattainable goals → constant alert-noise. Choose realistic SLOs.
Global averages hide local dips → introduce cohorts.
Metrics not e2e: high SLOs during actual degradation on the client → add RUM/synthetics.
Alerts on one threshold → switch to two-window burn rate.
There is no link to changes → releases are not annotated, there is no auto-rollback.

14) Mini Implementation Checklist

  • Critical paths and their SLI/SLO are described.
  • Monitoring and exclusion window is set.
  • Two-window BR alerts (fast and slow) are configured.
  • Dashboards of releases and operations with annotations of versions/flags.
  • The error budget policy affects releases.
  • Regular budget reviews and post-incident RCAs.
  • Documentation and scorecard owners.

15) Calculation example (specifics)

API availability SLO: 99. 9% in 28 days → budget = 0. 1%.
For 7 days accumulated 0. 06% of errors → used 60% of the weekly budget.
On a short window of 15 minutes, 2% of errors are observed. Valid on this window is' 0. 1% × (15 min/40320 min) ≈ 0. 000037%`.
Burn Rate ≫ 1 (tens of ×) → a fast pager is triggered, the canary rolls back to 1%, the degrade-payments-UX feature flag turns on, RCA starts.

16) The bottom line

SLA/SLO monitoring is not only numbers in the report, but a mechanism for managing the risk of changes and the quality of service. Correct SLIs, realistic SLOs, error budget management, two-window burn-rate alerts and e2e-observability turn metrics into working solutions: release value faster and keep the user experience predictable.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.