Uptime Reports and SLA Audits
1) Why do we need a formal uptime reporting process?
Customer confidence and contract transparency - a single measurement technique, repeatable calculations.
SLO and error budget management - linking the fact of availability with releases and incidents.
Correct SLA loans are objective formulas, predictable payments/offsets.
Legal sustainability - evidence base, independent audit, Legal Hold.
2) Terms and boundaries
SLI Availability - percentage of successful validations/transactions per period.
SLO - internal target (e.g. 99. 95% in 28 days).
SLA - external commitment (e.g. 99. 9 %/month + service loans).
Measurement window - calendar month (SLA) and rolling window (SLO).
Scope - which components are included in the calculation (edge, API, payments) and which are not (admin portal, non-prod).
3) Sources of truth (and when which one is in charge)
1. Synthetics (blackbox/headless) is the primary SLI for "user-eye accessibility."
2. Logs/metrics - confirm the scale and nature of the failure.
3. Business events are "transaction success" (for example, payment authorized).
4. Status page - public communication; is checked against facts No. 1-3.
In case of discrepancies: priority is given to synthetics with the correct quorum from ≥2 regions.
4) Availability calculation methodology
4. 1 Basic formula
Availability = Успешные проверки / Все проверки
ErrorBudget = 1 − SLO
Downtime(m) = (1 − Availability) × Длительность_периода(в мин)
4. 2 Multi-regional quorum
An incident is counted if ≥N independent regions/ASNs simultaneously record a failure.
Recommended: N = 2 of 3 (EU/NA/APAC).
4. 3 SLI types
HTTP SLI: код 2xx/3xx, latency ≤ T.
DNS/TLS SLI: NXDOMAIN/SERVFAIL/expiry.
SLI business: successful transactions/all attempts (excluding client failures).
4. 4 Exceptions (documented)
Scheduled maintenance windows declared in advance N hours and observed.
Force majeure from SLA (for example, IX disaster provider) - only if there is evidence and public notice.
Client errors/restrictions (quota exceeded, 4xx).
5) Window maintenance policy
Time slots agreed in the contract (e.g. Sun 02: 00-04: 00 UTC + 0).
'Maintenance = true'markers in alert/panels → exclusion from SLI.
Notification threshold: at least 5 working days (or as in the contract).
Out of window - SLA impact is considered.
6) Edge cases and rounding rules
Brownout (partial degradation): count the percentage of failures (weighted downtime), not "0/1."
Flapping: minimum unit of account - sample interval (for example, 30-60 seconds) + hysteresis (for: 2-5 minutes).
Clock drift: all times in UTC and ISO-8601; NTP synchronization.
7) Examples of PromQL (synthetics → uptime)
HTTP scan success:promql probe_success{job="blackbox-http"} == 1
p95 latency:
promql histogram_quantile(0.95, sum by (le, target) (rate(probe_http_duration_seconds_bucket[5m])))
SLA uptime per month (seconds):
promql sum_over_time((probe_success==1)[30d]) / (30246060)
Quorum of failures (region ≥2 of 3 minutes):
promql sum by (target) (max_over_time((probe_success==0)[3m])) >= 2
8) Examples of SQL (report aggregation)
Monthly uptime and downtime:sql with checks as (
select target, ts, success -- success: 1/0 from synthetic_checks where ts >=:from and ts <:to
),
agg as (
select date_trunc('month', ts) m, target,
sum(success)::float / count() as availability from checks group by 1,2
)
select m, target, availability,
(1-availability) extract(epoch from (date_trunc('month', m) + interval '1 month' - date_trunc('month', m))) / 60 as downtime_minutes from agg;
Status Page Reconciliation (Incidents):
sql select a.m, a.target, a.downtime_minutes, s.incident_id, s.start_utc, s.end_utc from monthly_downtime a left join statuspage_incidents s on a.m = date_trunc('month', s.start_utc)
and tstzrange(s.start_utc, s.end_utc) && daterange(a.m, a.m + interval '1 month');
9) Monthly report template (Customer-friendly)
yaml period: "2025-10-01..2025-10-31 (UTC)"
services:
- name: "API Edge"
sla: "99.90%"
measured_availability: "99.93%"
downtime:
total: "30m 14s"
windows:
- start: "2025-10-12T03:12Z"
end: "2025-10-12T03:38Z"
impact: "EU+NA, HTTP 5xx spike, p95>2s"
root_cause: "DB connection pool exhaustion"
rca_link: "INC-20251012-0312"
slo_budget:
period_target: "0.10%"
consumed: "0.07%"
- name: "Payments API"
sla: "99.95%"
measured_availability: "99.97%"
summary:
sla_breaches: 0 service_credits: 0 maintenance:
announced: 2 total_duration: "48m"
signatures:
generated_at: "2025-11-01T10:00Z"
report_id: "SLA-2025-10-API"
10) SLA credits: calculation and application
Table of credits: for example, 99. 0–99. 5% → 5% MRR; 98. 0–99. 0% → 10%, etc.
True-up: Credit applies as a credit note to the next account.
Automation: "if 'measured _ availability <SLA' → 'credit _ note rule. create()`».
Showcase for the client: portal card "SLA credits balance."
11) Audit, Evidence and Legal Hold
Audit trail: who/what/when calculated, version of the methodology, checksums.
Raw data is immutable (append-only); adjustments - by separate records.
Legal Hold: freezing the data range (samples, logs, incident cards, alerts).
Replica archives - WORM/S3 Object Lock.
12) Reconciliation with public status page
An incident on a status page must have a timeline and components.
The time/scale mismatch is → created by the discrepancy-record and is posted by the RCA.
The summary of the report contains the Reconciliation Notes section.
13) Incidents and Reporting
Each downtime window corresponds to an INC card (ID, SEV, owner, RCA, CAPA).
In report: link to INC, short root cause, CAPA status.
For SEV-1: postmore topics ≤ 48 hours from closing.
14) Data quality control
Hygiene of samples:> 99% of successful scraps of agents, absence of gaps> 5 minutes.
Anti-noise: quorum + multi-window, debounce.
Trace/log sampling is recorded and documented.
Method tests: unit tests of calculations, golden files based on historical data.
15) Security and privacy
TLS/mTLS for ingest, packet signature (HMAC).
PII edition in logs/reports; The SLA report must not disclose personal data.
RBAC/ABAC on reports; access traces are written to the audit log.
16) Dashboards and SLO widgets (what to show)
Overall availability by service for the month/quarter.
Downtime windows with severity and detection channel.
Error budget burn (fast/slow) and trends.
Releases overlay - annotations of calculations.
SLA credits forecast - at the current trend.
17) Implementation plan (3 iterations)
1. Model and data (2 weeks): fix SLI/SLO/SLA, include quorum synthetics, collect "raw materials" in DWH.
2. Calculation and report (2-3 weeks): formulas, SQL/PromQL, YAML/PDF templates, customer portal, auto-credits.
3. Audit and automation (3-4 weeks): Legal Hold, reconciliation with status page, signed webhooks, dispute regulations.
18) Report quality checklist
- Scope, SLI, method and measurement window defined.
- There are quorum and multi-window; flapping is suppressed.
- Exceptions (maintenance/force majeure) are documented.
- Each downtime window is associated with INC and RCA.
- Calculated SLA credits and reflected in billing.
- Report reproducible (formula/data versions).
- Audit trail and Legal Hold are included.
- The public status page is reconciled.
19) Mini-FAQ
Why is synthetics the main source?
It is closest to the user path and includes a perimeter (DNS/CDN/WAF). Metrics/logs - clarify the reason.
How to count partial degradation?
Weighted downtime: the proportion of failures × the duration of the window, and not "all or nothing."
Do I need to store "raw" checks?
Yes I did. For auditing and recalculation in a dispute - raw is required.
Result
Uptime reports and SLA audits are not a "figure at the end of the month," but a reproducible system of measurements, rules and evidence: correct SLIs, quorum checks, transparent formulas, linking with incidents and billing, exception control and Legal Hold. Record the methodology, automate the calculation and credits, keep the audit trail - and your SLAs will become manageable, understandable and secure.