Operations and → Management Audit Metrics and SLAs

Auditing Metrics and SLAs

1) Why do you need it

If the metrics are wrong - decisions will be wrong, SLAs will be violated "on paper" or vice versa to hide problems. Auditing metrics and SLAs ensures that promises to users and partners are comparable, reliable and legally secure.

Objectives:

Provide a single "source of truth" (SSOT) and reproducible calculations.
Reduce discrepancies between dashboards/reports/billing.
Make SLAs evidence-based.
Detect degradation in measurements as early as in services.

2) Basic concepts and boundaries of responsibility

Metric: measured quantity (RPS, p95, CR, GGR, Success Rate).
KPI/OKR: targets to which metrics are linked.
SLO: target quality of service (for example, "p99 ≤ 400 ms 99. 9% of the time").
SLA: external promise; legally significant, based on SLO.
OLA: internal agreement between teams/vendors, supports SLA.
SSOT: system/storage whose data is considered a reference for reporting.

3) Taxonomy of metrics (layers)

1. Infrastructure: CPU/Memory/IO/Net, pods/nodes, HPA/VPA.
2. Platform: queues/streams (lag, throughput), DB/caches (connections, hit), API (p95/p99, 5xx).
3. Business flows: deposits/withdrawals, bets, game launches, authorizations, KYC.
4. Product/marketing: conversions, ARPPU/LTV, campaigns.
5. Quality of processes: MTTA/MTTR, Change Failure Rate, check list coverage.

Rule: Each metric must have a layer, owner, and formula.

4) Data sources and "true"

Online telemetry: Prometheus/OTel, logs (ELK/ClickHouse), traces.
Events and accounting: Kafka/Outbox, DWH/data marts (BigQuery/ClickHouse).
Manual artifacts: post-mortems, tickets, incident registers.
External registries: provider reports (PSP/KYC/studios), billing.

Conflict resolution: in case of discrepancies "online vs DWH," the priority regulation applies (for example, for SLA - aggregates from DWH with source traceability).

5) Metrics audit process (control loop)

1. Inventory: metrics catalog/SLO/SLA (name, owner, layer, formula, source, calculation frequency).
2. Formula verification: reconciliation of SQL/promo queries with the definition (unit tests of calculations).
3. Sampling and rechecking: sampling event/log lines and manual reconciliation.
4. Contour mapping: comparison of online dashboards and DWH reports.
5. Change control: formula review for schema/logic releases.
6. SLA audit: verification of the correctness of assemblies and exceptions (planned maintenance, force majeure).
7. Report and improvements: a list of detected discrepancies and fixes with deadlines.

6) Definitions and formulas (samples)

Success Rate (API):

`success = requests - (5xx + timeouts + circuit_open)`
`success_rate = success / requests`

Latency p95/p99:

SSOT records a single definition of window (rolling 5m/1h) and aggregation (HDR/TDigest).

SLO (example):

'SLO _ availability _ month = (uptime - allowable _ exceptions )/total _ time'

SLA (example for provider):

`SLA_month = 99. 90% uptime by UTC window, excluding planned windows (T-48 notification), provable accidents at transit operators (documents). '

7) Data quality: checks and alerts

Quality checks:

Полнота (completeness): `received_events / expected_events ≥ 0. 99`.
Timeliness: load lag ≤ N minutes.
Uniqueness: without duplicate keys (idempotency-key).
consistency-Amounts/currency/characters.
Linearity - Counters are not "rolled back."

Alerts on measurement quality (ideas):


ALERT MetricsIngestionLagHigh
IF dwh_ingest_lag_minutes > 15 FOR 10m

ALERT EventsCompletenessDrop
IF (events_received / events_expected) < 0. 99 FOR 15m

ALERT DuplicateEventsSpike
IF rate(events_duplicates_total[10m]) > baseline_7d 2

8) SLA/OLA Audit: Methodology

1. Collect a calendar of exceptions: planned windows, agreed degradation, acts of vendors.
2. Calculation of uptime: according to a single time zone, based on SSOT.
3. Reconciliation with incidents: timeline, tickets, post-mortems.
4. Attribution: own failures, provider, transit, DDoS, routine maintenance.
5. SLA perimeter: user experience (E2E) vs one specific API.
6. Reporting: monthly/quarterly report: actual, deviations, compensations (if applicable), corrective measures.

9) Check of calculation reproducibility

Formula versioning: Git repository with SQL/PromQL/dock specifications.
Unit tests of metrics: on synthetic data (edge cases: gaps, duplicates, date boundaries).
Data lineage: from dashboard back to source tables and events.
Snapshots: freezing data for cutoff so that the re-calculations are comparable.

10) Sampling

Daily: 10-20 events by key flows (deposit/rate/CCL) - manual verification of tracing ↔ DWH.
Weekly: 1% sample to compare "online vs DWH" across aggregates.
Monthly: set of incidents with SLA effect - detailed reconstruction.

Sample report template (brief):


Date/Window: 2025-10-01.. 2025-10-07
Metric: SLO_api_p99
Source A: Prometheus (rolling 5m)
Source B: DWH snapshot (1h buckets)
Deviation: + 6. 2% (A above B)
Reason: different aggregation windows
Action: align window in both contours to 5m/rolling
Term/Owner: 2025-11-10/squad-observability

11) Audit of dashboards and alerts

Unified dictionary of metrics: glossary right on the dashboard.
Annotations of releases/events: to see the cause of deviations.
Pre/post release comparison: automatic regression panels.
Duplicates/discrepancies: identifying "two different p99s" - editing formulas/windows.
Panel availability: rights, reserve, link/version control.

12) Metric Change Management

RFC Process - Change Formula/Window/Source - via RFC with SLA/Reporting Impact Assessment

Migration "expand → migrate → contract": temporarily keep both versions, compare, then turn off the old one.

Communications: notify the product/business in advance of shifts in values "according to the new method."

13) Specifics iGaming/fintech

Demand peaks: metrics must withstand explosive loads (aggregations do not "stick").
Providers: SLA depends on OLA vendors → store their reports, incident statuses and quotas.
Cost: 'cost _ per _ 1k _ calls' and 'cost of success' are mandatory panels.
Antifraud/risk: sensitivity to delays and "false positives" of metrics.

14) Audit dashboards (minimum set)

Metrics Health: completeness/timeliness/duplicates, ingest-lag, ошибки ETL.
SLO/SLA Evidence: calculated SLO, actual SLA, exceptions, references to incidents/acts.
Online vs DWH Compare: p95/p99/Success Rate, deviations and trends.
Vendor SLA: uptime/quotas/timeouts/cost by provider.
Release Impact: regression of metrics after calculations/inclusion of features.

15) Audit checklist (operational)

The metrics/SLO/SLA directory with owners and formulas is up to date.
SSOT is defined for each report/panel.
Unit tests of formulas are green, calculation pipelines are documented.
Data quality alerts are active (completeness/timeline/duplicates).
"Online vs DWH" discrepancy ≤ acceptable threshold (e.g. ≤2%).
Agreed SLA exceptions are documented and attached to the report.
Control samples were taken and certificates were drawn up.
All formula changes have passed RFC and migration.

16) Examples (fragments)

PromQL - pre-/post-release p99 comparison:


api_p99_ms:release:ratio =
(api_latency_p99_ms{release="after"} / api_latency_p99_ms{release="before"})

SQL - Event Completeness Control:

sql
SELECT event_date,
COUNT() AS received,
SUM(expected_count) AS expected,
COUNT()::decimal / NULLIF(SUM(expected_count),0) AS completeness
FROM events
JOIN expected_events USING (event_date, event_type)
WHERE event_type IN ('deposit','bet_placed','kyc_completed')
AND event_date BETWEEN:from AND:to
GROUP BY 1;

Alertmanager rule - contour divergence:


ALERT DwhVsOnlineDrift
IF abs(dwh_kpis{metric="api_p99"} - online_kpis{metric="api_p99"}) > 0. 02 online_kpis
FOR 30m
LABELS {severity="warning", team="observability"}

17) Anti-patterns

Two different "same" metric formulas on different panels.
Changing the metric without migration and notification - "jumps" in OKR/SLA.
Reports in local Excel as "true" (non-reproducible).
Mixing time zones and calendars in SLA calculations.
SLA exceptions are not documented.
There are no alerts on the quality of measurements.

18) Measurement maturity KPI

Drift Rate Online↔DWH (target ≤2%).
Metrics Health Uptime.
Time-to-Fix Formula.
SLA Dispute Rate.
Coverage SLO/SLA (proportion of critical paths with formally described SLO/SLA).

19) Roles and responsibilities

Owner of the metric/service: formula, source, dashboard, alerts.
Observability/SRE: SSOT/platform, formula tests, data quality alerts.
Data/BI: DWH, report reproducibility, lineage.
Lawyers/partner managers: SLA agreements and exceptions.
Incident Manager: Attribution and linking SLA incidents.

20) Quick start (30 days)

Week 1: Inventory Metrics/SLO/SLA and Owners; assign an SSOT.
Week 2: Include data quality alerts and "Online vs DWH" panel.
Week 3: conduct control samples, align p95/p99 window.
Week 4: formalize the RFC process for formulas, prepare a monthly SLA report with attachments.

21) FAQ

Q: What is SSOT for SLA?
A: Storage with reproducible calculations (DWH) and full lineage; online panels - for operational control, not for legal acts.

Q: How to deal with "two p99s"?
A: Fix the window/aggregation method in the metrics directory, migrate panels, add alert to drift.

Q: How to consider planned works?
A: Maintain a calendar of exceptions and automatically deduct them from the SLA according to the rules of the contract; store confirmatory artifacts.

Operations and → Management Audit Metrics and SLAs

Auditing Metrics and SLAs

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects