Operational Analytics

1) What is operational analytics and why is it needed

Operational Analytics (Ops Analytics) is a system assembly of signals from observability (metrics/logs/trails), ITSM (incidents/problems/changes), CI/CD (releases/configs), providers (PSP/KYC/CDN/Cloud), FinOps (costs) and business SLS I (success of payments, registration), turned into single windows and dashboards for making decisions.

Objectives:

reduce MTTD/MTTR through early detection and correct attribution of causes;
keep SLOs and error budgets under control;
Link changes → impact (releases/configs → SLI/SLO/complaints/costs)
give self-service analytics to teams and management.

2) Sources and canonical data layer

Telemetry: metrics (SLI/resources), logs (sampling/PII edition), trails (trace_id/span_id, release tags).
ITSM/Incident modules: SEV, T0/Detected/Ack/Declared/Mitigated/Recovered timestamps, RCA/CAPA.
CI/CD & Config: versions, commits, canarics/blue-green, flag state, target configs.
Providers: statuses/SLAs, delays, error codes, route weights.
FinOps: cost by tags/accounts/tenants, $/unit (1k operas.) .
DataOps: window freshness, DQ errors, lineage.

The key principle is a single correlation through identifiers: 'service', 'region', 'tenant', 'release _ id', 'change _ id', 'incident _ id', 'provider', 'trace _ id'.

3) Single data model (simplified framework)


dim_service(service_id, owner, tier, slo_targets…)
dim_time(ts, date, hour, tz)
dim_region(region_id, country, cloud)
dim_provider(provider_id, type, sla)
fact_sli(ts, service_id, region_id, tenant, metric, value, target, window)
fact_incident(incident_id, service_id, sev, t0, t_detected, t_ack, t_declared, t_mitigated, t_recovered, root_cause, trigger_id, burn_minutes)
fact_change(change_id, type(code    config    infra), service_id, region_id, started_at, finished_at, canary_pct, outcome(ok    rollback), annotations)
fact_cost(ts, service_id, region_id, tenant, cost_total, cost_per_1k)
fact_provider(ts, provider_id, region_id, metric(latency    error    status), value)
fact_dq(ts, dataset, freshness_min, dq_errors)

4) SLI/SLO and business metrics

Бизнес-SLI: `payment_success_ratio`, `signup_completion`, `deposit_latency`.
Тех-SLI: `availability`, `http_p95`, `error_rate`, `queue_depth`.
SLO layer: targets + burn-rate (short/long window), automatic annotations of violations.
Normalization: indicators per 1k successful operations/users/traffic.

5) Correlations and attribution of causes

Releases/configs ↔ SLI/SLO: annotations on graphs; cause and effect reports (proportion of change incidents; MTTR change incidents).
Providers ↔ business SLI: weights of routes vs latency/errors, the contribution of each provider to the SLO miss.
Capacity/resources ↔ latency - pool overheating → p95 growth → conversion impact.

6) Anomalies and forecasting

Anomaly-detection: seasonality + percentile thresholds + change-search features (before/after release).
Forecast: weekly/seasonal load patterns, burn-out error budget forecast, cost prediction ($/unit) .
Gardrails: alerts only when quorum sources (synthetic + RUM + business SLI).

7) Showcases and dashboards (reference)

1. Executive 28d: SEV mix, median MTTR/MTTD, SLO adherence, $/unit, top reasons.
2. SRE Ops: SLI/SLO + burn-rate, Page Storm, Actionable %, Change Failure Rate.
3. Change Impact: releases/configs ↔ SLI/SLO/complaints, rollbacks and their effect.
4. Providers: PSP/KYC/CDN status lines, impacts on business SLI, response times.
5. FinOps: cost per 1k txn, logs/egress, cost anomalies, recommendations (sampling, storage).
6. DataOps: window freshness, DQ errors, pipeline SLAs, backfill success.

8) Data quality and governance

Event contracts: clear schemes for incidents/releases/SLIs (mandatory fields, uniform time zones).
DQ-checkers: completeness, uniqueness of keys, timeline consistency (t0≤detected≤ack...).
Lineage: dashboard to source (traceable).
PII/secrets: editing/masking by policy; WORM for evidence.
SLA freshness: Ops showcases ≤ 5 min delay.

9) Operational analytics maturity metrics

Coverage:% of critical services in storefronts and SLO boards (target ≥ 95%).
Freshness: the share of widgets with freshness ≤ 5 minutes (target ≥ 95%).
Actionability:% transition from dashboard to action (playbook/SOP/ticket) ≥ 90%.
Detection Coverage: ≥ 85% of incidents are detected by automation.
Attribution Rate: the percentage of incidents with confirmed cause and trigger ≥ 90%.
Change Impact Share: share of incidents related to changes (controlling the trend).
Data Quality: DQ errors/week → QoQ ↓.

10) Process: from data to action

1. Collection → cleaning → normalization of display case → (ETL/ELT, feature layer for ML).
2. Matrix Detection/Forecast → Escalation (IC/P1/P2/Comms).
3. Action: playbook/SOP, release gate, feature flag, provider switch.
4. Evidence and AAR/RCA: timeline, graphs, links to releases/logs/tracks.
5. CAPA and product solutions: prioritization by burn minutes and $ impact.

11) Query examples (idea)

11. 1 Impact of releases on SLO (24h)

sql
SELECT r. change_id,
COUNT(i. incident_id) AS incidents,
SUM(i. burn_minutes) AS burn_total_min,
AVG(CASE WHEN i.root_cause='code' THEN 1 ELSE 0 END) AS code_ratio
FROM fact_change r
LEFT JOIN fact_incident i
ON i.trigger_id = r. change_id
WHERE r. started_at >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY burn_total_min DESC;

11. 2 Share of problems from providers by region

sql
SELECT region_id, provider_id,
SUM(CASE WHEN root_cause='provider' THEN 1 ELSE 0 END) AS prov_inc,
COUNT() AS all_inc,
100. 0SUM(CASE WHEN root_cause='provider' THEN 1 ELSE 0 END)/COUNT() AS pct
FROM fact_incident
WHERE t0 >= DATE_TRUNC('month', NOW())
GROUP BY 1,2
ORDER BY pct DESC;

11. 3 Cost per 1k successful payments

sql
SELECT date(ts) d,
SUM(cost_total)/NULLIF(SUM(success_payments)/1000. 0,0) AS cost_per_1k
FROM fact_cost c
JOIN biz_payments b USING (ts, service_id, region_id, tenant)
GROUP BY d ORDER BY d DESC;

12) Artifact patterns

12. 1 Incident event diagram (JSON, fragment)

json
{
"incident_id": "2025-11-01-042",
"service": "payments-api",
"region": "eu",
"sev": "SEV-1",
"t0": "2025-11-01T12:04:00Z",
"detected": "2025-11-01T12:07:00Z",
"ack": "2025-11-01T12:09:00Z",
"declared": "2025-11-01T12:11:00Z",
"mitigated": "2025-11-01T12:24:00Z",
"recovered": "2025-11-01T12:48:00Z",
"root_cause": "provider",
"trigger_id": "chg-7842",
"burn_minutes": 18
}

12. 2 Metrics catalog (YAML, fragment)

yaml metric: biz. payment_success_ratio owner: team-payments type: sli target: 99. 5 windows: ["5m","1h","6h","28d"]
tags: [tier0, region:eu]
pii: false

12. 3 Executive report card (sections)


1) SEV mix and MTTR/MTTD trends
2) SLO adherence and burn-out risks
3) Change Impact (CFR)
4) Providers: Degradation and switchover
5) FinOps: $/unit, log anomalies/egress
6) CAPAs: Status and Deadlines

13) Tools and architectural patterns

Data Lake + DWH: "raw" layer for telemetry, showcases for solutions.
Stream-processing: near-real-time SLI/burn-rate, online features for anomalies.
Feature Store: reuse of features (canary, seasonality, provider signals).
Semantic Layer/Metric Store: Uniform Metric Definitions (SLO, MTTR...).
Access Control: RBAC/ABAC, row-level security for tenants/regions.
Catalog/Lineage: search, descriptions, dependencies, owners.

14) Checklists

14. 1 Launch of operational analytics

Approved dictionaries SLI/SLO, SEV, reasons, change types.
Event diagrams and uniform timezones.
Telemetry connectors, ITSM, CI/CD, providers, billing.
Showcases: SLI/SLO, Incidents, Changes, Providers, FinOps.
Executive/SRE/Change/Providers dashboards are available.
Quorum alerts and suppression are configured on maintenance windows.

14. 2 Weekly Ops Review

SEV trends, MTTR/MTTD, SLO misses, burn minutes.
Change Impact and CFR, rollback status.
Provider incidents and reaction times.
FinOps: $/unit, log anomalies/egress.
CAPA status, delinquencies, priorities.

15) Anti-patterns

"Wall of graphs" without going to action.
Different definitions of metrics for commands (no semantic layer).
Lack of release/window annotations - weak attribution of causes.
Medium orientation instead of p95/p99.

There is no normalization for volume - large services "seem worse."

PII in logs/storefronts, retension impairment.
Data "stagnates" (> 5-10 min for real-time widgets).

16) Implementation Roadmap (4-8 weeks)

1. Ned. 1: agreements on the dictionary of metrics, event schemes, id-correlation; SLI/SLO and ITSM connection.
2. Ned. 2: Incidents/Changes/Providers showcases, release annotations; Executive & SRE dashboards.
3. Ned. 3: FinOps layer ($/unit) , ligament with SLI; anomaly-detection with quorum.
4. Ned. 4: self-service (semantic layer/metric store), catalog and lineage.
5. Ned. 5-6: load/cost forecast, reports to providers, CAPA showcase.
6. Ned. 7-8: coverage of ≥95% Tier-0/1, SLA freshness ≤5 min, regular Ops reviews.

17) The bottom line

Operational analytics is a decision machine: uniform definitions of metrics, fresh storefronts, correct attribution of causes, and direct transitions to playbooks and SOPs. In such a system, the team quickly detects and explains deviations, accurately assesses the impact of releases and providers, manages costs and systematically reduces risk - and users get a stable service.

Operational Analytics

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects