Shift and performance analytics
1) Purpose and value
Shift analytics is a measurement system that makes the management of 24 × 7 operations predictable: confirms SLO coverage, identifies bottlenecks (night slots, congested domains), prevents burnout and improves the quality of handovers. For iGaming, this directly affects the speed of deposits/settles, KYC/AML deadlines and reputation.
2) Taxonomy of metrics
2. 1 Coverage and readiness
Coverage Rate -% hours with full composition (by role/domain/region).
On-Call Readiness - proportion of shifts with assigned IC/CL and valid contacts.
Handover SLA - compliance with the transfer window (10-15 min) and checklist.
2. 2 Reaction and reduction rate
MTTA/MTTR (by Day/Swing/Night slots, by domain): median, p90.
Detection Lead - a lag between SLI degradation and the first action.
Post-Release Monitoring Time - Actual monitoring of the release.
2. 3 Quality of shift transfer
Handover Defect Rate - blank checklist items.
Info Drift - discrepancy of facts between var-room, ITSM and status channel.
Action Carryover - the proportion of tasks that "migrated" without an owner/ETA.
2. 4 Load and fatigue
Pager Fatigue: alert/person/week, night pages, P1/person/shift.
Escalation Density: the proportion of incidents that have reached the L2/L3 (against runbook fixes L1).
Idle vs. Busy Ratio: vs. live load time waiting.
2. 5 Efficiency and automation
Auto-Fix Rate - incidents solved by auto-actions/bot.
Runbook Usage -% of alerts closed according to standard scenarios.
First Contact Resolution (FCR) - Close at L1 level without escalation.
Mean Time Between Incidents (MTBI) - domain/slot stability.
2. 6 Fairness and Sustainability
Fair-Share Index - evenness of nights/weekends by people.
Replacement SLA - replacements confirmed ≥48 hours before the shift.
Training Coverage - share of shifts with a shadow slot for onboarding.
2. 7 Business link
SLO Impact Score - How long did the shift keep SLO in the green.
Revenue at Risk (proxy) - estimate of lost revenue from shift P1/P2.
Partner Latency/Declines - contribution of PSP/KYC partners to shift incidents.
3) Data model
3. 1 Grain of events
shift_event: start/end, composition, roles (IC/CL/L1/L2), region, domains.
alert_event: signal, priority, owner, closing, runbook/auto-action.
incident_event: P1-P4, timelines, IC/CL, status publications.
handover_check: checklist marks + defects/comments.
release_watch: observation windows, gates, auto-rollbacks.
worklog: productive minutes (diagnostics, fixes, comma updates, post-mortem).
fatigue_signal: frequency of pages/nights, hours worked.
3. 2 Diagram (simplified)
Ключи: `timestamp`, `tenant`, `region`, `environment`, `domain`, `role`, `severity`.
Storage options: event lake (parquet/iceberg) + preaggregates in DWH/TSDB.
PII policy: aggregates and aliases only; e-mail/ID are masked.
4) Data collection (ETL)
1. ChatOps/bot: commands '/handover ', '/incident', '/runbook '→ WORM magazine.
2. ITSM: incident/ticket statuses, linking to var rooms.
3. Metrics API: SLI/SLO (auth-success, bet→settle p99, error-rate), KRI (queue lag, PSP declines).
4. Shift planner: calendars, replacements, roles, shadow.
5. CI/CD: releases, observation windows, auto-rollbacks.
ETL normalizes, adds' shift _ slot '(Day/Swing/Night), calculates derived metrics (MTTA/MTTR, Fair-Share).
5) Dashboards
5. 1 Exec (weekly/monthly review)
CFR, MTTR, Auto-Fix Rate, SLO Impact, Revenue-at-Risk (proxy).
Slot and domain overload map (thermal).
5. 2 Ops/SRE (every shift/daily)
Real-time panel: open P1-P4, burn-rate, queues/replication, guardrails.
Handover card of checklist status and defects.
Fatigue panel: pages/people, nights/people (last 4 weeks), warnings.
5. 3 Team/Domain
MTTA/MTTR by domain, FCR, Runbook Usage, share of L2/L3 escalations.
Fair-Share and Replacement SLA for a specific team.
6) Formulas and thresholds
Coverage Rate = Covered Watch/168. The goal ≥ 99%.
Handover SLA =% shifts where the transfer is completed and the checklist is closed ≤ 15 minutes (target ≥ 95%).
Pager Fatigue (wk) : p95 alert/person ≤ target; warning at> p90.
Fair-Share Index = 1 − (σ nights/ target_nochey). Target ≥ 0. 8.
Auto-Fix Rate ≥ 40% for L1 per quarter (target depends on maturity).
Runbook Usage ≥ 70% for repeated alerts (top 10 signals).
Control cards (X-MR, p-charts) for MTTA/MTTR and Defect Rate; alerts when going beyond control limits.
7) Analytical methods
Anomalies: STL/ESD/CUSUM by alert and MTTA/MTTR, mark outlayers and causes (release, provider).
Load prediction: Prophet/ARIMA by alert and P1/P2 per slot → FTE scheduling.
Result attribution: uplift model of changes in processes (for example, a new handover template) → MTTR.
Control experiments: A/B in internal processes (version of the checklist, new runbook).
Cohort analysis: performance of newcomers (shadow→solo) vs. experienced.
8) Integrations
Incident bot: posts shift metrics, reminds of an unclosed handover, retro starts.
Release-portal: connects release windows with load peaks; auto-pause at red SLOs.
Metrics API: ready-made SLO-view + exemplars (trace_id) for RCA.
HR/PTO: shrinkage factors → fair-share planning and analytics.
9) Politicians and RACI
Ops Analytics Owner (SRE/Platform): data model, dashboards, metric accuracy.
Service Owners: interpretation of domain signals, improvement plans.
Duty Manager: weekly KPI/KRI analysis, slot balance.
Compliance/Sec: Compliance with PII/SoD in telemetry and reporting.
Training Lead: Onboarding plans from analytics findings.
10) Artifact patterns
10. 1 Metrics Catalog (YAML)
yaml apiVersion: ops.analytics/v1 kind: MetricCatalog items:
- id: coverage_rate owner: "SRE"
formula: "covered_hours / 168"
slice: ["region","slot","domain"]
target: ">=0.99"
- id: mtta_p50 owner: "Ops"
formula: "median(ack_ts - alert_ts)"
slice: ["slot","severity","domain"]
target: "<=5m (P1)"
- id: handover_defect_rate owner: "Ops"
formula: "defects / handovers"
target: "<=5%"
- id: pager_fatigue_p95 owner: "SRE"
formula: "p95(alerts_per_person_week)"
target: "<=team_threshold"
10. 2 Query example (SQL aggregate)
sql
SELECT slot, domain,
percentile_cont(0.5) WITHIN GROUP (ORDER BY ack_s-emit_s) AS mtta_p50,
percentile_cont(0.9) WITHIN GROUP (ORDER BY ack_s-emit_s) AS mtta_p90,
AVG(auto_fix)::float AS autofix_rate
FROM alerts_fact
WHERE ts BETWEEN:from AND:to AND severity IN ('P1','P2')
GROUP BY slot, domain;
10. 3 Handover checklist (quality signals)
SLO/SLI Summary attached
Open incidents have owners/ETA
Planned works/releases are tied
Provider risks are fixed
Comm drafts ready
On-call contacts are relevant
Watchlist updated
11) Risk & Improvement Management
KRI: DLQ/queue-lag growth per night slot, FCR drop <target, Info Drift spike.
Improvement Plan: Weekly Ops Plan with Owners/ETA on Top 3 Flops.
Post-mortem discipline shifts: retro on handover defects and alert flapping.
Process A/B: checking the impact of new regulations on MTTR/Auto-Fix.
12) KPI/OKR examples (quarter)
KR1: MTTR P1 (median) ↓ from 22 min to 15 min.
KR2: Handover SLA ≥ 95% in three slots.
KR3: Auto-Fix Rate ≥ 45% for top 10 signaling rules.
KR4: Pager Fatigue p95 ↓ by 20% (after alert optimization).
KR5: Fair-Share Index ≥ 0. 85 in all teams.
13) Implementation Roadmap (6-10 weeks)
Ned. 1-2: event schemas, ETL from bot/ITSM/Metrics API, first metrics catalog, basic dashboards.
Ned. 3-4: control cards and thresholds, fatigue panel, handover quality, bundle with releases.
Ned. 5-6: load forecasting (slots/domains), fair-share and replacement analytics.
Ned. 7-8: auto-tips (which runbooks to automate), auto-fix ROI reports, retro templates.
Ned. 9-10: experiments in processes (A/B checklists), KPIs on Exec panels, training teams.
14) Antipatterns
Consider "shift success" only by the number of closed tickets (without MTTR/SLO context).
Ignore handover defects ("and so understandable").
Non-normalized metrics by traffic volume/seasonal peaks.
Personification and "people ratings" without taking into account complexity/input conditions.
Lack of fair-share → burnout and increased errors.
Zero correlation with releases/experiments → false conclusions.
Data without WORM audit and without PII policy.
Result
Shift and performance analytics is a production measurement system on top of ChatOps, ITSM, and telemetry: clear KPI/KRI taxonomy, correct data models, dashboards for different roles, statistical methods, and linkage to SLO/business effect. This approach balances loads, speeds up response, reduces burnout and predictably improves the quality of iGaming platform operations.