Shift and performance analytics

1) Purpose and value

Shift analytics is a measurement system that makes the management of 24 × 7 operations predictable: confirms SLO coverage, identifies bottlenecks (night slots, congested domains), prevents burnout and improves the quality of handovers. For iGaming, this directly affects the speed of deposits/settles, KYC/AML deadlines and reputation.

2) Taxonomy of metrics

2. 1 Coverage and readiness

Coverage Rate -% hours with full composition (by role/domain/region).
On-Call Readiness - proportion of shifts with assigned IC/CL and valid contacts.
Handover SLA - compliance with the transfer window (10-15 min) and checklist.

2. 2 Reaction and reduction rate

MTTA/MTTR (by Day/Swing/Night slots, by domain): median, p90.
Detection Lead - a lag between SLI degradation and the first action.
Post-Release Monitoring Time - Actual monitoring of the release.

2. 3 Quality of shift transfer

Handover Defect Rate - blank checklist items.
Info Drift - discrepancy of facts between var-room, ITSM and status channel.
Action Carryover - the proportion of tasks that "migrated" without an owner/ETA.

2. 4 Load and fatigue

Pager Fatigue: alert/person/week, night pages, P1/person/shift.
Escalation Density: the proportion of incidents that have reached the L2/L3 (against runbook fixes L1).
Idle vs. Busy Ratio: vs. live load time waiting.

2. 5 Efficiency and automation

Auto-Fix Rate - incidents solved by auto-actions/bot.
Runbook Usage -% of alerts closed according to standard scenarios.
First Contact Resolution (FCR) - Close at L1 level without escalation.
Mean Time Between Incidents (MTBI) - domain/slot stability.

2. 6 Fairness and Sustainability

Fair-Share Index - evenness of nights/weekends by people.
Replacement SLA - replacements confirmed ≥48 hours before the shift.
Training Coverage - share of shifts with a shadow slot for onboarding.

2. 7 Business link

SLO Impact Score - How long did the shift keep SLO in the green.
Revenue at Risk (proxy) - estimate of lost revenue from shift P1/P2.
Partner Latency/Declines - contribution of PSP/KYC partners to shift incidents.

3) Data model

3. 1 Grain of events

shift_event: start/end, composition, roles (IC/CL/L1/L2), region, domains.
alert_event: signal, priority, owner, closing, runbook/auto-action.
incident_event: P1-P4, timelines, IC/CL, status publications.
handover_check: checklist marks + defects/comments.
release_watch: observation windows, gates, auto-rollbacks.
worklog: productive minutes (diagnostics, fixes, comma updates, post-mortem).
fatigue_signal: frequency of pages/nights, hours worked.

3. 2 Diagram (simplified)

Ключи: `timestamp`, `tenant`, `region`, `environment`, `domain`, `role`, `severity`.
Storage options: event lake (parquet/iceberg) + preaggregates in DWH/TSDB.
PII policy: aggregates and aliases only; e-mail/ID are masked.

4) Data collection (ETL)

1. ChatOps/bot: commands '/handover ', '/incident', '/runbook '→ WORM magazine.
2. ITSM: incident/ticket statuses, linking to var rooms.
3. Metrics API: SLI/SLO (auth-success, bet→settle p99, error-rate), KRI (queue lag, PSP declines).
4. Shift planner: calendars, replacements, roles, shadow.
5. CI/CD: releases, observation windows, auto-rollbacks.

ETL normalizes, adds' shift _ slot '(Day/Swing/Night), calculates derived metrics (MTTA/MTTR, Fair-Share).

5) Dashboards

5. 1 Exec (weekly/monthly review)

CFR, MTTR, Auto-Fix Rate, SLO Impact, Revenue-at-Risk (proxy).
Slot and domain overload map (thermal).

5. 2 Ops/SRE (every shift/daily)

Real-time panel: open P1-P4, burn-rate, queues/replication, guardrails.
Handover card of checklist status and defects.
Fatigue panel: pages/people, nights/people (last 4 weeks), warnings.

5. 3 Team/Domain

MTTA/MTTR by domain, FCR, Runbook Usage, share of L2/L3 escalations.
Fair-Share and Replacement SLA for a specific team.

6) Formulas and thresholds

Coverage Rate = Covered Watch/168. The goal ≥ 99%.
Handover SLA =% shifts where the transfer is completed and the checklist is closed ≤ 15 minutes (target ≥ 95%).
Pager Fatigue (wk) : p95 alert/person ≤ target; warning at> p90.
Fair-Share Index = 1 − (σ nights/ target_nochey). Target ≥ 0. 8.
Auto-Fix Rate ≥ 40% for L1 per quarter (target depends on maturity).
Runbook Usage ≥ 70% for repeated alerts (top 10 signals).

Control cards (X-MR, p-charts) for MTTA/MTTR and Defect Rate; alerts when going beyond control limits.

7) Analytical methods

Anomalies: STL/ESD/CUSUM by alert and MTTA/MTTR, mark outlayers and causes (release, provider).
Load prediction: Prophet/ARIMA by alert and P1/P2 per slot → FTE scheduling.
Result attribution: uplift model of changes in processes (for example, a new handover template) → MTTR.
Control experiments: A/B in internal processes (version of the checklist, new runbook).
Cohort analysis: performance of newcomers (shadow→solo) vs. experienced.

8) Integrations

Incident bot: posts shift metrics, reminds of an unclosed handover, retro starts.
Release-portal: connects release windows with load peaks; auto-pause at red SLOs.
Metrics API: ready-made SLO-view + exemplars (trace_id) for RCA.
HR/PTO: shrinkage factors → fair-share planning and analytics.

9) Politicians and RACI

Ops Analytics Owner (SRE/Platform): data model, dashboards, metric accuracy.
Service Owners: interpretation of domain signals, improvement plans.
Duty Manager: weekly KPI/KRI analysis, slot balance.
Compliance/Sec: Compliance with PII/SoD in telemetry and reporting.
Training Lead: Onboarding plans from analytics findings.

10) Artifact patterns

10. 1 Metrics Catalog (YAML)

yaml apiVersion: ops. analytics/v1 kind: MetricCatalog items:
- id: coverage_rate owner: "SRE"
formula: "covered_hours / 168"
slice: ["region","slot","domain"]
target: ">=0. 99"
- id: mtta_p50 owner: "Ops"
formula: "median(ack_ts - alert_ts)"
slice: ["slot","severity","domain"]
target: "<=5m (P1)"
- id: handover_defect_rate owner: "Ops"
formula: "defects / handovers"
target: "<=5%"
- id: pager_fatigue_p95 owner: "SRE"
formula: "p95(alerts_per_person_week)"
target: "<=team_threshold"

10. 2 Query example (SQL aggregate)

sql
SELECT slot, domain,
percentile_cont(0. 5) WITHIN GROUP (ORDER BY ack_s-emit_s) AS mtta_p50,
percentile_cont(0. 9) WITHIN GROUP (ORDER BY ack_s-emit_s) AS mtta_p90,
AVG(auto_fix)::float AS autofix_rate
FROM alerts_fact
WHERE ts BETWEEN:from AND:to AND severity IN ('P1','P2')
GROUP BY slot, domain;

10. 3 Handover checklist (quality signals)

SLO/SLI Summary attached

Open incidents have owners/ETA

Planned works/releases are tied

Provider risks are fixed

Comm drafts ready

On-call contacts are relevant

Watchlist updated

11) Risk & Improvement Management

KRI: DLQ/queue-lag growth per night slot, FCR drop <target, Info Drift spike.
Improvement Plan: Weekly Ops Plan with Owners/ETA on Top 3 Flops.
Post-mortem discipline shifts: retro on handover defects and alert flapping.
Process A/B: checking the impact of new regulations on MTTR/Auto-Fix.

12) KPI/OKR examples (quarter)

KR1: MTTR P1 (median) ↓ from 22 min to 15 min.
KR2: Handover SLA ≥ 95% in three slots.
KR3: Auto-Fix Rate ≥ 45% for top 10 signaling rules.
KR4: Pager Fatigue p95 ↓ by 20% (after alert optimization).
KR5: Fair-Share Index ≥ 0. 85 in all teams.

13) Implementation Roadmap (6-10 weeks)

Ned. 1-2: event schemas, ETL from bot/ITSM/Metrics API, first metrics catalog, basic dashboards.
Ned. 3-4: control cards and thresholds, fatigue panel, handover quality, bundle with releases.
Ned. 5-6: load forecasting (slots/domains), fair-share and replacement analytics.
Ned. 7-8: auto-tips (which runbooks to automate), auto-fix ROI reports, retro templates.
Ned. 9-10: experiments in processes (A/B checklists), KPIs on Exec panels, training teams.

14) Antipatterns

Consider "shift success" only by the number of closed tickets (without MTTR/SLO context).
Ignore handover defects ("and so understandable").
Non-normalized metrics by traffic volume/seasonal peaks.
Personification and "people ratings" without taking into account complexity/input conditions.
Lack of fair-share → burnout and increased errors.
Zero correlation with releases/experiments → false conclusions.
Data without WORM audit and without PII policy.

Total

Shift and performance analytics is a production measurement system on top of ChatOps, ITSM, and telemetry: clear KPI/KRI taxonomy, correct data models, dashboards for different roles, statistical methods, and linkage to SLO/business effect. This approach balances loads, speeds up response, reduces burnout and predictably improves the quality of iGaming platform operations.

Shift and performance analytics

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects