Operations and → Management Transfer Context Between Shifts
Transferring context between shifts
1) Why do you need it
The shift comes - the system is already "running." Handover quality directly affects MTTR, alert noise and release stability. A good handover is a quick guide, clear risks and understandable next steps.
Objectives:- Exclude loss of context for incidents, releases and providers.
- Reduce the "entry time" of a new shift to minutes, not hours.
- Stabilize SLO critical paths (deposit, bet, game launch, output).
- Make communications predictable and verifiable.
2) Good handover principles
1. Standardized form (one template, one terminology).
2. Uniform artifacts (links to the same dashboards/tickets/runbook 'and).
3. Timebox (short "briefing" + "longrid" in writing).
4. Actionable: at the end is an explicit list of "who/what/when" tasks.
5. SLO-orientation: SLO/error status, not "event log."
6. Traceability: any fact is confirmed by an artifact.
3) Roles and responsibilities
Lead shifts (outgoing): prepares a handover package, holds a briefing.
Lead shift (receiving): fixes questions/risks, confirms acceptance.
Incident manager: updates the timeline/channel of the incident, monitors the SLA of updates.
Domain owners (Payments/Bets/Games/KYC): in their sections they give "status and risk."
SRE/Observability: supports artifacts (dashboards, release annotations, alerts).
4) Timing and channels
T-30 minutes before the shift: the outgoing shift freezes the status, updates the template.
T-10 min: Quick briefing (15-20 min maximum) on voice/video channel.
T + 0: publish handover package in the common channel "# ops-handover."
T + 15 min: the receiving shift confirms the reception and clarifies open questions.
Escalation: all "red" points immediately to the channel of the corresponding team.
5) Handover package structure (template)
Handoff - <date, time, TZ>
Shift: <outgoing> → <receiving>
Overall SLO status (last 4h):
- API p95/p99: <values/trends>
- Error rate: <values/trends>
- Queue lag/DB connections/Cache: <brief>
Critical incidents:
- <INC-123>: status, impact, next update ETA, links (ticket, channel, postmortem draft)
Providers (PSP/KYC/studios):
- PSP-X: quotas/errors/fake <links>
- KYC-A: Webhook delays <links>
Releases/Features:
- In progress: <service>, stage (canary X%), gate/metrics, risk
- Scheduled: windows/locks/dependencies
Risks and observations:
- <briefly, with links and graphs>
Action items (before <time>):
- [Owner] <task>, readiness criterion
Useful links:
- Dashboard Overview, dependency map, escalation matrix, runbook 'and
On-call contacts:
- Domains/Names/Channels
6) Handover Mini SOP
1. The outgoing shift updates release annotations and dashboards (SLO, providers, queues).
2. Checks the "red" alerts for the last 4 hours, fixes the status/reason.
3. Updates section "Risks and observations" (trends/suspicions, not facts).
4. Fills in Action items with deadlines and owners.
5. Holds a briefing: 10-15 minutes, strictly according to the template.
6. The receiving shift asks questions; if necessary - instant escalation to the owners.
7. Confirmation of acceptance: "received, questions/no," list of first steps.
7) Handover Quality Metrics (KPI)
Handoff Quality Score (HQS) - scoring a package (0-100) on a checklist.
Handoff Time - briefing duration (target corridor 10-20 min).
Acknowledgement SLA ≤ 15 minutes.
Missing Context Rate - the proportion of incidents with a "loss of context" after a shift.
Post-Handoff Incident Spike - An increase in alerts/incidents in the first 60 minutes.
Action Items SLA - the proportion of tasks closed on time after the shift.
8) Package quality checklist (HQS assessment)
- Filled in SLOs/key metrics in 4 hours with trends.
- All "red" alerts are listed with reasons/references.
- Incidents: number, status, impact, next update (time).
- Providers: quotas/errors/feilover, latest changes.
- Releases/Features: Stage, Risks, Gates/Canary.
- Action items: owner, term, readiness criterion.
- Links: dashboards, channels, runbook 'and, escalation matrix.
- On-call contacts and backup links.
9) Dashboards "for handover" (minimum)
Operations Overview: p95/p99, error rate, capacity headroom, queue lag.
Incidents Board: open incidents, ETA updates, impact.
Release & Feature: Canaries, Before/After Comparison, Autogates.
Providers Panel: quotas, timeouts, cost/1k calls, switches.
Dependency Map: latency/errors/retries.
10) Alerts on the quality of handovers (ideas)
ALERT HandoffNotPublished
IF handoff_published == 0 AND within(10m, shift_change) == true
LABELS {severity="warning", team="ops"}
ALERT HandoffAckSLA
IF handoff_ack_minutes > 15
LABELS {severity="warning", team="ops"}
ALERT MissingActionOwners
IF count_over_time(handoff_action_items{owner=""}[1h]) > 0
LABELS {severity="warning", team="ops"}
ALERT PostHandoffIncidentSpike
IF incidents_rate_60m_after_shift > baseline_14d 1. 5
LABELS {severity="info", team="ops"}
11) Communications and update format
Short update template (to shared channel):
[HH: MM] Handoff published. SLO OK/Degraded. Incidents: INC-123 (ETA 18:30), releases: bets-api canary 10%. Risks: PSP-X 85% quota. Action items: @ squad-payments until 7pm to check out the feilover.
Rules:
- Without private chats for critical points - only common channels.
- Any "red" zone is an immediate thread with the owners.
- All decisions/compromises - in writing, with reference to the data.
12) Domain Features (iGaming)
Payments: priority: deposit conversion and authorization time, PSP fake routes, limits by provider.
Bets: coefficient/cache updates, streaming/queue load, calculation delay.
Games/Live: broadcast events (jackpots/streams), website limits, UI degradation.
KYC/AML: check queue, SLA providers, sensitivity to peaks.
13) Anti-patterns
Free "arbitrary form" of handover (everyone writes as he wants).
There is no deadline for confirmation of admission.
Package without Action items and owners.
Handover turns into a "log reader" instead of SLO/risks.
Secret solutions in private chats - lack of traceability.
The template does not contain references to artifacts - there is nothing to check.
14) Integrations and artifacts
Annotations of releases on graphs, auto-links to handover.
Link unfurling: inserting links to dashboards/tickets with a preview of key metrics.
Runbook bindings: each "red" zone with a direct link to a specific runbook.
Escalation matrix: in the template - a single relevant document.
15) Retention policy and audit
Handovers - stored centrally (geos, date/time, authors).
Weekly HQS audit and selective analysis of bad handovers.
Revision of the template - quarterly or based on the results of post-mortems.
16) Fast start (30 days)
Week 1: approve template, roles and timing; start a pilot on the same line (for example, Payments).
Week 2: include dashboards "for handover," HandoffNotPublished/AckSLA alerts.
Week 3: Introduce an HQS score and audit of 10% of handovers.
Week 4: Expand on Bets/Games/KYC, do retrospective, update SOP.
17) Example of a "risk card" for a package
Risk: PSP-X hits 90% quota in prime time
Impact: rise in deposit refusals, SLO payments at risk
Signals: outbound_error_rate, quota_usage_ratio
Mitigation: raise PSP-Y up to 20% of traffic in advance, enable token cache
Owner/ETA: integrations@oncall / до 18:00
18) FAQ
Q: What if the briefing drags on?
A: Strict timebox and "in thread after briefing" rule. The package should contain everything for asynchronous familiarization.
Q: How to deal with "different versions of the truth"?
A: Unify artifacts: unified dashboards, release annotations, SSOT for SLA; link only to them.
Q: Does the briefing need to be recorded?
A: Yes, for controversial cases and training. But the record does not replace the standardized written package.