Operations and → Management Transfer Context Between Shifts

Transferring context between shifts

1) Why do you need it

The shift comes - the system is already "running." Handover quality directly affects MTTR, alert noise and release stability. A good handover is a quick guide, clear risks and understandable next steps.

Objectives:

Exclude loss of context for incidents, releases and providers.
Reduce the "entry time" of a new shift to minutes, not hours.
Stabilize SLO critical paths (deposit, bet, game launch, output).
Make communications predictable and verifiable.

2) Good handover principles

1. Standardized form (one template, one terminology).
2. Uniform artifacts (links to the same dashboards/tickets/runbook 'and).
3. Timebox (short "briefing" + "longrid" in writing).
4. Actionable: at the end is an explicit list of "who/what/when" tasks.

5. SLO-orientation: SLO/error status, not "event log."

6. Traceability: any fact is confirmed by an artifact.

3) Roles and responsibilities

Lead shifts (outgoing): prepares a handover package, holds a briefing.
Lead shift (receiving): fixes questions/risks, confirms acceptance.
Incident manager: updates the timeline/channel of the incident, monitors the SLA of updates.

Domain owners (Payments/Bets/Games/KYC): in their sections they give "status and risk."

SRE/Observability: supports artifacts (dashboards, release annotations, alerts).

4) Timing and channels

T-30 minutes before the shift: the outgoing shift freezes the status, updates the template.
T-10 min: Quick briefing (15-20 min maximum) on voice/video channel.

T + 0: publish handover package in the common channel "# ops-handover."

T + 15 min: the receiving shift confirms the reception and clarifies open questions.
Escalation: all "red" points immediately to the channel of the corresponding team.

5) Handover package structure (template)


Handoff - <date, time, TZ>
Shift: <outgoing> → <receiving>
Overall SLO status (last 4h):
- API p95/p99: <values/trends>
- Error rate: <values/trends>
- Queue lag/DB connections/Cache: <brief>
Critical incidents:
- <INC-123>: status, impact, next update ETA, links (ticket, channel, postmortem draft)
Providers (PSP/KYC/studios):
- PSP-X: quotas/errors/fake <links>
- KYC-A: Webhook delays <links>
Releases/Features:
- In progress: <service>, stage (canary X%), gate/metrics, risk
- Scheduled: windows/locks/dependencies
Risks and observations:
- <briefly, with links and graphs>
Action items (before <time>):
- [Owner] <task>, readiness criterion
Useful links:
- Dashboard Overview, dependency map, escalation matrix, runbook 'and
On-call contacts:
- Domains/Names/Channels

6) Handover Mini SOP

1. The outgoing shift updates release annotations and dashboards (SLO, providers, queues).
2. Checks the "red" alerts for the last 4 hours, fixes the status/reason.
3. Updates section "Risks and observations" (trends/suspicions, not facts).
4. Fills in Action items with deadlines and owners.
5. Holds a briefing: 10-15 minutes, strictly according to the template.
6. The receiving shift asks questions; if necessary - instant escalation to the owners.
7. Confirmation of acceptance: "received, questions/no," list of first steps.

7) Handover Quality Metrics (KPI)

Handoff Quality Score (HQS) - scoring a package (0-100) on a checklist.
Handoff Time - briefing duration (target corridor 10-20 min).
Acknowledgement SLA ≤ 15 minutes.
Missing Context Rate - the proportion of incidents with a "loss of context" after a shift.
Post-Handoff Incident Spike - An increase in alerts/incidents in the first 60 minutes.
Action Items SLA - the proportion of tasks closed on time after the shift.

8) Package quality checklist (HQS assessment)

Filled in SLOs/key metrics in 4 hours with trends.
All "red" alerts are listed with reasons/references.
Incidents: number, status, impact, next update (time).
Providers: quotas/errors/feilover, latest changes.
Releases/Features: Stage, Risks, Gates/Canary.
Action items: owner, term, readiness criterion.
Links: dashboards, channels, runbook 'and, escalation matrix.
On-call contacts and backup links.

9) Dashboards "for handover" (minimum)

Operations Overview: p95/p99, error rate, capacity headroom, queue lag.
Incidents Board: open incidents, ETA updates, impact.
Release & Feature: Canaries, Before/After Comparison, Autogates.
Providers Panel: quotas, timeouts, cost/1k calls, switches.
Dependency Map: latency/errors/retries.

10) Alerts on the quality of handovers (ideas)


ALERT HandoffNotPublished
IF handoff_published == 0 AND within(10m, shift_change) == true
LABELS {severity="warning", team="ops"}

ALERT HandoffAckSLA
IF handoff_ack_minutes > 15
LABELS {severity="warning", team="ops"}

ALERT MissingActionOwners
IF count_over_time(handoff_action_items{owner=""}[1h]) > 0
LABELS {severity="warning", team="ops"}

ALERT PostHandoffIncidentSpike
IF incidents_rate_60m_after_shift > baseline_14d 1. 5
LABELS {severity="info", team="ops"}

11) Communications and update format

Short update template (to shared channel):


[HH: MM] Handoff published. SLO OK/Degraded. Incidents: INC-123 (ETA 18:30), releases: bets-api canary 10%. Risks: PSP-X 85% quota. Action items: @ squad-payments until 7pm to check out the feilover.

Rules:

Without private chats for critical points - only common channels.
Any "red" zone is an immediate thread with the owners.
All decisions/compromises - in writing, with reference to the data.

12) Domain Features (iGaming)

Payments: priority: deposit conversion and authorization time, PSP fake routes, limits by provider.
Bets: coefficient/cache updates, streaming/queue load, calculation delay.
Games/Live: broadcast events (jackpots/streams), website limits, UI degradation.
KYC/AML: check queue, SLA providers, sensitivity to peaks.

13) Anti-patterns

Free "arbitrary form" of handover (everyone writes as he wants).
There is no deadline for confirmation of admission.
Package without Action items and owners.
Handover turns into a "log reader" instead of SLO/risks.
Secret solutions in private chats - lack of traceability.
The template does not contain references to artifacts - there is nothing to check.

14) Integrations and artifacts

Annotations of releases on graphs, auto-links to handover.
Link unfurling: inserting links to dashboards/tickets with a preview of key metrics.
Runbook bindings: each "red" zone with a direct link to a specific runbook.
Escalation matrix: in the template - a single relevant document.

15) Retention policy and audit

Handovers - stored centrally (geos, date/time, authors).
Weekly HQS audit and selective analysis of bad handovers.
Revision of the template - quarterly or based on the results of post-mortems.

16) Fast start (30 days)

Week 1: approve template, roles and timing; start a pilot on the same line (for example, Payments).
Week 2: include dashboards "for handover," HandoffNotPublished/AckSLA alerts.
Week 3: Introduce an HQS score and audit of 10% of handovers.
Week 4: Expand on Bets/Games/KYC, do retrospective, update SOP.

17) Example of a "risk card" for a package


Risk: PSP-X hits 90% quota in prime time
Impact: rise in deposit refusals, SLO payments at risk
Signals: outbound_error_rate, quota_usage_ratio
Mitigation: raise PSP-Y up to 20% of traffic in advance, enable token cache
Owner/ETA: integrations@oncall / до 18:00

18) FAQ

Q: What if the briefing drags on?
A: Strict timebox and "in thread after briefing" rule. The package should contain everything for asynchronous familiarization.

Q: How to deal with "different versions of the truth"?
A: Unify artifacts: unified dashboards, release annotations, SSOT for SLA; link only to them.

Q: Does the briefing need to be recorded?
A: Yes, for controversial cases and training. But the record does not replace the standardized written package.

Operations and → Management Transfer Context Between Shifts

Transferring context between shifts

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects