Change of duty and transfer of tasks
1) Why formalize duty shifts
Changing duty is a critical moment of risk: context is lost, reaction time increases, actions are duplicated. The formalized process reduces MTTA/MTTR, eliminates "forgotten tails," and ensures compliance (who accepted responsibility and when).
2) Roles and coverage model
Primary on-call (P1) - first response, triage, coordination before the arrival of IC.
Secondary on-call (P2) - backup, connects during overload/escalation.
Duty Manager/IC-of-the-day is the incident leader for SEV-1 +.
Follow-the-sun (multi-time zone) or Follow-the-moon (night coverage in other regions).
Time windows: avoid releases/risky work ± 30 minutes from shift.
3) Rotation schedules (examples)
24/7, 8-hour shifts: morning/day/night, 3 brigades, P1 + P2.
24/7, 12-hour shifts: fewer switches, higher risk of fatigue - need "compensation windows."
5 × 8 (workdays) + Weekend Pool: day primary coverage by product team, weekend - platform/SRE.
Hybrid: weekdays "in office time," nights/weekends - Follow-the-sun.
Fairness rules: calendar rotation, holiday/vacation accounting, maximum N night shifts per period.
4) Shift Handover Card
Minimum content standard:- When and who: 'Date/time (UTC and local)', transmits → accepts; P1/P2 contacts.
- Systems status: SLO/SLA summary, active alerts, known degradation.
- Open incidents: ID, SEV, current step, who is the owner, next action/ETA.
- Risks for the shift window: planned work, releases, migrations, limit states (provider quotas).
- Critical tickets/tasks: priority, blockers, deadlines.
- Communications outside: active posts on the status page/client updates.
- Known workarounds: included degradation feature flags, time limits.
- Domenica: payment providers/KYC/CDN - their statuses and routing.
- Housekeeping: who is on-call tomorrow, people unavailable windows (rallies/flights).
5) "Hand over shift" checklist (issuing party)
- Updated the shift card (all fields) and fixed the link in the '# oncall-handover' channel.
- Translated "oral knowledge" into tickets/notes; no "in head" tasks.
- All incidents have: SEV, owner, next step, next update time.
- The status page and client updates correspond to the actual status.
- Disabled noisy/false alerts (according to the procedure) or marked on the card.
- Checked the quotas/limits of external providers for the next shift window.
- Synchronized by voice/video for 5-10 minutes (if SEV-1 + is active).
- Recorded the fact of transfer (bot/ticket), indicated the receiver.
6) "I accept shift" checklist (receiving party)
- Read the card, clarified open questions.
- Checked SLO/alert dashboards in the last 2-4 hours.
- Confirmed the role of the P1/P2 in the bot (assign) and the sound/channels of the pager.
- Assumed ownership of active incidents and updated update timers.
- Checked planned works/releases, canceled risky operations for the first 30 minutes.
- Made an "echo message" to the channel: "I took a shift, active incidents:..., words. update in.... "
7) Communication standards
Каналы: `#oncall`, `#incident-warroom-<ID>`, `#statuspage`.
Update intervals: SEV-0: 15 min, SEV-1: 30 min, SEV-2 +: 60 min.
Update format: Impact - Diagnostics - Actions - Next update (time).
Escalation: no progress in N minutes → connect TL/Platform/DB/Sec by matrix.
Clarity of ownership: Every action has a performer and an ETA.
8) Transfer of tasks (not incident)
Transfer criteria: task blocks SLO/release/compliance or expires.
Design: ticket with "definition of next step" and the expected result, all artifacts (logs/pictures/graphs) are attached.
Prioritization: Kanban- swimlane "On-call Handover."
Deadlines: Transmissions have due-dates; delays are escalated to the owner of the service.
9) Automation and integration
Rotation calendar: synchronization with pager; the bot publishes "who is on duty" at the beginning of the shift.
ChatOps: '/handover start ', auto-collection of cards from sources (SLO statuses, open incidents, releases).
Ticketing: automatic assignment of the owner by P1/P2; "handover" tags.
Status page: bridge to public updates with templates.
Audit: transmission log (who/when accepted), communication with SEV and reports.
10) Fatigue Management
Limits: maximum X pages/hour and Y in a row at night - go to P2/escalation.
Quiet hours for non-critical alerts (tickets instead of paging).
After-hours compensation and post-incident rest.
Training and shadowing for new on-call engineers.
Retrospectives of noisy shifts → tuning of alerts and playbooks.
11) Quality metrics of shifts and passes
Handover Defect Rate: proportion of incidents with context loss during a shift.
MTTA around shift: median/peaks ± 30 min from switch.
Missed/late updates: expired SEV updates.
Alert Hygiene:% False Pages; alerts without runbook/owner.
Load per shift: pages/hour, average duration of active work.
Satisfaction: NPS shifts (on-call survey), fatigue on a scale.
12) Communication with Incident Management and RCA
Active incidents are not closed at the time of the shift; responsibility is explicitly transferred and fixed.
In RCA, the "Shift Impact" section is required: was there a context drift, a late update, a double action.
CAPA: card improvement, checklists, automation, training.
13) Security, compliance and confidentiality
PII/secrets are prohibited in the free text of cards; links to secure repositories.
Temporary accesses: on-call rights are issued for the shift window (JIT/JEA), key rotation.
Audit trail: immutable log who read/changed the card and status page.
Regulatory: the terms of client notifications are controlled in the shift card.
14) Anti-patterns
"I'll give it orally" without a card/ticket.
Release exactly at the time of the shift without IC and backup.
Pager in a person "on the plane/subway" without P2.
Card as a "sheet" without next step/ETA.
Triage on personal chats - information is lost, auditing is impossible.
There is no record of the fact of transfer - "who answered" disputes.
15) Templates
Shift card template (compressed)
Shift: 2025-11-01 18: 00-02: 00 UTC (local: Europe/Kyiv 20: 00-04: 00)
P1: @duty-alex P2: @duty-olga IC: @ic-of-day
SLO Summary: API ok, Payments p95↑ by 12% (observation)
Active Incidents:
- INC-3421 (SEV-2): KYC's success is falling in the TR region. Owner: @ p1. Trail. step: switch 20% of traffic to provider B, update at 20:30 UTC.
Risks/jobs: 22:00 UTC - index migration to ClickHouse (read-only), owner @ data-ivan.
Providers: PSP-A green, KYC-A partially degrades TR.
Status page: post from 17:50 UTC; next update 20:30 UTC.
Next steps P1: 1) Check KYC switching effect; 2) Prepare canary 5% for v2 payments. 14.
Receive Echo Template
[Took over shift] 18:02 UTC. Active: INC-3421 (SEV-2). Trail. update 18:30 UTC.
Checked alerts in 2h - no new P1s. Status page availability approx.
16) Embedding in daily practice
Daily shift ritual: 5-10 minutes voice synchronization in active incidents.
Weekly card audit: selectively check completeness/relevance.
Game-days: simulation of shifts with many parallel events.
Dock directory: templates of cards/checklists in the repository, review as code.
17) The bottom line
Well-organized shifts and transfers are the "lubrication" of the entire operating machine. Shift card, short synchronizations, strict checklists, automation and concern for the stability of the team turn risky moments into a routine without loss of quality: the context is preserved, the reaction time is stable, and users do not notice the change of duty at all.