Migration playbooks
1) Classification of migrations
DB schemes: adding/changing columns, indexes, sharding, changing key type.
Data: mass backfill/cleanup, normalization, retention/archiving.
Services and APIs: changing endpoints, versioning, contract refactoring.
Queues/buses: moving topics, changing membership keys, event format.
Infrastructure: moving to a new cluster/K8s/cloud/region, changing secrets/KMS.
Storage and analytics: changing the engine (OLTP/OLAP), format/partitioning of datasets.
Security/compliance: key rotation, encryption on the fly, geo-localization of data.
2) Principles of successful migration
1. Expand → Migrate → Contract. First, we expand the scheme/behavior (compatible), then transfer the data/traffic, then remove the old one.
2. Shadow & Dual. Shadow read/write and double entry for validation.
3. Feature flags and the "red button." Quick shutdown, step-by-step activation (percentile/tenants/regions).
4. Idempotency and repeatability. Scripts and tasks can be restarted without side effects.
5. Observability before changes. Dashboards/alerts in advance, migration markers in logs/tracks.
6. Rollback documented. The rollback runbook is as detailed as the plan forward.
7. Mini-games and pauses. We migrate in small portions, checking SLI and business invariants.
3) Inventory and dependency analysis
Consumer map: services, jobs, reports, external partners, BI/ETL, webhooks.
Contracts and schemas: API/event versions, backward/forward compatibility.
Accesses/secrets: who reads/writes where the caches/cues are.
Domain invariants: uniqueness, balance, idempotency, reporting day.
Volumes/speeds: data size, RPS, peak windows, RPO/RTO.
4) Canonical playbook template (YAML skeleton)
yaml playbook: "migrate-orders-to-v2"
owner: "orders-team"
stakeholders: ["platform", "data", "security", "support"]
change_type: ["schema", "data", "api"]
risk_level: "high"
preconditions:
- "Dashboards ready: latency/error/lag"
- "Runbook rollback validated on stage"
- "Backups verified (restore tested)"
plan:
phase_1_prepare:
steps:
- "Add new nullable columns (expand)"
- "Deploy code with dual-write (flag off)"
- "Enable CDC stream to target"
phase_2_shadow:
steps:
- "Shadow-read v2, compare with v1 (1%)"
- "Fix discrepancies; iterate"
phase_3_dual_write:
steps:
- "Enable dual-write (10%→50%→100%)"
- "Start backfill in batches (size=10k, sleep=200ms)"
phase_4_cutover:
steps:
- "Switch reads to v2 by tenants (canary)"
- "Monitor SLI 30m; expand scope"
phase_5_contract:
steps:
- "Drop old indices/columns after T+14d"
- "Disable old topic/api; update docs/SDK"
guardrails:
abort_if:
- "error_rate > 0. 5% for 5m"
- "p95 > baseline1. 5 for 10m"
- "data_mismatch > 0. 01%"
rollback:
steps:
- "Flip flag: reads back to v1"
- "Stop backfill; continue dual-write to v1"
- "Replay missed events (DLQ→v1)"
validation:
checks:
- "Row counts match within epsilon"
- "Business invariants hold (balances, limits)"
comms:
- channel: "on-call-bridge"
- status_updates: "T-24h, T-1h, start, cutover, finish"
window: "low-traffic Sun 02:00–05:00 UTC"
5) Migration patterns
5. 1 DB Schemas (RDBMS/NoSQL)
Add - do not change. New nullable columns/indexes → code reads old and new.
Online rebuilding. Use online indexes/concurrent DDLs.
Serialization versions. Version the payload in the JSON/Proto/Avro columns.
Key migration. When changing PK - time table of correspondences + trigger/CDC.
5. 2 Data (backfill/cleanup)
CDC + backfill. First, the flow of changes (to keep up), then the batch backfill.
Parties and deadlines. Small batches with lag control, checkpoints and restart.
Idempotent updates. Upsert by natural keys/versions.
5. 3 Events and queues
Versioning events. 'event _ type @ vN', consumers ignore unfamiliar fields.
Moving topics. Double post, consumers read from both before stabilizing; then "cutting" the old one.
Partition key. Key migration - through reissue with a map of correspondences and idempotency.
5. 4 Services and APIs
Blue/Green/Canary. Pool warm-up, partial traffic, fast rollback.
Ficha flags. By tenants/regions/percentages, observed inclusion.
Contracts. CDC contracts and compatibility tests - before switching.
5. 5 Regions/Clades
Geo-double recording. Data is recorded in two regions; readings - by proximity.
State transfer. Snapshot + replication; RPO "red line," DNS/Anycast transshipment.
Jurisdictions. Consent/localization of data, lists of "prohibited" for removal of kits.
6) Execution phases (detailed)
1. Preparation
Dashboards, alerts, limits, feature flags, backups with a recovery test, run on a stage.
2. Shadow (shadow check)
Mirror requests/writes to the new system without affecting users. Compare responses/states.
3. Dual-write / Dual-read
We write in both directions. Readings - gradually transfer to a new system. Nonconformance logs are analyzed.
4. Backfill
We load the historical data in batches. We control the CDC lag, monitor the load on the story/cache.
5. Cutover (switching)
Canarim by segment (tenants/regions/percentages). We support a quick rollback.
6. Contract (cleaning)
Cut off old paths, delete outdated fields/indexes/topics after the "security period."
7. Verification and retro
Report, metrics, lessons, update playbook/checklists.
7) Observability and SLO during migration
Technical SLIs: p50/p95/p99, error rate, retry/timeout, utilization, lag CDC, queue depth.
Business SLI: success of transactions/conversions, invariants (balances, limits, duplicates).
Special labels: 'migration _ id', 'phase', 'tenant', 'flag _ state'.
Alert guards: thresholds for tails and errors, "auto-stop" (abort) for SLO.
Comparison panels: v1 vs v2, delta by key metrics.
8) Rollback and emergency scenarios
Logical rollback: flags/traffic routing back, backfill freezing.
Data: "compensation" (Saga), event replay, DLQ → source system.
Secrets/keys: return to the previous key/certificate (dual-key).
DNS/traffic: "reverse drift" Anycast/ALB, TTL short in the migration window.
Communications: pre-agreed channel and status format.
9) Security, privacy, compliance
Data minimization. We transfer only the necessary fields; anonymization profiles on the copy.
Cryptography. Encryption "on the wire" and "at rest," KMS rotation; key operation log.
Time accesses. Temporary roles for migration jobs, selection of rights after completion.
Footprints. PD masking in logs/traces, export restrictions.
10) Change Management and Communications
RACI: who claims who performs, who is informed.
Freeze periods: prohibit irrelevant releases in the migration window.
Statuses: T-24h, T-1h, start, canary, cutover, finish, post-sea.
External partners: compatibility windows, contract letters, test sandbox.
11) Runbook templates
11. 1 Backfill (pseudocode)
for batch in paginate(ids, size=10_000):
try:
rows = read_v1(batch)
upsert_v2 (rows) # idempotently mark_checkpoint (batch. end)
sleep(jitter_ms(100..300))
except Throttle:
sleep (5s) # backpressure respect except Fatal as e:
alert("backfill-failed", e, context=batch)
abort_if_needed()
11. 2 Proverka一致nosti (snapshot/sample)
sample = random_ids(n=10_000, stratify=tenant,timestamp)
v1 = fetch_v1(sample); v2 = fetch_v2(sample)
assert schema_compatible(v2)
assert key_invariants_hold (v1, v2) # sum, statuses, versions mismatch_rate = diff (v1, v2). rate()
abort_if(mismatch_rate > 0. 0001)
11. 3 Switching readings
flag. enable("read_from_v2", segment="tenants: cohort_A")
monitor(30m)
if SLO_ok(): expand_segment()
else: rollback_segment()
12) Anti-patterns
"Big bang" instead of expand-migrate-contract.
Backfill without CDC → eternal catch-up and drift.
No idempotency → duplicates/dirty data.
Manual steps without scripts → human errors.
Migration without dashboards/guards → "blind flight."
Unacknowledged rollback → rollback does not work when needed.
Ignoring consumers (BI/partners) → broken reports/integrations.
13) Architect checklist
1. Goal, boundaries, migration type and outcome invariants defined?
2. Consumer and contract map drawn up, compatibility tests green?
3. Prepared dashboards, alerts, tags' migration _ id ', SLO/guardrails set?
4. Implemented shadow and/or dual-write, backfill idempotent?
5. Is there a practiced rollback runbook, check recovery from backup?
6. Window/coordination/communication agreed, freeze on?
7. Step-by-step plan with canary and expansion/stop criteria ready?
8. Security/compliance: keys, accesses, PII sanitation?
9. Is the documentation/SDK/spec updated in the same release cycle?
10. Post-sea and a playbook update after completion planned?
Conclusion
Migration playbooks are an architectural practice of risk management: small reversible steps, transparent metrics, ready rollback, and "expand-migrate-contract" discipline. Following the described templates, you migrate schemes, data, services and regions without downtime and surprises, while maintaining business invariants and user trust.