Migration playbooks

1) Classification of migrations

DB schemes: adding/changing columns, indexes, sharding, changing key type.
Data: mass backfill/cleanup, normalization, retention/archiving.
Services and APIs: changing endpoints, versioning, contract refactoring.
Queues/buses: moving topics, changing membership keys, event format.
Infrastructure: moving to a new cluster/K8s/cloud/region, changing secrets/KMS.
Storage and analytics: changing the engine (OLTP/OLAP), format/partitioning of datasets.
Security/compliance: key rotation, encryption on the fly, geo-localization of data.

2) Principles of successful migration

1. Expand → Migrate → Contract. First, we expand the scheme/behavior (compatible), then transfer the data/traffic, then remove the old one.
2. Shadow & Dual. Shadow read/write and double entry for validation.
3. Feature flags and the "red button." Quick shutdown, step-by-step activation (percentile/tenants/regions).
4. Idempotency and repeatability. Scripts and tasks can be restarted without side effects.
5. Observability before changes. Dashboards/alerts in advance, migration markers in logs/tracks.
6. Rollback documented. The rollback runbook is as detailed as the plan forward.
7. Mini-games and pauses. We migrate in small portions, checking SLI and business invariants.

3) Inventory and dependency analysis

Consumer map: services, jobs, reports, external partners, BI/ETL, webhooks.
Contracts and schemas: API/event versions, backward/forward compatibility.
Accesses/secrets: who reads/writes where the caches/cues are.
Domain invariants: uniqueness, balance, idempotency, reporting day.
Volumes/speeds: data size, RPS, peak windows, RPO/RTO.

4) Canonical playbook template (YAML skeleton)

yaml playbook: "migrate-orders-to-v2"
owner: "orders-team"
stakeholders: ["platform", "data", "security", "support"]
change_type: ["schema", "data", "api"]
risk_level: "high"
preconditions:
- "Dashboards ready: latency/error/lag"
- "Runbook rollback validated on stage"
- "Backups verified (restore tested)"
plan:
phase_1_prepare:
steps:
- "Add new nullable columns (expand)"
- "Deploy code with dual-write (flag off)"
- "Enable CDC stream to target"
phase_2_shadow:
steps:
- "Shadow-read v2, compare with v1 (1%)"
- "Fix discrepancies; iterate"
phase_3_dual_write:
steps:
- "Enable dual-write (10%→50%→100%)"
- "Start backfill in batches (size=10k, sleep=200ms)"
phase_4_cutover:
steps:
- "Switch reads to v2 by tenants (canary)"
- "Monitor SLI 30m; expand scope"
phase_5_contract:
steps:
- "Drop old indices/columns after T+14d"
- "Disable old topic/api; update docs/SDK"
guardrails:
abort_if:
- "error_rate > 0. 5% for 5m"
- "p95 > baseline1. 5 for 10m"
- "data_mismatch > 0. 01%"
rollback:
steps:
- "Flip flag: reads back to v1"
- "Stop backfill; continue dual-write to v1"
- "Replay missed events (DLQ→v1)"
validation:
checks:
- "Row counts match within epsilon"
- "Business invariants hold (balances, limits)"
comms:
- channel: "on-call-bridge"
- status_updates: "T-24h, T-1h, start, cutover, finish"
window: "low-traffic Sun 02:00–05:00 UTC"

5) Migration patterns

5. 1 DB Schemas (RDBMS/NoSQL)

Add - do not change. New nullable columns/indexes → code reads old and new.
Online rebuilding. Use online indexes/concurrent DDLs.
Serialization versions. Version the payload in the JSON/Proto/Avro columns.
Key migration. When changing PK - time table of correspondences + trigger/CDC.

5. 2 Data (backfill/cleanup)

CDC + backfill. First, the flow of changes (to keep up), then the batch backfill.
Parties and deadlines. Small batches with lag control, checkpoints and restart.
Idempotent updates. Upsert by natural keys/versions.

5. 3 Events and queues

Versioning events. 'event _ type @ vN', consumers ignore unfamiliar fields.
Moving topics. Double post, consumers read from both before stabilizing; then "cutting" the old one.
Partition key. Key migration - through reissue with a map of correspondences and idempotency.

5. 4 Services and APIs

Blue/Green/Canary. Pool warm-up, partial traffic, fast rollback.
Ficha flags. By tenants/regions/percentages, observed inclusion.
Contracts. CDC contracts and compatibility tests - before switching.

5. 5 Regions/Clades

Geo-double recording. Data is recorded in two regions; readings - by proximity.
State transfer. Snapshot + replication; RPO "red line," DNS/Anycast transshipment.
Jurisdictions. Consent/localization of data, lists of "prohibited" for removal of kits.

6) Execution phases (detailed)

1. Preparation

Dashboards, alerts, limits, feature flags, backups with a recovery test, run on a stage.

2. Shadow (shadow check)

Mirror requests/writes to the new system without affecting users. Compare responses/states.

3. Dual-write / Dual-read

We write in both directions. Readings - gradually transfer to a new system. Nonconformance logs are analyzed.

4. Backfill

We load the historical data in batches. We control the CDC lag, monitor the load on the story/cache.

5. Cutover (switching)

Canarim by segment (tenants/regions/percentages). We support a quick rollback.

6. Contract (cleaning)

Cut off old paths, delete outdated fields/indexes/topics after the "security period."

7. Verification and retro

Report, metrics, lessons, update playbook/checklists.

7) Observability and SLO during migration

Technical SLIs: p50/p95/p99, error rate, retry/timeout, utilization, lag CDC, queue depth.
Business SLI: success of transactions/conversions, invariants (balances, limits, duplicates).
Special labels: 'migration _ id', 'phase', 'tenant', 'flag _ state'.
Alert guards: thresholds for tails and errors, "auto-stop" (abort) for SLO.
Comparison panels: v1 vs v2, delta by key metrics.

8) Rollback and emergency scenarios

Logical rollback: flags/traffic routing back, backfill freezing.
Data: "compensation" (Saga), event replay, DLQ → source system.
Secrets/keys: return to the previous key/certificate (dual-key).
DNS/traffic: "reverse drift" Anycast/ALB, TTL short in the migration window.
Communications: pre-agreed channel and status format.

9) Security, privacy, compliance

Data minimization. We transfer only the necessary fields; anonymization profiles on the copy.
Cryptography. Encryption "on the wire" and "at rest," KMS rotation; key operation log.
Time accesses. Temporary roles for migration jobs, selection of rights after completion.
Footprints. PD masking in logs/traces, export restrictions.

10) Change Management and Communications

RACI: who claims who performs, who is informed.
Freeze periods: prohibit irrelevant releases in the migration window.
Statuses: T-24h, T-1h, start, canary, cutover, finish, post-sea.
External partners: compatibility windows, contract letters, test sandbox.

11) Runbook templates

11. 1 Backfill (pseudocode)


for batch in paginate(ids, size=10_000):
try:
rows = read_v1(batch)
upsert_v2 (rows) # idempotently mark_checkpoint (batch. end)
sleep(jitter_ms(100..300))
except Throttle:
sleep (5s) # backpressure respect except Fatal as e:
alert("backfill-failed", e, context=batch)
abort_if_needed()

11. 2 Proverka一致nosti (snapshot/sample)


sample = random_ids(n=10_000, stratify=tenant,timestamp)
v1 = fetch_v1(sample); v2 = fetch_v2(sample)
assert schema_compatible(v2)
assert key_invariants_hold (v1, v2) # sum, statuses, versions mismatch_rate = diff (v1, v2). rate()
abort_if(mismatch_rate > 0. 0001)

11. 3 Switching readings


flag. enable("read_from_v2", segment="tenants: cohort_A")
monitor(30m)
if SLO_ok(): expand_segment()
else: rollback_segment()

12) Anti-patterns

"Big bang" instead of expand-migrate-contract.
Backfill without CDC → eternal catch-up and drift.
No idempotency → duplicates/dirty data.
Manual steps without scripts → human errors.

Migration without dashboards/guards → "blind flight."

Unacknowledged rollback → rollback does not work when needed.
Ignoring consumers (BI/partners) → broken reports/integrations.

13) Architect checklist

1. Goal, boundaries, migration type and outcome invariants defined?
2. Consumer and contract map drawn up, compatibility tests green?
3. Prepared dashboards, alerts, tags' migration _ id ', SLO/guardrails set?
4. Implemented shadow and/or dual-write, backfill idempotent?
5. Is there a practiced rollback runbook, check recovery from backup?
6. Window/coordination/communication agreed, freeze on?
7. Step-by-step plan with canary and expansion/stop criteria ready?
8. Security/compliance: keys, accesses, PII sanitation?
9. Is the documentation/SDK/spec updated in the same release cycle?
10. Post-sea and a playbook update after completion planned?

Conclusion

Migration playbooks are an architectural practice of risk management: small reversible steps, transparent metrics, ready rollback, and "expand-migrate-contract" discipline. Following the described templates, you migrate schemes, data, services and regions without downtime and surprises, while maintaining business invariants and user trust.

Migration playbooks

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects