Disaster Recovery Scenarios
1) Why DR is needed and what is the purpose
Disaster Recovery (DR) is a set of architectures, processes and training for recovering services after disasters (data center/region failure, data loss, mass configuration errors). DR's goal is to meet target RTOs/RPOs at controlled cost and risk while maintaining customer trust and regulatory compliance.
Recovery Time Objective (RTO) -Allowed downtime.
Recovery Point Objective (RPO) - allowable data loss (time since last consistent point).
RLO (Recovery Level Objective): level of functionality that should return first (minimum viable service).
2) Classification of systems by criticality
Tier 0 (vital): payments, login, KYC, core transactions - RTO ≤ 15 min, RPO ≤ 1-5 min.
Tier 1 (high): operating panels, reports D-1 - RTO ≤ 1 h, RPO ≤ 15-60 min.
Tier 2 (average): back office, near-real-time analytics - RTO ≤ 4-8 hours, RPO ≤ 4-8 hours.
Tier 3 (low): non-critical auxiliary - RTO ≤ 24-72 h, RPO ≤ 24 h.
Assign Tier + target RTO/RPOs to each service in the service catalog; decisions and budgets should be checked against them.
3) Threat model and scenarios
Man-made: failure of AZ/region/provider, network degradation/DNS, database/storage failure, mass release bug.
Human factor: erroneous configs/IaC, data deletion, key compromise.
Natural/external: fire/flood, power outages, legal blockages.
For each - evaluate the probability/impact, link to the DR scenario and playbook.
4) DR architecture patterns
1. Active-Active (Multi-Region): Both regions serve traffic.
Pros: minimal RTO/RPO, high stability.
Disadvantages: data complexity/consistency, high price.
Where: read-heavy, cached loads, stateless services, multi-master DB (strict conflict rules).
2. Active-Passive (Hot Standby): A hot passive holds a fully heated copy.
RTO: minutes; RPO: Minutes. Requires automated failover and replication.
3. Warm Standby: part of the resources are warmed up, scaling in case of an accident.
RTO: tens of minutes; RPO: 15-60 minutes. More economical, but longer.
4. Pilot Light: minimal "spark" (metadata/images/scripts) + quick spread.
RTO: hours; RPO: hours. Cheap, suitable for Tier 2-3.
5. Backup & Restore: offline backups + manual warm-up.
RTO/RPO: hours/day. Only for low criticality and archives.
5) Data and consistency
Database replication:- Synchronous - almost zero RPO, but ↑latentnost/stoimost.
- Asynchronous - better performance, RPO> 0 (tail of logs).
- Consistency: Choose a model (strong/eventual/causal). For payments - strictly, for analytics - eventual.
- Snapshots: Create consistent points regularly + store logs (WAL/redo).
- Cross-region transactions: avoid 2PC; use idempotent operations, deli-and-repeat (retry with deduplication), event sourcing.
- Queues/buses: replication/mirroring, DLQ, ordering and idempotency of consumers.
6) Network, traffic and DNS
GSLB/Anycast/DNS: failover/failback policies, low TTL (but not too much), health-checks from several regions.
L7 routing: regional maps, degradation flags (function restriction).
Private-links/VPN: backup channels to providers (PSP/KYC/CDN).
Rate limiting: storm protection during recovery.
7) Stateful vs Stateless
Stateless is carried by script/autoscale; stateful requires a consistent data strategy (replication, snapshots, replica promotion, quorum).
Cache/sessions: external (Redis/Memcached) with cross-region replication or re-seed by logs; hold sessions in tokens (JWT) or shared storage.
8) DR triggers and automation
SLO gardrails and quorum probes → an automatic region-failover runbook.
Change freeze in case of accident: block irrelevant releases/migrations.
Infrastructure as Code: deployment of stand-by manifests, drift check.
Role promotion: automatic promote replica DB + writers/secrets dressing.
9) Communications and Compliance
War-room: IC/TL/Comms/Scribe; SEV update intervals.
Status page: geography of influence, ETA, workarounds.
Regulatory: notification deadlines, data security, unchangeable evidence storage.
Partners/providers: confirmed contacts, dedicated channel.
10) DR tests and exercises
Tabletop: Discussing scenario and solutions.
Game Day (stage/prod-light): simulation of AZ/regions failure, provider shutdown, DNS reset.
Restore tests: periodically restore backups in isolation and validate integrity.
Chaos/Failure injection: controlled network/node/dependency failures.
Exercise KPI: achieved RTO/RPO, playbook defects, CAPA.
11) Finance and Strategy Selection (FinOps)
Count $ for reduced RPO/RTO: the lower the targets, the more expensive the channels, licenses, reserves.
Hybrid: Tier 0 - active-active/hot; Tier 1 — warm; Tier 2–3 — pilot/backup.
Expensive data: use cold layers (archive/S3/GLACIER), incremental snapshots, deduplication.
Periodic review of DR-infra costs and certificates/licenses.
12) DR Maturity Metrics
RTO (actual) and RPO (actual) for each Tier.
DR Coverage:% of services with a designed script/playbook/test.
Backup Success & Restore Success: The daily success of backups and proven restores.
Time-to-Declare Disaster: Speed of failover decision.
Failback Time returns to normal topology.
Defect Rate Exercises: found gaps/teachings.
Compliance Evidence Completeness.
13) Checklists
Before DR implementation
- Service directory contains Tier, RTO/RPO, dependencies and owners.
- Selected pattern (AA/AP/WS/PL/BR) by Tier and budget.
- Consistency and replication agreements are documented.
- GSLB/DNS/routing and health-checks configured and tested.
- Backups, snapshots, change logs - enabled, checked for restore.
- DR playbooks and provider contacts are up to date.
During the accident (briefly)
- Declare a SEV and assemble a war-room; freeze releases.
- Check quorum of probes; record the impact/geography.
- Execute Failover Runbook: Traffic, Promotion DB, Queues, Cache.
- Enable degrade-UX/limits; publish updates on SLA.
- Collect evidence (timeline, graphs, logs, commands).
After the accident
- Observe SLO of N intervals; execute failback as planned.
- Conduct AAR/RCA; issue a CAPA.
- Update playbooks, alert catalysts, DR test cases.
- Report to stakeholders/regulators (if necessary).
14) Templates
14. 1 DR script card (example)
ID: DR-REGION-FAILOVER-01
Scope: prod EU ↔ prod US
Tier: 0 (Payments, Auth)
Targets: RTO ≤ 15m, RPO ≤ 5m
Trigger: quorum(probes EU, US) + burn-rate breach + provider status=red
Actions:
- Traffic: GSLB shift EU→US (25→50→100% with green SLIs)
- DB: promote US-replica to primary; re-point writers; freeze schema changes
- MQ: mirror switch; drain EU DLQ; idempotent reprocess
- Cache: invalidate region-specific keys; warm critical sets
- Features: enable degrade_payments_ux
- Comms: status page update q=15m; partners notify
Guardrails: payment_success ≥ 98%, p95 ≤ 300ms
Rollback/Failback: EU green 60m → 25→50→100% with guardrails
Owners: IC @platform, DB @data, Network @netops, Comms @support
14. 2 Runbook "Promote replica database" (fragment)
1) Freeze writes; verify WAL applied (lag ≤ 30s)
2) Promote replica; update cluster VIP / writer endpoint
3) Rotate app secrets/endpoints via remote config
4) Validate: read/write checks, consistency, replication restart to new secondary
5) Lift freeze, monitor errors p95/5xx for 30m
14. 3 DR Exercise Plan (Brief)
Purpose: to check RTO/RPO Tier 0 in case of EU failure
Scenario: EU incoming LB down + 60s replication delay
Success criteria: 100% traffic in US ≤ 12m; RPO ≤ 5m; SLI green 30m
Artifacts: switching logs, SLI graphs, step times, command output
15) Anti-patterns
"There are backups" without regular restore tests.
Secrets/endpoints are not automatically switched.
No idempotency → duplicate/lost transactions on redelivery.
Identical configs for regions without degradation feature flags.
Long Time-to-Declare for fear of "false alarm."
Monoregional providers (PSP/KYC) with no alternative.
There is no failback plan - we live in an emergency topology "forever."
16) Implementation Roadmap (6-10 weeks)
1. Ned. 1-2: classification of services by Tier, setting target RTO/RPO, choosing DR patterns.
2. Ned. 3-4: setting up replication/backups, GSLB/DNS, promotion procedures; playbooks and runbooks' and.
3. Ned. 5-6: first DR exercises (tabletop→stage), fixing metrics and CAPA.
4. Ned. 7-8: Traffic-Restricted Exercise Prod-Light; failover automation.
5. Ned. 9-10: cost optimization (FinOps), transfer of Tier 0 to hot/AA, quarterly exercise and reporting regulations.
17) The bottom line
Effective DR is not just about backups. These are consistent architecture, failover/failback automation, data discipline (idempotency/replication), training, and transparent communications. When RTO/RPOs are real, playbooks are worked out, and exercises are regular, the disaster turns into a controlled event, after which services quickly and predictably return to normal.