DR Strategies and RTO/RPO
1) Basic principles
1. Goals before means. First, we formulate RTO/RPO and critical scenarios, then select the technology.
2. Segmentation by importance. Not all services require "gold"; divide by business criticality.
3. Data is the core of DR. Consistency, replication, corruption detection, and recovery point are more important than hardware.
4. Automation and verifiability. DR is meaningless without IaC, recovery regression tests, and telemetry.
5. Teachings and evidence. A plan without regular "game day" is an illusion of readiness.
6. Safety and compliance. Encryption, isolation, WORM/immutable backups, DPA/jurisdictions.
2) Terms and correspondences
RTO - time from the moment of the event until the service is restored "normal."
RPO is the "age" of the last healthy data point at recovery.
RLO (Recovery Level Objective) - the level of functionality that must be restored (minimum viable service).
MTD (Maximum Tolerable Downtime) - the threshold after which the business suffers unacceptable damage.
RTA/RPA (Actual) - actual time/recovery point from practices.
Communication: RTO ≤ MTD; RPA ≤ RPO. The gap between goals and fact is the subject of post-mortem and improvement.
3) DR strategy classes (readiness levels)
4) Scenarios against which we defend
Loss of region/cloud/data center (electrics, network, provider).
Data corruption/operator error (deletion, broken replicas, logical corruption).
Malware/ransomware.
Release/configuration defect (mass outage).
Collapse of addiction (KMS, DNS, secrets, payment provider).
Legal events (blocking, prohibition of data export from the jurisdiction).
For each scenario, specify the RTO/RPO, DR level, playbook, responsible persons.
5) Data strategies (key to RPO)
5. 1 Backups
Full + incremental + transaction logs (for DB).
Immutable/WORM storages and offline copies ("air-gapped").
Catalog of backups with metadata and crypto signatures; scheduled test restores.
5. 2 Replication
Synchronous (low RPO, ↑latentnost, risk of spoil propagation).
Asynchronous (low impact on the perf, RPO> 0; combine with the spoilage child).
CDC (Change Data Capture) for streaming replication and state reconstruction.
5. 3 Protection against logical corruption
Versioning/" points in time" (PITR) with a window ≥ N days.
Invariant signatures (balances, sums, chexums) are an early detection of "broken" data.
"Slow" replication channels (delay 15-60 minutes) as a buffer against instant corruption.
python def pick_restore_point(pitr, anomaly_signals, max_age):
healthy = [p for p in pitr if not anomaly_signals. after(p. time)]
return max(healthy, key=lambda p: p. time if now()-p. time <= max_age else -1)
6) Application, status, cache
Stateless layer - scale and restart in any region (image/chart/manifestos in Git).
Status (DB/caches/kew): the source of truth is one of the DBs; caches and indexes are overgenerable.
Idempotence and re-drive - re-delivery of events is permissible; use outbox/inbox, dedup, and versions.
7) Network and entry point
GSLB/DNS-feilover: latency/health-based, short TTL to crash window.
Anycast/L7 proxy: single IP, regional health routing.
Regional domains and jurisdictional policies (geo-pinning for PII).
Certificate file/KMS: spare chains, dual-key.
python if slo_breach("region-a") or health("region-a")==down:
route. shift(traffic, from_="region-a", to="region-b", step=20) # канарим enable_readonly_if_needed()
8) Operating model and automation
IaC/GitOps: second region infrastructure = code, "single button" deployment.
Policy as Code: gate "no DR manifests/backups/alerts - no release."
Runbooks: step-by-step instructions and a "red button" identical to both regions.
Secrets: short-lived credits, OIDC federation, compromise/recall plan.
rego package dr deny["Missing PITR ≥ 7d"] {
input. db. pitr_window_days < 7
}
deny["No restore test in 30d"] {
now() - input. db. last_restore_test > 3024h
}
9) Exercises and tests (Game Days)
Scenario table: database loss, "broken" data, KMS failure, region drop, sudden egress limit.
Frequency: quarterly for mission-critical; once every six months - for the rest.
Exercise metrics: RTA/RPA vs goals, proportion of automatic steps, number of manual interventions, playbook errors.
Chaos-smoke in releases: dependency degradation should not "break" DR paths.
T0: cut off the primary database (firewall drop)
T + 2m: GSLB shift 20% of traffic, then 100% at SLO_ok
T + 6m: checking business invariants and lag replication
T + 10m: post-drill: fixing RTA/RPA, playbook improvements
10) Playbooks (canonical template)
yaml playbook: "dr-failover-region-a-to-b"
owner: "platform-sre"
rto: "15m"
rpo: "5m"
triggers:
- "health(region-a)==down"
- "slo_breach(payments)"
prechecks:
- "backup_catalog ok; last_restore_test < 30d"
- "pitr_window >= 7d"
steps:
- "Announce incident; open war-room; assign IC"
- "Freeze writes in region-a (flag write_readonly)"
- "Promote db-b to primary; verify replication stopped cleanly"
- "Shift GSLB 20%→50%→100%; monitor p95/error"
- "Enable compensations and re-drive queues"
validation:
- "Business invariants (balances, duplicate_checks)"
- "Synthetic tests green; dashboards stable 30m"
rollback:
- "If db-b unhealthy: revert traffic; engage restore from PITR T-Δ"
comms:
- "Status updates each 15m; external note if SEV1"
11) DR observability metrics
Replica lag (sec), RPO-drift (difference between target and actual RPO).
Restore SLI: cold/warm recovery time by environment.
Coverage:% of services with playbooks/backups/PITR ≥ N days.
Drill score: proportion of automatic steps, RTA distribution, error rate.
Immutability:% of backups in WORM/air-gapped.
Event metrics: queue length/re-drive speed after the fake.
12) Cost and trade-offs
CapEx/OpEx: The warm stand is cheaper than the Active/Active but more expensive than the Pilot Light.
Egress: inter-regional/inter-cloud replication costs money; cache/compression/local aggregates.
RTO/RPO vs $: each "nine" of availability and a second of RPO are several times more expensive - coordinate with the business.
Green windows: batch-replication - in cheap/" green "hours.
13) Safety and compliance
Encryption "at rest" and "in transit," separate KMS domains by region.
Immutable backups, ransomware protection: "3-2-1" (3 copies, 2 media, 1 offline), MFA-delete.
Jurisdictions: geo-pinning for PII, localizing backups, Legal Hold on top of TTL.
Time accesses: temporary roles for DR operations, audit log.
14) Anti-patterns
"Let's write a plan later" - DR without exercises.
Replication without protection against logical corruption - will instantly multiply the error.
One KMS/secrets region - no feilover possible.
Backups without regular restores - "Shredinger" DR.
Closely related synchronous transactions between regions are cascade latency/fall.
No prioritization: the same DR level for everything (expensive and useless).
15) Architect checklist
1. Defined RTO/RPO/RLO by service and scenario?
2. Classified data: source of truth, PITR/window, WORM/immutable?
3. Is DR (Backup/Restore, Pilot, Warm, A/P, A/A) per-service selected?
4. Network: GSLB/Anycast, certificates/keys with a margin, read-only flags?
5. App: idempotence, outbox/inbox, offsetting transactions?
6. IaC/GitOps/Policy as Code: one click on rolling out the second region?
7. Drill: Schedule, KPI RTA/RPA, post-training activities?
8. Monitoring: lag, RPO-drift, restore-SLI, drill-score, immutable backups?
9. Security/Compliance: KMS Domains, Jurisdictions, Legal Hold?
10. Cost: egress budget, green windows, economically sound level?
16) Mini recipes and sketches
16. 1 PITR for Postgres (idea):
bash base backup daily + WAL archive pg_basebackup -D/backups/base/$ (date +% F)
archive_command='aws s3 cp %p s3://bucket/wal/%f --sse'
restore pg_restore --time "2025-10-31 13:21:00Z"...
16. 2 Protection against logical corruption (delayed replica):
yaml replication:
mode: async apply_delay: "30m" # window to roll back on corruption
16. 3 Traffic switching (GSLB pseudo-API):
bash gslb set-weight api. example. com region-a 0 gslb set-weight api. example. com region-b 100
16. 4 Check of invariants after the feilover (pseudocode):
python assert total_balance(all_accounts) == snapshot_total assert no_duplicates(events_since(t_failover))
Conclusion
DR is the ability to make technical and organizational decisions faster than the damage grows. Identify realistic RTO/RPOs, select sufficient availability, automate infrastructure and checks, exercise regularly, and measure actual RTA/RPAs. Then the accident will not turn into a disaster, but into a controlled incident with a predictable outcome.