DR Strategies and RTO/RPO

1) Basic principles

1. Goals before means. First, we formulate RTO/RPO and critical scenarios, then select the technology.
2. Segmentation by importance. Not all services require "gold"; divide by business criticality.
3. Data is the core of DR. Consistency, replication, corruption detection, and recovery point are more important than hardware.
4. Automation and verifiability. DR is meaningless without IaC, recovery regression tests, and telemetry.
5. Teachings and evidence. A plan without regular "game day" is an illusion of readiness.
6. Safety and compliance. Encryption, isolation, WORM/immutable backups, DPA/jurisdictions.

2) Terms and correspondences

RTO - time from the moment of the event until the service is restored "normal."

RPO is the "age" of the last healthy data point at recovery.
RLO (Recovery Level Objective) - the level of functionality that must be restored (minimum viable service).
MTD (Maximum Tolerable Downtime) - the threshold after which the business suffers unacceptable damage.
RTA/RPA (Actual) - actual time/recovery point from practices.

Communication: RTO ≤ MTD; RPA ≤ RPO. The gap between goals and fact is the subject of post-mortem and improvement.

3) DR strategy classes (readiness levels)

Level	Description	Typical RTO/RPO	Cost	Application
Backup/Restore	Only backups and environment image	RTO: hours-days, RPO: hours	$	Non-critical systems, reporting
Pilot Light	"Spark": the minimum stack is raised, the data is replicated	RTO: tens of minutes-hours, RPO: minutes-hours	$$	Medium criticality, savings
Warm Standby	Warm stand: almost ready, low load	RTO: minutes-	$$$	B2C core, payment gateways
Active/Passive	Full passive clone, automatic feilover	RTO: minutes, RPO: seconds-minutes	$$$$	Mission-critical APIs
Active/Active	Both sites in sales	RTO≈0, RPO≈0 -sec. $$$$$	Extreme SLOs, global products

💡 Rule: Choose the minimum level appropriate for the business risk.

4) Scenarios against which we defend

Loss of region/cloud/data center (electrics, network, provider).
Data corruption/operator error (deletion, broken replicas, logical corruption).
Malware/ransomware.
Release/configuration defect (mass outage).
Collapse of addiction (KMS, DNS, secrets, payment provider).
Legal events (blocking, prohibition of data export from the jurisdiction).

For each scenario, specify the RTO/RPO, DR level, playbook, responsible persons.

5) Data strategies (key to RPO)

5. 1 Backups

Full + incremental + transaction logs (for DB).
Immutable/WORM storages and offline copies ("air-gapped").
Catalog of backups with metadata and crypto signatures; scheduled test restores.

5. 2 Replication

Synchronous (low RPO, ↑latentnost, risk of spoil propagation).
Asynchronous (low impact on the perf, RPO> 0; combine with the spoilage child).
CDC (Change Data Capture) for streaming replication and state reconstruction.

5. 3 Protection against logical corruption

Versioning/" points in time" (PITR) with a window ≥ N days.
Invariant signatures (balances, sums, chexums) are an early detection of "broken" data.
"Slow" replication channels (delay 15-60 minutes) as a buffer against instant corruption.

Recovery point selection sketch:

python def pick_restore_point(pitr, anomaly_signals, max_age):
healthy = [p for p in pitr if not anomaly_signals. after(p. time)]
return max(healthy, key=lambda p: p. time if now()-p. time <= max_age else -1)

6) Application, status, cache

Stateless layer - scale and restart in any region (image/chart/manifestos in Git).
Status (DB/caches/kew): the source of truth is one of the DBs; caches and indexes are overgenerable.
Idempotence and re-drive - re-delivery of events is permissible; use outbox/inbox, dedup, and versions.

7) Network and entry point

GSLB/DNS-feilover: latency/health-based, short TTL to crash window.
Anycast/L7 proxy: single IP, regional health routing.
Regional domains and jurisdictional policies (geo-pinning for PII).
Certificate file/KMS: spare chains, dual-key.

Feilover pseudocode:

python if slo_breach("region-a") or health("region-a")==down:
route. shift(traffic, from_="region-a", to="region-b", step=20) # канарим enable_readonly_if_needed()

8) Operating model and automation

IaC/GitOps: second region infrastructure = code, "single button" deployment.

Policy as Code: gate "no DR manifests/backups/alerts - no release."

Runbooks: step-by-step instructions and a "red button" identical to both regions.
Secrets: short-lived credits, OIDC federation, compromise/recall plan.

Gate (idea):

rego package dr deny["Missing PITR ≥ 7d"] {
input. db. pitr_window_days < 7
}
deny["No restore test in 30d"] {
now() - input. db. last_restore_test > 3024h
}

9) Exercises and tests (Game Days)

Scenario table: database loss, "broken" data, KMS failure, region drop, sudden egress limit.
Frequency: quarterly for mission-critical; once every six months - for the rest.
Exercise metrics: RTA/RPA vs goals, proportion of automatic steps, number of manual interventions, playbook errors.
Chaos-smoke in releases: dependency degradation should not "break" DR paths.

Example of a mini-exercise:


T0: cut off the primary database (firewall drop)
T + 2m: GSLB shift 20% of traffic, then 100% at SLO_ok
T + 6m: checking business invariants and lag replication
T + 10m: post-drill: fixing RTA/RPA, playbook improvements

10) Playbooks (canonical template)

yaml playbook: "dr-failover-region-a-to-b"
owner: "platform-sre"
rto: "15m"
rpo: "5m"
triggers:
- "health(region-a)==down"
- "slo_breach(payments)"
prechecks:
- "backup_catalog ok; last_restore_test < 30d"
- "pitr_window >= 7d"
steps:
- "Announce incident; open war-room; assign IC"
- "Freeze writes in region-a (flag write_readonly)"
- "Promote db-b to primary; verify replication stopped cleanly"
- "Shift GSLB 20%→50%→100%; monitor p95/error"
- "Enable compensations and re-drive queues"
validation:
- "Business invariants (balances, duplicate_checks)"
- "Synthetic tests green; dashboards stable 30m"
rollback:
- "If db-b unhealthy: revert traffic; engage restore from PITR T-Δ"
comms:
- "Status updates each 15m; external note if SEV1"

11) DR observability metrics

Replica lag (sec), RPO-drift (difference between target and actual RPO).
Restore SLI: cold/warm recovery time by environment.
Coverage:% of services with playbooks/backups/PITR ≥ N days.
Drill score: proportion of automatic steps, RTA distribution, error rate.
Immutability:% of backups in WORM/air-gapped.
Event metrics: queue length/re-drive speed after the fake.

12) Cost and trade-offs

CapEx/OpEx: The warm stand is cheaper than the Active/Active but more expensive than the Pilot Light.
Egress: inter-regional/inter-cloud replication costs money; cache/compression/local aggregates.
RTO/RPO vs $: each "nine" of availability and a second of RPO are several times more expensive - coordinate with the business.
Green windows: batch-replication - in cheap/" green "hours.

13) Safety and compliance

Encryption "at rest" and "in transit," separate KMS domains by region.
Immutable backups, ransomware protection: "3-2-1" (3 copies, 2 media, 1 offline), MFA-delete.
Jurisdictions: geo-pinning for PII, localizing backups, Legal Hold on top of TTL.
Time accesses: temporary roles for DR operations, audit log.

14) Anti-patterns

"Let's write a plan later" - DR without exercises.
Replication without protection against logical corruption - will instantly multiply the error.
One KMS/secrets region - no feilover possible.
Backups without regular restores - "Shredinger" DR.
Closely related synchronous transactions between regions are cascade latency/fall.
No prioritization: the same DR level for everything (expensive and useless).

15) Architect checklist

1. Defined RTO/RPO/RLO by service and scenario?
2. Classified data: source of truth, PITR/window, WORM/immutable?
3. Is DR (Backup/Restore, Pilot, Warm, A/P, A/A) per-service selected?
4. Network: GSLB/Anycast, certificates/keys with a margin, read-only flags?
5. App: idempotence, outbox/inbox, offsetting transactions?
6. IaC/GitOps/Policy as Code: one click on rolling out the second region?
7. Drill: Schedule, KPI RTA/RPA, post-training activities?
8. Monitoring: lag, RPO-drift, restore-SLI, drill-score, immutable backups?
9. Security/Compliance: KMS Domains, Jurisdictions, Legal Hold?
10. Cost: egress budget, green windows, economically sound level?

16) Mini recipes and sketches

16. 1 PITR for Postgres (idea):

bash base backup daily + WAL archive pg_basebackup -D/backups/base/$ (date +% F)
archive_command='aws s3 cp %p s3://bucket/wal/%f --sse'
restore pg_restore --time "2025-10-31 13:21:00Z"...

16. 2 Protection against logical corruption (delayed replica):

yaml replication:
mode: async apply_delay: "30m" # window to roll back on corruption

16. 3 Traffic switching (GSLB pseudo-API):

bash gslb set-weight api. example. com region-a 0 gslb set-weight api. example. com region-b 100

16. 4 Check of invariants after the feilover (pseudocode):

python assert total_balance(all_accounts) == snapshot_total assert no_duplicates(events_since(t_failover))

Conclusion

DR is the ability to make technical and organizational decisions faster than the damage grows. Identify realistic RTO/RPOs, select sufficient availability, automate infrastructure and checks, exercise regularly, and measure actual RTA/RPAs. Then the accident will not turn into a disaster, but into a controlled incident with a predictable outcome.

DR Strategies and RTO/RPO

Conclusion

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects