Disaster Recovery и cold-backups
Brief Summary
DR is the ability to restore business functions after a major disaster. Cold-backups - "last line of defense": unchangeable/isolated copies suitable for recovery in case of complete de-energization of the site or compromise. The strategy is built around RTO/RPO, system prioritization, annual DR exercises and strict operational discipline (catalogs, keys, checks).
Terms and objectives
RPO (Recovery Point Objective) - maximum allowable data loss (e.g. ≤ 15 min).
RTO (Recovery Time Objective) - maximum allowable recovery time (e.g. ≤ 2 hours).
Black-start - bare metal recovery: hardware/cluster/secrets/data/DNS.
Air-gap - physical/logical isolation of copies (tape/disabled account/offline media).
Immutability (WORM) - immutable storage (tape/object with Lock/Retention).
DR availability levels
Cold Site - infrastructure is missing/frozen; RTO: hours-days; cheapest CAPEX/OPEX.
Warm Site - templates/images/partially finished services; RTO: Tens of minutes-hours.
Hot Site - active replicas; RTO: minutes; more expensive and more complicated.
Hybrid: kernel → hot/warm, everything else → cold (with priority at startup).
Where cold-backups are indispensable
Massive crypto infestation/domain compromise.
Data corruption that went to all the replicas.
Loss of region/data center, force majeure (fire, flood).
Intentional removal/sabotage from privileged accounts.
Cold-backups topology
1. Media/Storage Classes
Tapes (LTO-8/9): low cost, default air-gap, high capacity, sequential access.
Offline disks/NAS: "safe cases," connect only to the backup/restore window.
Archived object classes (Glacier-like): low storage price, higher extraction time.
2. Placement
Other site/region; other provider/account; individual keys/administrators.
3. Immutability
WORM/Object Lock (Compliance/Governance) tapes with retention and Legal Hold.
Policy 3-2-1-1-0 (with focus on cold)
3 copies of data (prod + local backup + offsite).
2 different media (disc/tape/object).
1 offsite (other site/cloud).
1 immutable (WORM/air-gap).
0 check errors (checksum/periodic test recoveries).
Directories, Metadata, and Integrity Control
Backup catalog: what, where, when, version, keys, check amounts, retention period.
Asset Catalog - Service → dependencies → volumes/buckets → priority.
Checksums and manifest files: write and restore reconciliation.
Canary files: regular restore for early detection of media problems.
Encryption and Keys
Encryption at rest (tape/object) and in flight (copying).
KMS/Vault with dual-control, offline safes for master keys, rotation.
Separate keys for sales/backups/archives (minimizing blast radius).
Documented key access process during DR (requirements, roles, log).
DR Plan Prioritization and Consistency
Priority map (example):1. Identification and access: IdP (minimum zone), Vault/KMS, network core.
2. Data and control planes: etcd K8s, configs, secrets, image registers, deploy artifacts.
3. Transaction databases/wallet: logs + latest full/incremental.
4. Payment/integration gateways: keys, certificates, IP/DNS.
5. Web/api fronts: canary launch, static content from the object.
6. Analytics/Reporting: At Core Completion.
Restore sequence (black-start):1. Infrastructure: network, DNS/Anycast, kernel IAM, base images/cluster.
2. Secrets/certificates: restore Vault/KMS from cold-backup, distribute bootstrap secrets.
3. Control plane: etcd/Control Plane/registers/repositories.
4. Data: deploy database from cold-backup + PITR from logs (by RPO).
5. Applications: launching tree dependencies, warming up caches/CDN.
6. Tests and validation: health tests, consistency, checksums.
7. Traffic switching: DNS/routing/balancers (phased/canary).
8. Post-checks: no leaks/debts, logging and DR act.
Cold-restore procedures (typical)
Tapes: inventory, download, parallel streams, file map → directories → recovery tasks; accounting for search and rewind times.
Archive classes: request for extraction (minutes→hours), staging to hot storage, restore by manifest.
Offline disks: read-only connection, checksum checks → copying.
Practice: an isolated sandbox for restoration, then transfer to the production environment.
Communications and org. structure in DR
Роли: Incident Commander, Tech Lead (Infra), DB Lead, App Lead, Comms, Security.
Channels: backup (outside the corporate domain), voice/chat, SecureDocs.
Message templates: to clients/partners/regulators; update frequency; a single "source of truth."
Unified event log: timeline, solutions, owners.
DNS, Networks and Traffic
Split-brain-protection: "DR-mode" flags in the configuration; feature-flags for limited functionality.
DNS strategy: low TTL in advance, independent DNS provider; step change A/AAAA/CNAME, warm up CDN.
Routing: Anycast/Geo, BGP announcement from DR site; ACLs/firewalls are reassembled from IaC.
SLO for DR
RPO met ≥ 99% of the time (log/increment lag within target).
RTO black-start (full scenario) ≤ target (for example, 4 hours) on tests once a quarter.
Success of DR exercises - 100% of critical tasks are completed in the window.
Immutability - the share of backups with Retention/Lock = 100%.
Integrity checks - 100% as per schedule; media failure → migration ticket.
Tests and exercises
Table-top: scripts, roles, checklists, contact list.
Technical: selective recovery of databases/files/secrets to the sandbox with verification of checksums and consistency.
Black-start-drill: once/quarter (or once/six months) - full kernel launch in the DR site.
Post-mortem: facts, bottlenecks, improvement plan (SLO/processes/automation).
Automation and Artifacts
IaC: clusters, networks, stacks - in code; DR branches/parameters.
Runbooks: component by component (Vault/KMS, etcd, DB, gateways, fronts).
DR package: offline copy of key docks (contacts, schemes, passwords of safe phrases), physical access instructions.
Canary-restore: daily small restore and checksum reconciliation.
Tags/tags: "DR-critical," "Warm-only," "Cold-only" for services/volumes.
Implementation checklist
- Data classes and their RPOs/RTOs are aligned with the business; recovery priorities are defined.
- Implemented cold-backups: media, immutability (WORM/Object Lock), offsite/air-gap.
- Catalogs: assets, backups, keys; Check amounts and version control.
- black-start procedures: networks/DNS, IdP/Vault/KMS, control plane, data, applayer.
- Exercises: table-top quarterly; canary restores daily; black-start once/quarter-six months.
- Communications and regulatory templates; separate communication channels.
- SLO/metrics/alerts for DR; reports to management.
- Agreements with providers (tapes/archive classes/DNS/CDN), SLA confirmed.
- Finance: media/archive budget, logistics, media replacement by time.
Common errors
"There is a replica - no backup is needed →" a logical error/the ransomware will leave everywhere.
There is no immutability/air-gap → a single vector for compromising all copies.
The lack of catalogs/check amounts → restored "something," but not that.
DNS TTL is too large → multi-day traffic migration.
Keys/KMS in the same domain/account → blocking access in an incident.
Exercises only "on paper" → RTO/RPO are not confirmed.
iGaming/fintech specific
Wallet/payment core: strict RPO (≤ 1-5 minutes) and RTO (≤ 15-60 minutes); logs to an object with WORM; DR function "read-only balance" for transparent communication.
PSP/content providers: pre-agreed DR-IP/domain, whitelists, certificates, HMAC/mTLS keys - copies in the DR packet.
Reporting/regulators: notification templates, unchanging archives, provable integrity, activity log.
Peaks and events: DR readiness is checked before major tournaments/promotions; canary restore and CDN warming.
Mini Runbook Templates
1) Vault/KMS black-start (concept):1. Initializing the DR cluster, loading unseal (dual-control) keys.
2. Restore storage backup (cold-copy).
3. Checking policies, issuing bootstrap secrets for CI/CD/K8s.
2) PostgreSQL DR (PITR из cold-backup):1. Expand an empty instance, restore full from cold.
2. Upload WAL logs (increments) to the target moment.
3. Consistency check, enable replication, open read-only, then read-write.
3) DNS/traffic:1. Reduce TTL in 24-72 hours to planned risks (or keep low constantly).
2. Switching A/AAAA/CNAME by checklist, error/latency monitoring.
3. Gradual traffic growth (canary 5% → 25% → 100%).
Result
A reliable DR based on cold-backups is: immutable isolated copies, formalized black-start procedures, clear RPO/RTOs, regular exercises, a well-thought-out DNS/network strategy, and key discipline. Commit everything to IaC and runbooks, automate integrity checks and canary restores - and you will always have a controlled path to recovery even after a worst-case scenario.