Disaster Recovery и cold-backups

Brief Summary

DR is the ability to restore business functions after a major disaster. Cold-backups - "last line of defense": unchangeable/isolated copies suitable for recovery in case of complete de-energization of the site or compromise. The strategy is built around RTO/RPO, system prioritization, annual DR exercises and strict operational discipline (catalogs, keys, checks).

Terms and objectives

RPO (Recovery Point Objective) - maximum allowable data loss (e.g. ≤ 15 min).
RTO (Recovery Time Objective) - maximum allowable recovery time (e.g. ≤ 2 hours).
Black-start - bare metal recovery: hardware/cluster/secrets/data/DNS.
Air-gap - physical/logical isolation of copies (tape/disabled account/offline media).
Immutability (WORM) - immutable storage (tape/object with Lock/Retention).

DR availability levels

Cold Site - infrastructure is missing/frozen; RTO: hours-days; cheapest CAPEX/OPEX.
Warm Site - templates/images/partially finished services; RTO: Tens of minutes-hours.
Hot Site - active replicas; RTO: minutes; more expensive and more complicated.
Hybrid: kernel → hot/warm, everything else → cold (with priority at startup).

Where cold-backups are indispensable

Massive crypto infestation/domain compromise.
Data corruption that went to all the replicas.
Loss of region/data center, force majeure (fire, flood).
Intentional removal/sabotage from privileged accounts.

Cold-backups topology

1. Media/Storage Classes

Tapes (LTO-8/9): low cost, default air-gap, high capacity, sequential access.
Offline disks/NAS: "safe cases," connect only to the backup/restore window.
Archived object classes (Glacier-like): low storage price, higher extraction time.

2. Placement

Other site/region; other provider/account; individual keys/administrators.

3. Immutability

WORM/Object Lock (Compliance/Governance) tapes with retention and Legal Hold.

Policy 3-2-1-1-0 (with focus on cold)

3 copies of data (prod + local backup + offsite).
2 different media (disc/tape/object).
1 offsite (other site/cloud).
1 immutable (WORM/air-gap).
0 check errors (checksum/periodic test recoveries).

Directories, Metadata, and Integrity Control

Backup catalog: what, where, when, version, keys, check amounts, retention period.
Asset Catalog - Service → dependencies → volumes/buckets → priority.
Checksums and manifest files: write and restore reconciliation.
Canary files: regular restore for early detection of media problems.

Encryption and Keys

Encryption at rest (tape/object) and in flight (copying).
KMS/Vault with dual-control, offline safes for master keys, rotation.
Separate keys for sales/backups/archives (minimizing blast radius).
Documented key access process during DR (requirements, roles, log).

DR Plan Prioritization and Consistency

Priority map (example):

1. Identification and access: IdP (minimum zone), Vault/KMS, network core.

2. Data and control planes: etcd K8s, configs, secrets, image registers, deploy artifacts.

3. Transaction databases/wallet: logs + latest full/incremental.

4. Payment/integration gateways: keys, certificates, IP/DNS.

5. Web/api fronts: canary launch, static content from the object.

6. Analytics/Reporting: At Core Completion.

Restore sequence (black-start):

1. Infrastructure: network, DNS/Anycast, kernel IAM, base images/cluster.

2. Secrets/certificates: restore Vault/KMS from cold-backup, distribute bootstrap secrets.

3. Control plane: etcd/Control Plane/registers/repositories.

4. Data: deploy database from cold-backup + PITR from logs (by RPO).

5. Applications: launching tree dependencies, warming up caches/CDN.

6. Tests and validation: health tests, consistency, checksums.

7. Traffic switching: DNS/routing/balancers (phased/canary).

8. Post-checks: no leaks/debts, logging and DR act.

Cold-restore procedures (typical)

Tapes: inventory, download, parallel streams, file map → directories → recovery tasks; accounting for search and rewind times.
Archive classes: request for extraction (minutes→hours), staging to hot storage, restore by manifest.
Offline disks: read-only connection, checksum checks → copying.
Practice: an isolated sandbox for restoration, then transfer to the production environment.

Communications and org. structure in DR

Роли: Incident Commander, Tech Lead (Infra), DB Lead, App Lead, Comms, Security.
Channels: backup (outside the corporate domain), voice/chat, SecureDocs.

Message templates: to clients/partners/regulators; update frequency; a single "source of truth."

Unified event log: timeline, solutions, owners.

DNS, Networks and Traffic

Split-brain-protection: "DR-mode" flags in the configuration; feature-flags for limited functionality.
DNS strategy: low TTL in advance, independent DNS provider; step change A/AAAA/CNAME, warm up CDN.
Routing: Anycast/Geo, BGP announcement from DR site; ACLs/firewalls are reassembled from IaC.

SLO for DR

RPO met ≥ 99% of the time (log/increment lag within target).
RTO black-start (full scenario) ≤ target (for example, 4 hours) on tests once a quarter.
Success of DR exercises - 100% of critical tasks are completed in the window.
Immutability - the share of backups with Retention/Lock = 100%.
Integrity checks - 100% as per schedule; media failure → migration ticket.

Tests and exercises

Table-top: scripts, roles, checklists, contact list.
Technical: selective recovery of databases/files/secrets to the sandbox with verification of checksums and consistency.
Black-start-drill: once/quarter (or once/six months) - full kernel launch in the DR site.
Post-mortem: facts, bottlenecks, improvement plan (SLO/processes/automation).

Automation and Artifacts

IaC: clusters, networks, stacks - in code; DR branches/parameters.
Runbooks: component by component (Vault/KMS, etcd, DB, gateways, fronts).
DR package: offline copy of key docks (contacts, schemes, passwords of safe phrases), physical access instructions.
Canary-restore: daily small restore and checksum reconciliation.
Tags/tags: "DR-critical," "Warm-only," "Cold-only" for services/volumes.

Implementation checklist

Data classes and their RPOs/RTOs are aligned with the business; recovery priorities are defined.
Implemented cold-backups: media, immutability (WORM/Object Lock), offsite/air-gap.
Catalogs: assets, backups, keys; Check amounts and version control.
black-start procedures: networks/DNS, IdP/Vault/KMS, control plane, data, applayer.
Exercises: table-top quarterly; canary restores daily; black-start once/quarter-six months.
Communications and regulatory templates; separate communication channels.
SLO/metrics/alerts for DR; reports to management.
Agreements with providers (tapes/archive classes/DNS/CDN), SLA confirmed.
Finance: media/archive budget, logistics, media replacement by time.

Common errors

"There is a replica - no backup is needed →" a logical error/the ransomware will leave everywhere.
There is no immutability/air-gap → a single vector for compromising all copies.
The lack of catalogs/check amounts → restored "something," but not that.
DNS TTL is too large → multi-day traffic migration.
Keys/KMS in the same domain/account → blocking access in an incident.
Exercises only "on paper" → RTO/RPO are not confirmed.

iGaming/fintech specific

Wallet/payment core: strict RPO (≤ 1-5 minutes) and RTO (≤ 15-60 minutes); logs to an object with WORM; DR function "read-only balance" for transparent communication.
PSP/content providers: pre-agreed DR-IP/domain, whitelists, certificates, HMAC/mTLS keys - copies in the DR packet.
Reporting/regulators: notification templates, unchanging archives, provable integrity, activity log.
Peaks and events: DR readiness is checked before major tournaments/promotions; canary restore and CDN warming.

Mini Runbook Templates

1) Vault/KMS black-start (concept):

1. Initializing the DR cluster, loading unseal (dual-control) keys.

2. Restore storage backup (cold-copy).

3. Checking policies, issuing bootstrap secrets for CI/CD/K8s.

2) PostgreSQL DR (PITR из cold-backup):

1. Expand an empty instance, restore full from cold.

2. Upload WAL logs (increments) to the target moment.

3. Consistency check, enable replication, open read-only, then read-write.

3) DNS/traffic:

1. Reduce TTL in 24-72 hours to planned risks (or keep low constantly).

2. Switching A/AAAA/CNAME by checklist, error/latency monitoring.

3. Gradual traffic growth (canary 5% → 25% → 100%).

Result

A reliable DR based on cold-backups is: immutable isolated copies, formalized black-start procedures, clear RPO/RTOs, regular exercises, a well-thought-out DNS/network strategy, and key discipline. Commit everything to IaC and runbooks, automate integrity checks and canary restores - and you will always have a controlled path to recovery even after a worst-case scenario.