Backups and disaster recovery
Backups and Disaster Recovery
1) Definitions and objectives
Backup - a consistent copy of data/configurations for subsequent recovery (from accidental deletions, bugs, cryptolocers, disasters).
DR (Disaster Recovery) - the process of restoring infrastructure/services to working SLOs after a major accident (fire, loss of region, massive compromise).
RPO (Recovery Point Objective) - maximum allowable data loss in time (for example, 15 minutes).
RTO (Recovery Time Objective) - service recovery time target (for example, 30 minutes).
Key principle: replication ≠ backup. Replication quickly smears errors and encryption across all copies. A backup is an isolated, verified, potentially unchangeable copy.
2) Data classification and criticality levels
Divide assets into classes:- Tier-0 (vital): transactional databases, payments, balance sheet accounting, secrets/PKI.
- Tier-1 (critical): service configs, queues, CI/CD artifacts, container registers.
- Tier-2 (important): analytics, reports, secondary indexes, log archives.
- Tier-3 (auxiliary): caches, time data (can be restored by reconstruction).
For each class, define the RPO/RTO, retention period, immutability requirements, and location.
3) Retention Strategies: Rule 3-2-1-1-0
3 copies of data (prod + 2 backups).
2 different media/storage types.
1 offsite copy (different region/cloud).
1 immutable/air-gap (WORM/Object Lock/Tape).
0 errors in recovery checks (regular tests).
4) Types of backups
Full - full copy. Slow/expensive but base for all strategies.
Incremental - the difference with the last any backup. Optimal in volume.
Differential - the difference with the last full. Faster recovery, more space.
Snapshot - snapshot of a volume/disk (EBS/ZFS/LVM). We need app-consistent snapshots (quiesce).
PITR (Point-in-Time Recovery) - basic backup + logs (WAL/binlog) for rollback to exact time/LSN.
Object/file/figurative - for specific data types (VM images, S3 objects, DB dumps).
5) Consistency of backups
Crash-consistent: as after a sudden shutdown - suitable for stateless/journaled FS.
App-consistent: the application "freezes" operations (fsfreeze/pre-post scripts) → guaranteed integrity.
Database consistency: API of the backup tool (pgBackRest, XtraBackup), hot-backup modes, freezing checkpoints.
6) Encryption, keys and access
At-rest and in-transit encryption for all copies.
Keys in KMS/HSM, rotation by policy (90/180 days), separate keys by environment.
Separation of duties: who creates/removes backups ≠ who can decrypt/read them.
Do not keep decryption keys in the same trust domain as the target copies.
7) Unmodifiable copies and ransomware protection
Object Lock/WORM (Compliance/Governance) with retention and Legal Hold.
Air-gap: isolated/offline storage (feed, offline cloud/account).
"Delayed activation" deletion policies, MFA-Delete, separate account for backup-buckets, prohibition of public access.
Verification for malware/indicators of compromise before mounting.
8) Frequency, schedule and retention
GFS (Grandfather-Father-Son): daily increments, weekly full/diff, monthly full with long storage.
RPO dictates the frequency of increments and WAL/binlog archiving (for example, every 5-15 minutes).
Storage: critical - ≥ 35-90 days + monthly for 12-36 months (legal requirements).
Seasonal peaks are separate control points (before promotions/releases).
9) DR models and scenarios
Active-Active: Both regions serve traffic. Minimal RTO, data collapse requires a strict conflict policy.
Active-Passive (hot/warm): hot - unfolded and synchronized (RTO minutes), warm - partially ready (RTO hours).
Cold: store copies and Terraform/Ansible/images, raise on demand (RTO day +).
DRaaS: provider orchestration of VMs/networks/addresses in another zone.
10) Feilover orchestration and recovery priorities
Startup priority: network/VPN/DNS → secrets/KMS → databases/clusters → queues/cache → applications → perimeter/CDN → analytics.
Automation: scripts/runbook actions, Terraform/Ansible/Helm/ArgoCD profiles for DR environment.
Data: DB PITR → reindex/replica → warm cache → launching services with schema compatibility flags.
DNS/GSLB: TTL downgrade in advance, switch scenarios with validation.
11) Backup verification tests
Restore tests on a schedule: sampling N% of backups, sandbox deployment, automatic schema/invariant checks.
Full DR-drill (game-day): disabling region/AZ, checking RTO/RPO on live traffic (or traffic shadows).
Integrity tests: hash directories, checksums, attempt to read all layers (full + chain).
Document report: time, steps, anomalies, gap size from goals, corrections.
12) Practice for core technologies
Databases
PostgreSQL: base backup + WAL archive (PITR), pgBackRest/Barman tools; replication slots, monitoring'lsn '.
MySQL/MariaDB: Percona XtraBackup/Enterprise Backup, binlog archiving.
MongoDB: 'mongodump' for logical copy + snapshot for large sets; Oplog for PITR.
Redis: RDB/AOF for critical (if Redis is not only cache), but more often - logical reconstruction from the source + snapshot for accidents.
Kafka/Pulsar: metadata backup (ZK/Kraft/BookKeeper), disk snapshots, topic/log mirroring.
Kubernetes
etcd snapshot + Velero for resources/volumes (CSI snapshots).
Backup secrets/PKI separately (Vault snapshot).
Separate register of images: immutable tags.
VMs and File Systems
ZFS: 'zfs snapshot' + 'zfs send | zstd | send-recv' increments, checking 'scrub'.
LVM/EBS snapshots with pre/post scripts (app-consistent).
Object Stores - Versions + Object Lock.
13) Cataloging and version control of backups
Directory (metadata cataloging): what, where, when, than done, hashes, KMS key, owner, retention period.
Метки/теги: `env=prod|stage`, `system=db|k8s|vm`, `tier=0|1|2`, `retention=35d|1y`.
Gold checkpoints: before migrations/DDL/large-scale releases.
14) Observability and metrics
Job success rate:% successful/failed, reasons.
Backup/restore time, window width.
RPO-actual: log archive log (WAL/binlog) p95.
Integrity: proportion of chains tested, hash reconciliation errors.
Cost: storage capacity by class, deduplication/compression ratio.
DR-readiness: frequency and result of exercises (pass/fail).
15) Access and compliance policies
Separate accounts/projects for backup storage; access according to the NaC principle (we do not allow deletion/encryption from production accounts).
Logs of access/changes (audit trail), alerts for mass deletions/changes of retshn.
Compliance: GDPR (right to delete vs archives), PCI DSS (encryption, keys, segmentation), local regulators.
16) Anti-patterns
"There is a replica, which means you don't need a backup."
No immutable/air-gap: one error/malware erases everything.
Backups in the same account/region as prod.
Never check recovery (backup "dead before check").
No cataloging and version control → chaos in an accident.
Shared encryption keys for all environments.
Snapshots without app-consistent mode for database.
The backup window intersects with peaks (affects p99 and SLO).
17) Implementation checklist (0-60 days)
0-10 days
Inventory of systems/data, criticality classes.
Set RPO/RTO targets by class.
Enable full + incremental for Tier-0/1, WAL/binlog archive.
Post backups: separate region/account + enable KMS encryption.
11-30 days
Configure immutable (Object Lock/WORM) for critical copies.
Enter cataloging, tags, reporting; alerts to failures and lag magazines.
First DR-drill: restore a separate service from a backup in an isolated environment.
31-60 days
Automate runbook: Terraform/Ansible/Helm profiles DR.
Regular restore tests (week/month) + quarterly full DR scenario.
Optimize cost-deduplication/compression/storage lifecycles.
18) Maturity metrics
Restore tests: ≥ 1/week for Tier-0 (selective), ≥ 1/month - full scenario.
Immutable coverage для Tier-0/1 = 100%.
RPO-actual p95 ≤ target (e.g. ≤ 15 min).
RTO-actual on DR-exercises ≤ target (e.g. ≤ 30 min).
Directory completeness = 100% (each backup is described and checked).
Incident-to-restore - Time from detection to start of recovery.
19) Examples (snippets)
PostgreSQL - PITR policy (idea):bash base backup once a day pgbackrest --stanza = prod --type = full backup archive WAL every 5 minutes pgbackrest --stanza = prod archive-push restore to time pgbackrest --stanza = prod restore --type = time --target =" 2025-11-03 14:00:00 + 02"
MySQL - incremental loop:
bash xtrabackup --backup --target-dir=/backup/full-2025-11-01 xtrabackup --backup --incremental-basedir=/backup/full-2025-11-01 --target-dir=/backup/inc-2025-11-02 xtrabackup --prepare --apply-log-only --target-dir=/backup/full-2025-11-01 xtrabackup --prepare --target-dir=/backup/full-2025-11-01 --incremental-dir=/backup/inc-2025-11-02
Kubernetes - Velero (manifesto ideas):
yaml apiVersion: velero. io/v1 kind: Backup metadata: { name: prod-daily }
spec:
includedNamespaces: ["prod-"]
ttl: 720h storageLocation: s3-immutable
S3 Object Lock (sample lifecycle policy):
json
{
"Rules": [{
"ID": "prod-immutable",
"Status": "Enabled",
"NoncurrentVersionExpiration": { "NoncurrentDays": 365 }
}]
}
20) Communications and operational roles
Incident Commander, Comms Lead, Ops Lead, DB Lead, Security.
Message templates for stakeholders/regulators/users.
Post-mortem with actions: where they lost minutes, where to improve automation.
21) Conclusion
A reliable loop of backups and DR is not a "make a copy," but a cycle: classification → goals RPO/RTO → multi-level and immutable copies → automated runbooks' and → regular restores and exercises. Adhere to 3-2-1-1-0, separate replication from backups, encrypt and isolate keys, document and verify. Then even the "black swan" will turn into a manageable process with predictable downtime and minimal data loss.