Backups and disaster recovery

Backups and Disaster Recovery

1) Definitions and objectives

Backup - a consistent copy of data/configurations for subsequent recovery (from accidental deletions, bugs, cryptolocers, disasters).
DR (Disaster Recovery) - the process of restoring infrastructure/services to working SLOs after a major accident (fire, loss of region, massive compromise).
RPO (Recovery Point Objective) - maximum allowable data loss in time (for example, 15 minutes).
RTO (Recovery Time Objective) - service recovery time target (for example, 30 minutes).

Key principle: replication ≠ backup. Replication quickly smears errors and encryption across all copies. A backup is an isolated, verified, potentially unchangeable copy.

2) Data classification and criticality levels

Divide assets into classes:

Tier-0 (vital): transactional databases, payments, balance sheet accounting, secrets/PKI.
Tier-1 (critical): service configs, queues, CI/CD artifacts, container registers.
Tier-2 (important): analytics, reports, secondary indexes, log archives.
Tier-3 (auxiliary): caches, time data (can be restored by reconstruction).

For each class, define the RPO/RTO, retention period, immutability requirements, and location.

3) Retention Strategies: Rule 3-2-1-1-0

3 copies of data (prod + 2 backups).
2 different media/storage types.
1 offsite copy (different region/cloud).
1 immutable/air-gap (WORM/Object Lock/Tape).
0 errors in recovery checks (regular tests).

4) Types of backups

Full - full copy. Slow/expensive but base for all strategies.
Incremental - the difference with the last any backup. Optimal in volume.
Differential - the difference with the last full. Faster recovery, more space.
Snapshot - snapshot of a volume/disk (EBS/ZFS/LVM). We need app-consistent snapshots (quiesce).
PITR (Point-in-Time Recovery) - basic backup + logs (WAL/binlog) for rollback to exact time/LSN.
Object/file/figurative - for specific data types (VM images, S3 objects, DB dumps).

5) Consistency of backups

Crash-consistent: as after a sudden shutdown - suitable for stateless/journaled FS.
App-consistent: the application "freezes" operations (fsfreeze/pre-post scripts) → guaranteed integrity.
Database consistency: API of the backup tool (pgBackRest, XtraBackup), hot-backup modes, freezing checkpoints.

6) Encryption, keys and access

At-rest and in-transit encryption for all copies.
Keys in KMS/HSM, rotation by policy (90/180 days), separate keys by environment.
Separation of duties: who creates/removes backups ≠ who can decrypt/read them.
Do not keep decryption keys in the same trust domain as the target copies.

7) Unmodifiable copies and ransomware protection

Object Lock/WORM (Compliance/Governance) with retention and Legal Hold.
Air-gap: isolated/offline storage (feed, offline cloud/account).
"Delayed activation" deletion policies, MFA-Delete, separate account for backup-buckets, prohibition of public access.
Verification for malware/indicators of compromise before mounting.

8) Frequency, schedule and retention

GFS (Grandfather-Father-Son): daily increments, weekly full/diff, monthly full with long storage.
RPO dictates the frequency of increments and WAL/binlog archiving (for example, every 5-15 minutes).
Storage: critical - ≥ 35-90 days + monthly for 12-36 months (legal requirements).
Seasonal peaks are separate control points (before promotions/releases).

9) DR models and scenarios

Active-Active: Both regions serve traffic. Minimal RTO, data collapse requires a strict conflict policy.
Active-Passive (hot/warm): hot - unfolded and synchronized (RTO minutes), warm - partially ready (RTO hours).
Cold: store copies and Terraform/Ansible/images, raise on demand (RTO day +).
DRaaS: provider orchestration of VMs/networks/addresses in another zone.

10) Feilover orchestration and recovery priorities

Startup priority: network/VPN/DNS → secrets/KMS → databases/clusters → queues/cache → applications → perimeter/CDN → analytics.
Automation: scripts/runbook actions, Terraform/Ansible/Helm/ArgoCD profiles for DR environment.
Data: DB PITR → reindex/replica → warm cache → launching services with schema compatibility flags.
DNS/GSLB: TTL downgrade in advance, switch scenarios with validation.

11) Backup verification tests

Restore tests on a schedule: sampling N% of backups, sandbox deployment, automatic schema/invariant checks.
Full DR-drill (game-day): disabling region/AZ, checking RTO/RPO on live traffic (or traffic shadows).
Integrity tests: hash directories, checksums, attempt to read all layers (full + chain).
Document report: time, steps, anomalies, gap size from goals, corrections.

12) Practice for core technologies

Databases

PostgreSQL: base backup + WAL archive (PITR), pgBackRest/Barman tools; replication slots, monitoring'lsn '.
MySQL/MariaDB: Percona XtraBackup/Enterprise Backup, binlog archiving.
MongoDB: 'mongodump' for logical copy + snapshot for large sets; Oplog for PITR.
Redis: RDB/AOF for critical (if Redis is not only cache), but more often - logical reconstruction from the source + snapshot for accidents.
Kafka/Pulsar: metadata backup (ZK/Kraft/BookKeeper), disk snapshots, topic/log mirroring.

Kubernetes

etcd snapshot + Velero for resources/volumes (CSI snapshots).
Backup secrets/PKI separately (Vault snapshot).
Separate register of images: immutable tags.

VMs and File Systems

ZFS: 'zfs snapshot' + 'zfs send | zstd | send-recv' increments, checking 'scrub'.
LVM/EBS snapshots with pre/post scripts (app-consistent).
Object Stores - Versions + Object Lock.

13) Cataloging and version control of backups

14) Observability and metrics

Job success rate:% successful/failed, reasons.
Backup/restore time, window width.
RPO-actual: log archive log (WAL/binlog) p95.
Integrity: proportion of chains tested, hash reconciliation errors.
Cost: storage capacity by class, deduplication/compression ratio.
DR-readiness: frequency and result of exercises (pass/fail).

15) Access and compliance policies

Separate accounts/projects for backup storage; access according to the NaC principle (we do not allow deletion/encryption from production accounts).
Logs of access/changes (audit trail), alerts for mass deletions/changes of retshn.
Compliance: GDPR (right to delete vs archives), PCI DSS (encryption, keys, segmentation), local regulators.

16) Anti-patterns

"There is a replica, which means you don't need a backup."

No immutable/air-gap: one error/malware erases everything.
Backups in the same account/region as prod.
Never check recovery (backup "dead before check").
No cataloging and version control → chaos in an accident.
Shared encryption keys for all environments.
Snapshots without app-consistent mode for database.
The backup window intersects with peaks (affects p99 and SLO).

17) Implementation checklist (0-60 days)

0-10 days

Inventory of systems/data, criticality classes.
Set RPO/RTO targets by class.
Enable full + incremental for Tier-0/1, WAL/binlog archive.
Post backups: separate region/account + enable KMS encryption.

11-30 days

Configure immutable (Object Lock/WORM) for critical copies.
Enter cataloging, tags, reporting; alerts to failures and lag magazines.
First DR-drill: restore a separate service from a backup in an isolated environment.

31-60 days

Automate runbook: Terraform/Ansible/Helm profiles DR.
Regular restore tests (week/month) + quarterly full DR scenario.
Optimize cost-deduplication/compression/storage lifecycles.

18) Maturity metrics

Restore tests: ≥ 1/week for Tier-0 (selective), ≥ 1/month - full scenario.
Immutable coverage для Tier-0/1 = 100%.
RPO-actual p95 ≤ target (e.g. ≤ 15 min).
RTO-actual on DR-exercises ≤ target (e.g. ≤ 30 min).
Directory completeness = 100% (each backup is described and checked).
Incident-to-restore - Time from detection to start of recovery.

19) Examples (snippets)

PostgreSQL - PITR policy (idea):

bash base backup once a day pgbackrest --stanza = prod --type = full backup archive WAL every 5 minutes pgbackrest --stanza = prod archive-push restore to time pgbackrest --stanza = prod restore --type = time --target =" 2025-11-03 14:00:00 + 02"

MySQL - incremental loop:

bash xtrabackup --backup --target-dir=/backup/full-2025-11-01 xtrabackup --backup --incremental-basedir=/backup/full-2025-11-01 --target-dir=/backup/inc-2025-11-02 xtrabackup --prepare --apply-log-only --target-dir=/backup/full-2025-11-01 xtrabackup --prepare --target-dir=/backup/full-2025-11-01 --incremental-dir=/backup/inc-2025-11-02

Kubernetes - Velero (manifesto ideas):

yaml apiVersion: velero. io/v1 kind: Backup metadata: { name: prod-daily }
spec:
includedNamespaces: ["prod-"]
ttl: 720h storageLocation: s3-immutable

S3 Object Lock (sample lifecycle policy):

json
{
"Rules": [{
"ID": "prod-immutable",
"Status": "Enabled",
"NoncurrentVersionExpiration": { "NoncurrentDays": 365 }
}]
}

20) Communications and operational roles

Incident Commander, Comms Lead, Ops Lead, DB Lead, Security.
Message templates for stakeholders/regulators/users.
Post-mortem with actions: where they lost minutes, where to improve automation.

21) Conclusion

A reliable loop of backups and DR is not a "make a copy," but a cycle: classification → goals RPO/RTO → multi-level and immutable copies → automated runbooks' and → regular restores and exercises. Adhere to 3-2-1-1-0, separate replication from backups, encrypt and isolate keys, document and verify. Then even the "black swan" will turn into a manageable process with predictable downtime and minimal data loss.

Backups and disaster recovery