GH GambleHub

Backups and disaster recovery

Backups and Disaster Recovery

1) Definitions and objectives

Backup - a consistent copy of data/configurations for subsequent recovery (from accidental deletions, bugs, cryptolocers, disasters).
DR (Disaster Recovery) - the process of restoring infrastructure/services to working SLOs after a major accident (fire, loss of region, massive compromise).
RPO (Recovery Point Objective) - maximum allowable data loss in time (for example, 15 minutes).
RTO (Recovery Time Objective) - service recovery time target (for example, 30 minutes).

Key principle: replication ≠ backup. Replication quickly smears errors and encryption across all copies. A backup is an isolated, verified, potentially unchangeable copy.

2) Data classification and criticality levels

Divide assets into classes:
  • Tier-0 (vital): transactional databases, payments, balance sheet accounting, secrets/PKI.
  • Tier-1 (critical): service configs, queues, CI/CD artifacts, container registers.
  • Tier-2 (important): analytics, reports, secondary indexes, log archives.
  • Tier-3 (auxiliary): caches, time data (can be restored by reconstruction).

For each class, define the RPO/RTO, retention period, immutability requirements, and location.

3) Retention Strategies: Rule 3-2-1-1-0

3 copies of data (prod + 2 backups).
2 different media/storage types.
1 offsite copy (different region/cloud).
1 immutable/air-gap (WORM/Object Lock/Tape).
0 errors in recovery checks (regular tests).

4) Types of backups

Full - full copy. Slow/expensive but base for all strategies.
Incremental - the difference with the last any backup. Optimal in volume.
Differential - the difference with the last full. Faster recovery, more space.
Snapshot - snapshot of a volume/disk (EBS/ZFS/LVM). We need app-consistent snapshots (quiesce).
PITR (Point-in-Time Recovery) - basic backup + logs (WAL/binlog) for rollback to exact time/LSN.
Object/file/figurative - for specific data types (VM images, S3 objects, DB dumps).

5) Consistency of backups

Crash-consistent: as after a sudden shutdown - suitable for stateless/journaled FS.
App-consistent: the application "freezes" operations (fsfreeze/pre-post scripts) → guaranteed integrity.
Database consistency: API of the backup tool (pgBackRest, XtraBackup), hot-backup modes, freezing checkpoints.

6) Encryption, keys and access

At-rest and in-transit encryption for all copies.
Keys in KMS/HSM, rotation by policy (90/180 days), separate keys by environment.
Separation of duties: who creates/removes backups ≠ who can decrypt/read them.
Do not keep decryption keys in the same trust domain as the target copies.

7) Unmodifiable copies and ransomware protection

Object Lock/WORM (Compliance/Governance) with retention and Legal Hold.
Air-gap: isolated/offline storage (feed, offline cloud/account).
"Delayed activation" deletion policies, MFA-Delete, separate account for backup-buckets, prohibition of public access.
Verification for malware/indicators of compromise before mounting.

8) Frequency, schedule and retention

GFS (Grandfather-Father-Son): daily increments, weekly full/diff, monthly full with long storage.
RPO dictates the frequency of increments and WAL/binlog archiving (for example, every 5-15 minutes).
Storage: critical - ≥ 35-90 days + monthly for 12-36 months (legal requirements).
Seasonal peaks are separate control points (before promotions/releases).

9) DR models and scenarios

Active-Active: Both regions serve traffic. Minimal RTO, data collapse requires a strict conflict policy.
Active-Passive (hot/warm): hot - unfolded and synchronized (RTO minutes), warm - partially ready (RTO hours).
Cold: store copies and Terraform/Ansible/images, raise on demand (RTO day +).
DRaaS: provider orchestration of VMs/networks/addresses in another zone.

10) Feilover orchestration and recovery priorities

Startup priority: network/VPN/DNS → secrets/KMS → databases/clusters → queues/cache → applications → perimeter/CDN → analytics.
Automation: scripts/runbook actions, Terraform/Ansible/Helm/ArgoCD profiles for DR environment.
Data: DB PITR → reindex/replica → warm cache → launching services with schema compatibility flags.
DNS/GSLB: TTL downgrade in advance, switch scenarios with validation.

11) Backup verification tests

Restore tests on a schedule: sampling N% of backups, sandbox deployment, automatic schema/invariant checks.
Full DR-drill (game-day): disabling region/AZ, checking RTO/RPO on live traffic (or traffic shadows).
Integrity tests: hash directories, checksums, attempt to read all layers (full + chain).
Document report: time, steps, anomalies, gap size from goals, corrections.

12) Practice for core technologies

Databases

PostgreSQL: base backup + WAL archive (PITR), pgBackRest/Barman tools; replication slots, monitoring'lsn '.
MySQL/MariaDB: Percona XtraBackup/Enterprise Backup, binlog archiving.
MongoDB: 'mongodump' for logical copy + snapshot for large sets; Oplog for PITR.
Redis: RDB/AOF for critical (if Redis is not only cache), but more often - logical reconstruction from the source + snapshot for accidents.
Kafka/Pulsar: metadata backup (ZK/Kraft/BookKeeper), disk snapshots, topic/log mirroring.

Kubernetes

etcd snapshot + Velero for resources/volumes (CSI snapshots).
Backup secrets/PKI separately (Vault snapshot).
Separate register of images: immutable tags.

VMs and File Systems

ZFS: 'zfs snapshot' + 'zfs send | zstd | send-recv' increments, checking 'scrub'.
LVM/EBS snapshots with pre/post scripts (app-consistent).
Object Stores - Versions + Object Lock.

13) Cataloging and version control of backups

Directory (metadata cataloging): what, where, when, than done, hashes, KMS key, owner, retention period.
Метки/теги: `env=prod|stage`, `system=db|k8s|vm`, `tier=0|1|2`, `retention=35d|1y`.
Gold checkpoints: before migrations/DDL/large-scale releases.

14) Observability and metrics

Job success rate:% successful/failed, reasons.
Backup/restore time, window width.
RPO-actual: log archive log (WAL/binlog) p95.
Integrity: proportion of chains tested, hash reconciliation errors.
Cost: storage capacity by class, deduplication/compression ratio.
DR-readiness: frequency and result of exercises (pass/fail).

15) Access and compliance policies

Separate accounts/projects for backup storage; access according to the NaC principle (we do not allow deletion/encryption from production accounts).
Logs of access/changes (audit trail), alerts for mass deletions/changes of retshn.
Compliance: GDPR (right to delete vs archives), PCI DSS (encryption, keys, segmentation), local regulators.

16) Anti-patterns

"There is a replica, which means you don't need a backup."

No immutable/air-gap: one error/malware erases everything.
Backups in the same account/region as prod.
Never check recovery (backup "dead before check").
No cataloging and version control → chaos in an accident.
Shared encryption keys for all environments.
Snapshots without app-consistent mode for database.
The backup window intersects with peaks (affects p99 and SLO).

17) Implementation checklist (0-60 days)

0-10 days

Inventory of systems/data, criticality classes.
Set RPO/RTO targets by class.
Enable full + incremental for Tier-0/1, WAL/binlog archive.
Post backups: separate region/account + enable KMS encryption.

11-30 days

Configure immutable (Object Lock/WORM) for critical copies.
Enter cataloging, tags, reporting; alerts to failures and lag magazines.
First DR-drill: restore a separate service from a backup in an isolated environment.

31-60 days

Automate runbook: Terraform/Ansible/Helm profiles DR.
Regular restore tests (week/month) + quarterly full DR scenario.
Optimize cost-deduplication/compression/storage lifecycles.

18) Maturity metrics

Restore tests: ≥ 1/week for Tier-0 (selective), ≥ 1/month - full scenario.
Immutable coverage для Tier-0/1 = 100%.
RPO-actual p95 ≤ target (e.g. ≤ 15 min).
RTO-actual on DR-exercises ≤ target (e.g. ≤ 30 min).
Directory completeness = 100% (each backup is described and checked).
Incident-to-restore - Time from detection to start of recovery.

19) Examples (snippets)

PostgreSQL - PITR policy (idea):
bash base backup once a day pgbackrest --stanza = prod --type = full backup archive WAL every 5 minutes pgbackrest --stanza = prod archive-push restore to time pgbackrest --stanza = prod restore --type = time --target =" 2025-11-03 14:00:00 + 02"
MySQL - incremental loop:
bash xtrabackup --backup --target-dir=/backup/full-2025-11-01 xtrabackup --backup --incremental-basedir=/backup/full-2025-11-01 --target-dir=/backup/inc-2025-11-02 xtrabackup --prepare --apply-log-only --target-dir=/backup/full-2025-11-01 xtrabackup --prepare --target-dir=/backup/full-2025-11-01 --incremental-dir=/backup/inc-2025-11-02
Kubernetes - Velero (manifesto ideas):
yaml apiVersion: velero. io/v1 kind: Backup metadata: { name: prod-daily }
spec:
includedNamespaces: ["prod-"]
ttl: 720h storageLocation: s3-immutable
S3 Object Lock (sample lifecycle policy):
json
{
"Rules": [{
"ID": "prod-immutable",
"Status": "Enabled",
"NoncurrentVersionExpiration": { "NoncurrentDays": 365 }
}]
}

20) Communications and operational roles

Incident Commander, Comms Lead, Ops Lead, DB Lead, Security.
Message templates for stakeholders/regulators/users.
Post-mortem with actions: where they lost minutes, where to improve automation.

21) Conclusion

A reliable loop of backups and DR is not a "make a copy," but a cycle: classification → goals RPO/RTO → multi-level and immutable copies → automated runbooks' and → regular restores and exercises. Adhere to 3-2-1-1-0, separate replication from backups, encrypt and isolate keys, document and verify. Then even the "black swan" will turn into a manageable process with predictable downtime and minimal data loss.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.