Backup and replication strategies
Brief Summary
A reliable data strategy rests on three pillars: backup, replication, recovery. The replica reduces RTO (recovery time), the backup guarantees RPO (data loss) and protects against logical errors/ransomware. Basic principles: 3-2-1-1-0 (3 copies, 2 types of media, 1 - offsite, 1 - unchangeable, 0 errors in checks), regular DR tests and immutability of critical sets.
Terms and objectives
RPO - how much data can be lost (for example, ≤ 5 minutes).
RTO - how much time is allowed to restore (for example, ≤ 15 minutes).
PITR (Point-in-Time Recovery) - "moment X" recovery with log replay.
Data SLO is a service level contract for RPO/RTO and the success of backup tasks.
Fault tolerance and replication models
Topology Options
Active-Passive (hot/warm/cold): simpler, predictable fylovers.
Active-Active: high availability, but conflict-resolution and consistency are more difficult.
Multi-Zone/Region/Cloud: Balance of delay and egress cost.
Synchronous vs asynchronous
Synchronous: RPO≈0, above latency, distance limit.
Asynchron: close to zero RTO at low RPO (minutes), withstands regions/clouds.
Hybrid: synchronous within a zone, asynchronous to a remote region.
Replica ≠ backup
The replica carries errors/deletions after the source. Backup - off-path copy with versioning, checks and isolation.
Policy 3-2-1-1-0 and immutability
3 copies (prod + local backup + offsite).
2 types of media (block/NAS/object/tape).
1 offsite (other site/cloud/tape).
1 immutable copy (WORM: Object Lock, immutable snapshots/tape).
0 Error (s): Regular Integrity Check (checksum/verify/restore tests).
- Enable versioning and Object Lock (Compliance/Governance) for objects with critical backups.
- For NAS/blocks - immutable snapshots with retention and prohibition of deletion until the deadline.
Types of backups and schedules
Full - full copy.
Incremental - only changes from the previous backup.
Differential - changes since the last complete.
Forever-incremental with GFS-plan (Grandfather-Father-Son): daily increments, weekly and monthly "synthetic full."
Recommendation (example):- Prod DB: daily full (or synthetic full), increments/logs every 5-15 minutes (PITR).
- File servers: weekly full, daily incremental, monthly archives.
- Object: lifecycle + versions; cold - to archive storage class/tape.
Applications and Databases: PITR Practices
PostgreSQL
Enable WAL archiving and base backup; PITR via 'restore _ command'.
Tools: 'pgBackRest', 'wal-g' (object), 'pg _ basebackup' for complete.
Split volumes: data and WAL; write WAL on fast NVMe with PLP.
MySQL/MariaDB
Binary log for PITR, complete via 'Percona XtraBackup' (hot backup).
GTID replication; for DR - asynchronous to region/cloud.
MongoDB
Oplog for PITR; snapshots at storaj + 'mongodump' level for logical copies.
Test the consistency of the replica before the backup.
Redis/Caches
Not considered a backup: keep RDB/AOF + offsite; restore as warm-cache or from a source of truth.
Kubernetes and containers
etcd cluster - a separate critical goal (frequent snapshots, offsite).
Velero: backup manifests/resources + CSI snapshots/PV; storage in an S3-compatible bucket (with Object Lock).
Stateful downloads: app-consistent snapshots (pre/post hooks), otherwise - crash-consistent.
Versioning of object artifacts (models/media) - at the level of buckets.
Virtualization and file servers
VM snapshots: use CBT (Changed Block Tracking), store offsite, periodically do guest-aware quiesce (VSS for Windows).
File servers (NAS): snapshots + replica and regular catalog restore tests (file sampling).
Backup security
Encryption at rest (LUKS/ZFS/cloud KMS/Vault) and during transmission (TLS/mTLS).
Key management: individual roles, dual-control, rotation, offline storage of master keys.
Isolation: backup software accounts without rights to delete immutable copies; individual networks/VLANs.
Ransomware-resistance: immutable, air-gap (tapes/isolated account/lab).
Audit: log of backup system operations, notifications about deletion/reduction of retention.
Window and bandwidth planning
Backup window vs load: throttling I/O/networks, deduplication, compression.
Network: increments every N minutes, individual channels/VPN, replica at night or permanently with QoS.
Change Block Tracking/CDC to reduce traffic.
Large bases: parallel streams/streaming, multichannel multipart to object.
Monitoring, Metrics and SLO
Tech metrics:- Success of backup/replication tasks (%), duration, speed, log lag (WAL/binlog/oplog).
- Backup storage space, dedup coefficient, other expenses.
- Time and success of test recoveries.
- The success of backups ≥ 99. 9 %/30 days.
- RPO met ≥ 99% of the time (log lag ≤ target).
- RTO (test-restore) ≤ 15 min for wallet, ≤ 1 h for reporting.
- Monthly DR-drill: 100% of routine scenarios completed.
- Missed/unsuccessful backup, PITR> threshold lag, deduplication drop, lack of space, change in retention policy, lack of fresh test restore.
DR drills and recovery checks
Table-top: role coordination, contacts, communications.
Technical: sandbox recovery, RTO measurement, checksum/data comparison.
Black-start: full bare iron/clean cluster recovery.
Data catalogs: pre-described recovery steps (runbooks) for each system class.
Automation: periodic "canary" restore and verification of checksums.
Practical templates
1) PostgreSQL (pgBackRest + WAL archive to object)
ini
[global]
repo1-type=s3 repo1-path=/pgbackups repo1-s3-endpoint=minio. local:9000 repo1-s3-bucket=pg-wal repo1-s3-key=ACCESSKEY repo1-s3-key-secret=SECRET repo1-retention-full=8 start-fast=y compress-type=zst
2) wal-g (ENV example)
bash export WALG_S3_PREFIX=s3://pg-wal/prod export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export WALG_COMPRESSION_METHOD=zstd
3) Velero (K8s - object + immutability of the bucket)
yaml apiVersion: velero. io/v1 kind: BackupStorageLocation metadata: { name: default, namespace: velero }
spec:
provider: aws objectStorage:
bucket: k8s-backups config:
s3Url: https://minio. example s3ForcePathStyle: "true"
publicUrl: https://minio. example
4) Object Lock policy (example 'mc')
bash mc version enable my/backups mc retention set --default COMPLIANCE 365d my/backups
5) Example of GFS schedule (concept)
Daily: increments every 15 min (magazines), daily synthetic full.
Weekly: One "full" (synthetic), store for 8 weeks
Monthly: full, store 12-24 months (archive/tape).
Implementation checklist
- Defined data classes, owners, RPO/RTO/SLO.
- Replication (sync/async) and topology (AZ/Region/Cloud) models selected.
- Backups are configured: full/incremental/PITR, schedules, directories.
- Includes immutability (WORM/Object Lock/immutable snapshots) and offsite/air-gap.
- Encryption and KMS/Vault, separate roles and key rotations.
- Monitoring: task success, log lag, place, test restore; alerts.
- Runbooks recovery and feilover; contacts, escalations, communication templates.
- Monthly DR drills + report, adjust plans.
- Budget and FinOps: storage cost/egress, archiving/tearing project.
Common errors
"There is a replica - no backup is needed": logical deletions and ransomware will leave for the replica.
No recovery tests - backup exists "theoretically."
The lack of immutability and offsite is a single point of risk.
The same account/keys for sales and backups - compromise = loss of everything.
Too long backup windows → conflict with peaks; no throttling and QoS.
PITR without log lag control.
Ignore app-consistent snapshots - dirty recoverable volumes.
iGaming/fintech specific
Wallet/payment core: RPO ≤ 1-5 min, RTO ≤ 15 min; logs (WAL/binlog) to an object with WORM; synchronous in zone + asynchronous region.
Reporting/regulatory: unchangeable repositories, long retention (years), verifiable integrity, clear procedures for issuing data to regulators.
Logs/raw events/anti-fraud: cheap long-lived storage (object) + lifecycle; indices and storefronts - separately.
Peaks (matches/tournaments): backup windows outside peaks, throttling; DR-plans for the event period; canary restores before stocks.
Total
Data protection is an architectural discipline: 3-2-1-1-0, versioning and immutability, RPO/RTO as SLO, regular DR exercises, and "on-the-spot" recovery testing. Combine replication for uptime and fast failovers with backups for logical errors and compromises. Automate, measure, document - and you will always have a working path back, even on the worst day.