Backup and replication strategies

Brief Summary

A reliable data strategy rests on three pillars: backup, replication, recovery. The replica reduces RTO (recovery time), the backup guarantees RPO (data loss) and protects against logical errors/ransomware. Basic principles: 3-2-1-1-0 (3 copies, 2 types of media, 1 - offsite, 1 - unchangeable, 0 errors in checks), regular DR tests and immutability of critical sets.

Terms and objectives

RPO - how much data can be lost (for example, ≤ 5 minutes).
RTO - how much time is allowed to restore (for example, ≤ 15 minutes).
PITR (Point-in-Time Recovery) - "moment X" recovery with log replay.
Data SLO is a service level contract for RPO/RTO and the success of backup tasks.

Matrix example:

Data class	RPO	RTO	Notes
Transactions/Wallet	≤ 1-5 min	≤ 5-15 min	Logs + Synchronous Core Replica
Reporting/PII	≤ 1 hour	≤ 1 hour	WORM/immutability, archives
Logs/Raw Events	≤ 24 h	≤ 4 h	Object, lifecycle

Fault tolerance and replication models

Topology Options

Active-Passive (hot/warm/cold): simpler, predictable fylovers.
Active-Active: high availability, but conflict-resolution and consistency are more difficult.
Multi-Zone/Region/Cloud: Balance of delay and egress cost.

Synchronous vs asynchronous

Synchronous: RPO≈0, above latency, distance limit.
Asynchron: close to zero RTO at low RPO (minutes), withstands regions/clouds.
Hybrid: synchronous within a zone, asynchronous to a remote region.

Replica ≠ backup

The replica carries errors/deletions after the source. Backup - off-path copy with versioning, checks and isolation.

Policy 3-2-1-1-0 and immutability

3 copies (prod + local backup + offsite).
2 types of media (block/NAS/object/tape).
1 offsite (other site/cloud/tape).
1 immutable copy (WORM: Object Lock, immutable snapshots/tape).
0 Error (s): Regular Integrity Check (checksum/verify/restore tests).

Practice:

Enable versioning and Object Lock (Compliance/Governance) for objects with critical backups.
For NAS/blocks - immutable snapshots with retention and prohibition of deletion until the deadline.

Types of backups and schedules

Full - full copy.
Incremental - only changes from the previous backup.
Differential - changes since the last complete.

Forever-incremental with GFS-plan (Grandfather-Father-Son): daily increments, weekly and monthly "synthetic full."

Recommendation (example):

Prod DB: daily full (or synthetic full), increments/logs every 5-15 minutes (PITR).
File servers: weekly full, daily incremental, monthly archives.
Object: lifecycle + versions; cold - to archive storage class/tape.

Applications and Databases: PITR Practices

PostgreSQL

Enable WAL archiving and base backup; PITR via 'restore _ command'.
Tools: 'pgBackRest', 'wal-g' (object), 'pg _ basebackup' for complete.
Split volumes: data and WAL; write WAL on fast NVMe with PLP.

MySQL/MariaDB

Binary log for PITR, complete via 'Percona XtraBackup' (hot backup).
GTID replication; for DR - asynchronous to region/cloud.

MongoDB

Oplog for PITR; snapshots at storaj + 'mongodump' level for logical copies.
Test the consistency of the replica before the backup.

Redis/Caches

Not considered a backup: keep RDB/AOF + offsite; restore as warm-cache or from a source of truth.

Kubernetes and containers

etcd cluster - a separate critical goal (frequent snapshots, offsite).
Velero: backup manifests/resources + CSI snapshots/PV; storage in an S3-compatible bucket (with Object Lock).
Stateful downloads: app-consistent snapshots (pre/post hooks), otherwise - crash-consistent.
Versioning of object artifacts (models/media) - at the level of buckets.

Virtualization and file servers

VM snapshots: use CBT (Changed Block Tracking), store offsite, periodically do guest-aware quiesce (VSS for Windows).
File servers (NAS): snapshots + replica and regular catalog restore tests (file sampling).

Backup security

Encryption at rest (LUKS/ZFS/cloud KMS/Vault) and during transmission (TLS/mTLS).
Key management: individual roles, dual-control, rotation, offline storage of master keys.
Isolation: backup software accounts without rights to delete immutable copies; individual networks/VLANs.
Ransomware-resistance: immutable, air-gap (tapes/isolated account/lab).
Audit: log of backup system operations, notifications about deletion/reduction of retention.

Window and bandwidth planning

Backup window vs load: throttling I/O/networks, deduplication, compression.
Network: increments every N minutes, individual channels/VPN, replica at night or permanently with QoS.
Change Block Tracking/CDC to reduce traffic.
Large bases: parallel streams/streaming, multichannel multipart to object.

Monitoring, Metrics and SLO

Tech metrics:

Success of backup/replication tasks (%), duration, speed, log lag (WAL/binlog/oplog).
Backup storage space, dedup coefficient, other expenses.
Time and success of test recoveries.

SLO (example):

The success of backups ≥ 99. 9 %/30 days.
RPO met ≥ 99% of the time (log lag ≤ target).
RTO (test-restore) ≤ 15 min for wallet, ≤ 1 h for reporting.
Monthly DR-drill: 100% of routine scenarios completed.

Alerts:

Missed/unsuccessful backup, PITR> threshold lag, deduplication drop, lack of space, change in retention policy, lack of fresh test restore.

DR drills and recovery checks

Table-top: role coordination, contacts, communications.
Technical: sandbox recovery, RTO measurement, checksum/data comparison.
Black-start: full bare iron/clean cluster recovery.
Data catalogs: pre-described recovery steps (runbooks) for each system class.
Automation: periodic "canary" restore and verification of checksums.

Practical templates

1) PostgreSQL (pgBackRest + WAL archive to object)

ini
[global]
repo1-type=s3 repo1-path=/pgbackups repo1-s3-endpoint=minio. local:9000 repo1-s3-bucket=pg-wal repo1-s3-key=ACCESSKEY repo1-s3-key-secret=SECRET repo1-retention-full=8 start-fast=y compress-type=zst

2) wal-g (ENV example)

bash export WALG_S3_PREFIX=s3://pg-wal/prod export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export WALG_COMPRESSION_METHOD=zstd

3) Velero (K8s - object + immutability of the bucket)

yaml apiVersion: velero. io/v1 kind: BackupStorageLocation metadata: { name: default, namespace: velero }
spec:
provider: aws objectStorage:
bucket: k8s-backups config:
s3Url: https://minio. example s3ForcePathStyle: "true"
publicUrl: https://minio. example

4) Object Lock policy (example 'mc')

bash mc version enable my/backups mc retention set --default COMPLIANCE 365d my/backups

5) Example of GFS schedule (concept)

Daily: increments every 15 min (magazines), daily synthetic full.

Weekly: One "full" (synthetic), store for 8 weeks

Monthly: full, store 12-24 months (archive/tape).

Implementation checklist

Defined data classes, owners, RPO/RTO/SLO.
Replication (sync/async) and topology (AZ/Region/Cloud) models selected.
Backups are configured: full/incremental/PITR, schedules, directories.
Includes immutability (WORM/Object Lock/immutable snapshots) and offsite/air-gap.
Encryption and KMS/Vault, separate roles and key rotations.
Monitoring: task success, log lag, place, test restore; alerts.
Runbooks recovery and feilover; contacts, escalations, communication templates.
Monthly DR drills + report, adjust plans.
Budget and FinOps: storage cost/egress, archiving/tearing project.

Common errors

"There is a replica - no backup is needed": logical deletions and ransomware will leave for the replica.

No recovery tests - backup exists "theoretically."

The lack of immutability and offsite is a single point of risk.
The same account/keys for sales and backups - compromise = loss of everything.
Too long backup windows → conflict with peaks; no throttling and QoS.
PITR without log lag control.
Ignore app-consistent snapshots - dirty recoverable volumes.

iGaming/fintech specific

Wallet/payment core: RPO ≤ 1-5 min, RTO ≤ 15 min; logs (WAL/binlog) to an object with WORM; synchronous in zone + asynchronous region.
Reporting/regulatory: unchangeable repositories, long retention (years), verifiable integrity, clear procedures for issuing data to regulators.
Logs/raw events/anti-fraud: cheap long-lived storage (object) + lifecycle; indices and storefronts - separately.
Peaks (matches/tournaments): backup windows outside peaks, throttling; DR-plans for the event period; canary restores before stocks.

Total

Data protection is an architectural discipline: 3-2-1-1-0, versioning and immutability, RPO/RTO as SLO, regular DR exercises, and "on-the-spot" recovery testing. Combine replication for uptime and fast failovers with backups for logical errors and compromises. Automate, measure, document - and you will always have a working path back, even on the worst day.