GH GambleHub

Disaster Recovery Scenarios

1) Why DR is needed and what is the purpose

Disaster Recovery (DR) is a set of architectures, processes and training for recovering services after disasters (data center/region failure, data loss, mass configuration errors). DR's goal is to meet target RTOs/RPOs at controlled cost and risk while maintaining customer trust and regulatory compliance.

Recovery Time Objective (RTO) -Allowed downtime.
Recovery Point Objective (RPO) - allowable data loss (time since last consistent point).
RLO (Recovery Level Objective): level of functionality that should return first (minimum viable service).

2) Classification of systems by criticality

Tier 0 (vital): payments, login, KYC, core transactions - RTO ≤ 15 min, RPO ≤ 1-5 min.
Tier 1 (high): operating panels, reports D-1 - RTO ≤ 1 h, RPO ≤ 15-60 min.
Tier 2 (average): back office, near-real-time analytics - RTO ≤ 4-8 hours, RPO ≤ 4-8 hours.
Tier 3 (low): non-critical auxiliary - RTO ≤ 24-72 h, RPO ≤ 24 h.

Assign Tier + target RTO/RPOs to each service in the service catalog; decisions and budgets should be checked against them.

3) Threat model and scenarios

Man-made: failure of AZ/region/provider, network degradation/DNS, database/storage failure, mass release bug.
Human factor: erroneous configs/IaC, data deletion, key compromise.
Natural/external: fire/flood, power outages, legal blockages.
For each - evaluate the probability/impact, link to the DR scenario and playbook.

4) DR architecture patterns

1. Active-Active (Multi-Region): Both regions serve traffic.

Pros: minimal RTO/RPO, high stability.
Disadvantages: data complexity/consistency, high price.
Where: read-heavy, cached loads, stateless services, multi-master DB (strict conflict rules).

2. Active-Passive (Hot Standby): A hot passive holds a fully heated copy.

RTO: minutes; RPO: Minutes. Requires automated failover and replication.

3. Warm Standby: part of the resources are warmed up, scaling in case of an accident.

RTO: tens of minutes; RPO: 15-60 minutes. More economical, but longer.

4. Pilot Light: minimal "spark" (metadata/images/scripts) + quick spread.

RTO: hours; RPO: hours. Cheap, suitable for Tier 2-3.

5. Backup & Restore: offline backups + manual warm-up.

RTO/RPO: hours/day. Only for low criticality and archives.

5) Data and consistency

Database replication:
  • Synchronous - almost zero RPO, but ↑latentnost/stoimost.
  • Asynchronous - better performance, RPO> 0 (tail of logs).
  • Consistency: Choose a model (strong/eventual/causal). For payments - strictly, for analytics - eventual.
  • Snapshots: Create consistent points regularly + store logs (WAL/redo).
  • Cross-region transactions: avoid 2PC; use idempotent operations, deli-and-repeat (retry with deduplication), event sourcing.
  • Queues/buses: replication/mirroring, DLQ, ordering and idempotency of consumers.

6) Network, traffic and DNS

GSLB/Anycast/DNS: failover/failback policies, low TTL (but not too much), health-checks from several regions.
L7 routing: regional maps, degradation flags (function restriction).
Private-links/VPN: backup channels to providers (PSP/KYC/CDN).
Rate limiting: storm protection during recovery.

7) Stateful vs Stateless

Stateless is carried by script/autoscale; stateful requires a consistent data strategy (replication, snapshots, replica promotion, quorum).
Cache/sessions: external (Redis/Memcached) with cross-region replication or re-seed by logs; hold sessions in tokens (JWT) or shared storage.

8) DR triggers and automation

SLO gardrails and quorum probes → an automatic region-failover runbook.
Change freeze in case of accident: block irrelevant releases/migrations.
Infrastructure as Code: deployment of stand-by manifests, drift check.
Role promotion: automatic promote replica DB + writers/secrets dressing.

9) Communications and Compliance

War-room: IC/TL/Comms/Scribe; SEV update intervals.
Status page: geography of influence, ETA, workarounds.
Regulatory: notification deadlines, data security, unchangeable evidence storage.
Partners/providers: confirmed contacts, dedicated channel.

10) DR tests and exercises

Tabletop: Discussing scenario and solutions.
Game Day (stage/prod-light): simulation of AZ/regions failure, provider shutdown, DNS reset.
Restore tests: periodically restore backups in isolation and validate integrity.
Chaos/Failure injection: controlled network/node/dependency failures.
Exercise KPI: achieved RTO/RPO, playbook defects, CAPA.

11) Finance and Strategy Selection (FinOps)

Count $ for reduced RPO/RTO: the lower the targets, the more expensive the channels, licenses, reserves.
Hybrid: Tier 0 - active-active/hot; Tier 1 — warm; Tier 2–3 — pilot/backup.
Expensive data: use cold layers (archive/S3/GLACIER), incremental snapshots, deduplication.
Periodic review of DR-infra costs and certificates/licenses.

12) DR Maturity Metrics

RTO (actual) and RPO (actual) for each Tier.
DR Coverage:% of services with a designed script/playbook/test.
Backup Success & Restore Success: The daily success of backups and proven restores.
Time-to-Declare Disaster: Speed of failover decision.
Failback Time returns to normal topology.
Defect Rate Exercises: found gaps/teachings.
Compliance Evidence Completeness.

13) Checklists

Before DR implementation

  • Service directory contains Tier, RTO/RPO, dependencies and owners.
  • Selected pattern (AA/AP/WS/PL/BR) by Tier and budget.
  • Consistency and replication agreements are documented.
  • GSLB/DNS/routing and health-checks configured and tested.
  • Backups, snapshots, change logs - enabled, checked for restore.
  • DR playbooks and provider contacts are up to date.

During the accident (briefly)

  • Declare a SEV and assemble a war-room; freeze releases.
  • Check quorum of probes; record the impact/geography.
  • Execute Failover Runbook: Traffic, Promotion DB, Queues, Cache.
  • Enable degrade-UX/limits; publish updates on SLA.
  • Collect evidence (timeline, graphs, logs, commands).

After the accident

  • Observe SLO of N intervals; execute failback as planned.
  • Conduct AAR/RCA; issue a CAPA.
  • Update playbooks, alert catalysts, DR test cases.
  • Report to stakeholders/regulators (if necessary).

14) Templates

14. 1 DR script card (example)


ID: DR-REGION-FAILOVER-01
Scope: prod EU ↔ prod US
Tier: 0 (Payments, Auth)
Targets: RTO ≤ 15m, RPO ≤ 5m
Trigger: quorum(probes EU, US) + burn-rate breach + provider status=red
Actions:
- Traffic: GSLB shift EU→US (25→50→100% with green SLIs)
- DB: promote US-replica to primary; re-point writers; freeze schema changes
- MQ: mirror switch; drain EU DLQ; idempotent reprocess
- Cache: invalidate region-specific keys; warm critical sets
- Features: enable degrade_payments_ux
- Comms: status page update q=15m; partners notify
Guardrails: payment_success ≥ 98%, p95 ≤ 300ms
Rollback/Failback: EU green 60m → 25→50→100% with guardrails
Owners: IC @platform, DB @data, Network @netops, Comms @support

14. 2 Runbook "Promote replica database" (fragment)


1) Freeze writes; verify WAL applied (lag ≤ 30s)
2) Promote replica; update cluster VIP / writer endpoint
3) Rotate app secrets/endpoints via remote config
4) Validate: read/write checks, consistency, replication restart to new secondary
5) Lift freeze, monitor errors p95/5xx for 30m

14. 3 DR Exercise Plan (Brief)


Purpose: to check RTO/RPO Tier 0 in case of EU failure
Scenario: EU incoming LB down + 60s replication delay
Success criteria: 100% traffic in US ≤ 12m; RPO ≤ 5m; SLI green 30m
Artifacts: switching logs, SLI graphs, step times, command output

15) Anti-patterns

"There are backups" without regular restore tests.
Secrets/endpoints are not automatically switched.
No idempotency → duplicate/lost transactions on redelivery.
Identical configs for regions without degradation feature flags.

Long Time-to-Declare for fear of "false alarm."

Monoregional providers (PSP/KYC) with no alternative.

There is no failback plan - we live in an emergency topology "forever."

16) Implementation Roadmap (6-10 weeks)

1. Ned. 1-2: classification of services by Tier, setting target RTO/RPO, choosing DR patterns.
2. Ned. 3-4: setting up replication/backups, GSLB/DNS, promotion procedures; playbooks and runbooks' and.
3. Ned. 5-6: first DR exercises (tabletop→stage), fixing metrics and CAPA.
4. Ned. 7-8: Traffic-Restricted Exercise Prod-Light; failover automation.
5. Ned. 9-10: cost optimization (FinOps), transfer of Tier 0 to hot/AA, quarterly exercise and reporting regulations.

17) The bottom line

Effective DR is not just about backups. These are consistent architecture, failover/failback automation, data discipline (idempotency/replication), training, and transparent communications. When RTO/RPOs are real, playbooks are worked out, and exercises are regular, the disaster turns into a controlled event, after which services quickly and predictably return to normal.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.