GH GambleHub

Disaster Recovery Scenarios

1) Why DR is needed and what is the purpose

Disaster Recovery (DR) is a set of architectures, processes and training for recovering services after disasters (data center/region failure, data loss, mass configuration errors). DR's goal is to meet target RTOs/RPOs at controlled cost and risk while maintaining customer trust and regulatory compliance.

Recovery Time Objective (RTO) -Allowed downtime.
Recovery Point Objective (RPO) - allowable data loss (time since last consistent point).
RLO (Recovery Level Objective): level of functionality that should return first (minimum viable service).

2) Classification of systems by criticality

Tier 0 (vital): payments, login, KYC, core transactions - RTO ≤ 15 min, RPO ≤ 1-5 min.
Tier 1 (high): operating panels, reports D-1 - RTO ≤ 1 h, RPO ≤ 15-60 min.
Tier 2 (average): back office, near-real-time analytics - RTO ≤ 4-8 hours, RPO ≤ 4-8 hours.
Tier 3 (low): non-critical auxiliary - RTO ≤ 24-72 h, RPO ≤ 24 h.

Assign Tier + target RTO/RPOs to each service in the service catalog; decisions and budgets should be checked against them.

3) Threat model and scenarios

Man-made: failure of AZ/region/provider, network degradation/DNS, database/storage failure, mass release bug.
Human factor: erroneous configs/IaC, data deletion, key compromise.
Natural/external: fire/flood, power outages, legal blockages.
For each - evaluate the probability/impact, link to the DR scenario and playbook.

4) DR architecture patterns

1. Active-Active (Multi-Region): Both regions serve traffic.

Pros: minimal RTO/RPO, high stability.
Disadvantages: data complexity/consistency, high price.
Where: read-heavy, cached loads, stateless services, multi-master DB (strict conflict rules).

2. Active-Passive (Hot Standby): A hot passive holds a fully heated copy.

RTO: minutes; RPO: Minutes. Requires automated failover and replication.

3. Warm Standby: part of the resources are warmed up, scaling in case of an accident.

RTO: tens of minutes; RPO: 15-60 minutes. More economical, but longer.

4. Pilot Light: minimal "spark" (metadata/images/scripts) + quick spread.

RTO: hours; RPO: hours. Cheap, suitable for Tier 2-3.

5. Backup & Restore: offline backups + manual warm-up.

RTO/RPO: hours/day. Only for low criticality and archives.

5) Data and consistency

Database replication:
  • Synchronous - almost zero RPO, but ↑latentnost/stoimost.
  • Asynchronous - better performance, RPO> 0 (tail of logs).
  • Consistency: Choose a model (strong/eventual/causal). For payments - strictly, for analytics - eventual.
  • Snapshots: Create consistent points regularly + store logs (WAL/redo).
  • Cross-region transactions: avoid 2PC; use idempotent operations, deli-and-repeat (retry with deduplication), event sourcing.
  • Queues/buses: replication/mirroring, DLQ, ordering and idempotency of consumers.

6) Network, traffic and DNS

GSLB/Anycast/DNS: failover/failback policies, low TTL (but not too much), health-checks from several regions.
L7 routing: regional maps, degradation flags (function restriction).
Private-links/VPN: backup channels to providers (PSP/KYC/CDN).
Rate limiting: storm protection during recovery.

7) Stateful vs Stateless

Stateless is carried by script/autoscale; stateful requires a consistent data strategy (replication, snapshots, replica promotion, quorum).
Cache/sessions: external (Redis/Memcached) with cross-region replication or re-seed by logs; hold sessions in tokens (JWT) or shared storage.

8) DR triggers and automation

SLO gardrails and quorum probes → an automatic region-failover runbook.
Change freeze in case of accident: block irrelevant releases/migrations.
Infrastructure as Code: deployment of stand-by manifests, drift check.
Role promotion: automatic promote replica DB + writers/secrets dressing.

9) Communications and Compliance

War-room: IC/TL/Comms/Scribe; SEV update intervals.
Status page: geography of influence, ETA, workarounds.
Regulatory: notification deadlines, data security, unchangeable evidence storage.
Partners/providers: confirmed contacts, dedicated channel.

10) DR tests and exercises

Tabletop: Discussing scenario and solutions.
Game Day (stage/prod-light): simulation of AZ/regions failure, provider shutdown, DNS reset.
Restore tests: periodically restore backups in isolation and validate integrity.
Chaos/Failure injection: controlled network/node/dependency failures.
Exercise KPI: achieved RTO/RPO, playbook defects, CAPA.

11) Finance and Strategy Selection (FinOps)

Count $ for reduced RPO/RTO: the lower the targets, the more expensive the channels, licenses, reserves.
Hybrid: Tier 0 - active-active/hot; Tier 1 — warm; Tier 2–3 — pilot/backup.
Expensive data: use cold layers (archive/S3/GLACIER), incremental snapshots, deduplication.
Periodic review of DR-infra costs and certificates/licenses.

12) DR Maturity Metrics

RTO (actual) and RPO (actual) for each Tier.
DR Coverage:% of services with a designed script/playbook/test.
Backup Success & Restore Success: The daily success of backups and proven restores.
Time-to-Declare Disaster: Speed of failover decision.
Failback Time returns to normal topology.
Defect Rate Exercises: found gaps/teachings.
Compliance Evidence Completeness.

13) Checklists

Before DR implementation

  • Service directory contains Tier, RTO/RPO, dependencies and owners.
  • Selected pattern (AA/AP/WS/PL/BR) by Tier and budget.
  • Consistency and replication agreements are documented.
  • GSLB/DNS/routing and health-checks configured and tested.
  • Backups, snapshots, change logs - enabled, checked for restore.
  • DR playbooks and provider contacts are up to date.

During the accident (briefly)

  • Declare a SEV and assemble a war-room; freeze releases.
  • Check quorum of probes; record the impact/geography.
  • Execute Failover Runbook: Traffic, Promotion DB, Queues, Cache.
  • Enable degrade-UX/limits; publish updates on SLA.
  • Collect evidence (timeline, graphs, logs, commands).

After the accident

  • Observe SLO of N intervals; execute failback as planned.
  • Conduct AAR/RCA; issue a CAPA.
  • Update playbooks, alert catalysts, DR test cases.
  • Report to stakeholders/regulators (if necessary).

14) Templates

14. 1 DR script card (example)


ID: DR-REGION-FAILOVER-01
Scope: prod EU ↔ prod US
Tier: 0 (Payments, Auth)
Targets: RTO ≤ 15m, RPO ≤ 5m
Trigger: quorum(probes EU, US) + burn-rate breach + provider status=red
Actions:
- Traffic: GSLB shift EU→US (25→50→100% with green SLIs)
- DB: promote US-replica to primary; re-point writers; freeze schema changes
- MQ: mirror switch; drain EU DLQ; idempotent reprocess
- Cache: invalidate region-specific keys; warm critical sets
- Features: enable degrade_payments_ux
- Comms: status page update q=15m; partners notify
Guardrails: payment_success ≥ 98%, p95 ≤ 300ms
Rollback/Failback: EU green 60m → 25→50→100% with guardrails
Owners: IC @platform, DB @data, Network @netops, Comms @support

14. 2 Runbook "Promote replica database" (fragment)


1) Freeze writes; verify WAL applied (lag ≤ 30s)
2) Promote replica; update cluster VIP / writer endpoint
3) Rotate app secrets/endpoints via remote config
4) Validate: read/write checks, consistency, replication restart to new secondary
5) Lift freeze, monitor errors p95/5xx for 30m

14. 3 DR Exercise Plan (Brief)


Purpose: to check RTO/RPO Tier 0 in case of EU failure
Scenario: EU incoming LB down + 60s replication delay
Success criteria: 100% traffic in US ≤ 12m; RPO ≤ 5m; SLI green 30m
Artifacts: switching logs, SLI graphs, step times, command output

15) Anti-patterns

"There are backups" without regular restore tests.
Secrets/endpoints are not automatically switched.
No idempotency → duplicate/lost transactions on redelivery.
Identical configs for regions without degradation feature flags.

Long Time-to-Declare for fear of "false alarm."

Monoregional providers (PSP/KYC) with no alternative.

There is no failback plan - we live in an emergency topology "forever."

16) Implementation Roadmap (6-10 weeks)

1. Ned. 1-2: classification of services by Tier, setting target RTO/RPO, choosing DR patterns.
2. Ned. 3-4: setting up replication/backups, GSLB/DNS, promotion procedures; playbooks and runbooks' and.
3. Ned. 5-6: first DR exercises (tabletop→stage), fixing metrics and CAPA.
4. Ned. 7-8: Traffic-Restricted Exercise Prod-Light; failover automation.
5. Ned. 9-10: cost optimization (FinOps), transfer of Tier 0 to hot/AA, quarterly exercise and reporting regulations.

17) The bottom line

Effective DR is not just about backups. These are consistent architecture, failover/failback automation, data discipline (idempotency/replication), training, and transparent communications. When RTO/RPOs are real, playbooks are worked out, and exercises are regular, the disaster turns into a controlled event, after which services quickly and predictably return to normal.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.