GH GambleHub

Disaster Recovery Plan

1) Purpose, scope and principles

The goal: to ensure the timely recovery of the IT platform after disasters (those, cyber, vendor, geopolitical) without violating regulatory requirements, contracts and player expectations.
Area: productive environments (gaming circuit, payments, KYC/AML, anti-fraud, DWH/BI storefronts), integrations (PSP, KYC, CDN, studios/aggregators), infrastructure (cloud/K8s, networks, secrets/keys), data (databases, files, logs).
Principles: safety-first, RTO/RPO minimization, automation and reproducibility (IaC), "provability by default," regular exercises.


2) System classification and recovery objectives

2. 1 Criticality levels

Tier-1 (vital): payments/cashouts, core games, login/authentication, ICC/sanctions.
Tier-2: real-time analytics, marketing/CRM, DWH reporting.
Tier-3: internal portals, auxiliary services.

2. 2 Targets

RTO - Recovery Time Objective

Recovery Point Objective (RPO) - allowable time loss of data.
RTA (Recovery Time Actual )/RPA (Recovery Point Actual) - actual values ​ ​ are recorded in reports.
MTO/MBCO: maximum tolerated downtime/minimum acceptable service level (degraded mode).

Example goals (for reference):
  • Tier-1 - RTO ≤ 30-60 min, RPO ≤ 15 min; Tier-2 — RTO ≤ 4 ч, RPO ≤ 1 ч; Tier-3 — RTO ≤ 24 ч, RPO ≤ 24 ч.

3) DR Strategies and Architecture

3. 1 Topologies

Active-Active (multi-region): minimal RTO/RPO, requires consistency and conflict-resolution.
Active-Standby (hot/warm/cold): cost/speed balance.
Geo-separation of data and keys: KMS/HSM per-region, BYOK, independent replication paths.

3. 2 Data and backups

PITR (point-in-time recovery): transaction logs, archiving intervals ≤ 5-15 minutes for Tier-1.
Snapshots/full backups: daily/hourly, storage according to the 3-2-1 scheme (3 copies, 2 media, 1 offline/offsite).
Immutability: WORM/object locks, signature/hash chains of artifacts.
Recovery catalog: backup inventory, integrity, expiration date, test decryptions.

3. 3 Applications and integrations

Statles Services - Rapid deployment via IaC/CI

Statefull components: consistent snapshots, orchestration of the launch sequence.
Integrations (PSP/KYC/aggregators): double credits, fallback endpoints, signed webhooks, re-delivery control (idempotency).


4) Recovery order (general runbook)

1. Declaring a DR script → assigning DR Incident Commander (DR-IC), launching a war-room.
2. Damage assessment: affected regions/subsystems, current RTA/RPA, decision to activate the feilover.
3. Isolation/containment: blocking the original causes (network ACLs, secrets, disconnecting the provider).

4. Initializing DR:
  • network/secrets/KMS →
  • DB/Vault/Cache →
  • API/services → front/CDN → external integrations.
  • 5. Integrity check: counter. amounts, "dry" requests, health samples.
  • 6. Reconciliation of finance/games: reconciliation of payments, bets, balances, idempotent repetition of transactions.
  • 7. Communications: status page, players/partners/regulators; update timeline.
  • 8. Observation and stabilization: deactivation of degradation as normalization proceeds.
  • 9. Post-mortem: RCA, CAPA, DRP update.

5) Specialist runbooks (snippets)

5. 1 Active-Standby → Standby

yaml trigger: "loss_of_region_primary OR quorum_fail >= 5m"
prechecks:
- "secondary region green"
- "replication_lag <= 15m"
steps:
- DR-IC approves region_failover
- Platform: GSLB switch → secondary
- Data: promote replicas, enable PITR streams
- Apps: redeploy with region vars; warm caches
- QA: smoke tests (login, deposit, bet, payout)
- Comms: status-page + partner notice rollback: "switch-back after 60m stability window"

5. 2 Corruption DB/Recovery from PITR

yaml trigger: "data_corruption_detected OR accidental_drop"
steps:
- Freeze writes (feature flag), snapshot evidence
- Restore to timestamp T (<= RPO)
- Reindex/consistency checks
- Replay idempotent events from queue (from T)
- Reopen writes in throttle mode validation: ["checksum_ok", "balance_diff=0", "orders_gap=0"]

5. 3 PSP degradation in DR mode

yaml trigger: "auth_rate_psp1 < baseline-3σ for 15m"
steps:
- Route X%→psp2, cap payouts, enable manual VIP
- Reconciliation plan T+0, alerts Finance
- Notify players in cashier; vendor escalation

6) Data integrity and reconciliation

Finance: reconciliations of deposits/payments/commissions, re-sending notifications and webhooks with deduplication (idempotency-keys).
Game contour: recovery of round states, repetition of settlements if necessary, protection against double charges/charges.
Logs/audits: before/after WORM log mapping, signatures/hashes, consistency reports.
DPO/Compliance Report: In case of PII impact, capture scale, timeline and notifications.


7) DR for key technologies (examples)

DBMS (relational): synchronous/asynchronous replication, WAL slots, fast-promote, hot standbys.
NoSQL/caches: multicluster, TTL-disability, cold filling, rejection of cross-region write without conflict-resolution.
Queues/streams: mirror topicals/clusters, offset control, consumer deduplication.
Object Storage: versioning, bunker replication, object inventory, and retention policies.
CI/CD/artifacts: replicas of registries, signature of artifacts, offline copies of critical containers.
Secrets/keys: KMS per-region, independent root keys, break-glass with logging and TTL.


8) Security and privacy in DR

The principle of least rights: DR-accesses by individual roles/profiles (JIT/PAM).
Immutable backups: offline/offsite, recovery and decryption test.
Regulatory windows: event capture and notification decision (regulator/bank/PSP/users) along with Legal/DPO.
Traceability: full DR command activity log, timeline signature.


9) Exercises and types of tests

Walkthrough/Review: Document/Role/Contact Review (Quarterly).
Tabletop: run scenarios on "dry" with conflict resolution.
Technical partial: recovery of a single service/database.
Full failover/switch-over - transfer of traffic and data to the backup region.
Chaos-days (controlled): injection of failures/failures to check automatics.

Each test → a report with an RTA/RPA, deviation list, CAPA, and DRP update.


10) Metrics (KPI/KRI)

RTA/RPA vs RTO/RPO (Tier-1): 95% match ≥.
DR Test Coverage: ≥ 2 complete DR tests/year + regular partial.
Time-to-First-Status: ≤ 15 min after DR announcement.
Reconciliation Zero-Diff: all cash and game reconciliations without discrepancies.
Backup Integrity: 100% of spot restores are successful in a quarter.
Config Drift: 0 drift between primary/secondary (IaC comparison).
Security in DR: 100% DR activities with log and confirmation.


11) RACI (enlarged)

ActivityDR-ICPlatform/SREData/DBASecurity/DPOPaymentsRisk/KYCProduct/EngComms/PRLegal/Compliance
DR AnnouncementA/RCCCCCCCC
Feilover/LiftCA/RRCCCRII
Validation/HealthCRA/RCCCRII
ReconciliationIRA/RIRRRII
CommunicationsIIICCCIA/RC
Regulators/PSPIIIA/RRRICR
Post-mortem/CAPAA/RRRRRRRCC

12) Checklists

12. 1 DR readiness

  • DR Team/Vendor/Regulator contacts updated
  • Replication green, PITR enabled, test decryption of backups
  • JIT/PAM accesses, break-glass verified
  • Fake playbooks and environment variables are valid
  • PSP/KYC Dual Credits/Webhooks, Alternate Routes
  • Status Page/Message Templates Ready

12. 2 During DR

  • DR-IC assigned, war-room open, event timeline
  • Cause isolation, scripting, running runbooks
  • Integrity checks, health tests, smoke tests
  • First public update ≤ 15 min; notifications to partners/regulators on SLAs
  • Capturing artifacts for investigation

12. 3 After DR

  • Complete reconciliation of money/games and magazines
  • Post-mortem, RCA, CAPA with dates and owners
  • DRP/BIA/Contact/IaC Update
  • Fixes retest plan

13) Templates (fragments)

13. 1 Service card (DR passport)

yaml service: payments-api tier: 1 dependencies: [auth, ledger-db, psp1, psp2, kms-eu]
rto: "45m"
rpo: "15m"
backups: {pitr: true, snapshots: "hourly", immutability: "7d"}
failover: {mode: "active-standby", regions: ["eu1","eu2"]}
runbooks: ["rb_failover_region", "rb_psp_degradation"]
health_checks: ["/healthz","/readyz"]

13. 2 DR test report (exposure)

yaml test_id: DR-2025-10 scope: "Full switch-over eu1→eu2"
rta: "27m"
rpa: "11m"
issues:
- id: CAPA-117, desc: "долгое прогревание кэша", due: 2025-11-20, owner: SRE
- id: CAPA-118, desc: "устаревший webhook PSP#2", due: 2025-11-12, owner: Payments reconciliation: {finance: "ok", games: "ok"}
management_signoff: "2025-11-02"

13. 3 Status message template


[UTC+02] Идет аварийное переключение в резервный регион. Игры доступны, выводы временно ограничены. Средства игроков в безопасности. Следующее обновление через 15 минут.

14) Implementation Roadmap (6-8 weeks)

Weeks 1-2: inventory of services and dependencies, Tier classification, RTO/RPO goals, topology selection, DR passports.
Weeks 3-4: implementation of backups/PITR/immutability, secret replication/KMS, preparation of runbooks and status.
Weeks 5-6: partial technical tests (database/cache/queues), tabletop according to PSP/KYC/region scenarios.
Weeks 7-8: full switch-over (if possible), report with RTA/RPA, CAPA, DRP update and regular test plan.


15) Integration with other wiki sections

Link to: BCP, Risk Register, Incident Management, Log Policy (WORM), TPRM and SLA, ISO 27001/27701, SOC 2, PCI DSS, RBAC/Least Privilege, Password Policy and MFA, Change/Release Management.


TL; DR

Working DRP = clear RTO/RPO by Tier → Active-Active/Standby architecture + immutable backups/PITR → playable runbooks and feilover → reconciliation of money/games → regular exercises and CAPAs. Then any major failure turns into a manageable procedure with predictable recovery times and zero surprises for regulators and players.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.