Disaster Recovery Plan
1) Purpose, scope and principles
The goal: to ensure the timely recovery of the IT platform after disasters (those, cyber, vendor, geopolitical) without violating regulatory requirements, contracts and player expectations.
Area: productive environments (gaming circuit, payments, KYC/AML, anti-fraud, DWH/BI storefronts), integrations (PSP, KYC, CDN, studios/aggregators), infrastructure (cloud/K8s, networks, secrets/keys), data (databases, files, logs).
Principles: safety-first, RTO/RPO minimization, automation and reproducibility (IaC), "provability by default," regular exercises.
2) System classification and recovery objectives
2. 1 Criticality levels
Tier-1 (vital): payments/cashouts, core games, login/authentication, ICC/sanctions.
Tier-2: real-time analytics, marketing/CRM, DWH reporting.
Tier-3: internal portals, auxiliary services.
2. 2 Targets
RTO - Recovery Time Objective
Recovery Point Objective (RPO) - allowable time loss of data.
RTA (Recovery Time Actual )/RPA (Recovery Point Actual) - actual values are recorded in reports.
MTO/MBCO: maximum tolerated downtime/minimum acceptable service level (degraded mode).
- Tier-1 - RTO ≤ 30-60 min, RPO ≤ 15 min; Tier-2 — RTO ≤ 4 ч, RPO ≤ 1 ч; Tier-3 — RTO ≤ 24 ч, RPO ≤ 24 ч.
3) DR Strategies and Architecture
3. 1 Topologies
Active-Active (multi-region): minimal RTO/RPO, requires consistency and conflict-resolution.
Active-Standby (hot/warm/cold): cost/speed balance.
Geo-separation of data and keys: KMS/HSM per-region, BYOK, independent replication paths.
3. 2 Data and backups
PITR (point-in-time recovery): transaction logs, archiving intervals ≤ 5-15 minutes for Tier-1.
Snapshots/full backups: daily/hourly, storage according to the 3-2-1 scheme (3 copies, 2 media, 1 offline/offsite).
Immutability: WORM/object locks, signature/hash chains of artifacts.
Recovery catalog: backup inventory, integrity, expiration date, test decryptions.
3. 3 Applications and integrations
Statles Services - Rapid deployment via IaC/CI
Statefull components: consistent snapshots, orchestration of the launch sequence.
Integrations (PSP/KYC/aggregators): double credits, fallback endpoints, signed webhooks, re-delivery control (idempotency).
4) Recovery order (general runbook)
1. Declaring a DR script → assigning DR Incident Commander (DR-IC), launching a war-room.
2. Damage assessment: affected regions/subsystems, current RTA/RPA, decision to activate the feilover.
3. Isolation/containment: blocking the original causes (network ACLs, secrets, disconnecting the provider).
- network/secrets/KMS →
- DB/Vault/Cache →
- API/services → front/CDN → external integrations.
- 5. Integrity check: counter. amounts, "dry" requests, health samples.
- 6. Reconciliation of finance/games: reconciliation of payments, bets, balances, idempotent repetition of transactions.
- 7. Communications: status page, players/partners/regulators; update timeline.
- 8. Observation and stabilization: deactivation of degradation as normalization proceeds.
- 9. Post-mortem: RCA, CAPA, DRP update.
5) Specialist runbooks (snippets)
5. 1 Active-Standby → Standby
yaml trigger: "loss_of_region_primary OR quorum_fail >= 5m"
prechecks:
- "secondary region green"
- "replication_lag <= 15m"
steps:
- DR-IC approves region_failover
- Platform: GSLB switch → secondary
- Data: promote replicas, enable PITR streams
- Apps: redeploy with region vars; warm caches
- QA: smoke tests (login, deposit, bet, payout)
- Comms: status-page + partner notice rollback: "switch-back after 60m stability window"
5. 2 Corruption DB/Recovery from PITR
yaml trigger: "data_corruption_detected OR accidental_drop"
steps:
- Freeze writes (feature flag), snapshot evidence
- Restore to timestamp T (<= RPO)
- Reindex/consistency checks
- Replay idempotent events from queue (from T)
- Reopen writes in throttle mode validation: ["checksum_ok", "balance_diff=0", "orders_gap=0"]
5. 3 PSP degradation in DR mode
yaml trigger: "auth_rate_psp1 < baseline-3σ for 15m"
steps:
- Route X%→psp2, cap payouts, enable manual VIP
- Reconciliation plan T+0, alerts Finance
- Notify players in cashier; vendor escalation
6) Data integrity and reconciliation
Finance: reconciliations of deposits/payments/commissions, re-sending notifications and webhooks with deduplication (idempotency-keys).
Game contour: recovery of round states, repetition of settlements if necessary, protection against double charges/charges.
Logs/audits: before/after WORM log mapping, signatures/hashes, consistency reports.
DPO/Compliance Report: In case of PII impact, capture scale, timeline and notifications.
7) DR for key technologies (examples)
DBMS (relational): synchronous/asynchronous replication, WAL slots, fast-promote, hot standbys.
NoSQL/caches: multicluster, TTL-disability, cold filling, rejection of cross-region write without conflict-resolution.
Queues/streams: mirror topicals/clusters, offset control, consumer deduplication.
Object Storage: versioning, bunker replication, object inventory, and retention policies.
CI/CD/artifacts: replicas of registries, signature of artifacts, offline copies of critical containers.
Secrets/keys: KMS per-region, independent root keys, break-glass with logging and TTL.
8) Security and privacy in DR
The principle of least rights: DR-accesses by individual roles/profiles (JIT/PAM).
Immutable backups: offline/offsite, recovery and decryption test.
Regulatory windows: event capture and notification decision (regulator/bank/PSP/users) along with Legal/DPO.
Traceability: full DR command activity log, timeline signature.
9) Exercises and types of tests
Walkthrough/Review: Document/Role/Contact Review (Quarterly).
Tabletop: run scenarios on "dry" with conflict resolution.
Technical partial: recovery of a single service/database.
Full failover/switch-over - transfer of traffic and data to the backup region.
Chaos-days (controlled): injection of failures/failures to check automatics.
Each test → a report with an RTA/RPA, deviation list, CAPA, and DRP update.
10) Metrics (KPI/KRI)
RTA/RPA vs RTO/RPO (Tier-1): 95% match ≥.
DR Test Coverage: ≥ 2 complete DR tests/year + regular partial.
Time-to-First-Status: ≤ 15 min after DR announcement.
Reconciliation Zero-Diff: all cash and game reconciliations without discrepancies.
Backup Integrity: 100% of spot restores are successful in a quarter.
Config Drift: 0 drift between primary/secondary (IaC comparison).
Security in DR: 100% DR activities with log and confirmation.
11) RACI (enlarged)
12) Checklists
12. 1 DR readiness
- DR Team/Vendor/Regulator contacts updated
- Replication green, PITR enabled, test decryption of backups
- JIT/PAM accesses, break-glass verified
- Fake playbooks and environment variables are valid
- PSP/KYC Dual Credits/Webhooks, Alternate Routes
- Status Page/Message Templates Ready
12. 2 During DR
- DR-IC assigned, war-room open, event timeline
- Cause isolation, scripting, running runbooks
- Integrity checks, health tests, smoke tests
- First public update ≤ 15 min; notifications to partners/regulators on SLAs
- Capturing artifacts for investigation
12. 3 After DR
- Complete reconciliation of money/games and magazines
- Post-mortem, RCA, CAPA with dates and owners
- DRP/BIA/Contact/IaC Update
- Fixes retest plan
13) Templates (fragments)
13. 1 Service card (DR passport)
yaml service: payments-api tier: 1 dependencies: [auth, ledger-db, psp1, psp2, kms-eu]
rto: "45m"
rpo: "15m"
backups: {pitr: true, snapshots: "hourly", immutability: "7d"}
failover: {mode: "active-standby", regions: ["eu1","eu2"]}
runbooks: ["rb_failover_region", "rb_psp_degradation"]
health_checks: ["/healthz","/readyz"]
13. 2 DR test report (exposure)
yaml test_id: DR-2025-10 scope: "Full switch-over eu1→eu2"
rta: "27m"
rpa: "11m"
issues:
- id: CAPA-117, desc: "долгое прогревание кэша", due: 2025-11-20, owner: SRE
- id: CAPA-118, desc: "устаревший webhook PSP#2", due: 2025-11-12, owner: Payments reconciliation: {finance: "ok", games: "ok"}
management_signoff: "2025-11-02"
13. 3 Status message template
[UTC+02] Идет аварийное переключение в резервный регион. Игры доступны, выводы временно ограничены. Средства игроков в безопасности. Следующее обновление через 15 минут.
14) Implementation Roadmap (6-8 weeks)
Weeks 1-2: inventory of services and dependencies, Tier classification, RTO/RPO goals, topology selection, DR passports.
Weeks 3-4: implementation of backups/PITR/immutability, secret replication/KMS, preparation of runbooks and status.
Weeks 5-6: partial technical tests (database/cache/queues), tabletop according to PSP/KYC/region scenarios.
Weeks 7-8: full switch-over (if possible), report with RTA/RPA, CAPA, DRP update and regular test plan.
15) Integration with other wiki sections
Link to: BCP, Risk Register, Incident Management, Log Policy (WORM), TPRM and SLA, ISO 27001/27701, SOC 2, PCI DSS, RBAC/Least Privilege, Password Policy and MFA, Change/Release Management.
TL; DR
Working DRP = clear RTO/RPO by Tier → Active-Active/Standby architecture + immutable backups/PITR → playable runbooks and feilover → reconciliation of money/games → regular exercises and CAPAs. Then any major failure turns into a manageable procedure with predictable recovery times and zero surprises for regulators and players.