Disaster Recovery Plan

1) Purpose, scope and principles

The goal: to ensure the timely recovery of the IT platform after disasters (those, cyber, vendor, geopolitical) without violating regulatory requirements, contracts and player expectations.
Area: productive environments (gaming circuit, payments, KYC/AML, anti-fraud, DWH/BI storefronts), integrations (PSP, KYC, CDN, studios/aggregators), infrastructure (cloud/K8s, networks, secrets/keys), data (databases, files, logs).
Principles: safety-first, RTO/RPO minimization, automation and reproducibility (IaC), "provability by default," regular exercises.

2) System classification and recovery objectives

2. 1 Criticality levels

Tier-1 (vital): payments/cashouts, core games, login/authentication, ICC/sanctions.
Tier-2: real-time analytics, marketing/CRM, DWH reporting.
Tier-3: internal portals, auxiliary services.

2. 2 Targets

RTO - Recovery Time Objective

Recovery Point Objective (RPO) - allowable time loss of data.
RTA (Recovery Time Actual )/RPA (Recovery Point Actual) - actual values are recorded in reports.
MTO/MBCO: maximum tolerated downtime/minimum acceptable service level (degraded mode).

Example goals (for reference):

Tier-1 - RTO ≤ 30-60 min, RPO ≤ 15 min; Tier-2 — RTO ≤ 4 ч, RPO ≤ 1 ч; Tier-3 — RTO ≤ 24 ч, RPO ≤ 24 ч.

3) DR Strategies and Architecture

3. 1 Topologies

Active-Active (multi-region): minimal RTO/RPO, requires consistency and conflict-resolution.
Active-Standby (hot/warm/cold): cost/speed balance.
Geo-separation of data and keys: KMS/HSM per-region, BYOK, independent replication paths.

3. 2 Data and backups

PITR (point-in-time recovery): transaction logs, archiving intervals ≤ 5-15 minutes for Tier-1.
Snapshots/full backups: daily/hourly, storage according to the 3-2-1 scheme (3 copies, 2 media, 1 offline/offsite).
Immutability: WORM/object locks, signature/hash chains of artifacts.
Recovery catalog: backup inventory, integrity, expiration date, test decryptions.

3. 3 Applications and integrations

Statles Services - Rapid deployment via IaC/CI

Statefull components: consistent snapshots, orchestration of the launch sequence.
Integrations (PSP/KYC/aggregators): double credits, fallback endpoints, signed webhooks, re-delivery control (idempotency).

4) Recovery order (general runbook)

1. Declaring a DR script → assigning DR Incident Commander (DR-IC), launching a war-room.
2. Damage assessment: affected regions/subsystems, current RTA/RPA, decision to activate the feilover.
3. Isolation/containment: blocking the original causes (network ACLs, secrets, disconnecting the provider).

4. Initializing DR:

network/secrets/KMS →
DB/Vault/Cache →
API/services → front/CDN → external integrations.
5. Integrity check: counter. amounts, "dry" requests, health samples.
6. Reconciliation of finance/games: reconciliation of payments, bets, balances, idempotent repetition of transactions.
7. Communications: status page, players/partners/regulators; update timeline.
8. Observation and stabilization: deactivation of degradation as normalization proceeds.
9. Post-mortem: RCA, CAPA, DRP update.

5) Specialist runbooks (snippets)

5. 1 Active-Standby → Standby

yaml trigger: "loss_of_region_primary OR quorum_fail >= 5m"
prechecks:
- "secondary region green"
- "replication_lag <= 15m"
steps:
- DR-IC approves region_failover
- Platform: GSLB switch → secondary
- Data: promote replicas, enable PITR streams
- Apps: redeploy with region vars; warm caches
- QA: smoke tests (login, deposit, bet, payout)
- Comms: status-page + partner notice rollback: "switch-back after 60m stability window"

5. 2 Corruption DB/Recovery from PITR

yaml trigger: "data_corruption_detected OR accidental_drop"
steps:
- Freeze writes (feature flag), snapshot evidence
- Restore to timestamp T (<= RPO)
- Reindex/consistency checks
- Replay idempotent events from queue (from T)
- Reopen writes in throttle mode validation: ["checksum_ok", "balance_diff=0", "orders_gap=0"]

5. 3 PSP degradation in DR mode

yaml trigger: "auth_rate_psp1 < baseline-3σ for 15m"
steps:
- Route X%→psp2, cap payouts, enable manual VIP
- Reconciliation plan T+0, alerts Finance
- Notify players in cashier; vendor escalation

6) Data integrity and reconciliation

Finance: reconciliations of deposits/payments/commissions, re-sending notifications and webhooks with deduplication (idempotency-keys).
Game contour: recovery of round states, repetition of settlements if necessary, protection against double charges/charges.
Logs/audits: before/after WORM log mapping, signatures/hashes, consistency reports.
DPO/Compliance Report: In case of PII impact, capture scale, timeline and notifications.

7) DR for key technologies (examples)

DBMS (relational): synchronous/asynchronous replication, WAL slots, fast-promote, hot standbys.
NoSQL/caches: multicluster, TTL-disability, cold filling, rejection of cross-region write without conflict-resolution.
Queues/streams: mirror topicals/clusters, offset control, consumer deduplication.
Object Storage: versioning, bunker replication, object inventory, and retention policies.
CI/CD/artifacts: replicas of registries, signature of artifacts, offline copies of critical containers.
Secrets/keys: KMS per-region, independent root keys, break-glass with logging and TTL.

8) Security and privacy in DR

The principle of least rights: DR-accesses by individual roles/profiles (JIT/PAM).
Immutable backups: offline/offsite, recovery and decryption test.
Regulatory windows: event capture and notification decision (regulator/bank/PSP/users) along with Legal/DPO.
Traceability: full DR command activity log, timeline signature.

9) Exercises and types of tests

Walkthrough/Review: Document/Role/Contact Review (Quarterly).
Tabletop: run scenarios on "dry" with conflict resolution.
Technical partial: recovery of a single service/database.
Full failover/switch-over - transfer of traffic and data to the backup region.
Chaos-days (controlled): injection of failures/failures to check automatics.

Each test → a report with an RTA/RPA, deviation list, CAPA, and DRP update.

10) Metrics (KPI/KRI)

RTA/RPA vs RTO/RPO (Tier-1): 95% match ≥.
DR Test Coverage: ≥ 2 complete DR tests/year + regular partial.
Time-to-First-Status: ≤ 15 min after DR announcement.
Reconciliation Zero-Diff: all cash and game reconciliations without discrepancies.
Backup Integrity: 100% of spot restores are successful in a quarter.
Config Drift: 0 drift between primary/secondary (IaC comparison).
Security in DR: 100% DR activities with log and confirmation.

11) RACI (enlarged)

Activity	DR-IC	Platform/SRE	Data/DBA	Security/DPO	Payments	Risk/KYC	Product/Eng	Comms/PR	Legal/Compliance
DR Announcement	A/R	C	C	C	C	C	C	C	C
Feilover/Lift	C	A/R	R	C	C	C	R	I	I
Validation/Health	C	R	A/R	C	C	C	R	I	I
Reconciliation	I	R	A/R	I	R	R	R	I	I
Communications	I	I	I	C	C	C	I	A/R	C
Regulators/PSP	I	I	I	A/R	R	R	I	C	R
Post-mortem/CAPA	A/R	R	R	R	R	R	R	C	C

12) Checklists

12. 1 DR readiness

DR Team/Vendor/Regulator contacts updated
Replication green, PITR enabled, test decryption of backups
JIT/PAM accesses, break-glass verified
Fake playbooks and environment variables are valid
PSP/KYC Dual Credits/Webhooks, Alternate Routes
Status Page/Message Templates Ready

12. 2 During DR

DR-IC assigned, war-room open, event timeline
Cause isolation, scripting, running runbooks
Integrity checks, health tests, smoke tests
First public update ≤ 15 min; notifications to partners/regulators on SLAs
Capturing artifacts for investigation

12. 3 After DR

Complete reconciliation of money/games and magazines
Post-mortem, RCA, CAPA with dates and owners
DRP/BIA/Contact/IaC Update
Fixes retest plan

13) Templates (fragments)

13. 1 Service card (DR passport)

yaml service: payments-api tier: 1 dependencies: [auth, ledger-db, psp1, psp2, kms-eu]
rto: "45m"
rpo: "15m"
backups: {pitr: true, snapshots: "hourly", immutability: "7d"}
failover: {mode: "active-standby", regions: ["eu1","eu2"]}
runbooks: ["rb_failover_region", "rb_psp_degradation"]
health_checks: ["/healthz","/readyz"]

13. 2 DR test report (exposure)

yaml test_id: DR-2025-10 scope: "Full switch-over eu1→eu2"
rta: "27m"
rpa: "11m"
issues:
- id: CAPA-117, desc: "long cache warm-up," due: 2025-11-20, owner: SRE
- id: CAPA-118, desc: "устаревший webhook PSP#2", due: 2025-11-12, owner: Payments reconciliation: {finance: "ok", games: "ok"}
management_signoff: "2025-11-02"

13. 3 Status message template


[UTC + 02] Failover to backup region in progress. Games are available, conclusions are temporarily limited. Player funds are safe. Next update in 15 minutes.

14) Implementation Roadmap (6-8 weeks)

Weeks 1-2: inventory of services and dependencies, Tier classification, RTO/RPO goals, topology selection, DR passports.
Weeks 3-4: implementation of backups/PITR/immutability, secret replication/KMS, preparation of runbooks and status.
Weeks 5-6: partial technical tests (database/cache/queues), tabletop according to PSP/KYC/region scenarios.
Weeks 7-8: full switch-over (if possible), report with RTA/RPA, CAPA, DRP update and regular test plan.

15) Integration with other wiki sections

Link to: BCP, Risk Register, Incident Management, Log Policy (WORM), TPRM and SLA, ISO 27001/27701, SOC 2, PCI DSS, RBAC/Least Privilege, Password Policy and MFA, Change/Release Management.

TL; DR

Working DRP = clear RTO/RPO by Tier → Active-Active/Standby architecture + immutable backups/PITR → playable runbooks and feilover → reconciliation of money/games → regular exercises and CAPAs. Then any major failure turns into a manageable procedure with predictable recovery times and zero surprises for regulators and players.

Disaster Recovery Plan

TL; DR

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects