Business Continuity Plan
1) Purpose, scope and principles
Purpose: to ensure the continuation of critical services (deposits, bets/games, conclusions, KYC/AML, support) in case of failures and quick recovery without violating licenses and contracts.
Area: online platform, payment loop, anti-fraud/CUS, DWH/BI, support, operational and legal functions, key vendors (PSP/KYC/cloud/CDN/studios/aggregators).
Principles: safety first, player first, regulatory correctness, RTO/RPO minimization, simple degradation modes, provability and regular exercises.
2) BIA - Business Impact Analysis
Identify critical processes, inputs/outputs, dependencies, manual alternatives, and target RTO/RPOs.
Example of BIA fragment (YAML):yaml process: payouts owner: head_of_payments criticality: tier1 dependencies: [psp1, psp2, bank_api, kyc_service, ledger_db]
rto: "4h"
rpo: "15m"
manual_workaround: "limited manual VIP payments when the PSP is completely unavailable"
max_tolerable_downtime: "8h"
legal_constraints: ["AML/KYC check before payout," "regulatory notification windows"]
3) Risk → Impact → Response
Those: cloud region crash, database failure, cluster loss, DDoS attacks, CDN failure.
Vendors: PSP/KYC degradation, break with game aggregator, inaccessibility of anti-fraud/sanction screening.
Cyber: Account/key compromise, ransomware, PII leak.
Processes/People: Strikes/Illnesses, Key Specialist Departures, Release Error.
Geo/force majeure: communications/energy outages, military/sanctions risks, domain/traffic blockages.
For each: triggers, escalation threshold, control measures, service degradation and communication templates.
4) Sustainability architecture and strategies
Active-active/active-standby by region; infrastructure as code for quick ascent.
Degradation modes: read-only showcases, disconnection of non-critical game providers, payment limits, "only deposits" with deferred cashouts (if legally permissible), lower analytics/ETL frequency.
Traffic management: Anycast CDN, geo-balancing, health-checks, canary-routing.
Data: PITR backups, change logs, inter-region replication, cryptographic integrity (hashes/WORM).
Keys/secrets: independent KMS per-region, "break-glass" with logging.
PSP/KYC multi-homing: automatic failover, SLA/latency routing.
5) Incident Command System
Incident Commander (IC) - a single decision point.
Ops Lead (SRE/Platform) - technical stabilization, feilover, metrics.
Business Continuity Lead - coordination of processes/manual procedures.
Comms Lead - external/internal notifications (players, partners, regulators).
Security/DPO - cyber incidents/privacy, regulatory windows.
Payments/KYC Leads - PSP/KYC scenarios.
Liaisons: Legal, Support, VIP/CRM, Data/BI.
Rule: one IC per incident, clear channels and decision logs.
6) Communications Plan
Channels: war-room (chat/bridge), backup connections (phone/radio/alt-messenger), pre-checked PSP/KYC/bank contacts.
External message templates: status page, social networks, email/push; tone - facts, timing, next steps.
Regulators and partners: preset addresses, SLA notifications; agreed wording.
Players: transparent ETAs, compensations/bonuses (if applicable), FAQs for the degradation period.
7) Operational Plans (Runbooks)
Examples of fragments:7. 1 Feilover to another region
yaml trigger: "loss of primary availability> = 5m, p95_latency>threshold"
steps:
- IC approves region_failover
- SRE: flip traffic via GSLB to secondary
- Data: verify replication lag < RPO
- Apps: switch env vars/secrets; warm caches
- QA: smoke tests; Business: announce status rollback: "switch-back on 60m stability"
7. 2 PSP degradation
yaml trigger: "auth_rate_psp1 < baseline-3σ 15m"
steps:
- Payments: route X%→psp2, include limits
- Comms: banner at the checkout, status page
- Finance: reconciliation plan for T + 0
- Legal: notification log and SLA letter
7. 3 KYC provider unavailable
yaml trigger: "kyc_sla_breach 30m"
steps:
- Risk: time limits of deposits/rates
- Ops: VIP/High-risk manual check
- Comms: KYC Time Increase Notice
- Vendor: escalation, protection switch
8) IT and Data Recovery (DR)
System categories: Tier-1 (platform/payments/CCM), Tier-2 (games/analytics), Tier-3 (internal).
Lifting procedure: set→sekrety/KMS→BD→kesh→API→front/CDN→integratsii→analitika.
Integrity checks - checksums, log/replication verification, transaction reconciliation.
DR tests: annually full (switch-over), quarterly partial; Commit actual RTOs/RPOs
9) People, offices and logistics
Remote-ready: redundant laptops/modems, access via SSO/MFA, "red" access for IC.
Alternative locations: spare offices/coworking spaces, pass lists, evacuation plan.
Rotation of shifts: competence matrix, duplication of key roles, replacement plan.
Critical communication/energy providers: contacts, SLA, generators/UPS (if relevant).
10) Vendors and Supply Chain
BCP/DR requirements in contracts: RTO/RPO, mandatory tests, audit rights and joint exercises.
Register of sub-processors: contacts, outage plans, confirmation of data deletion/export when offboarding.
Tier-1 Quarterly Reviews: Incidents, DR Protocols, Certification Status, SLAs.
11) Training, drills and testing
Tabletop once a quarter: PSP/KYC/cloud/cyber scenarios.
Tech exercises: DR partial/full; DDoS/CDN switching; "kill-switch" SDK providers.
Communication drills: press release/status updates/regulatory letters.
Retrospectives: timeline, RCA, CAPA, runbooks update and BIA.
12) Metrics (KPI/KRI)
RTO/RPO actual (according to Tier-1): meet the goals ≥ 95%.
MTTD/MTTR: downward trend; MTTR of critical incidents ≤ targeted.
Feilover success: without loss of data/orders/rates, ≤ X minutes of degradation.
Coverage exercises: ≥ 2 complete DR tests/year + 4 tabletop.
Communications: the time to the first update ≤ 15 minutes, the frequency of updates according to the policy.
Vendor resilience: the share of Tier-1 with confirmed DR tests in 12 months is 100%.
13) RACI (enlarged)
14) Checklists
14. 1 Ready-to-Failover
- Current IC/Vendor/Regulator contacts
- Replication health, regular PITR backup
- SDK/Webhook kill-switch verified
- Traffic Manager (GSLB/CDN) with validated health-checks
- Status/letter templates and publishing rights
- Runbooks and accesses (SSO/MFA) reviewed monthly
14. 2 During the incident
- IC assigned, war-room open, decision logs start
- Classification (P1/P2), scenario selection and degradation
- Technical actions (feilover/limits/disconnections)
- First public update ≤ 15 minutes
- SLA Regulatory/Partner Notifications
- Capturing artifacts for post-mortem
14. 3 After the incident
- Post-mortem with RCA and CAPA
- Updated BIA/thresholds/routines
- Training/retest fixes, board report
- Financial/reconciliation
15) Templates (fragments)
15. 1 Script card
yaml scenario: "Region outage: cloud-eu1"
triggers: ["error_rate>5%", "loss of quorum", "cdn health fail"]
degradation: ["disable live-casino", "payments=psp2 only", "payouts=VIP manual"]
rto_target: "30m"
rpo_target: "15m"
contacts: {cloud: "...", isp: "...", regulator: "..."}
comms_templates: ["status_page_v1", "partner_notice_v2"]
15. 2 Message to status page
[UTC + 02] We are seeing the degradation of payments through PSP # 1. Transactions are automatically routed through an alternative provider. Player funds are safe. The next update is in 15 minutes.
16) Document and version management
Versioning BCP/Runbooks in the repository, change-log, document owner.
Revision period (quarterly for Tier-1), control of offline copies availability.
Storing drill/incident artifacts and performance metrics.
17) Implementation Roadmap (6-8 weeks)
Weeks 1-2: BIA and critical processes, RTO/RPO goals, list of scenarios and owners.
Weeks 3-4: architecture of stability and degradation modes, runbooks, communication templates, contacts.
Weeks 5-6: vendor integration (PSP/KYC/cloud), pilot exercises (tabletop + partial DR), adjustments.
Weeks 7-8: full DR test (if possible), launch of quarterly exercise cycle, board report and regulatory package (if required).
18) Related wiki sections
Risk Register, Incidents and Leaks, DR/BCP tests, TPRM and SLA, ISO 27001/27701, SOC 2, PCI DSS, IGA/RBAC/Least Privilege, Log Policy/WORM - for a single loop of robustness and provability.
TL; DR
Effective BCP = BIA→RTO/RPO→stsenarii and degradatsii→multi-vendor/multi-region + clear Incident Command, communications and exercises. Keep the document alive, test regularly - and even a big crash won't stop the business or hit licenses.