Business Continuity Plan

1) Purpose, scope and principles

Purpose: to ensure the continuation of critical services (deposits, bets/games, conclusions, KYC/AML, support) in case of failures and quick recovery without violating licenses and contracts.
Area: online platform, payment loop, anti-fraud/CUS, DWH/BI, support, operational and legal functions, key vendors (PSP/KYC/cloud/CDN/studios/aggregators).
Principles: safety first, player first, regulatory correctness, RTO/RPO minimization, simple degradation modes, provability and regular exercises.

2) BIA - Business Impact Analysis

Identify critical processes, inputs/outputs, dependencies, manual alternatives, and target RTO/RPOs.

Example of BIA fragment (YAML):

yaml process: payouts owner: head_of_payments criticality: tier1 dependencies: [psp1, psp2, bank_api, kyc_service, ledger_db]
rto: "4h"
rpo: "15m"
manual_workaround: "limited manual VIP payments when the PSP is completely unavailable"
max_tolerable_downtime: "8h"
legal_constraints: ["AML/KYC check before payout," "regulatory notification windows"]

3) Risk → Impact → Response

Those: cloud region crash, database failure, cluster loss, DDoS attacks, CDN failure.
Vendors: PSP/KYC degradation, break with game aggregator, inaccessibility of anti-fraud/sanction screening.
Cyber: Account/key compromise, ransomware, PII leak.
Processes/People: Strikes/Illnesses, Key Specialist Departures, Release Error.
Geo/force majeure: communications/energy outages, military/sanctions risks, domain/traffic blockages.

For each: triggers, escalation threshold, control measures, service degradation and communication templates.

4) Sustainability architecture and strategies

Active-active/active-standby by region; infrastructure as code for quick ascent.
Degradation modes: read-only showcases, disconnection of non-critical game providers, payment limits, "only deposits" with deferred cashouts (if legally permissible), lower analytics/ETL frequency.
Traffic management: Anycast CDN, geo-balancing, health-checks, canary-routing.
Data: PITR backups, change logs, inter-region replication, cryptographic integrity (hashes/WORM).
Keys/secrets: independent KMS per-region, "break-glass" with logging.
PSP/KYC multi-homing: automatic failover, SLA/latency routing.

5) Incident Command System

Incident Commander (IC) - a single decision point.
Ops Lead (SRE/Platform) - technical stabilization, feilover, metrics.
Business Continuity Lead - coordination of processes/manual procedures.
Comms Lead - external/internal notifications (players, partners, regulators).
Security/DPO - cyber incidents/privacy, regulatory windows.
Payments/KYC Leads - PSP/KYC scenarios.
Liaisons: Legal, Support, VIP/CRM, Data/BI.

Rule: one IC per incident, clear channels and decision logs.

6) Communications Plan

Channels: war-room (chat/bridge), backup connections (phone/radio/alt-messenger), pre-checked PSP/KYC/bank contacts.
External message templates: status page, social networks, email/push; tone - facts, timing, next steps.
Regulators and partners: preset addresses, SLA notifications; agreed wording.
Players: transparent ETAs, compensations/bonuses (if applicable), FAQs for the degradation period.

7) Operational Plans (Runbooks)

Examples of fragments:

7. 1 Feilover to another region

yaml trigger: "loss of primary availability> = 5m, p95_latency>threshold"
steps:
- IC approves region_failover
- SRE: flip traffic via GSLB to secondary
- Data: verify replication lag < RPO
- Apps: switch env vars/secrets; warm caches
- QA: smoke tests; Business: announce status rollback: "switch-back on 60m stability"

7. 2 PSP degradation

yaml trigger: "auth_rate_psp1 < baseline-3σ 15m"
steps:
- Payments: route X%→psp2, include limits
- Comms: banner at the checkout, status page
- Finance: reconciliation plan for T + 0
- Legal: notification log and SLA letter

7. 3 KYC provider unavailable

yaml trigger: "kyc_sla_breach 30m"
steps:
- Risk: time limits of deposits/rates
- Ops: VIP/High-risk manual check
- Comms: KYC Time Increase Notice
- Vendor: escalation, protection switch

8) IT and Data Recovery (DR)

System categories: Tier-1 (platform/payments/CCM), Tier-2 (games/analytics), Tier-3 (internal).
Lifting procedure: set→sekrety/KMS→BD→kesh→API→front/CDN→integratsii→analitika.
Integrity checks - checksums, log/replication verification, transaction reconciliation.

DR tests: annually full (switch-over), quarterly partial; Commit actual RTOs/RPOs

9) People, offices and logistics

Remote-ready: redundant laptops/modems, access via SSO/MFA, "red" access for IC.
Alternative locations: spare offices/coworking spaces, pass lists, evacuation plan.
Rotation of shifts: competence matrix, duplication of key roles, replacement plan.
Critical communication/energy providers: contacts, SLA, generators/UPS (if relevant).

10) Vendors and Supply Chain

BCP/DR requirements in contracts: RTO/RPO, mandatory tests, audit rights and joint exercises.
Register of sub-processors: contacts, outage plans, confirmation of data deletion/export when offboarding.
Tier-1 Quarterly Reviews: Incidents, DR Protocols, Certification Status, SLAs.

11) Training, drills and testing

Tabletop once a quarter: PSP/KYC/cloud/cyber scenarios.
Tech exercises: DR partial/full; DDoS/CDN switching; "kill-switch" SDK providers.
Communication drills: press release/status updates/regulatory letters.
Retrospectives: timeline, RCA, CAPA, runbooks update and BIA.

12) Metrics (KPI/KRI)

RTO/RPO actual (according to Tier-1): meet the goals ≥ 95%.
MTTD/MTTR: downward trend; MTTR of critical incidents ≤ targeted.
Feilover success: without loss of data/orders/rates, ≤ X minutes of degradation.
Coverage exercises: ≥ 2 complete DR tests/year + 4 tabletop.
Communications: the time to the first update ≤ 15 minutes, the frequency of updates according to the policy.
Vendor resilience: the share of Tier-1 with confirmed DR tests in 12 months is 100%.

13) RACI (enlarged)

Activity	IC	SRE/Platform	Security/DPO	Payments	Risk/KYC	Product	Support/CRM	Legal/Compliance	Comms/PR	Data/BI
Declaration of incident	A/R	R	R	R	R	C	C	C	C	C
Technical stabilization/feilover	C	A/R	C	C	C	C	I	I	I	C
PSP/KYC routing	C	C	I	A/R	A/R	C	I	I	I	I
Communications	A	I	C	C	C	C	C	C	R	I
Regulatory Notices	I	I	A/R	C	C	I	I	R	I	I
Post-mortem/CAPA	A/R	R	R	R	R	R	R	C	C	R

14) Checklists

14. 1 Ready-to-Failover

Current IC/Vendor/Regulator contacts
Replication health, regular PITR backup
SDK/Webhook kill-switch verified
Traffic Manager (GSLB/CDN) with validated health-checks
Status/letter templates and publishing rights
Runbooks and accesses (SSO/MFA) reviewed monthly

14. 2 During the incident

IC assigned, war-room open, decision logs start
Classification (P1/P2), scenario selection and degradation
Technical actions (feilover/limits/disconnections)
First public update ≤ 15 minutes
SLA Regulatory/Partner Notifications
Capturing artifacts for post-mortem

14. 3 After the incident

Post-mortem with RCA and CAPA
Updated BIA/thresholds/routines
Training/retest fixes, board report
Financial/reconciliation

15) Templates (fragments)

15. 1 Script card

yaml scenario: "Region outage: cloud-eu1"
triggers: ["error_rate>5%", "loss of quorum", "cdn health fail"]
degradation: ["disable live-casino", "payments=psp2 only", "payouts=VIP manual"]
rto_target: "30m"
rpo_target: "15m"
contacts: {cloud: "...", isp: "...", regulator: "..."}
comms_templates: ["status_page_v1", "partner_notice_v2"]

15. 2 Message to status page


[UTC + 02] We are seeing the degradation of payments through PSP # 1. Transactions are automatically routed through an alternative provider. Player funds are safe. The next update is in 15 minutes.

16) Document and version management

Versioning BCP/Runbooks in the repository, change-log, document owner.
Revision period (quarterly for Tier-1), control of offline copies availability.
Storing drill/incident artifacts and performance metrics.

17) Implementation Roadmap (6-8 weeks)

Weeks 1-2: BIA and critical processes, RTO/RPO goals, list of scenarios and owners.
Weeks 3-4: architecture of stability and degradation modes, runbooks, communication templates, contacts.
Weeks 5-6: vendor integration (PSP/KYC/cloud), pilot exercises (tabletop + partial DR), adjustments.
Weeks 7-8: full DR test (if possible), launch of quarterly exercise cycle, board report and regulatory package (if required).

18) Related wiki sections

Risk Register, Incidents and Leaks, DR/BCP tests, TPRM and SLA, ISO 27001/27701, SOC 2, PCI DSS, IGA/RBAC/Least Privilege, Log Policy/WORM - for a single loop of robustness and provability.

TL; DR

Effective BCP = BIA→RTO/RPO→stsenarii and degradatsii→multi-vendor/multi-region + clear Incident Command, communications and exercises. Keep the document alive, test regularly - and even a big crash won't stop the business or hit licenses.

Business Continuity Plan

TL; DR

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects