Data warehouses
1) Purpose and role of DWH in iGaming
DWH is the central data consolidation and serving layer for reporting, analytics, compliance and ML. It provides:- Common metric definitions (GGR/NGR, ARPPU, Retention, Churn).
- Reproducible reports for regulators and internal stakeholders.
- Fast storefronts for BI/operating panels and sources for models.
- Platform-level quality control, lineage, and safety.
2) Architectural options
2. 1 Classic DWH
ETL → DWH → BI.
Pros: Manageable models, strong consistency.
Cons: expensive downloads, complex backfill, limited flexibility.
2. 2 Lakehouse DWH
Bronze/Silver/Gold on ACID tables (Delta/Iceberg/Hudi) + SQL/MPP engine.
Pros: unified storage, time-travel, simple reprocessing.
Cons: requires discipline of layers and DQ, mature orchestration.
2. 3 Hybrid
Lakehouse as a "source of truth" (Bronze/Silver), DWH-March in MPP (ClickHouse/Pinot/Druid/Cloud DWH) for high-speed reading.
Pros: balance of cost and performance, flexible storefronts.
Cons: dual support for circuits and skating, synchronization is needed.
Recommendation: for iGaming - Lakehouse + DWH-March (hybrid). Bronze/Silver - standardize, Gold/Real-time marts - serve reading loads.
3) Data modeling
3. 1 Star and Snowflake
Fact tables: narrow, event-driven: 'fact _ bets', 'fact _ payouts', 'fact _ payments'.
Dimensions: 'dim _ users' (SCD), 'dim _ games', 'dim _ providers', 'dim _ markets'.
Snowflake is appropriate in Silver (normalization), Star - in Gold (reading).
3. 2 Data Vault 2. 0 (integration core)
Hubs (business keys), Links (relationships), Satellites (context/history).
Apply in Silver for long-lived provider/PSP integrations.
3. 3 SCD I/II/III
SCD II for RG/KYC/channels and game attributes (RTP/volatility).
Strict intervals' valid _ from/valid _ to ', correct join in time.
4) Load: ETL/ELT, CDC and increments
ELT approach: loading in Silver → transformation in DWH.
CDC: Debezium/log replication from OLTP; merzhi are idempotent.
Increments: by time water ('updated _ at> max_loaded_ts') and/or hash delta.
Backfill/Reprocessing: time-travel, ranges, quotas, dry-run comparisons.
sql
MERGE INTO silver. payments s
USING stage. payments_delta d
ON s. transaction_id = d. transaction_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;
5) Semantic layer and metrics
Metrics Store/Semantic Layer: uniform formulas GGR/NGR/Conversion/LTV.
Versioning metrics and "as-of" computation for reproducibility.
Conventions are metric names, units, currency (base EUR), and 'fx _ source'.
6) Storefronts and Serving
Gold showcases: denormalized, SLA ready (for example, until 06:00 lock.) .
Operational Marts: ClickHouse/Pinot/Druid for 1-5 minute panels.
Export: CSV/JSON/PDF + hash; immutable packets (WORM) for regulators.
sql
CREATE OR REPLACE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) AS event_date,
b. market,
g. provider_id,
SUM(b. stake_base) AS stakes_eur,
SUM(p. amount_base) AS payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) AS ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. games g ON g. game_id = b. game_id
GROUP BY 1,2,3;
7) Data quality (DQ) and contracts
Schema-first: JSON/Avro registry + compatibility tests (consumer-driven).
DQ-как-код: completeness/validity/uniqueness/FK/range/temporal.
Reaction policies: critical → fail + DLQ; major/minor → tag and report.
DQ observability: Freshness/Completeness/Validity dashboards, lost records funnel.
8) Security, privacy and residency
PII minimization: users via pseudo-ID; mappings separately.
RLS/CLS: Line-by-line/post-table access by role and jurisdiction.
Encryption: TLS in-transit; at-rest - KMS/CMK with rotation.
Data Residency: separate directories and keys for EEA/UK/BR; banning cross-regional joins without reason.
DSAR/RTBF: computable projections and selective edits; Legal Hold on reporting artifacts.
9) Performance and Cost (Cost Engineering)
Partitioning: by date/market/tenant; clustering/Z-order by 'market', 'provider _ id', 'game _ id', 'user _ pseudo _ id'.
Formats: Parquet + statistics and compression; OPTIMIZE/VACUUM on schedule.
Materialization: stable aggregates and summary tables; avoid "fat" joins on the fly.
Quotas/Chargeback: budgets for heavy requests/replays; reports cost/query, cost/GB.
Tiered storage: hot/warm/cold; clear recovery SLAs.
10) Observability and management
Pipeline metrics: duration, volumes, retrays, lags, fault tolerance.
DWH metrics: response time/competitiveness/cache hits/value.
Lineage: graph from sources to reports; impact analysis on changes.
SLO: Freshness Silver p95 ≤ 15 мин; Gold daily - ready until 06:00; Validity ≥ 99. 9%; Completeness ≥ 99. 5%; availability ≥ 99. 9%.
11) Multi-tenancy and domain isolation
Division by schema/database/catalog into tenant/market.
Quotas and resource groups; limiting "noisy neighbors."
Export/import policies between tenants, standardized contracts.
12) Data register and documentation
Data Catalog: owner, SLA, schema, examples, DQ rules, lineage.
Metrics/dashboards: cards with formulas and responsible.
Change Log: versions of logic, migrations, impact.
13) Processes and RACI
R (Responsible): Data Engineering (models Silver/Gold, DAG'i), Data Platform (infra, registry, DQ).
A (Accountable): Head of Data/CDO.
C (Consulted): Compliance/Legal/DPO, Finance (FX/GGR), Risk (RG/AML), SRE (SLO/стоимость).
I (Informed): BI, Product, Marketing, Operations.
14) Implementation Roadmap
MVP (4-6 weeks):1. Lakehouse Bronze/Silver (ACID tables), CDC/increments for Payments/Gameplay.
2. First Gold showcases (GGR Daily, conversion), SLAs until 06:00.
3. DQ-like-code (10-15 rules) + Freshness/Completeness dashboards.
4. Data catalog and base semantic layer of metrics.
Phase 2 (6-12 weeks):- SCD II для users/games/providers; domain expansion.
- Online March (ClickHouse/Pinot) for real-time/near-real-time panels.
- Lineage/impact analysis, DSAR/RTBF procedures, regionalization (EEA/UK).
- Auto-simulation of changes (dry-run), replay and comparison of metrics.
- Chargeback/quotas, cost-dashboards; DR exercises and time-travel recovery.
- Auto-generation of showcase documentation and metrics cards.
15) Examples of SQL templates
Actual rates (Silver, 3NF):sql
CREATE TABLE silver. fact_bets (
bet_id STRING PRIMARY KEY,
user_pseudo_id STRING NOT NULL,
game_id STRING NOT NULL,
stake_ccy DECIMAL(18,2) NOT NULL,
currency CHAR(3) NOT NULL,
stake_base DECIMAL(18,2) NOT NULL,
market CHAR(2) NOT NULL,
event_time TIMESTAMP NOT NULL
);
Connection to SCD II (get RG status at the time of bet):
sql
SELECT b. bet_id, u. rg_status
FROM silver. fact_bets b
JOIN dim. users_scd u
ON u. user_pseudo_id = b. user_pseudo_id
AND b. event_time >= u. valid_from
AND (u. valid_to IS NULL OR b. event_time < u. valid_to);
Completeness control by market:
sql
SELECT market, DATE(event_time) d, COUNT() n
FROM silver. fact_bets
GROUP BY market, DATE(event_time)
HAVING n = 0;
16) Pre-sale checklist
- Schemes and contracts in the registry, compatibility tests are green.
- CDC/increments and MERGE procedures are idempotent.
- Gold showcases have SLAs, metric formulas are fixed.
- DQ rules are active (critical → fail + DLQ), Freshness/Completeness dashboards.
- RBAC/ABAC, encryption, residency by region, access logs.
- Lineage/impact enabled; time-travel/backup/DR checked.
- Cost under control: parties, clustering, materialization, quotas.
17) Anti-patterns and risks
"One fat DWH without layers": a mixture of raw and reported data → chaos and expensive fixes.
Full reload daily needlessly: use increments/CDC.
Gold without owner and formulas: lack of a single version of the truth → disputes and regressions.
PII in analytical layers: keep mappings separate, CLS/RLS.
No DQ/lineage: no evidence for regulators/audit.
Unmanageable cost: no batches/optimizations/quotas.
18) Glossary (brief)
DWH is a data warehouse for consolidation and analytics.
Lakehouse - data lake + ACID tables and SQL engine.
CDC - Capture changes from OLTP.
SCD - slowly changing measurements (I/II/III).
Gold showcase - ready-to-consume report sheet/presentation.
Semantic Layer - uniform definitions of metrics and attributes.
19) The bottom line
Modern DWH for iGaming is not a "big table," but a manageable platform: Bronze/Silver/Gold layers, strict contracts and DQ, uniform metrics and lineage, privacy and residency, performance and efficiency. By building a Lakehouse + DWH-March hybrid, you will have fast and verifiable decision making ready for audit, scale and new markets.