Data Stores and OLAP Models
(Section: Technology and Infrastructure)
Brief Summary
Data warehouse (DWH) - iGaming's core analytics layer: reports to regulators, profitability by product/market, cohort LTV, anti-fraud analytics, CRM segmentation and real-time dashboards. Sustainable DWH is built on a clear data model (Star/Snowflake/Data Vault), robust integration (ETL/ELT + CDC), thoughtful performance (column engines, parties, MVs), strict metrics semantics, security/PII, and cost management.
Architectural approaches
Classic DWH (Kimball vs Inmon)
Kimball (Dimensional/Star/Snowflake): quick report cases; focus on facts and dimensions, SCD history. Quick time-value.
Inmon (Corporate Information Factory): normalized core + storefronts; heavier in time, but strictly and centrally.
Data Vault 2. 0
Hubs-Links-Satellites: A scalable "raw" model for source integration and change auditing. Star storefronts are being built on top.
Data Lake / Lakehouse
Data Lake: raw files (Parquet/ORC) + directories (Hive/Glue/Unity/Metastore).
Lakehouse: single layer for batch/stream, ACID tables (Delta/Iceberg/Hudi), time-travel, upsert/merge, compact files, Z-order/Clustering.
Medallion (Bronze–Silver–Gold)
Bronze: raw data (raw) + CDC.
Silver: purified and conformal.
Gold: business cases/metrics/cubes.
Suitable for hybrids (Kafka → Bronze; Silver в Lakehouse; Gold в ClickHouse/BigQuery/Snowflake).
OLAP models: Star, Snowflake, Data Vault
Star Schema (star)
Fact tables: transactions (rates, deposits, sessions).
Dimensions: player, game, provider, date/time, geo, attraction channel.
Pros: Simple joynes, predicted performance.
Snowflake
Dimension normalization (country/region/city hierarchies, product hierarchies).
Pros: less duplication; minus - more joynes.
Data Vault → Star
We add raw changes to DV (audit, full reproducibility), build report windows like Star/Snowflake.
Integration: ETL/ELT, CDC, slow changes
Pipeline
Outbox/CDC from OLTP (Postgres/MySQL) → Kafka/ → Bronze connectors.
ELT: cleaning, dedup, normalization in Silver.
Business logic and aggregations in Gold/storefronts.
SCD (Slowly Changing Dimensions)
Type 1-Overwrite (for non-essential fields).
Type 2: historicity (dated versions) - standard for profiles/channels/prices.
Type 3: storage of a pair of values (rare).
sql
-- insert new dimension version
INSERT INTO dim_player (player_sk, player_id, country, valid_from, valid_to, is_current)
SELECT gen_sk(), s. player_id, s. country, now(), '9999-12-31', true
FROM staging_player s
LEFT JOIN dim_player d ON d. player_id = s. player_id AND d. is_current = true
WHERE d. player_id IS NULL OR d. country <> s. country;
-- closing the old version
UPDATE dim_player d
SET valid_to = now(), is_current = false
FROM staging_player s
WHERE d. player_id = s. player_id AND d. is_current = true AND d. country <> s. country;
Semantic layer and "true" metrics
Enter a single semantic layer: GGR, NGR, Net Deposits, ARPPU, LTV, Churn, Retention Cohorts definitions.
Metrics as code (dbt metrics/LookML/Semantic Layer) → the same formulas in all reports.
Calendar: date/hour table with attributes of TZ/regions/weekends/campaigns.
Vaults and engines: selecting for profile
Column and Cloud DWH
ClickHouse: ultrafast scans/aggregations, materialized representations, projections; is different for events/telemetry and marketing showcases.
BigQuery: serverless, scale, automatic caches/clusters; price per scan; convenient for mixed loads and ad-hoc.
Snowflake: compute/storage branch, on-demand clusters, time-travel; transparent to different teams.
Redshift/Vertica/Pinot/Druid: options for OLAP/real-time.
Profile tuning
Partitioning by date/region/channel.
Clustering/sorting by filter/joyn keys.
Compression and coding by dictionaries.
Preaggregations (rollup, cubes), materialized views.
Approx functions (HyperLogLog/approx_distinct) for cheap ratings.
Performance Engineering
Partitioning and clustering
The party is the boundary of the compartment. Day/hour parties for events.
Clustering (sort keys/Z-order) - accelerates ranges and joins.
Materialized Views (MVs)
GGR/NGR pre-report by day/country/product.
Incremental update from CDC stream.
sql
CREATE MATERIALIZED VIEW mv_ggr_daily
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (country, product_id, toDate(ts)) AS
SELECT toDate(ts) AS d,
country,
product_id,
sum(stake) AS stake_sum,
sum(win) AS win_sum,
sum(stake - win) AS ggr
FROM bets
GROUP BY d, country, product_id;
Incremental Models (dbt/ELT)
Strategies' insert _ overwrite'by party,' merge'by CDC keys,' watermark'by' updated _ at '.
Join-strategies
Replica of measurements in each batch segment (denorm).
Broadcast small dims; shuffle large facts sorted by key.
Cost: control and optimization
BigQuery/Snowflake: limit scan size (design parties/clusters), enable result cache/materialized views, limit BI auto-quests.
ClickHouse: lot size, merjey frequency, storage budget (TTL for raw events, aggregations are durable).
Metrics semantics reduces "double" computation.
Data pruning: retention for Bronze, aggregation for Gold.
Data quality (DQ), catalog, lineage
DQ-checks: completeness, uniqueness, ranges, business rules (for example, GGR ≥ 0 in aggregates).
Data Catalog & Lineage: table/field descriptions, owners, PII classification, report-to-source tracing.
Control schemes: contract for events/CDC, alerts for incompatible changes.
Safety, compliance and multi-tenancy
PII segmentation: individual zones, masking/pseudonymization, columns with KMS encryption.
RBAC/ABAC: roles at the project/schema/table/row (RLS) level, reels for "need-to-know."
Localization of data: regional buckets/warehouses (EU/TR/LATAM).
Access audit: who read/changed storefronts and models.
DR, backups and reproducibility
Data code versioning (dbt/git), Dev/QA/Prod environment.
Metastor/catalog snapshots + time-travel tables.
Retention/TTL layers Bronze/Silver/Gold; export of critical storefronts.
Game-day: restoring showcases, checking the integrity of metrics.
Real-time and hybrid storefronts
Stream-to-OLAP: Kafka → ClickHouse/Pinot/Druid for minute showcases.
Materialized views + CDC for near-online updates (5-15 min).
The semantic layer remains the same: the metrics are identical in real-time and batch.
GGR by Day and Country Showcase Example (Generic SQL)
sql
CREATE TABLE fact_bets (
bet_id BIGINT,
player_sk BIGINT,
game_sk BIGINT,
country_sk BIGINT,
stake DECIMAL(18,2),
win DECIMAL(18,2),
ts TIMESTAMP
) PARTITION BY DATE(ts);
CREATE TABLE dim_country (
country_sk BIGINT PRIMARY KEY,
iso2 STRING,
region STRING,
valid_from TIMESTAMP,
valid_to TIMESTAMP,
is_current BOOL
);
-- Showcase
CREATE MATERIALIZED VIEW mart_ggr_daily AS
SELECT
DATE(ts) as d,
c. region,
SUM(stake) AS stake_sum,
SUM(win) AS win_sum,
SUM(stake - win) AS ggr
FROM fact_bets f
JOIN dim_country c ON c. country_sk = f. country_sk AND c. is_current
GROUP BY d, c. region;
Implementation checklist
1. Define sources and domains, fix the metric dictionary.
2. Select model: DV for raw/audited layers + Star for display cases.
3. Design parties/clusters for key queries and windows.
4. Configure CDC/ELT, SCD policy, and surrogate keys.
5. Enter a semantic layer (metrics as code) and a date/hour calendar.
6. Create MVs/preaggregations for expensive reports.
7. Enable DQ/directory/lineage and schema control.
8. Define RBAC/PII/localization, encryption, auditing.
9. Set up p95/p99 monitoring, cost, alerts on degradation and overrun.
10. Regular DR exercises and reproducibility of environments.
Anti-patterns
"One gigantic fact without parties" → terabyte scans and the score is growing.
Inconsistent definitions of metrics in different dashboards.
Lack of SCD2 where business requires historicity.
Premature normalization of measurements: unnecessary joynes and slow reports.
Raw data without DQ checks and lineage → "nothing" reports.
Absence of permission/TTL → storage of garbage and explosion of cost.
Summary
Reliable iGaming-DWH is a clear model (DV→Star), a single metric dictionary, correct partitions/clustering, materialized storefronts, strict DQ/lineage, and RBAC/PII/localization. Add hybrid streaming for freshness, ELT-driven and value discipline - and get a sustainable analytics platform that scales to tournaments, regulatory reports and ad-hoc research without surprises in p99 and budget.