GH GambleHub

Data Stores and OLAP Models

(Section: Technology and Infrastructure)

Brief Summary

Data warehouse (DWH) - iGaming's core analytics layer: reports to regulators, profitability by product/market, cohort LTV, anti-fraud analytics, CRM segmentation and real-time dashboards. Sustainable DWH is built on a clear data model (Star/Snowflake/Data Vault), robust integration (ETL/ELT + CDC), thoughtful performance (column engines, parties, MVs), strict metrics semantics, security/PII, and cost management.

Architectural approaches

Classic DWH (Kimball vs Inmon)

Kimball (Dimensional/Star/Snowflake): quick report cases; focus on facts and dimensions, SCD history. Quick time-value.
Inmon (Corporate Information Factory): normalized core + storefronts; heavier in time, but strictly and centrally.

Data Vault 2. 0

Hubs-Links-Satellites: A scalable "raw" model for source integration and change auditing. Star storefronts are being built on top.

Data Lake / Lakehouse

Data Lake: raw files (Parquet/ORC) + directories (Hive/Glue/Unity/Metastore).
Lakehouse: single layer for batch/stream, ACID tables (Delta/Iceberg/Hudi), time-travel, upsert/merge, compact files, Z-order/Clustering.

Medallion (Bronze–Silver–Gold)

Bronze: raw data (raw) + CDC.
Silver: purified and conformal.
Gold: business cases/metrics/cubes.
Suitable for hybrids (Kafka → Bronze; Silver в Lakehouse; Gold в ClickHouse/BigQuery/Snowflake).

OLAP models: Star, Snowflake, Data Vault

Star Schema (star)

Fact tables: transactions (rates, deposits, sessions).
Dimensions: player, game, provider, date/time, geo, attraction channel.
Pros: Simple joynes, predicted performance.

Snowflake

Dimension normalization (country/region/city hierarchies, product hierarchies).
Pros: less duplication; minus - more joynes.

Data Vault → Star

We add raw changes to DV (audit, full reproducibility), build report windows like Star/Snowflake.

Integration: ETL/ELT, CDC, slow changes

Pipeline

Outbox/CDC from OLTP (Postgres/MySQL) → Kafka/ → Bronze connectors.
ELT: cleaning, dedup, normalization in Silver.
Business logic and aggregations in Gold/storefronts.

SCD (Slowly Changing Dimensions)

Type 1-Overwrite (for non-essential fields).
Type 2: historicity (dated versions) - standard for profiles/channels/prices.
Type 3: storage of a pair of values (rare).

Example SCD2 (SQL, general view):
sql
-- insert new dimension version
INSERT INTO dim_player (player_sk, player_id, country, valid_from, valid_to, is_current)
SELECT gen_sk(), s. player_id, s. country, now(), '9999-12-31', true
FROM staging_player s
LEFT JOIN dim_player d ON d. player_id = s. player_id AND d. is_current = true
WHERE d. player_id IS NULL OR d. country <> s. country;

-- closing the old version
UPDATE dim_player d
SET valid_to = now(), is_current = false
FROM staging_player s
WHERE d. player_id = s. player_id AND d. is_current = true AND d. country <> s. country;

Semantic layer and "true" metrics

Enter a single semantic layer: GGR, NGR, Net Deposits, ARPPU, LTV, Churn, Retention Cohorts definitions.
Metrics as code (dbt metrics/LookML/Semantic Layer) → the same formulas in all reports.
Calendar: date/hour table with attributes of TZ/regions/weekends/campaigns.

Vaults and engines: selecting for profile

Column and Cloud DWH

ClickHouse: ultrafast scans/aggregations, materialized representations, projections; is different for events/telemetry and marketing showcases.
BigQuery: serverless, scale, automatic caches/clusters; price per scan; convenient for mixed loads and ad-hoc.
Snowflake: compute/storage branch, on-demand clusters, time-travel; transparent to different teams.
Redshift/Vertica/Pinot/Druid: options for OLAP/real-time.

Profile tuning

Partitioning by date/region/channel.
Clustering/sorting by filter/joyn keys.
Compression and coding by dictionaries.
Preaggregations (rollup, cubes), materialized views.
Approx functions (HyperLogLog/approx_distinct) for cheap ratings.

Performance Engineering

Partitioning and clustering

The party is the boundary of the compartment. Day/hour parties for events.
Clustering (sort keys/Z-order) - accelerates ranges and joins.

Materialized Views (MVs)

GGR/NGR pre-report by day/country/product.
Incremental update from CDC stream.

Example of ClickHouse (merge-holding MV):
sql
CREATE MATERIALIZED VIEW mv_ggr_daily
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (country, product_id, toDate(ts)) AS
SELECT toDate(ts) AS d,
country,
product_id,
sum(stake) AS stake_sum,
sum(win)  AS win_sum,
sum(stake - win) AS ggr
FROM bets
GROUP BY d, country, product_id;

Incremental Models (dbt/ELT)

Strategies' insert _ overwrite'by party,' merge'by CDC keys,' watermark'by' updated _ at '.

Join-strategies

Replica of measurements in each batch segment (denorm).
Broadcast small dims; shuffle large facts sorted by key.

Cost: control and optimization

BigQuery/Snowflake: limit scan size (design parties/clusters), enable result cache/materialized views, limit BI auto-quests.
ClickHouse: lot size, merjey frequency, storage budget (TTL for raw events, aggregations are durable).
Metrics semantics reduces "double" computation.
Data pruning: retention for Bronze, aggregation for Gold.

Data quality (DQ), catalog, lineage

DQ-checks: completeness, uniqueness, ranges, business rules (for example, GGR ≥ 0 in aggregates).
Data Catalog & Lineage: table/field descriptions, owners, PII classification, report-to-source tracing.
Control schemes: contract for events/CDC, alerts for incompatible changes.

Safety, compliance and multi-tenancy

PII segmentation: individual zones, masking/pseudonymization, columns with KMS encryption.

RBAC/ABAC: roles at the project/schema/table/row (RLS) level, reels for "need-to-know."

Localization of data: regional buckets/warehouses (EU/TR/LATAM).
Access audit: who read/changed storefronts and models.

DR, backups and reproducibility

Data code versioning (dbt/git), Dev/QA/Prod environment.
Metastor/catalog snapshots + time-travel tables.
Retention/TTL layers Bronze/Silver/Gold; export of critical storefronts.
Game-day: restoring showcases, checking the integrity of metrics.

Real-time and hybrid storefronts

Stream-to-OLAP: Kafka → ClickHouse/Pinot/Druid for minute showcases.
Materialized views + CDC for near-online updates (5-15 min).
The semantic layer remains the same: the metrics are identical in real-time and batch.

GGR by Day and Country Showcase Example (Generic SQL)

sql
CREATE TABLE fact_bets (
bet_id   BIGINT,
player_sk BIGINT,
game_sk  BIGINT,
country_sk BIGINT,
stake   DECIMAL(18,2),
win    DECIMAL(18,2),
ts     TIMESTAMP
) PARTITION BY DATE(ts);

CREATE TABLE dim_country (
country_sk BIGINT PRIMARY KEY,
iso2    STRING,
region   STRING,
valid_from TIMESTAMP,
valid_to  TIMESTAMP,
is_current BOOL
);

-- Showcase
CREATE MATERIALIZED VIEW mart_ggr_daily AS
SELECT
DATE(ts) as d,
c. region,
SUM(stake) AS stake_sum,
SUM(win)  AS win_sum,
SUM(stake - win) AS ggr
FROM fact_bets f
JOIN dim_country c ON c. country_sk = f. country_sk AND c. is_current
GROUP BY d, c. region;

Implementation checklist

1. Define sources and domains, fix the metric dictionary.
2. Select model: DV for raw/audited layers + Star for display cases.
3. Design parties/clusters for key queries and windows.
4. Configure CDC/ELT, SCD policy, and surrogate keys.
5. Enter a semantic layer (metrics as code) and a date/hour calendar.
6. Create MVs/preaggregations for expensive reports.
7. Enable DQ/directory/lineage and schema control.
8. Define RBAC/PII/localization, encryption, auditing.
9. Set up p95/p99 monitoring, cost, alerts on degradation and overrun.
10. Regular DR exercises and reproducibility of environments.

Anti-patterns

"One gigantic fact without parties" → terabyte scans and the score is growing.
Inconsistent definitions of metrics in different dashboards.
Lack of SCD2 where business requires historicity.
Premature normalization of measurements: unnecessary joynes and slow reports.
Raw data without DQ checks and lineage → "nothing" reports.
Absence of permission/TTL → storage of garbage and explosion of cost.

Summary

Reliable iGaming-DWH is a clear model (DV→Star), a single metric dictionary, correct partitions/clustering, materialized storefronts, strict DQ/lineage, and RBAC/PII/localization. Add hybrid streaming for freshness, ELT-driven and value discipline - and get a sustainable analytics platform that scales to tournaments, regulatory reports and ad-hoc research without surprises in p99 and budget.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.