GH GambleHub

Data lakes and flow aggregation

1) Purpose and value

Data Lake/Lakehouse - a reference layer of long-term storage and large-scale reading, where:
  • Streams from products/games/payments land in Bronze "as is."
  • Silver normalizes and enriches, providing consistent keys and quality.
  • Gold - aggregated showcases (including real-/near-real-time) for BI, regulator, anti-fraud/RG.

Aggregation of flows on Lakehouse yields: low report latency, predictable cost, reproducibility, and forensics.

2) Reference architecture

1. Ingest/Edge: HTTP/gRPC, OTel, batch endpoints → шина (Kafka/Redpanda).
2. Bronze (append-only): object storage + ACID tables (Delta/Iceberg/Hudi), partitions by date/market/tenant; storage of the original payload.
3. Stream Compute: Flink/Spark/Beam - window units, CEP, deadup, online-lookups.
4. Silver (clean/conform): currency normalization/timezone, FK/directories, SCD for measurements.
5. Serving/OLAP: ClickHouse/Pinot/Druid - materialized minute/second aggregates for panels.
6. Gold (serve): daily/hourly display cases, regulatory slices, immutable export packages (WORM).
7. Control loops: Schema Registry, DQ-as-code, lineage, directories, secrets/KMS, RBAC/ABAC.

3) Contracts and schemes

Schema-first: JSON/Avro/Protobuf; required fields are 'event _ time (UTC)', 'event _ id', 'trace _ id', 'user _ pseudo _ id', 'market', 'schema _ version'.
Evolution: back-compatible → adding nullable; breaking → '/v2 '+ double entry.
Directory: domain description, owner, fresh SLA, DQ rules, lineage.

4) Landing streams into the lake

Exactly-once at the bottom: at-least-once publication + idempotent sink (MERGE/upsert by 'event _ id').
Dedup: stateful in stream + uniqueness in Silver.
File compression: small files → regular OPTIMIZE/VACUUM for reading and cost.
Time-travel: includes debugging, replay and audit.

Example of Iceberg partitioning (DDL idea):
sql
CREATE TABLE bronze. payment_events (
event_id STRING, user_pseudo_id STRING, currency STRING,
amount DECIMAL(18,2), market STRING, event_time TIMESTAMP, payload STRING
)
PARTITIONED BY (days(event_time), market);

5) Stream aggregation: windows and watermarks

Windows:
  • Tumbling - fixed (e.g. 1 min/5 min) for stable panels.
  • Hopping - overlapping (step
  • Session - behavioral gaps in inactivity.
  • Watermarks: late data control (usually 2-5 minutes), pre-issue/correction rules.
Flink SQL - 1 minute deposits by market:
sql
SELECT market,
TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS ts_min,
COUNT() AS deposits_1m,
SUM(amount_base) AS sum_1m
FROM silver. payments
GROUP BY market, TUMBLE(event_time, INTERVAL '1' MINUTE);

6) Materialization of aggregates

OLAP engine (ClickHouse/Pinot/Druid): stores minute/second aggregates for dashboards and operational analytics.
Lakehouse Gold: keeps daily/hourly slices for reporting and reconciliation (reproducibility).

ClickHouse - materialized view (every minute GGR):
sql
CREATE MATERIALIZED VIEW mv_ggr_1m
ENGINE = AggregatingMergeTree()
PARTITION BY toDate(event_time)
ORDER BY (toStartOfMinute(event_time), market, provider_id) AS
SELECT toStartOfMinute(event_time) AS ts_min,
market,
provider_id,
sumState(stake_base) AS s_stake,
sumState(payout_base) AS s_payout
FROM stream. game_events
GROUP BY ts_min, market, provider_id;
Gold - day slice (Lakehouse):
sql
CREATE OR REPLACE VIEW gold. ggr_daily AS
SELECT
DATE(event_time) AS event_date,
market, provider_id,
SUM(stake_base) AS stakes_eur,
SUM(payout_base) AS payouts_eur,
SUM(stake_base) - SUM(payout_base) AS ggr_eur
FROM silver. fact_game_financials
GROUP BY 1,2,3;

7) Silver: normalization and reconciliation

Time and currency: 'event _ time (UTC)', 'amount _ base', 'fx _ rate _ used', 'fx _ source'.
Keys/directories: 'user _ pseudo _ id', 'game _ id', 'provider _ id', 'market'.
SCD II: dimension historization (users/games/providers/RG/KYC).
DQ rules: key uniqueness, directories, amount ranges, temporal validity.

8) Register of units and "correct" definitions

Semantic Layer: uniform GGR/NGR formulas, bets/wins, conversion, ARPPU, latency p95.
Versioning metrics: 'metric _ version' and "as-of" calculations.
Dockcards: owner, formula, sources, SLA readiness.

9) Exactly-once/idempotency and order

Bus: at-least-once + partitioning (local order).
Processing: dedup by 'event _ id' (TTL 24-72h), CEP/window operators with adjustments.
Sink: transactional commits or idempotent upsert/merge.
Outbox/Inbox: publishing domain events from OLTP with warranty.

10) Late data and adjustments

Allowed lateness: 2-5 min for operational displays; daily reassembly for Gold.
Corrections: additional emissions in OLAP and resampling of Gold (idempotent).
Flags: 'late = true', 'correction _ of = <event _ id>' for audit.

11) Observability and DQ

SLI/SLO (landmarks):
  • p95 ingest→1 -min showcase ≤ 2-5 s; Gold daily is ready until 06:00 lock.
  • Completeness ≥ 99. 5%; Schema validity ≥ 99. 9%; Trace coverage ≥ 98%.
  • Pipeline metrics: lag/throughput/busy time/state size, late-ratio, dup-rate.
  • DQ-dashboards: Freshness/Completeness/Validity, loss funnel, hot key card.
  • Lineage: the way from Bronze to Gold/exports; impact analysis on changes.

12) Privacy, residency, security

PII minimization: pseudonymization, separate protected mapping.
Residency: EEA/UK/BR - separate directories and encryption keys; banning cross-regional joins without reason.
Encryption: TLS in-transit; KMS/CMK at-rest; export signatures + WORM at throttling.
DSAR/RTBF/Legal Hold: selective edits, deletion freezes, audited accesses.

13) Performance and cost

Partitioning: by date/market/tenant; clustering/Z-order by frequently filtered attributes.
Compaction: small files elimination, regular OPTIMIZE/VACUUM.
Materialization: minutes/seconds - in OLAP; day/hours - in Gold.
Tiered storage: hot/warm/cold, SLA recovery, chargeback by command (cost/GB, cost/query).
Preaggregation/sketches: HyperLogLog/approx-distinct where acceptable.

14) Examples (fragments)

Flink CEP - deposit structuring (10 min):
python if count_deposits(window=10MIN) >= 3 \
and sum_deposits(window=10MIN) > THRESH \
and all(d. amount < REPORTING_LIMIT for d in window_events):
emit_alert("AML_STRUCTURING", user_id, snapshot())
SQL - dedup when loaded into Silver:
sql
CREATE TABLE silver. payments AS
SELECT EXCEPT(rn) FROM (
SELECT p., ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_time) rn
FROM bronze. payment_events p
) WHERE rn = 1;
Iceberg/Delta - MERGE idempotent:
sql
MERGE INTO silver. fact_bets s
USING stage. fact_bets_delta d
ON s. bet_id = d. bet_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;

15) Processes and RACI

R (Responsible):
  • Data Platform (Lakehouse/catalog/ACID, compaction),
  • Streaming (units/CEP/dedup),
  • Domain Analytics (metrics/Gold).
  • A (Accountable): Head of Data/CDO.
  • C (Consulted): Compliance/Legal/DPO (PII/residency/Legal Hold), Finance (FX/GGR), SRE (SLO/стоимость), Security.
  • I (Informed): BI/Product/Marketing/Operations.

16) Implementation Roadmap

MVP (3-5 weeks):

1. Lakehouse Bronze/Silver (ACID tables), ingest from Kafka, registry schemes.

2. Basic stream units (1-5 minutes) in OLAP; showcase Gold. ggr_daily (D + 1 until 06:00).

3. DQ-as-code for Payments/Gameplay, Freshness/Completeness dashboards.

4. Compaction/OPTIMIZE, minimum cost metrics and lag/late/dup alerts.

Phase 2 (5-10 weeks):
  • Silver extension (SCD II for users/games/providers), lineage and impact analysis.
  • Asynchronous lookups (RG/KYC/ASN/BIN), late correction control.
  • Semantic layer of metrics, export regulations (WORM/signatures).
Phase 3 (10-16 weeks):
  • Multi-region, DR/replay simulator, auto-tuning windows and watermarks.
  • Cost-dashboards, chargeback/quotas, tiered storage and archiving.
  • Auto-generation of showcase documentation and metrics cards.

17) Pre-sale checklist

  • Schemes and contracts in the register; back-compat tests are green.
  • Dedup, watermark/allowed lateness, DLQ included.
  • /OPTIMIZE/VACUUM is configured on schedule.
  • SLO: p95 ingest→minute-view, Gold до 06:00; alerts lag/late/dup/state size.
  • DQ rules are active; lineage is visible from Bronze to exports.
  • RBAC/ABAC и KMS; residency and DSAR/RTBF/Legal Hold tested.
  • Cost under control (cost/GB, cost/query, cold share), limits on replays.

18) Anti-patterns and risks

Mixing raw and reported data in the same table: violates reproducibility.
Lack of compression: small files explosion → expensive requests.
Calculating FX "retroactively": Breaks history and reports.

No watermarks/late policies: storefronts and alerts "float."

Full reload needlessly: use/MERGE increments and adjustments.
PII in Analytics: Keep mappings separate, enable CLS/RLS.

19) Glossary (brief)

Lakehouse - data lake + ACID tables and SQL engine.
Bronze/Silver/Gold - raw/normalized/serving layers.
Watermark - window readiness limit by event-time.
Materialized View is a pre-calculated showcase for fast reading.
Time-travel - reading historical versions of tables.
WORM - unchangeable storage of export artifacts.

20) The bottom line

A data lake with proper stream aggregation is a discipline of layers and contracts: Bronze "as is," Silver for normalization and quality, OLAP for minute panels, Gold for reproducible reports. Managing windows and watermarks, deduplication and compression, privacy and cost, you get fast, verifiable and compliant storefronts for product, compliance and operational management.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.