Data Lake and centralized storage

(Section: Technology and Infrastructure)

Brief Summary

Data Lake is the basic layer of centralized storage of raw materials and consolidated datasets. For iGaming, it accepts betting/payment/game log events, affiliate uploads, CDC from OLTP and gives them to analytics, anti-fraud, CRM and BI. Modern practice - Lakehouse: open column formats + ACID table layer + single directory + transactions/data versions. The key to success is the discipline of schemes and partitioning, cost management, PII security and a strict operating culture (DQ, lineage, DR).

Data Lake's Role in the iGaming Platform

A single point of truth for analytics: storing raw and purified data regardless of source and format.
Flexibility: support for batch and streaming (CDC/connectors, event streams).
Evolution: from raw Bronze to conformal Silver and Gold business cases.
Division of responsibility: production services write to the tire/staging, analytics/ML consumes from Lake layers.

Architectural models: Lake vs Lakehouse

Data Lake (S3/ADLS/GCS + Parquet/ORC): schema-on-read, cheap storage, flexible formats.
Lakehouse (Delta/Iceberg/Hudi over Parquet): ACID transactions, upsert/merge, time-travel, compact files, vacuum, indexing/clustering.
Practice: Lakehouse is beneficial for iGaming as the main layer, and external OLAPs (ClickHouse/BigQuery/Snowflake/Pinot) as showcases and special engines.

Medallion layer model

Bronze (Raw/Staging): raw files from sources (CDC, log dumps, affiliate CSV, webhooks). Minimal validation, "as is."

Silver (Conformed): cleaning/dedup, normalization of currencies/time zones, typing, SCD measurements, consistent keys.
Gold (Marts/Serving): aggregates for GGR/NGR/LTV/Retention, materialized storefronts for BI/CRM/anti-fraud.
TTL: Aggressive on Bronze, Moderate on Silver, Long-Term on Gold units.

Formats and table layers

Column: Parquet (de facto standard), ORC.

Open table formats (ACID):

Delta Lake - transactions, 'MERGE', time-travel, optimization/vacuum, Z-order.
Apache Iceberg - tables with manifests/snapshots, hidden partitioning, 'MERGE/DELETE/UPDATE', time-travel.
Apache Hudi - copy-on-write/merge-on-read, upsert-optimization, incremental extractions.
Make your choice based on the ecosystem and requirements for upsert/streaming/flexibility of the evolution of schemes.

Catalogue and metastor

A single directory (Hive Metastore/Unity/Glue/platform directories) stores schemas, parties, versions, rights.
Requirements: transactional consistency with a table layer, support for multiple engines (Spark, Trino/Presto, Flink, dbt), audit/lineage.

Schemes and evolution

Schema contract: fix mandatory fields, types, semantics; versioning sources ('schema _ version').
Evolution: adding optional fields, prohibiting breaking changes without migrations; automatic check schemes in pipelines.
PII segmentation: sensitive fields - into separate columns/tables with encryption and separate rights.

Data partitioning and lay-out

Date/hour - base key for events; optional fields: 'country', 'product', 'tenant _ id'.
Hive-style путь: `s3://lake/bronze/payments/source=pspA/dt=2025-11-05/hour=13/part-0001. parquet`.
Clustering/sorting: Z-order/Sort keys by frequently filtered fields (player_id, country).
File size: Aim for 128-1024 MB; avoid "small files" (see below).
Virtual columns (Iceberg/Delta) for hidden partitioning.

Small files and compaction problem

Sources stream small chunks → degradation of scans and metadata.
Solution: periodic optimize/compaction (coalesce), compaction task scheduler, batch micro-bundle on ingestion, 'autoOptimize' (if available).
The merge-on-read vs copy-on-write policy is a balance between write latency and read speed.

Injest: batch, stream, CDC

CDC from OLTP (Debezium/connectors) → Bronze (minute freshness).
Stream (Kafka/Flink/Spark Structured Streaming) → Silver/Gold incrementally (upsert/merge).
Batch (partner reports/CSV/JSON) - through "receivers" with manifests, control of duplicates by checksum.
Idempotency: keys (idempotency_key), dedup by (key, ts), "watermarks" for later arriving records.

Data quality (DQ) and lineage

DQ checks: completeness, uniqueness of keys, ranges, reference integrity (country/currency lists), business rules (GGR ≥ 0).
Liniage: graph of dependencies from the report to the source, version of the model code and snapshot of the table.
Schema control: automatic back/forward-compat tests that block "breaking" changes.
Audit downloads: who/when/how many, rejected batches, retrays.

Serving and Access

SQL engines: Spark/Trino/Presto for ad-hoc and transformations; dbt for ELT models.
Real-time/near-real-time: Pinot/Druid/ClickHouse as storefronts; Lakehouse is a source through incremental sink.
Data Sharing: sharing tables/snapshots to external commands without copies (if supported by the format).

Security, PII and multi-tenancy

Encryption: at-rest (KMS) and in-transit (TLS).
IAM/RBAC/ABAC: roles at directory/table/column/row level (masking, dynamic policies).
Segmentation by region (EU/Turkey/LatAm localization): isolation of buckets and compute pools.
Multi-tenancy: namespace/directories and path prefixes, filters by 'tenant _ id', optional - row-level policies.
Access auditing: metadata read/change logs, retention and non-modifiable logs.

Cost management

Storage classes: hot (often readable) in a standard class, archive - in cold/Glacier classes with TTL policies.
Partitioning/clusters reduce scans → less than $ $.
Materialized storefronts for expensive reports; BI results cache.
Compression and "correct file size" - less metadata and I/O.
Quotas and budgeting: limits on compute clusters/jobs, cost reports on dataset/team.
Garbage removal: 'VACUUM/REWRITE' in table formats, TTL Bronze.

DR and reproducibility

Time-travel table versioning and catalog snapshots.
Cross-region replication of buckets and metadata.
PITR: storage of table transaction logs (Delta/Iceberg/Hudi) and pipeline logs.
Game-day: regular recovery exercises and switching regions.

Observability and SLO

SLO freshness: Bronze ≤ 5 min, Silver ≤ 15-30 min, Gold ≤ 60 min (example).
Metrics: volume/number of files, average parquet file size, scan time, share of missed batches, compaction frequency, cost/date, DQ errors, late data.
Alerts: small files surge, cost growth, p95/p99 degradation, DQ/scheme violation, stream-blue lag.

Naming Conventions and Pathways (template)


s3://<lake>/<layer>/<domain>/<dataset>/
source=<sys>/      # для Bronze dt=YYYY-MM-DD/
hour=HH/
country=XX/

Dataset names: 'bets _ raw', 'payments _ cdc', 'players _ silver', 'mart _ ggr _ daily'.
Metadata columns: 'ingest _ ts', 'source', 'schema _ version', 'trace _ id', 'tenant _ id'.

Examples (generalized)

1) Iceberg: Silver table with hidden party by date

sql
CREATE TABLE silver. bets (
bet_id    BIGINT,
player_id   BIGINT,
country    STRING,
stake     DECIMAL(18,2),
win      DECIMAL(18,2),
event_ts   TIMESTAMP,
ingest_ts   TIMESTAMP,
schema_version INT
)
PARTITIONED BY (days(event_ts))
TBLPROPERTIES ('format-version'='2');

2) Delta: Incremental upsert from CDC

sql
MERGE INTO silver. players t
USING bronze. players_cdc s
ON t. player_id = s. player_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;

3) TTL policy for Bronze (idea)


bronze/: keep 30 days silver/: keep 365 days (non-PII), 90 days (PII masked)
gold/marts/: keep 2–3 years (aggregated)

Implementation checklist

1. Select the table format (Delta/Iceberg/Hudi) and directory; align with engines (Spark/Trino/Flink/dbt).
2. Define medallion layers, TTL rules, and team responsibility.
3. Capture schema contracts, evolution control, PII segmentation, and encryption.
4. Design lay-out: parts, sort-keys, target file size; enable compaction.
5. Configure ingest (CDC/stream/batch) with idempotency and deduplication.
6. Enable DQ/lineage, metadata catalog, and auditing.
7. Define freshness/cost SLOs, dashboards of metrics and alerts.
8. Organize DR: snapshots/replication/recovery + regular exercises.
9. Standardize naming and paths, meta columns ('ingest _ ts', 'source', 'schema _ version').
10. Bring Gold showcases and real-time serving to the right OLAP/RT engines.

Anti-patterns

One common "bag" without layers and TTL → chaos and an explosion of cost.
Time-only partitioning excluding country/product → heavy scans.
Threads that create thousands of small files/hour without compaction.
Lack of control of schemes and DQ → "breaking" changes and distrust of reports.
Mixing PII with Gold showcases without masking/rights separation.
Hardcode of access rights at the level of buckets instead of a directory and tabular policies.

Summary

Modern Data Lake for iGaming is a Lakehouse with an open table format, a single catalog and a medallion model. The discipline of schemes/parties, compaction against small files, DQ/lineage, PII security and cost hygiene turn the lake layer into a sustainable foundation: cheap to store, fast to read, predictable in SLO and ready for DR. Such a foundation scales to tournament peaks and supports both batch analytics and near-real-time storefronts.