GH GambleHub

ETL/ELT processes

1) Purpose and context

ETL/ELT pipelines provide predictable loading, transformation and publishing of data for reporting (GGR/NGR, regulators), analytics/ML and operational panels.

ETL: transform before loading into DWH/Lakehouse (less often in modern stacks).
ELT: first load into Lakehouse (Bronze/Silver), then transform SQL/engines (recommended).

2) Reference architecture

1. Ingest/Edge: HTTP/gRPC/Batch, CDC from OLTP, provider upload S3/FTP.
2. Bronze (raw, append-only): immutable payloads, parties by date/market/tenant.
3. Silver (clean/conform): normalization, dedup, directories, SCD, FX/timezones.
4. Gold (serve): denormalized storefronts for BI/regulator/models.
5. Orchestration: Airflow/Dagster/Prefect (DAG'i, SLA, Retrai, Shifts).
6. DQ/Contracts: Schema Registry + DQ-как-код, consumer-driven tests.
7. Observability: pipeline metrics, lineage, logs, cost-dashboards.

3) ETL vs ELT selection

CriterionETLELT (recommended)
Flexibility of allocationslowhigh (time-travel, reprocessing)
Costmore expensive when growingoptimal when scaling
Quality controlon ingestto Silver/Gold + DQ-as-code
Historicity/forensicslimitedfull (Bronze append-only)

Practice: in iGaming - ELT + CDC: load quickly, then standardize and count.

4) Increments and CDC

Delta approaches:
  • CDC (Debezium/log replication): OLTP changes → Bronze → MERGE in Silver.
  • Watermark by time: 'updated _ at> max_loaded_ts'.
  • Hash diff: 'md5 (row)' comparison for change detection.
  • Upsert/MERGE: idempotency of downloads.
Example MERGE (Delta/Iceberg):
sql
MERGE INTO silver. payments s
USING stage. payments_delta d
ON s. transaction_id = d. transaction_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;

5) Contracts and schemes

Schema-first: JSON/Avro/Protobuf in Registry; 'schema _ version' in events/files.
Evolution: back-compatible (nullable additions); breaking - '/v2 '+ double entry.
Required fields are 'event _ time (UTC)', 'event _ id', 'trace _ id', 'user _ pseudo _ id', 'market'.

6) DQ-as-code (minimum set)

yaml table: silver. payments owner: data-payments slo:
freshness_minutes: 15 completeness_percent: 99. 5 rules:
- name: unique_tx # uniqueness of transactions type: unique columns: [transaction_id]
severity: critical
- name: currency_whitelist type: in_set column: currency set: [EUR,USD,GBP,TRY,BRL]
severity: major
- name: amount_positive type: range column: amount_base min: 0. 01 severity: critical
- name: fk_user type: foreign_key column: user_pseudo_id ref_table: dim. users_scd severity: critical

7) Orchestration: DAG 'and, dependencies, SLA

DAG design: from sources to storefronts; explicit dependencies between tasks.
Retrai and idempotence: backoff, "clean" replays, checkpoints.
Catchup: a neat catch of missed periods.
SLA: Gold, for example. daily is ready before 06:00 local time; alerts on violations.
Parameterization: markets/tenants/dates via vars; a single job template.

8) Idempotence and exactly-once

In ingest: duplicates are possible → dedup by '(event_id, source)'.
In processing: upsert/merge; "pure" transformation functions.

In sink: transactional commits or idempotent writes; control of "double counting."

Outbox/Inbox: transactional publishing of domain events from OLTP.

9) Backfill и reprocessing

Backfill: primary fill/historical ranges.
Reprocessing - recalculation when logic changes/corrections.
Guardrails: range limits, quotas, time windows, dry-run with metric comparison.
Marking: 'logic _ version', 'reprocessed _ at', 'recalc _ reason'.

10) Silver/Gold Modeling

Silver (3NF/BCNF): facts' fact _ bets/payments/payouts', dimensions' dim _ users/games/providers/markets (SCD II) ', currency standardization/timezones.
Gold: denormalized storefronts for BI/regulator/models; immutable export packages (WORM) + signature.

Gold example: GGR Daily

sql
CREATE OR REPLACE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) AS event_date,
b. market,
g. provider_id,
SUM(b. stake_base) AS stakes_eur,
SUM(p. amount_base) AS payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) AS ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. games g ON g. game_id = b. game_id
GROUP BY 1,2,3;

11) Privacy and residency

PII minimization: tokenization; mappings of real IDs in the isolated loop.
RLS/CLS: access policies by role/jurisdiction, masking.
Residency: separate directories/keys for EEA/UK/BR; banning cross-regional joins without reason.
DSAR/RTBF & Legal Hold: selective edits, WORM archives for reporting, export audits.

12) Observability and SLO

SLI/SLO benchmarks:
  • Freshness Silver p95 ≤ 15 min; Gold daily is ready until 06:00 lock. time.
  • Completeness ≥ 99. 5%, Validity (scheme) ≥ 99. 9%.
  • The success of jobs ≥ 99. 0%, MTTR incidents ≤ 24-48 h.

Dashboards: Freshness heatmap, DQ loss funnel, cost/query & cost/GB, lineage graph.

13) Performance and cost

Partitioning: date/market/tenant; clustering/Z-order by filters.
Formats: Parquet + ACID (Delta/Iceberg/Hudi), compression and statistics.
Compaction: fighting small files (OPTIMIZE/VACUUM).
Materialization: stable aggregates; avoid giant on-the-fly joins.
Chargeback: budgets, replay quotas/backfill; scheduling in low load windows.

14) Examples of typical DAG tasks (Airflow pseudocode)

python with DAG("elt_payments_daily", schedule="@daily", start_date=..., catchup=True) as dag:
extract = BashOperator(task_id="extract_cdc", bash_command="run_cdc_to_bronze. sh {{ ds }}")
load  = BashOperator(task_id="load_to_silver", bash_command="sql/run_merge_silver. sql {{ ds }}")
dq   = BashOperator(task_id="dq_checks", bash_command="dq/run_checks. sh silver. payments {{ ds }}")
gold  = BashOperator(task_id="build_gold_ggr", bash_command="sql/build_gold_ggr. sql {{ ds }}")
export = BashOperator(task_id="export_regulator", bash_command="export/run_worm_pack. sh {{ ds }}")

extract >> load >> dq >> gold >> export

15) Processes and RACI

R (Responsible): Data Engineering (DAG, Silver/Gold models), Data Platform (infra, Registry, DQ).
A (Accountable): Head of Data/CDO.
C (Consulted): Compliance/Legal/DPO (PII/residency/Legal Hold), Finance (FX/GGR), Risk (RG/AML), SRE (SLO/стоимость).
I (Informed): BI/Product/Marketing/Operations.

16) Implementation Roadmap

MVP (3-5 weeks):

1. Lakehouse Bronze/Silver (ACID) + CDC/increments for Payments/Gameplay.

2. DQ-like-code (10-15 rules) and basic Freshness/Completeness dashboards.

3. First Gold Showcase (GGR Daily) with SLA "until 06:00," WORM export with signature.

4. DAG and alert orchestration on SLA/DQ.

Phase 2 (5-10 weeks):
  • Domain extension, SCD II for users/games/providers.
  • Semantic layer of metrics; lineage/impact analysis; backfill/reprocessing procedures.
  • Regionalization (EEA/UK), RLS/CLS, cost control (quotas/chargeback).
Phase 3 (10-16 weeks):
  • Replay simulator (what-if), auto-generation of showcase/metrics documentation.
  • Cost optimization (clustering, materialization, TTL, compression).
  • DR exercises and time-travel recovery.

17) Pre-sale checklist

  • Contracts/schemas in Registry, compatibility tests green.
  • CDC/increments and MERGE are idempotent; dedup to ingest.
  • DQ rules are active (critical → fail + DLQ), SLA dashboards are configured.
  • Gold showcases are documented, metric formulas in the semantic layer.
  • RBAC/ABAC, encryption, residency, DSAR/RTBF/Legal Hold verified.
  • Compression/OPTIMIZE/VACUUM on a schedule; backfill/replay limits.
  • Runbook 'and incidents and reprocessing, audit exports (WORM + hash).

18) Anti-patterns and risks

Full reload "just in case": use CDC/increments.
Mixing raw and reported data: Keep Bronze/Silver/Gold separate.
Lack of DQ and lineage: no provability and reproducibility.
PII in analytical layers: isolate mappings, apply CLS/RLS.
Monolithic "night" jabs: crush, parallel in batches.
Ignore cost: keep an eye on small files, materialize aggregates, introduce quotas.

19) Glossary (brief)

ETL/ELT - extraction/transformation/loading (before/after loading).
CDC - Capture change.
SCD - measurement historization (I/II/III).
WORM - unchangeable storage of report packages.
Time-travel - reading historical versions of tables.

20) The bottom line

Modern ETL/ELT is not scripts, but a managed platform: contracts and DQ, idempotent increments/CDC, Bronze/Silver/Gold layer discipline, observability and SLO, privacy and economy. By following this guide, you will get reproducible and auditable pipelines that consistently power reporting, product and models at scale and without surprises.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.