GH GambleHub

Analytics and ETL pipelines

(Section: Technology and Infrastructure)

Brief Summary

The analytical pipeline turns "raw" iGaming operational events (bets, deposits, PSP webhooks, game logs) into stable metric showcases (GGR/NGR, LTV, retention, anti-fraud signals). Supporting principles: a single layer model (Bronze/Silver/Gold), instrumental discipline DQ/lineage, incrementality and idempotence, observability and SLO, cost control. Decisions are made taking into account the load profile (tournament peaks), regulatory (PII/localization) and business requirements for data freshness.

1) Architectures: ETL vs ELT, batch vs stream

Extract → Transform → Load-Transforms before loading into DWH. Suitable where transformations require a controlled environment/secrets before the "cloud."

ELT (Extract → Load → Transform): raw materials in Lake/Lakehouse/DWH, then SQL/engine (dbt/SQL scripts). Convenient for column engines and flexible iterations.
Batch: scheduled windows (every 5/15/60 minutes, nightly). Cheap and predictable.
Stream: почти real-time (Kafka → Flink/ksqlDB → OLAP). For near-real-time windows (5-60 seconds) and anti-fraud/CRM signals.
Hybrid: Bronze fills with stream, Silver/Gold - incremental batch models.

Recommendation: in iGaming keep ELT + streaming: events via CDC/outbox → Bronze (minute freshness), incremental transformations in Silver/Gold.

2) Medallion

Bronze (Raw): Raw events/CDC without business logic. Parquet/ORC formats, schemes as is, minimal validation.
Silver (Conformed): cleaning, deduplication, ID normalization, dimension SCD, currency/time zone unification.
Gold (Marts): business cases (facts/dimensions, cubes), materialized views, preaggregations (days/countries/products).

Pros: reproducibility, transparent evolution, different SLOs and TTLs by layer.

3) Sources and loading: CDC, outbox, files

CDC (Change Data Capture): change flows from OLTP (Postgres/MySQL) with guaranteed order and idempotency.
Outbox pattern: events are written to the outbox table/collection in the service transaction → the connector publishes to the bus/lake.
File upload: PSP uploads, partner reports; use manifests, checksum, and receive directories.

Practices: sources are versioned (schema version), for each source - a contract of fields and quality expectations.

4) Orchestration: DAG, dependencies, deploy

DAGs: explicit dependencies (raw → staging → dims → facts → marts).
Task idempotency: rerun without side effects (partition-overwrite, 'MERGE '/upsert).
Separation of environments: Dev/Stage/Prod, promotion of artifacts, "manual approval" for expensive backfill.
Scheduling: cron/time windows + event triggers (according to the arrival of files/parties).
Secrets: from secret manager; barring secrets in DAG code.

Example of an abstract DAG (pseudocode):
python with DAG("dwh_daily", schedule="0  ") as dag:
bronze = ingest_cdc(source="payments", partition=hour())
silver = dedup_normalize(input=bronze)
dims  = build_dimensions(input=silver)
facts = build_facts(input=silver, dims=dims)
marts = build_marts(input=facts)
bronze >> silver >> [dims, facts] >> marts

5) Data quality (DQ) and lineage

DQ-checks: completeness (count, late arrivals), uniqueness of keys, ranges/domain rules (amount ≥ 0, currency in the directory).
Trigger threshold: hard stop/soft-fail with alert depending on the criticality of the table.
Lineage/catalog: from report to source (tables, columns, metrics), owners, documentation, PII classification.
Schema control: automatic compatibility tests (backward-/forward-compatible), alert for "breaking" changes.

6) Simulation: SCD, surrogate keys, normalization

SCD2 for dimensions are 'valid _ from/valid _ to/is _ current', surrogate key ('_ sk'), and natural key ('_ id').
SCD1-Overwrites for minor attributes (for example, interface locale).
Surrogate keys: stable '_ sk' for join, natural keys for uniqueness.
Dimension normalization: snowflake where hierarchies are deep; otherwise star for speed.

7) Incremental models and partitioning

Watermark ('updated _ at', 'ingest _ ts'): read only new/changed lines.
Incremental strategies: 'MERGE' by business keys, 'INSERT OVERWRITE' by lots, 'DELETE + INSERT' for small lots.
Partitioning: by date/hour/region; clustering (sort keys/Z-order) by filtering and join keys.
Materialized views: GGR/NGR preaggregation, cache of popular sections.
Approx units: HLL/approx_distinct for cheap top-N showcases.

Example of incremental 'MERGE' (generic):
sql
MERGE INTO fact_deposits f
USING staging_deposits s
ON (f. deposit_id = s. deposit_id)
WHEN MATCHED THEN UPDATE SET amount = s. amount, status = s. status, updated_at = s. updated_at
WHEN NOT MATCHED THEN INSERT (...)
VALUES (...);

8) Backfill, reprocessing and story management

Backfill: individual DAGs with resource limits and windows; a clear "window of truth" (e.g. 2024-01-01.. 2025-11-05).
Reprocessing: deterministic transformations → repeated runs give the same result. Logging versions of model code.

Time-travel/table versions: convenient for investigations and DR "logical errors."

Retraction: retract (delete/fix) policy with logging.

9) Conveyor CLO/SLA/SLO

freshness: Bronze ≤ 1-5 min, Silver ≤ 15 min, Gold ≤ 60 min (example).
Reliability: DAG ≥ 99 pass rate x%.
Performance: p95/p99 node duration; party time budget.

Lag monitoring: lag of the ingest stream, queue depth, share of "late data."

Alerts: violation of freshness/volume, DQ-files, increase in the cost of scans, degradation of MV.

10) Cost: forecasting and optimization

Partitions and clusters minimize scan volume.
Materialization of hot markers (days/countries/products).
Results cache/MVs for frequently used dashboards.
Monitor restart frequency (no "every 5 minutes" for no reason).
TTL: aggressive Bronze retention, medium Silver, long Gold (aggregates only).
Capacity planning: catalog metrics, forecast of tournament/campaign peaks.

11) Safety, PII and localization

Data classification: PII/financial/operational.
Encryption: at rest and in transit; KMS/role-based access.
De-identification: hashing/masking, separate columns with keys.
RLS/blizzards for multi-tenancy (by 'tenant _ id').
Localization: storage and processing areas by region (EU/TR/LATAM); export only to allowed locations.
Auditing: reads/writes to critical tables, directory access.

12) Observability: metrics, logs, trails

Pipeline metrics: task duration, queue, errors, retrays, bytes/rows processed, cost.
Logs: structured; correlation on 'trace _ id '/' run _ id'.
Tracing: from source to showcase (ingest → transform → load → BI).
Dashboards: freshness of layers, success of DAGs, top expensive requests, p95/p99.

13) Instruments (role benchmarks)

Orchestration: DAG orchestrators (with scheduler, retrays, alerts, secrets).
Transformations: SQL modeling ("models as code"), unit tests of models, documentation.
DQ/contracts: validation frameworks and SLAs on datasets.
Lineage/Catalog: automatic dependency graph, find owner.
Streaming: window/aggregation processors, sink/source connectors.

(Specific vendors are selected for the company's stack and security requirements.)

14) Sample templates

GGR Showcase Template (Generic SQL)

sql
CREATE OR REPLACE TABLE mart_ggr_daily AS
SELECT
DATE(b. ts) AS d,
c. country_code,
SUM(b. stake) AS stake_sum,
SUM(b. win)  AS win_sum,
SUM(b. stake - b. win) AS ggr
FROM fact_bets b
JOIN dim_country c ON c. country_sk = b. country_sk AND c. is_current
WHERE b. ts >= DATE_SUB(CURRENT_DATE, INTERVAL 60 DAY)
GROUP BY d, c. country_code;

Incremental watermarked model

sql
INSERT INTO fact_bets PARTITION (dt)
SELECT
FROM staging_bets
WHERE updated_at > (SELECT COALESCE(MAX(watermark), '1970-01-01') FROM _meta_watermarks WHERE table='fact_bets');
-- then update watermark

DQ checks (idea)

sql
-- 1) key uniqueness
SELECT deposit_id FROM fact_deposits GROUP BY deposit_id HAVING COUNT()>1;

-- 2) negative amounts (error)
SELECT FROM fact_deposits WHERE amount < 0;

15) Implementation checklist

1. Define the metrics dictionary (GGR/NGR/LTV/Retention) and owners.
2. Record SLO freshness across Bronze/Silver/Gold layers.
3. Standardize source contracts (schemas, DQ, SLA).
4. Construct a DAG graph with idempotent steps and isolated secrets.
5. Implement incrementality (MERGE/overwrite by party) and watermarks.
6. Include DQ (critical/soft checks), lineage and data directory.
7. Set up observability (metrics, logs, trails) and alerts.
8. Enter retention/TTL and backfill/reprocessing policy.
9. Provide PII control, encryption, RLS, and localization.
10. Spend game-day: imitation of a source drop, "breaking" schemes, mass backfill.

16) Antipatterns

"One Night ETL for Everything" with no parties and no incrementality.
Lack of DQ and lineage → conflicting reports and ghost hunting.
Complete revision of tables at each launch (explosion of cost).
Hard bundle in real time without buffers/retraces.
Blending PII and public storefronts without segmentation and masking.
No retraction/deletion policy (errors cannot be corrected).

Summary

The robust analytics pipeline in iGaming is ELT + streaming loading into a layered model with hard DQ/lineage, incremental models, transparent orchestrator, and measurable SLOs. Add cost control, PII/localization policy, regular backfill/DR exercises - and your analytical platform will reliably scale to tournament peaks, responding to business with data of the desired freshness and quality.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.