Analytics and ETL pipelines
(Section: Technology and Infrastructure)
Brief Summary
The analytical pipeline turns "raw" iGaming operational events (bets, deposits, PSP webhooks, game logs) into stable metric showcases (GGR/NGR, LTV, retention, anti-fraud signals). Supporting principles: a single layer model (Bronze/Silver/Gold), instrumental discipline DQ/lineage, incrementality and idempotence, observability and SLO, cost control. Decisions are made taking into account the load profile (tournament peaks), regulatory (PII/localization) and business requirements for data freshness.
1) Architectures: ETL vs ELT, batch vs stream
Extract → Transform → Load-Transforms before loading into DWH. Suitable where transformations require a controlled environment/secrets before the "cloud."
ELT (Extract → Load → Transform): raw materials in Lake/Lakehouse/DWH, then SQL/engine (dbt/SQL scripts). Convenient for column engines and flexible iterations.
Batch: scheduled windows (every 5/15/60 minutes, nightly). Cheap and predictable.
Stream: почти real-time (Kafka → Flink/ksqlDB → OLAP). For near-real-time windows (5-60 seconds) and anti-fraud/CRM signals.
Hybrid: Bronze fills with stream, Silver/Gold - incremental batch models.
Recommendation: in iGaming keep ELT + streaming: events via CDC/outbox → Bronze (minute freshness), incremental transformations in Silver/Gold.
2) Medallion
Bronze (Raw): Raw events/CDC without business logic. Parquet/ORC formats, schemes as is, minimal validation.
Silver (Conformed): cleaning, deduplication, ID normalization, dimension SCD, currency/time zone unification.
Gold (Marts): business cases (facts/dimensions, cubes), materialized views, preaggregations (days/countries/products).
Pros: reproducibility, transparent evolution, different SLOs and TTLs by layer.
3) Sources and loading: CDC, outbox, files
CDC (Change Data Capture): change flows from OLTP (Postgres/MySQL) with guaranteed order and idempotency.
Outbox pattern: events are written to the outbox table/collection in the service transaction → the connector publishes to the bus/lake.
File upload: PSP uploads, partner reports; use manifests, checksum, and receive directories.
Practices: sources are versioned (schema version), for each source - a contract of fields and quality expectations.
4) Orchestration: DAG, dependencies, deploy
DAGs: explicit dependencies (raw → staging → dims → facts → marts).
Task idempotency: rerun without side effects (partition-overwrite, 'MERGE '/upsert).
Separation of environments: Dev/Stage/Prod, promotion of artifacts, "manual approval" for expensive backfill.
Scheduling: cron/time windows + event triggers (according to the arrival of files/parties).
Secrets: from secret manager; barring secrets in DAG code.
python with DAG("dwh_daily", schedule="0 ") as dag:
bronze = ingest_cdc(source="payments", partition=hour())
silver = dedup_normalize(input=bronze)
dims = build_dimensions(input=silver)
facts = build_facts(input=silver, dims=dims)
marts = build_marts(input=facts)
bronze >> silver >> [dims, facts] >> marts
5) Data quality (DQ) and lineage
DQ-checks: completeness (count, late arrivals), uniqueness of keys, ranges/domain rules (amount ≥ 0, currency in the directory).
Trigger threshold: hard stop/soft-fail with alert depending on the criticality of the table.
Lineage/catalog: from report to source (tables, columns, metrics), owners, documentation, PII classification.
Schema control: automatic compatibility tests (backward-/forward-compatible), alert for "breaking" changes.
6) Simulation: SCD, surrogate keys, normalization
SCD2 for dimensions are 'valid _ from/valid _ to/is _ current', surrogate key ('_ sk'), and natural key ('_ id').
SCD1-Overwrites for minor attributes (for example, interface locale).
Surrogate keys: stable '_ sk' for join, natural keys for uniqueness.
Dimension normalization: snowflake where hierarchies are deep; otherwise star for speed.
7) Incremental models and partitioning
Watermark ('updated _ at', 'ingest _ ts'): read only new/changed lines.
Incremental strategies: 'MERGE' by business keys, 'INSERT OVERWRITE' by lots, 'DELETE + INSERT' for small lots.
Partitioning: by date/hour/region; clustering (sort keys/Z-order) by filtering and join keys.
Materialized views: GGR/NGR preaggregation, cache of popular sections.
Approx units: HLL/approx_distinct for cheap top-N showcases.
sql
MERGE INTO fact_deposits f
USING staging_deposits s
ON (f. deposit_id = s. deposit_id)
WHEN MATCHED THEN UPDATE SET amount = s. amount, status = s. status, updated_at = s. updated_at
WHEN NOT MATCHED THEN INSERT (...)
VALUES (...);
8) Backfill, reprocessing and story management
Backfill: individual DAGs with resource limits and windows; a clear "window of truth" (e.g. 2024-01-01.. 2025-11-05).
Reprocessing: deterministic transformations → repeated runs give the same result. Logging versions of model code.
Time-travel/table versions: convenient for investigations and DR "logical errors."
Retraction: retract (delete/fix) policy with logging.
9) Conveyor CLO/SLA/SLO
freshness: Bronze ≤ 1-5 min, Silver ≤ 15 min, Gold ≤ 60 min (example).
Reliability: DAG ≥ 99 pass rate x%.
Performance: p95/p99 node duration; party time budget.
Lag monitoring: lag of the ingest stream, queue depth, share of "late data."
Alerts: violation of freshness/volume, DQ-files, increase in the cost of scans, degradation of MV.
10) Cost: forecasting and optimization
Partitions and clusters minimize scan volume.
Materialization of hot markers (days/countries/products).
Results cache/MVs for frequently used dashboards.
Monitor restart frequency (no "every 5 minutes" for no reason).
TTL: aggressive Bronze retention, medium Silver, long Gold (aggregates only).
Capacity planning: catalog metrics, forecast of tournament/campaign peaks.
11) Safety, PII and localization
Data classification: PII/financial/operational.
Encryption: at rest and in transit; KMS/role-based access.
De-identification: hashing/masking, separate columns with keys.
RLS/blizzards for multi-tenancy (by 'tenant _ id').
Localization: storage and processing areas by region (EU/TR/LATAM); export only to allowed locations.
Auditing: reads/writes to critical tables, directory access.
12) Observability: metrics, logs, trails
Pipeline metrics: task duration, queue, errors, retrays, bytes/rows processed, cost.
Logs: structured; correlation on 'trace _ id '/' run _ id'.
Tracing: from source to showcase (ingest → transform → load → BI).
Dashboards: freshness of layers, success of DAGs, top expensive requests, p95/p99.
13) Instruments (role benchmarks)
Orchestration: DAG orchestrators (with scheduler, retrays, alerts, secrets).
Transformations: SQL modeling ("models as code"), unit tests of models, documentation.
DQ/contracts: validation frameworks and SLAs on datasets.
Lineage/Catalog: automatic dependency graph, find owner.
Streaming: window/aggregation processors, sink/source connectors.
(Specific vendors are selected for the company's stack and security requirements.)
14) Sample templates
GGR Showcase Template (Generic SQL)
sql
CREATE OR REPLACE TABLE mart_ggr_daily AS
SELECT
DATE(b. ts) AS d,
c. country_code,
SUM(b. stake) AS stake_sum,
SUM(b. win) AS win_sum,
SUM(b. stake - b. win) AS ggr
FROM fact_bets b
JOIN dim_country c ON c. country_sk = b. country_sk AND c. is_current
WHERE b. ts >= DATE_SUB(CURRENT_DATE, INTERVAL 60 DAY)
GROUP BY d, c. country_code;
Incremental watermarked model
sql
INSERT INTO fact_bets PARTITION (dt)
SELECT
FROM staging_bets
WHERE updated_at > (SELECT COALESCE(MAX(watermark), '1970-01-01') FROM _meta_watermarks WHERE table='fact_bets');
-- then update watermark
DQ checks (idea)
sql
-- 1) key uniqueness
SELECT deposit_id FROM fact_deposits GROUP BY deposit_id HAVING COUNT()>1;
-- 2) negative amounts (error)
SELECT FROM fact_deposits WHERE amount < 0;
15) Implementation checklist
1. Define the metrics dictionary (GGR/NGR/LTV/Retention) and owners.
2. Record SLO freshness across Bronze/Silver/Gold layers.
3. Standardize source contracts (schemas, DQ, SLA).
4. Construct a DAG graph with idempotent steps and isolated secrets.
5. Implement incrementality (MERGE/overwrite by party) and watermarks.
6. Include DQ (critical/soft checks), lineage and data directory.
7. Set up observability (metrics, logs, trails) and alerts.
8. Enter retention/TTL and backfill/reprocessing policy.
9. Provide PII control, encryption, RLS, and localization.
10. Spend game-day: imitation of a source drop, "breaking" schemes, mass backfill.
16) Antipatterns
"One Night ETL for Everything" with no parties and no incrementality.
Lack of DQ and lineage → conflicting reports and ghost hunting.
Complete revision of tables at each launch (explosion of cost).
Hard bundle in real time without buffers/retraces.
Blending PII and public storefronts without segmentation and masking.
No retraction/deletion policy (errors cannot be corrected).
Summary
The robust analytics pipeline in iGaming is ELT + streaming loading into a layered model with hard DQ/lineage, incremental models, transparent orchestrator, and measurable SLOs. Add cost control, PII/localization policy, regular backfill/DR exercises - and your analytical platform will reliably scale to tournament peaks, responding to business with data of the desired freshness and quality.