GH GambleHub

Batch processing

1) Purpose and value

Batch conveyors form reliable daily/hourly display cases for:
  • Regulatory and financial reporting (GGR/NGR, taxes, RG/AML registries).
  • BI and product analytics (cohorts, LTV, conversion funnels).
  • Accuracy verification (OLTP↔DWH, providers/PSP), historization (SCD).
  • Preparation of features and training sets for ML.

Key properties: predictability, completeness, reproducibility, low cost per unit of data.

2) Architecture (reference)

1. Ingest (raw capture): HTTP/gRPC, CDC from OLTP, provider uploads → Bronze.
2. Lakehouse: Bronze (raw, append-only) → Silver (clean/conform) → Gold (serve).
3. Orchestration: Airflow/Dagster/Prefect (DAG 'and, dependencies, retrays, SLA).
4. Processing: Spark/Trino/DBT/SQL engines; partitioning and ACID formats (Delta/Iceberg/Hudi).
5. DQ and Contracts: Schema Registry, DQ rules (YAML/SQL), consumer-tests.
6. Serving: BI/semantic layer, reportable exports (CSV/PDF/JSON + hash), API/GraphQL.
7. Observability: pipeline metrics, lineage, logs, cost (cost/GB, cost/query).

3) Frequencies and SLAs

Daily (D + 1 to 06:00 lock.) : GGR reports, regulatory uploads, reconciliations.
Hourly/quasi-time: operational panels for Ops/Finance.
Weekly/monthly: finconsolidation, models and retroprocesses.

Recommended SLOs:
  • Gold-daily showcases are ready until 06:00 local time.
  • Freshness Silver p95 ≤ 15 min for microbats/ ≤ 2 h for daytime.
  • Completeness ≥ 99. 5%, Validity (scheme) ≥ 99. 9%.

4) Incremental Downloads and CDC

Approaches:
  • CDC (Change Data Capture): Debezium/log replication → Bronze → increments in Silver.
  • Watermark by time: 'updated _ at> max_loaded_ts'.
  • Hash comparison: 'md5 (row)' for change detection.
  • Upsert/Merge: Idempotent Silver/Gold updates.
Example MERGE (Delta/Iceberg):
sql
MERGE INTO silver. payments AS s
USING staging. payments_delta AS d
ON s. transaction_id = d. transaction_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;

5) SCD (measurement historization)

SCD I: overwriting (spelling, minor corrections).
SCD II: full history ('valid _ from/valid _ to/is _ current').
SCD III: "before/after" for brief comparisons.

SCD II (example):
sql
MERGE INTO dim. users_scd t
USING stage. users u
ON t. user_pseudo_id = u. user_pseudo_id AND t. is_current = TRUE
WHEN MATCHED AND (t. country <> u. country OR t. rg_status <> u. rg_status)
THEN UPDATE SET t. is_current = FALSE, t. valid_to = CURRENT_TIMESTAMP
WHEN NOT MATCHED
THEN INSERT (user_pseudo_id, country, rg_status, valid_from, valid_to, is_current)
VALUES (u. user_pseudo_id, u. country, u. rg_status, CURRENT_TIMESTAMP, NULL, TRUE);

6) Backfill и Reprocessing

Backfill: Initial Fill/Historical Backfill.
Reprocessing: recalculating shop windows after editing logic/correcting data.

Principles:
  • Idempotency (MERGE/upsert), Bronze immutability, logic versioning.
  • Time-travel for repeated runs metadata snapshots.
  • Guardrails: Limiting ranges, quotas and competitive jobs.
  • Documentation: runbook with steps and completion criteria.

7) Layer modeling

Bronze:
  • Append-only, 'event _ date', 'jurisdiction', 'tenant' partitions.
  • We store the original payload (for forensics), fix 'ingested _ at'.
Silver:
  • Normalization and standardization: FK/directories, dedup, FX/timezones.
  • Fact/dimension tables (3NF/BCNF), SCD for key dimensions.
Gold:
  • Denormalized storefronts for BI/regulatory/finance, SLA readiness.
  • Materialization of aggregates; immutable export artifacts (hash + WORM).

8) Data quality (DQ-as-code)

An example of YAML rules for Silver:
yaml table: silver. payments slo:
freshness_minutes: 15 completeness_percent: 99. 5 rules:
- name: amount_positive type: range column: amount_base min: 0. 01 severity: critical
- name: currency_whitelist type: in_set column: currency set: [EUR,USD,GBP,TRY,BRL]
severity: major
- name: unique_tx type: unique columns: [transaction_id]
severity: critical
- name: fk_user type: foreign_key column: user_pseudo_id ref_table: dim. users_scd severity: critical

Reaction policies: critical → fail job + DLQ; major/minor → tag + report.

9) Semantic layer and reporting

Unified definitions of metrics (GGR/NGR, ARPPU, Retention) in the semantic-layer/metrics-store.

Versioning metrics; Integration with BI/export packages

Reports: CSV/JSON/PDF + sha256, download log and Legal Hold if necessary.

10) Privacy, residency, security

PII minimization: pseudonymization of users; mapping - in a separate protected loop.
Data residency: separate directories/keys for EEA/UK/BR; prohibition of cross-regional joins without legal grounds.
Encryption: TLS in-transit; KMS/CMK at-rest; export controls.
DSAR/RTBF: computable projections, selective edits; access audit.
Legal Hold: WORM archives for regulatory artifacts.

11) Performance and cost

Partitioning by date/market/tenant; Z-order/cluster by frequent predicates.
Formats: Parquet + ACID tables; compression/statistics, OPTIMIZE/VACUUM.
Materialization: stable aggregations in Gold; avoid "monolithic" jobs.
Quotas/budgets: chargeback by team; backfill limits/heavy requests.
Scheduling: low load windows (night/weekend), queue priorities.

12) Observability and management

Pipeline metrics: duration, success rate, retries, rows processed, cost/query.
DQ metrics: completeness, validity, uniqueness, FK errors, drift.
Freshness heatmap: by domain and market; SLA dashboards.
Lineage: Bronze origins to reports; impact analysis before changes.
Alerts: SLO budgets, DQ degradation, delays, cost growth.

13) SQL/Model Examples

Currency normalization (Silver):
sql
CREATE OR REPLACE TABLE silver. payments AS
SELECT p. transaction_id,
p. user_pseudo_id,
p. currency,
p. amount_orig,
r. rate AS fx_rate_used,
p. amount_orig r. rate AS amount_base,
p. market,
CAST(p. event_time AS TIMESTAMP) AS event_time
FROM bronze. payment_events p
JOIN dim. fx_rates r
ON r. date = DATE(p. event_time)
AND r. ccy_from = p. currency AND r. ccy_to = 'EUR';
GGR Daily Showcase (Gold):
sql
CREATE OR REPLACE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) AS event_date,
b. market,
g. provider_id,
SUM(b. stake_base) AS stakes_eur,
SUM(p. amount_base) AS payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) AS ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. games g ON g. game_id = b. game_id
GROUP BY 1,2,3;
Completeness control (DQ SQL):
sql
SELECT market, event_date, COUNT() AS n
FROM silver. fact_bets
GROUP BY market, DATE(event_time) AS event_date
HAVING n = 0;

14) Processes and RACI

R (Responsible): Data Engineering (DAG ', Silver/Gold models), Data Platform (infra, circuit register, DQ).
A (Accountable): Head of Data / Chief Data Officer.
C (Consulted): Compliance/Legal/DPO (PII/retention), Finance (FX/GGR), Risk (RG/AML), SRE (SLO/стоимость).
I (Informed): BI/Product/Marketing/Operations.

15) Implementation Roadmap

MVP (4-6 weeks):

1. Lakehouse Bronze/Silver (ACID format), CDC/increments for 2-3 domains.

2. DQ-like-code: 10-15 rules for Payments/Gameplay + CI validation.

3. First Gold Showcase (GGR Daily) with SLA until 06:00; reported export + hash.

4. Freshness/Completeness/Cost dashboards, basic alerts.

Phase 2 (6-12 weeks):
  • SCD II для users/games/providers; domain expansion.
  • Semantic layer of metrics; checks with OLTP/providers (accuracy).
  • Backfill/reprocessing procedures, lineage and impact analysis, regionalization (EEA/UK).
Phase 3 (12 + weeks):
  • Auto-simulation of changes (dry-run), budgets/quotas, chargeback.
  • Automatic documentation (data product pages), DR exercises and time-travel-recovery.
  • Cost optimization (clustering, materialization, TTL, vacuum).

16) Pre-sale checklist

  • Contracts and schemas in Registry, compatibility tests are green.
  • Incremental downloads/CDC works, MERGE is idempotent.
  • DQ rules are active; critical → fail + DLQ; report on violations.
  • SLA/freshness/fullness dashboards; alerts are set up.
  • PII/DSAR/RTBF/Legal Hold policies confirmed by Legal/DPO.
  • Runbook 'and backfill/reprocessing/DR tested.
  • Cost under control (cost/query, cost/GB, quotas).

17) Anti-patterns and how to avoid

Monolithic night jabs: split into independent steps, parallel by parties.
Full-reload needlessly: use increments/CDC/merge.
PII blending in analytics: keep mappings separate, apply CLS/RLS.
No DQ/lineage: Enter DQ-as-code and trace origin.
"Manual" backfills: automate and document, limit ranges.
Unmanageable cost: clustering, materialization, retention policies.

18) Glossary (brief)

CDC - Capture changes from OLTP.
SCD - slowly changing measurements (I/II/III).
Lakehouse - data lake + ACID tables.
MERGE/Upsert - idempotent update operations.
Time-travel - reading historical versions of tables.
WORM - unchangeable storage of artifacts.

19) The bottom line

Batch processing is a discipline of predictable, reproducible, and complimentary pipelines. By following the principles of schema-first, increments/CDC, SCD historization, DQ-as-code, observability and conscious economics, you will receive stable Gold showcases and reports, verified by sparkles and ready for audit at any time.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.