Batch processing
1) Purpose and value
Batch conveyors form reliable daily/hourly display cases for:- Regulatory and financial reporting (GGR/NGR, taxes, RG/AML registries).
- BI and product analytics (cohorts, LTV, conversion funnels).
- Accuracy verification (OLTP↔DWH, providers/PSP), historization (SCD).
- Preparation of features and training sets for ML.
Key properties: predictability, completeness, reproducibility, low cost per unit of data.
2) Architecture (reference)
1. Ingest (raw capture): HTTP/gRPC, CDC from OLTP, provider uploads → Bronze.
2. Lakehouse: Bronze (raw, append-only) → Silver (clean/conform) → Gold (serve).
3. Orchestration: Airflow/Dagster/Prefect (DAG 'and, dependencies, retrays, SLA).
4. Processing: Spark/Trino/DBT/SQL engines; partitioning and ACID formats (Delta/Iceberg/Hudi).
5. DQ and Contracts: Schema Registry, DQ rules (YAML/SQL), consumer-tests.
6. Serving: BI/semantic layer, reportable exports (CSV/PDF/JSON + hash), API/GraphQL.
7. Observability: pipeline metrics, lineage, logs, cost (cost/GB, cost/query).
3) Frequencies and SLAs
Daily (D + 1 to 06:00 lock.) : GGR reports, regulatory uploads, reconciliations.
Hourly/quasi-time: operational panels for Ops/Finance.
Weekly/monthly: finconsolidation, models and retroprocesses.
- Gold-daily showcases are ready until 06:00 local time.
- Freshness Silver p95 ≤ 15 min for microbats/ ≤ 2 h for daytime.
- Completeness ≥ 99. 5%, Validity (scheme) ≥ 99. 9%.
4) Incremental Downloads and CDC
Approaches:- CDC (Change Data Capture): Debezium/log replication → Bronze → increments in Silver.
- Watermark by time: 'updated _ at> max_loaded_ts'.
- Hash comparison: 'md5 (row)' for change detection.
- Upsert/Merge: Idempotent Silver/Gold updates.
sql
MERGE INTO silver. payments AS s
USING staging. payments_delta AS d
ON s. transaction_id = d. transaction_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;
5) SCD (measurement historization)
SCD I: overwriting (spelling, minor corrections).
SCD II: full history ('valid _ from/valid _ to/is _ current').
SCD III: "before/after" for brief comparisons.
sql
MERGE INTO dim. users_scd t
USING stage. users u
ON t. user_pseudo_id = u. user_pseudo_id AND t. is_current = TRUE
WHEN MATCHED AND (t. country <> u. country OR t. rg_status <> u. rg_status)
THEN UPDATE SET t. is_current = FALSE, t. valid_to = CURRENT_TIMESTAMP
WHEN NOT MATCHED
THEN INSERT (user_pseudo_id, country, rg_status, valid_from, valid_to, is_current)
VALUES (u. user_pseudo_id, u. country, u. rg_status, CURRENT_TIMESTAMP, NULL, TRUE);
6) Backfill и Reprocessing
Backfill: Initial Fill/Historical Backfill.
Reprocessing: recalculating shop windows after editing logic/correcting data.
- Idempotency (MERGE/upsert), Bronze immutability, logic versioning.
- Time-travel for repeated runs metadata snapshots.
- Guardrails: Limiting ranges, quotas and competitive jobs.
- Documentation: runbook with steps and completion criteria.
7) Layer modeling
Bronze:- Append-only, 'event _ date', 'jurisdiction', 'tenant' partitions.
- We store the original payload (for forensics), fix 'ingested _ at'.
- Normalization and standardization: FK/directories, dedup, FX/timezones.
- Fact/dimension tables (3NF/BCNF), SCD for key dimensions.
- Denormalized storefronts for BI/regulatory/finance, SLA readiness.
- Materialization of aggregates; immutable export artifacts (hash + WORM).
8) Data quality (DQ-as-code)
An example of YAML rules for Silver:yaml table: silver. payments slo:
freshness_minutes: 15 completeness_percent: 99. 5 rules:
- name: amount_positive type: range column: amount_base min: 0. 01 severity: critical
- name: currency_whitelist type: in_set column: currency set: [EUR,USD,GBP,TRY,BRL]
severity: major
- name: unique_tx type: unique columns: [transaction_id]
severity: critical
- name: fk_user type: foreign_key column: user_pseudo_id ref_table: dim. users_scd severity: critical
Reaction policies: critical → fail job + DLQ; major/minor → tag + report.
9) Semantic layer and reporting
Unified definitions of metrics (GGR/NGR, ARPPU, Retention) in the semantic-layer/metrics-store.
Versioning metrics; Integration with BI/export packages
Reports: CSV/JSON/PDF + sha256, download log and Legal Hold if necessary.
10) Privacy, residency, security
PII minimization: pseudonymization of users; mapping - in a separate protected loop.
Data residency: separate directories/keys for EEA/UK/BR; prohibition of cross-regional joins without legal grounds.
Encryption: TLS in-transit; KMS/CMK at-rest; export controls.
DSAR/RTBF: computable projections, selective edits; access audit.
Legal Hold: WORM archives for regulatory artifacts.
11) Performance and cost
Partitioning by date/market/tenant; Z-order/cluster by frequent predicates.
Formats: Parquet + ACID tables; compression/statistics, OPTIMIZE/VACUUM.
Materialization: stable aggregations in Gold; avoid "monolithic" jobs.
Quotas/budgets: chargeback by team; backfill limits/heavy requests.
Scheduling: low load windows (night/weekend), queue priorities.
12) Observability and management
Pipeline metrics: duration, success rate, retries, rows processed, cost/query.
DQ metrics: completeness, validity, uniqueness, FK errors, drift.
Freshness heatmap: by domain and market; SLA dashboards.
Lineage: Bronze origins to reports; impact analysis before changes.
Alerts: SLO budgets, DQ degradation, delays, cost growth.
13) SQL/Model Examples
Currency normalization (Silver):sql
CREATE OR REPLACE TABLE silver. payments AS
SELECT p. transaction_id,
p. user_pseudo_id,
p. currency,
p. amount_orig,
r. rate AS fx_rate_used,
p. amount_orig r. rate AS amount_base,
p. market,
CAST(p. event_time AS TIMESTAMP) AS event_time
FROM bronze. payment_events p
JOIN dim. fx_rates r
ON r. date = DATE(p. event_time)
AND r. ccy_from = p. currency AND r. ccy_to = 'EUR';
GGR Daily Showcase (Gold):
sql
CREATE OR REPLACE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) AS event_date,
b. market,
g. provider_id,
SUM(b. stake_base) AS stakes_eur,
SUM(p. amount_base) AS payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) AS ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. games g ON g. game_id = b. game_id
GROUP BY 1,2,3;
Completeness control (DQ SQL):
sql
SELECT market, event_date, COUNT() AS n
FROM silver. fact_bets
GROUP BY market, DATE(event_time) AS event_date
HAVING n = 0;
14) Processes and RACI
R (Responsible): Data Engineering (DAG ', Silver/Gold models), Data Platform (infra, circuit register, DQ).
A (Accountable): Head of Data / Chief Data Officer.
C (Consulted): Compliance/Legal/DPO (PII/retention), Finance (FX/GGR), Risk (RG/AML), SRE (SLO/стоимость).
I (Informed): BI/Product/Marketing/Operations.
15) Implementation Roadmap
MVP (4-6 weeks):1. Lakehouse Bronze/Silver (ACID format), CDC/increments for 2-3 domains.
2. DQ-like-code: 10-15 rules for Payments/Gameplay + CI validation.
3. First Gold Showcase (GGR Daily) with SLA until 06:00; reported export + hash.
4. Freshness/Completeness/Cost dashboards, basic alerts.
Phase 2 (6-12 weeks):- SCD II для users/games/providers; domain expansion.
- Semantic layer of metrics; checks with OLTP/providers (accuracy).
- Backfill/reprocessing procedures, lineage and impact analysis, regionalization (EEA/UK).
- Auto-simulation of changes (dry-run), budgets/quotas, chargeback.
- Automatic documentation (data product pages), DR exercises and time-travel-recovery.
- Cost optimization (clustering, materialization, TTL, vacuum).
16) Pre-sale checklist
- Contracts and schemas in Registry, compatibility tests are green.
- Incremental downloads/CDC works, MERGE is idempotent.
- DQ rules are active; critical → fail + DLQ; report on violations.
- SLA/freshness/fullness dashboards; alerts are set up.
- PII/DSAR/RTBF/Legal Hold policies confirmed by Legal/DPO.
- Runbook 'and backfill/reprocessing/DR tested.
- Cost under control (cost/query, cost/GB, quotas).
17) Anti-patterns and how to avoid
Monolithic night jabs: split into independent steps, parallel by parties.
Full-reload needlessly: use increments/CDC/merge.
PII blending in analytics: keep mappings separate, apply CLS/RLS.
No DQ/lineage: Enter DQ-as-code and trace origin.
"Manual" backfills: automate and document, limit ranges.
Unmanageable cost: clustering, materialization, retention policies.
18) Glossary (brief)
CDC - Capture changes from OLTP.
SCD - slowly changing measurements (I/II/III).
Lakehouse - data lake + ACID tables.
MERGE/Upsert - idempotent update operations.
Time-travel - reading historical versions of tables.
WORM - unchangeable storage of artifacts.
19) The bottom line
Batch processing is a discipline of predictable, reproducible, and complimentary pipelines. By following the principles of schema-first, increments/CDC, SCD historization, DQ-as-code, observability and conscious economics, you will receive stable Gold showcases and reports, verified by sparkles and ready for audit at any time.