GH GambleHub

Stream vs Batch analysis

1) Brief gist

Stream - continuous processing of events in seconds: anti-fraud/AML, RG triggers, SLA alerts, operational panels.
Batch - periodic recalculation with full reproducibility: regulatory reporting (GGR/NGR), financial documents, ML datasets.

Landmarks: Stream p95 e2e 0. 5-5 s, Batch D + 1 to 06:00 (lock.) .

2) Selection matrix (TL; DR)

CriterionStreamBatch
SLA reactionsseconds/minuteshours/days
Completenesshigh, but late fixes are possiblevery high, controlled D + 1
Reproducibility "as-of"harder (replay)easier (time-travel/snapshots)
Cost per unitmore expensive online waycheaper per volume
Typical tasksAML/RG alerts, SRE, real-time showcasesreports, reconciliations, ML off-line
Historization (SCD)restrictedlyfully
Regulatory/WORMvia Gold reviewnatively (Gold/D + 1)

80/20 rule: Anything that doesn't require a reaction <5 minutes - in Batch; the rest is in Stream, with Batch night validation.

3) Architectures

3. 1 Lambda

Stream for online + Batch for consolidation. Plus: flexibility. Minus: two logics.

3. 2 Kappa

Everything is like streams; Batch = "replay" via log. Plus: a single code. Minus: complexity of replays/cost.

3. 3 Lakehouse-Hybrid (recommended)

Stream → online OLAP Marts (minutes) and Bronze/Silver; Batch reassembles Gold (D + 1) and publishes reports.

4) Data and time

Stream

Windows: tumbling/hopping/session.
Watermarks: 2-5 min; late data is marked and dimmed.
Stateful: CEP, dedup, TTL.

Batch

Increments/CDC: 'updated _ at', log replication.
SCD I/II/III: attribute history.

Snapshots: day/month layers for "as-of."

5) Application patterns in iGaming

AML/Antifraud: Stream (velocity/structuring) + Batch reconciliations and cases.
Responsible Gaming: Stream control of limits/self-exclusions; Batch reporting registers.
Operations/SRE: Stream alerts SLA; Batch post-analysis of incidents and trends.
Product/Marketing: Stream Personalization/Missions; Batch cohorts/LTV.
Finance/reports: Batch (Gold D + 1, WORM packages), Stream - operational panels.

6) DQ, reproducibility, replay

Stream DQ: validation of schemes, dedup '(event_id, source)', completeness of the window, late-ratio, dup-rate; critical DLQ →.
Batch DQ: uniqueness/FK/range/temporal, reconciliations with OLTP/providers; critical → fail job + report.

Reproducibility:
  • Stream: replica topics by range + deterministic transformation.
  • Batch: time-travel/logic versions ('logic _ version') + Gold snapshots.

7) Privacy and residency

Stream: pseudonymization, online masking, regional pipelines (EEA/UK/BR), timeouts to external PII-lookups.
Batch: PII mapping isolation, RLS/CLS, DSAR/RTBF, Legal Hold, WORM archives.

8) Cost-engineering

Stream: avoid "hot" keys (salting), limit async lookups, TTL states, preaggregation.
Batch: partitioning/clustering, small files compression, materialization of stable aggregates, quotas/launch windows.

9) Examples

9. 1 Stream - Flink SQL (10-min deposit velocity)

sql
SELECT user_id,
TUMBLE_START(event_time, INTERVAL '10' MINUTE) AS win_start,
COUNT() AS deposits_10m,
SUM(amount_base) AS sum_10m
FROM stream. payments
GROUP BY user_id, TUMBLE(event_time, INTERVAL '10' MINUTE);

9. 2 Stream - CEP (AML pseudo code)

python if count_deposits(10MIN) >= 3 and sum_deposits(10MIN) > THRESH \
and all(d. amount < REPORTING_LIMIT for d in window):
emit_alert("AML_STRUCTURING", user_id, snapshot())

9. 3 Batch - MERGE (Silver increment)

sql
MERGE INTO silver. payments s
USING stage. delta_payments d
ON s. transaction_id = d. transaction_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;

9. 4 Batch — Gold GGR (D+1)

sql
CREATE OR REPLACE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) event_date,
b. market, g. provider_id,
SUM(b. stake_base) stakes_eur,
SUM(p. amount_base) payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. games g ON g. game_id = b. game_id
GROUP BY 1,2,3;

10) Metrics and SLO

Stream (landmarks)

p95 ingest→alert ≤ 2–5 c completeness окна ≥ 99. 5%

schema-errors ≤ 0. 1%

late-ratio ≤ 1%

availability ≥ 99. 9%

Batch (landmarks)

Gold. daily is ready until 06:00 lock.

completeness ≥ 99. 5%

validity ≥ 99. 9%

MTTR DQ incident ≤ 24-48 hours

11) Testing and releases

Contracts/schemes: consumer-driven tests; back-compat CI.
Stream: canary rules, dark launch, replay simulator.
Batch: dry-run on samples, comparison of metrics, reconciliation.

12) Anti-patterns

Duplicate logic: different Stream and Batch calculations without formula alignment.
Synchronous external APIs in the Stream hot path without cache/timeouts.
Full reload "just in case" instead of increments.
No watermarks/late policies.
PII in analytical layers; no CLS/RLS.
Gold showcases that "mutate" retroactively.

13) Recommended hybrid (playbook)

1. Stream-loop: ingest → bus → Flink/Beam (watermarks, dedup, CEP) →

OLAP (ClickHouse/Pinot) for 1-5 min panels + Bronze/Silver (append).
2. Batch Loop: Increments/CDC → Silver Normalization/SCD → Gold Daily Displays/Reports (WORM).
3. Matching: a single semantic layer of metrics; nightly Stream↔Batch reconciliation; discrepancies> threshold → tickets.

14) RACI

R (Responsible): Streaming Platform (Stream-info), Data Engineering (Batch models), Domain Analytics (metrics/rules), MLOps (features/Feature Store).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/Legal/DPO, Finance (FX/GGR), Risk (RG/AML), SRE (SLO/стоимость).
I (Informed): BI/Product/Marketing/Operations.

15) Roadmap

MVP (2-4 weeks):

1. Kafka/Redpanda + 2 critical topics ('payments', 'auth').

2. Flink job: watermark + dedup + 1 CEP rule (AML or RG).

3. OLAP showcase 1-5 min + dashboards lag/late/dup.

4. Lakehouse Silver (ACID), the first Gold. ggr_daily (D + 1 until 06:00).

Phase 2 (4-8 weeks):
  • Increments/CDC by domain, SCD II, semantic metrics layer.
  • Streaming DQ and nightly Stream↔Batch reconciliation.
  • Regionalisation (EEA/UK/BR), DSAR/RTBF, Legal Hold.
Phase 3 (8-12 weeks):
  • Replay simulator, canary/A-B releases of rules/metrics.
  • Cost-dashboards and quotas; tiered storage; DR teachings.
  • Auto-generation of showcase/metrics documentation and lineage.

16) Implementation checklist

  • Schemes/contracts in Registry; back-compat tests are green.
  • Stream: watermarks/allowed-lateness, дедуп, DLQ; OLAP panels in prod.
  • Batch: increments/CDC, SCD II, Gold D + 1 with WORM exports.
  • Single semantic layer of metrics; nightly Stream↔Batch reconciliation.
  • Freshness/Completeness/Validity DQ boards; alert lag/late/dup.
  • RBAC/ABAC, encryption, residency; DSAR/RTBF/Legal Hold.
  • Cost under control (cost/GB, cost/query, state size, replays are quota-allocated).

17) The bottom line

Stream and Batch are not competitors, but two gears of the same drive. Stream gives the reaction "here and now," Batch - verifiable truth "in the morning." The hybrid Lakehouse approach, a single layer of metrics and the DQ/lineage discipline allow you to build fast, reproducible and compliant analytical contours that are optimal in SLA and cost.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.