Stream vs Batch analysis
1) Brief gist
Stream - continuous processing of events in seconds: anti-fraud/AML, RG triggers, SLA alerts, operational panels.
Batch - periodic recalculation with full reproducibility: regulatory reporting (GGR/NGR), financial documents, ML datasets.
Landmarks: Stream p95 e2e 0. 5-5 s, Batch D + 1 to 06:00 (lock.) .
2) Selection matrix (TL; DR)
80/20 rule: Anything that doesn't require a reaction <5 minutes - in Batch; the rest is in Stream, with Batch night validation.
3) Architectures
3. 1 Lambda
Stream for online + Batch for consolidation. Plus: flexibility. Minus: two logics.
3. 2 Kappa
Everything is like streams; Batch = "replay" via log. Plus: a single code. Minus: complexity of replays/cost.
3. 3 Lakehouse-Hybrid (recommended)
Stream → online OLAP Marts (minutes) and Bronze/Silver; Batch reassembles Gold (D + 1) and publishes reports.
4) Data and time
Stream
Windows: tumbling/hopping/session.
Watermarks: 2-5 min; late data is marked and dimmed.
Stateful: CEP, dedup, TTL.
Batch
Increments/CDC: 'updated _ at', log replication.
SCD I/II/III: attribute history.
Snapshots: day/month layers for "as-of."
5) Application patterns in iGaming
AML/Antifraud: Stream (velocity/structuring) + Batch reconciliations and cases.
Responsible Gaming: Stream control of limits/self-exclusions; Batch reporting registers.
Operations/SRE: Stream alerts SLA; Batch post-analysis of incidents and trends.
Product/Marketing: Stream Personalization/Missions; Batch cohorts/LTV.
Finance/reports: Batch (Gold D + 1, WORM packages), Stream - operational panels.
6) DQ, reproducibility, replay
Stream DQ: validation of schemes, dedup '(event_id, source)', completeness of the window, late-ratio, dup-rate; critical DLQ →.
Batch DQ: uniqueness/FK/range/temporal, reconciliations with OLTP/providers; critical → fail job + report.
- Stream: replica topics by range + deterministic transformation.
- Batch: time-travel/logic versions ('logic _ version') + Gold snapshots.
7) Privacy and residency
Stream: pseudonymization, online masking, regional pipelines (EEA/UK/BR), timeouts to external PII-lookups.
Batch: PII mapping isolation, RLS/CLS, DSAR/RTBF, Legal Hold, WORM archives.
8) Cost-engineering
Stream: avoid "hot" keys (salting), limit async lookups, TTL states, preaggregation.
Batch: partitioning/clustering, small files compression, materialization of stable aggregates, quotas/launch windows.
9) Examples
9. 1 Stream - Flink SQL (10-min deposit velocity)
sql
SELECT user_id,
TUMBLE_START(event_time, INTERVAL '10' MINUTE) AS win_start,
COUNT() AS deposits_10m,
SUM(amount_base) AS sum_10m
FROM stream. payments
GROUP BY user_id, TUMBLE(event_time, INTERVAL '10' MINUTE);
9. 2 Stream - CEP (AML pseudo code)
python if count_deposits(10MIN) >= 3 and sum_deposits(10MIN) > THRESH \
and all(d. amount < REPORTING_LIMIT for d in window):
emit_alert("AML_STRUCTURING", user_id, snapshot())
9. 3 Batch - MERGE (Silver increment)
sql
MERGE INTO silver. payments s
USING stage. delta_payments d
ON s. transaction_id = d. transaction_id
WHEN MATCHED THEN UPDATE SET
WHEN NOT MATCHED THEN INSERT;
9. 4 Batch — Gold GGR (D+1)
sql
CREATE OR REPLACE VIEW gold. ggr_daily AS
SELECT
DATE(b. event_time) event_date,
b. market, g. provider_id,
SUM(b. stake_base) stakes_eur,
SUM(p. amount_base) payouts_eur,
SUM(b. stake_base) - SUM(p. amount_base) ggr_eur
FROM silver. fact_bets b
LEFT JOIN silver. fact_payouts p
ON p. user_pseudo_id = b. user_pseudo_id
AND p. game_id = b. game_id
AND DATE(p. event_time) = DATE(b. event_time)
JOIN dim. games g ON g. game_id = b. game_id
GROUP BY 1,2,3;
10) Metrics and SLO
Stream (landmarks)
p95 ingest→alert ≤ 2–5 c completeness окна ≥ 99. 5%
schema-errors ≤ 0. 1%
late-ratio ≤ 1%
availability ≥ 99. 9%
Batch (landmarks)
Gold. daily is ready until 06:00 lock.
completeness ≥ 99. 5%
validity ≥ 99. 9%
MTTR DQ incident ≤ 24-48 hours
11) Testing and releases
Contracts/schemes: consumer-driven tests; back-compat CI.
Stream: canary rules, dark launch, replay simulator.
Batch: dry-run on samples, comparison of metrics, reconciliation.
12) Anti-patterns
Duplicate logic: different Stream and Batch calculations without formula alignment.
Synchronous external APIs in the Stream hot path without cache/timeouts.
Full reload "just in case" instead of increments.
No watermarks/late policies.
PII in analytical layers; no CLS/RLS.
Gold showcases that "mutate" retroactively.
13) Recommended hybrid (playbook)
1. Stream-loop: ingest → bus → Flink/Beam (watermarks, dedup, CEP) →
OLAP (ClickHouse/Pinot) for 1-5 min panels + Bronze/Silver (append).
2. Batch Loop: Increments/CDC → Silver Normalization/SCD → Gold Daily Displays/Reports (WORM).
3. Matching: a single semantic layer of metrics; nightly Stream↔Batch reconciliation; discrepancies> threshold → tickets.
14) RACI
R (Responsible): Streaming Platform (Stream-info), Data Engineering (Batch models), Domain Analytics (metrics/rules), MLOps (features/Feature Store).
A (Accountable): Head of Data / CDO.
C (Consulted): Compliance/Legal/DPO, Finance (FX/GGR), Risk (RG/AML), SRE (SLO/стоимость).
I (Informed): BI/Product/Marketing/Operations.
15) Roadmap
MVP (2-4 weeks):1. Kafka/Redpanda + 2 critical topics ('payments', 'auth').
2. Flink job: watermark + dedup + 1 CEP rule (AML or RG).
3. OLAP showcase 1-5 min + dashboards lag/late/dup.
4. Lakehouse Silver (ACID), the first Gold. ggr_daily (D + 1 until 06:00).
Phase 2 (4-8 weeks):- Increments/CDC by domain, SCD II, semantic metrics layer.
- Streaming DQ and nightly Stream↔Batch reconciliation.
- Regionalisation (EEA/UK/BR), DSAR/RTBF, Legal Hold.
- Replay simulator, canary/A-B releases of rules/metrics.
- Cost-dashboards and quotas; tiered storage; DR teachings.
- Auto-generation of showcase/metrics documentation and lineage.
16) Implementation checklist
- Schemes/contracts in Registry; back-compat tests are green.
- Stream: watermarks/allowed-lateness, дедуп, DLQ; OLAP panels in prod.
- Batch: increments/CDC, SCD II, Gold D + 1 with WORM exports.
- Single semantic layer of metrics; nightly Stream↔Batch reconciliation.
- Freshness/Completeness/Validity DQ boards; alert lag/late/dup.
- RBAC/ABAC, encryption, residency; DSAR/RTBF/Legal Hold.
- Cost under control (cost/GB, cost/query, state size, replays are quota-allocated).
17) The bottom line
Stream and Batch are not competitors, but two gears of the same drive. Stream gives the reaction "here and now," Batch - verifiable truth "in the morning." The hybrid Lakehouse approach, a single layer of metrics and the DQ/lineage discipline allow you to build fast, reproducible and compliant analytical contours that are optimal in SLA and cost.