GH GambleHub

Batch vs Stream: When What

Why choose at all

Any data system balances between latency, cost, support complexity, and reliability.
Batch - periodic "chunks" of data with high bandwidth and low cost per record.
Stream - continuous processing of events with minimal delay and state in memory/local sectors.


Briefly about models

Batch

Source: files/tables/snapshots.
Trigger: schedule (hour/day) or condition (new parquet file).
Strengths: Simplicity, determinism, full data context, cheap big recalculations.
Weak: no "online," high latency, "windows" without real-time signals.

Stream

Source: brokers (Kafka/NATS/Pulsar), CDC, queues

Trigger: event.
Strong: low latency, reactivity, natural integration with the product.
Weak: time complexity (event vs processing), order/duplicates, state, operation.


Solution: selection matrix

CriterionBatchStream
Required freshness≥ minutes/hoursseconds/sub-seconds
Recalculation volumeBig historicalIncremental
CostLower at high volumesAbove for "constant readiness"
ComplexityBelowAbove (state, windows, watermark)
Retroactive correctionsAre naturalRetract/upsert needed
Input format stabilityHighThere may be "dirty" events
Criticality "exactly one effect"Easy to transactRequires idempotency/EOS
Grocery UX (real time)It is unsuitableNaturally

Rule 80/20: if SLA allows minute/hour delays and there are no reactive features - take batch. If the reaction is critical "here and now" or you need live showcases - stream (often + additional night batch for reconciliation).


Typical scenarios

Batch - when better:
  • Daily reporting, billing in periods, ML training, large joins, deduplication "with the whole set."
  • Medallion model (bronze/silver/gold) with deep validations.
  • Mass backtests and shop window reassembly.
Stream - when is better:
  • Anti-fraud/monitoring, SRE alerts, real-time balance/missions, "now" recommendations.
  • Event-as-fact (EDC) integrations, Materialized Views Update (CQRS).
  • Microservices: notifications, webhooks, reactions to business events.
Hybrid - most often:
  • The flow generates operational displays and signals; night batch does the reconciliation, vault and cheap historical recounts.

Architecture

Lambda (Stream + Batch)

Stream for increment and online; Batch for completeness and corrections.
Pros: Flexibility and SLAs. Cons: double logic, code duplication.

Kappa (все — Stream + Replay)

A single log as a source of truth; batch-recalculations = replay.
Pros: one code base, single semantics. Cons: more difficult to operate, log storage requirements.

Hybrid-Pragmatic

Streaming "operating system" + periodic batch jobs for heavy joins/ML/corrections.
In practice, it is the most common option.


Time, order, windows (for Stream)

Rely on event time, not processing time.
Manage watermark and 'allowed _ lateness'; support retractions/upserts for later events.

Partition by unit keys, plan "hot keys."


Reliability and semantics of effects

Batch

Database transactions or atomic replacement of batches/tables.
Idempotency - through deterministic computing and overwrite/insert-overwrite.

Stream

At-least-once + idempotent sinks (upsert/merge, versions of aggregates).
Transactional "read-write-fix position" for EOS by effect.
Deduplication tables by 'event _ id '/' operation _ id'.


Vaults and formats

Batch

Data Lake (Parquet/Delta/Iceberg), OLAP (ClickHouse/BigQuery), object storage.
ACID tables for atomic replace, time travel.

Stream

Logs/topics in brokers, state stores (RocksDB/embedded), KV/Redis, OLTP for projections.
Schema registry (Avro/JSON/Proto), compatibility modes.


Cost and SLO

Batch: you pay in batches - it is profitable with large volumes, but the delay ≥ the schedule.
Stream: constant runtime resources, peak cost at high QPS; but SLA in seconds.
Count p95/p99 latency, pass-through lag, cost in cu/event and support TCO.


Testing

Common: golden-sets, property-based invariants, dirty inputs generation.
Batch: determinacy, idempotent restarts, before/after comparison of vaults.
Stream: out-of-order/duplicates, fault-injection between effect and offset fixation, replay tests.


Observability

Batch: duration of job, share of fails/retreats, freshness of shop windows, scan-cost.
Stream: time/message lag, watermark, late-rate, state size/checkpoint frequency, DLQ rate.
Everywhere: 'trace _ id', 'event _ id', versions of schemes/pipelines.


Security and data

PII/PCI - minimize, encrypt at-rest/in-flight, mark fields in circuits ('x-pii').
For Stream - protection of state/checkpoints, ACLs for topics.
GDPR/right to be forgotten: in Stream - crypto erasure/editing in projections; in Batch - recalculation of batches.


Transition strategies

Batch → Stream: start by publishing events (Outbox/CDC), raise a small real-time showcase without touching the existing vault.
Stream → Batch - Add daily vaults for reporting/reconciliation and load reduction on streaming sinks.


Anti-patterns

"All in Stream" for the sake of fashion: expensive and difficult without real need.
"One giant night batch" with requirements <5 minutes.
Use processing time for business metrics.
Raw CDCs as Public Events: Tight Connectivity, Pain in Evolution.
No idempotency in sinks → double effects on restarts.


Selection checklist

  • Freshness SLO: How many seconds/minutes/hours is acceptable?
  • Input Stability: Are there out-of-orders/duplicates?
  • Do I need online reactions/storefronts?
  • Cost: runtime 24/7 vs "scheduled window."
  • The correction method is retract/upsert or night recalculation.
  • Team and operational maturity (observability, on-call).
  • Requirements for "exactly one effect."
  • PII policies/retentions/right to be forgotten.

Reference patterns

Operational Showcase (Hybrid):
  • Stream: EDC → projections (KV/Redis, OLTP) for UI, idempotent upsert.
  • Batch: nightly vault in OLAP, reconciliation, ML features.
Antifraud:
  • Stream: session-windows, CEP-rules, alerts <1-5 s.
  • Batch: retraining models, offline validation.
Marketing/CRM:
  • Stream: triggers, real-time segments.
  • Batch: scoring, LTV models, reports.

FAQ

Is it possible to get "almost real-time" on batch?
Yes: microbatches/trigger jabs (every 1-5 minutes) - a compromise, but without the complexity of windows/late-events.

Does the Lambda approach need everywhere?
No, it isn't. If the thread closes all tasks and you know how to do replay - Kappa is easier to long. Otherwise - a hybrid.

How to count the cost?
Sum compute + storage + ops. For Stream, add the "24/7" downtime price and emergency nights; for Batch - the price of "overdue" data.


Result

Choose Batch when low cost, simplicity and period vaults are important; Stream - when reactivity and freshness are critical. In practice, the hybrid wins: the stream - for online and signals, the batch - for completeness and cheap historical recalculations. The main thing is to set the SLO, ensure idempotency/observability and design the correction path in advance.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.