Batch vs Stream: When What

Why choose at all

Any data system balances between latency, cost, support complexity, and reliability.
Batch - periodic "chunks" of data with high bandwidth and low cost per record.
Stream - continuous processing of events with minimal delay and state in memory/local sectors.

Model Brief

Batch

Source: files/tables/snapshots.
Trigger: schedule (hour/day) or condition (new parquet file).
Strengths: Simplicity, determinism, full data context, cheap big recalculations.
Weak: no "online," high latency, "windows" without real-time signals.

Stream

Source: brokers (Kafka/NATS/Pulsar), CDC, queues

Trigger: event.
Strong: low latency, reactivity, natural integration with the product.
Weak: time complexity (event vs processing), order/duplicates, state, operation.

Solution: selection matrix

Criterion	Batch	Stream
Required freshness	≥ minutes/hours	seconds/sub-seconds
Recalculation volume	Big historical	Incremental
Cost	Lower at high volumes	Above for "constant readiness"
Complexity	Below	Above (state, windows, watermark)
Retroactive corrections	Are natural	Retract/upsert needed
Input format stability	High	There may be "dirty" events
Criticality "exactly one effect"	Easy to transact	Requires idempotency/EOS
Grocery UX (real time)	It is unsuitable	Naturally

Rule 80/20: if SLA allows minute/hour delays and there are no reactive features - take batch. If the reaction is critical "here and now" or you need live showcases - stream (often + additional night batch for reconciliation).

Typical scenarios

Batch - when better:

Daily reporting, billing in periods, ML training, large joins, deduplication "with the whole set."
Medallion model (bronze/silver/gold) with deep validations.
Mass backtests and shop window reassembly.

Stream - when is better:

Anti-fraud/monitoring, SRE alerts, real-time balance/missions, "now" recommendations.
Event-as-fact (EDC) integrations, Materialized Views Update (CQRS).
Microservices: notifications, webhooks, reactions to business events.

Hybrid - most often:

The flow generates operational displays and signals; night batch does the reconciliation, vault and cheap historical recounts.

Architectures

Lambda (Stream + Batch)

Stream for increment and online; Batch for completeness and corrections.
Pros: Flexibility and SLAs. Cons: double logic, code duplication.

Kappa (все — Stream + Replay)

A single log as a source of truth; batch-recalculations = replay.
Pros: one code base, single semantics. Cons: more difficult to operate, log storage requirements.

Hybrid-Pragmatic

Streaming "operating system" + periodic batch jobs for heavy joins/ML/corrections.
In practice, it is the most common option.

Time, order, windows (for Stream)

Rely on event time, not processing time.
Manage watermark and 'allowed _ lateness'; support retractions/upserts for later events.

Partition by unit keys, plan "hot keys."

Reliability and semantics of effects

Batch

Database transactions or atomic replacement of batches/tables.
Idempotency - through deterministic computing and overwrite/insert-overwrite.

Stream

At-least-once + idempotent sinks (upsert/merge, versions of aggregates).
Transactional "read-write-fix position" for EOS by effect.
Deduplication tables by 'event _ id '/' operation _ id'.

Storage and formats

Batch

Data Lake (Parquet/Delta/Iceberg), OLAP (ClickHouse/BigQuery), object storage.
ACID tables for atomic replace, time travel.

Stream

Logs/topics in brokers, state stores (RocksDB/embedded), KV/Redis, OLTP for projections.
Schema registry (Avro/JSON/Proto), compatibility modes.

Cost and SLO

Batch: you pay in batches - it is profitable with large volumes, but the delay ≥ the schedule.
Stream: constant runtime resources, peak cost at high QPS; but SLA in seconds.
Count p95/p99 latency, pass-through lag, cost in cu/event and support TCO.

Testing

Common: golden-sets, property-based invariants, dirty inputs generation.
Batch: determinacy, idempotent restarts, before/after comparison of vaults.
Stream: out-of-order/duplicates, fault-injection between effect and offset fixation, replay tests.

Observability

Batch: duration of job, share of fails/retreats, freshness of shop windows, scan-cost.
Stream: time/message lag, watermark, late-rate, state size/checkpoint frequency, DLQ rate.
Everywhere: 'trace _ id', 'event _ id', versions of schemes/pipelines.

Security and data

PII/PCI - minimize, encrypt at-rest/in-flight, mark fields in circuits ('x-pii').
For Stream - protection of state/checkpoints, ACLs for topics.
GDPR/right to be forgotten: in Stream - crypto erasure/editing in projections; in Batch - recalculation of batches.

Transition strategies

Batch → Stream: start by publishing events (Outbox/CDC), raise a small real-time showcase without touching the existing vault.
Stream → Batch - Add daily vaults for reporting/reconciliation and load reduction on streaming sinks.

Anti-patterns

"All in Stream" for the sake of fashion: expensive and difficult without real need.
"One giant night batch" with requirements <5 minutes.
Use processing time for business metrics.
Raw CDCs as Public Events: Tight Connectivity, Pain in Evolution.
No idempotency in sinks → double effects on restarts.

Selection checklist

Freshness SLO: How many seconds/minutes/hours is acceptable?
Input Stability: Are there out-of-orders/duplicates?
Do I need online reactions/storefronts?
Cost: runtime 24/7 vs "scheduled window."
The correction method is retract/upsert or night recalculation.
Team and operational maturity (observability, on-call).
Requirements for "exactly one effect."
PII policies/retentions/right to be forgotten.

Reference patterns

Operational Showcase (Hybrid):

Stream: EDC → projections (KV/Redis, OLTP) for UI, idempotent upsert.
Batch: nightly vault in OLAP, reconciliation, ML features.

Antifraud:

Stream: session-windows, CEP-rules, alerts <1-5 s.
Batch: retraining models, offline validation.

Marketing/CRM:

Stream: triggers, real-time segments.
Batch: scoring, LTV models, reports.

FAQ

Is it possible to get "almost real-time" on batch?
Yes: microbatches/trigger jabs (every 1-5 minutes) - a compromise, but without the complexity of windows/late-events.

Does the Lambda approach need everywhere?
No, it isn't. If the thread closes all tasks and you know how to do replay - Kappa is easier to long. Otherwise - a hybrid.

How to count the cost?
Sum compute + storage + ops. For Stream, add the "24/7" downtime price and emergency nights; for Batch - the price of "overdue" data.

Total

Choose Batch when low cost, simplicity and period vaults are important; Stream - when reactivity and freshness are critical. In practice, the hybrid wins: the stream - for online and signals, the batch - for completeness and cheap historical recalculations. The main thing is to set the SLO, ensure idempotency/observability and design the correction path in advance.

Batch vs Stream: When What

Why choose at all

Model Brief

Stream

Typical scenarios

Architectures

Kappa (все — Stream + Replay)

Hybrid-Pragmatic

Time, order, windows (for Stream)

Reliability and semantics of effects

Stream

Storage and formats

Stream

Cost and SLO

Testing

Observability

Security and data

Transition strategies

Anti-patterns

Selection checklist

Reference patterns

FAQ

Total

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects