Batch vs Stream: When What
Why choose at all
Any data system balances between latency, cost, support complexity, and reliability.
Batch - periodic "chunks" of data with high bandwidth and low cost per record.
Stream - continuous processing of events with minimal delay and state in memory/local sectors.
Briefly about models
Batch
Source: files/tables/snapshots.
Trigger: schedule (hour/day) or condition (new parquet file).
Strengths: Simplicity, determinism, full data context, cheap big recalculations.
Weak: no "online," high latency, "windows" without real-time signals.
Stream
Source: brokers (Kafka/NATS/Pulsar), CDC, queues
Trigger: event.
Strong: low latency, reactivity, natural integration with the product.
Weak: time complexity (event vs processing), order/duplicates, state, operation.
Solution: selection matrix
Rule 80/20: if SLA allows minute/hour delays and there are no reactive features - take batch. If the reaction is critical "here and now" or you need live showcases - stream (often + additional night batch for reconciliation).
Typical scenarios
Batch - when better:- Daily reporting, billing in periods, ML training, large joins, deduplication "with the whole set."
- Medallion model (bronze/silver/gold) with deep validations.
- Mass backtests and shop window reassembly.
- Anti-fraud/monitoring, SRE alerts, real-time balance/missions, "now" recommendations.
- Event-as-fact (EDC) integrations, Materialized Views Update (CQRS).
- Microservices: notifications, webhooks, reactions to business events.
- The flow generates operational displays and signals; night batch does the reconciliation, vault and cheap historical recounts.
Architecture
Lambda (Stream + Batch)
Stream for increment and online; Batch for completeness and corrections.
Pros: Flexibility and SLAs. Cons: double logic, code duplication.
Kappa (все — Stream + Replay)
A single log as a source of truth; batch-recalculations = replay.
Pros: one code base, single semantics. Cons: more difficult to operate, log storage requirements.
Hybrid-Pragmatic
Streaming "operating system" + periodic batch jobs for heavy joins/ML/corrections.
In practice, it is the most common option.
Time, order, windows (for Stream)
Rely on event time, not processing time.
Manage watermark and 'allowed _ lateness'; support retractions/upserts for later events.
Partition by unit keys, plan "hot keys."
Reliability and semantics of effects
Batch
Database transactions or atomic replacement of batches/tables.
Idempotency - through deterministic computing and overwrite/insert-overwrite.
Stream
At-least-once + idempotent sinks (upsert/merge, versions of aggregates).
Transactional "read-write-fix position" for EOS by effect.
Deduplication tables by 'event _ id '/' operation _ id'.
Vaults and formats
Batch
Data Lake (Parquet/Delta/Iceberg), OLAP (ClickHouse/BigQuery), object storage.
ACID tables for atomic replace, time travel.
Stream
Logs/topics in brokers, state stores (RocksDB/embedded), KV/Redis, OLTP for projections.
Schema registry (Avro/JSON/Proto), compatibility modes.
Cost and SLO
Batch: you pay in batches - it is profitable with large volumes, but the delay ≥ the schedule.
Stream: constant runtime resources, peak cost at high QPS; but SLA in seconds.
Count p95/p99 latency, pass-through lag, cost in cu/event and support TCO.
Testing
Common: golden-sets, property-based invariants, dirty inputs generation.
Batch: determinacy, idempotent restarts, before/after comparison of vaults.
Stream: out-of-order/duplicates, fault-injection between effect and offset fixation, replay tests.
Observability
Batch: duration of job, share of fails/retreats, freshness of shop windows, scan-cost.
Stream: time/message lag, watermark, late-rate, state size/checkpoint frequency, DLQ rate.
Everywhere: 'trace _ id', 'event _ id', versions of schemes/pipelines.
Security and data
PII/PCI - minimize, encrypt at-rest/in-flight, mark fields in circuits ('x-pii').
For Stream - protection of state/checkpoints, ACLs for topics.
GDPR/right to be forgotten: in Stream - crypto erasure/editing in projections; in Batch - recalculation of batches.
Transition strategies
Batch → Stream: start by publishing events (Outbox/CDC), raise a small real-time showcase without touching the existing vault.
Stream → Batch - Add daily vaults for reporting/reconciliation and load reduction on streaming sinks.
Anti-patterns
"All in Stream" for the sake of fashion: expensive and difficult without real need.
"One giant night batch" with requirements <5 minutes.
Use processing time for business metrics.
Raw CDCs as Public Events: Tight Connectivity, Pain in Evolution.
No idempotency in sinks → double effects on restarts.
Selection checklist
- Freshness SLO: How many seconds/minutes/hours is acceptable?
- Input Stability: Are there out-of-orders/duplicates?
- Do I need online reactions/storefronts?
- Cost: runtime 24/7 vs "scheduled window."
- The correction method is retract/upsert or night recalculation.
- Team and operational maturity (observability, on-call).
- Requirements for "exactly one effect."
- PII policies/retentions/right to be forgotten.
Reference patterns
Operational Showcase (Hybrid):- Stream: EDC → projections (KV/Redis, OLTP) for UI, idempotent upsert.
- Batch: nightly vault in OLAP, reconciliation, ML features.
- Stream: session-windows, CEP-rules, alerts <1-5 s.
- Batch: retraining models, offline validation.
- Stream: triggers, real-time segments.
- Batch: scoring, LTV models, reports.
FAQ
Is it possible to get "almost real-time" on batch?
Yes: microbatches/trigger jabs (every 1-5 minutes) - a compromise, but without the complexity of windows/late-events.
Does the Lambda approach need everywhere?
No, it isn't. If the thread closes all tasks and you know how to do replay - Kappa is easier to long. Otherwise - a hybrid.
How to count the cost?
Sum compute + storage + ops. For Stream, add the "24/7" downtime price and emergency nights; for Batch - the price of "overdue" data.
Result
Choose Batch when low cost, simplicity and period vaults are important; Stream - when reactivity and freshness are critical. In practice, the hybrid wins: the stream - for online and signals, the batch - for completeness and cheap historical recalculations. The main thing is to set the SLO, ensure idempotency/observability and design the correction path in advance.