Lucy

Posted on Jun 5 • Originally published at lucentinnovation.com

Batch vs Streaming Pipelines: How I Actually Choose Between Them

#ai #tutorial #productivity #devops

Every data pipeline starts with one big question before a single line of code gets written.

Should I process data in scheduled chunks? Or should I process it the moment events arrive?

That is the batch vs streaming decision. It sounds simple. But in real projects, it shapes everything: which tools you pick, how much you spend each month, what guarantees you can make about fresh data, and how many nights you spend fixing production incidents.

I have seen teams pick streaming when batch would have worked just fine. I have also seen the opposite. Both mistakes are expensive. This post walks through how I think about it.

What Batch Processing Actually Means

Batch processing collects data over a time window and then processes it all at once when a scheduled trigger fires.

Think about doing laundry. You do not wash one shirt the moment it gets dirty. You wait until you have a full load, then run the machine. The shirts pile up during the week. On Sunday, the machine runs.

Data batch pipelines work the same way. Source data builds up in a staging area. At a set time, usually overnight or hourly, a job picks up everything that arrived, runs the transformations, and loads the results into the destination.

The batch job has a clear start. It has a clear end. When it finishes, the destination has a snapshot of data as of the run time. Between runs, nothing changes.

What batch is great at:

Batch handles complex transformations well because there is zero time pressure per record. A batch job can join across tables with hundreds of millions of rows. It can run expensive multi-level calculations. It can apply feature engineering for machine learning without worrying about processing each event in milliseconds.

Batch pipelines are also much easier to test, debug, and rerun. When a transformation gives wrong results, you fix the logic and reprocess the affected time window. The worst thing that happens is a delayed job, not a production fire.

Where batch falls short:

Batch produces stale data. How stale depends on the schedule. Nightly jobs produce data up to 24 hours old. Hourly jobs produce data up to 60 minutes old.

For use cases where decisions depend on what is happening right now, that staleness is a real problem.

A fraud detection system that runs on a nightly batch schedule is not a fraud detection system. It is a fraud reporting system. The fraud already happened hours ago.

What Streaming Processing Actually Means

Streaming treats data as a continuous flow of individual events. Each event gets processed the moment it arrives, without waiting for others to pile up first.

Think about a moving walkway at an airport. People step onto the walkway as they arrive. Each person moves forward right away. Nobody waits for 500 people to gather before the walkway starts moving. The walkway runs all day whether one person is on it or ten thousand.

A streaming pipeline works the same way. An event source like Apache Kafka, Amazon Kinesis, or Google Pub/Sub delivers events in real time. The stream processor picks up each event, applies the transformation logic, and writes the result downstream within milliseconds to seconds. The pipeline runs 24 hours a day, seven days a week.

What streaming is great at:

Streaming is right when the output of the pipeline needs to trigger an action or update a system in real time.

Fraud detection needs to check whether a transaction looks suspicious before approving it. That decision cannot wait 60 minutes for the next batch run.

An e-commerce recommendation engine that adapts to clicks, cart additions, and browsing behavior as they happen gives a fundamentally different experience than one running on overnight batch data.

Infrastructure health dashboards that catch CPU spikes, error rate increases, or latency anomalies need second-level data, not hourly summaries.

Where streaming falls short:

Streaming infrastructure is a lot more complex to run than batch.

Stream processing introduces distributed processing requirements, state management, and fault tolerance mechanisms that batch engineers rarely deal with. Systems consume compute resources at all times rather than only during defined job windows.

Two failure modes in streaming catch teams off guard. The first is backpressure: incoming events exceed processing capacity, lag builds up, and outputs start describing events from minutes ago instead of seconds ago.

The second is silent correctness drift. Streaming systems often keep running even when data quality issues occur. Duplicate events, missing events, or schema changes can slowly corrupt outputs while dashboards still show active data.

The Comparison at a Glance

Dimension	Batch	Streaming
How data moves	Collects over time, processes in one run	Each event processed the moment it arrives
Latency	Minutes to hours	Milliseconds to seconds
Infrastructure	Compute spins up for the job, shuts down after	Always on, always running
Cost	Lower baseline, pay only when jobs run	Higher baseline, persistent infrastructure
Complexity	Lower, simpler error handling	Higher, state management and fault tolerance required
Failure mode	Delayed job, rerun and recover	Production incident, live intervention needed
Debugging	Rerun the job on the failed time window	Replay events from the message queue checkpoint
Schema change	Pipeline breaks loudly on next run	Can cause silent issues if not monitored

The One Question That Decides It

One question cuts through most of the debate: what happens if the data is one hour old?

If the answer is nothing meaningful, batch is probably the right choice.

If the answer is a real business loss, streaming earns its complexity.

Streaming is justified when the output triggers action. If the output only feeds retrospective analysis, batch is usually sufficient.

Four Questions to Ask Before Picking

1. How fresh does the data need to be to be useful?

Most analytics use cases tolerate data that is a few hours old. A weekly revenue report does not need second-level freshness. A fraud detection engine does. Know the actual freshness requirement before assuming you need streaming.

2. Does stale data cause a real business loss?

If a customer gets a product recommendation based on yesterday's browsing instead of what they clicked five minutes ago, does that cost the business money? If yes, streaming may be worth it. If it is a marginal difference, batch is almost certainly the right choice.

3. What is the operational capacity of your team?

Streaming infrastructure needs engineers who understand state management, checkpointing, exactly-once delivery semantics, and how to respond to backpressure incidents at midnight. If your team is small or your use case does not demand real-time results, that complexity is cost without benefit.

4. Is real-time the actual requirement, or is faster batch enough?

Stakeholders often say they want real-time when what they mean is they want data more current than nightly. A pipeline that runs every 15 minutes often satisfies that requirement at a fraction of the cost and complexity of a true streaming system.

When stakeholders say "real-time" but would accept hourly updates without meaningful business impact, they want faster batch, not streaming.

Real Use Cases: When Each Pattern Wins

When Batch Is the Right Answer

Nightly financial reporting. A bank's end-of-day ledger reconciliation processes every transaction from the day against regulatory limits and account balances. The job needs to run across the full day's dataset, apply complex multi-table joins, and produce a validated snapshot. Batch runs at end of day. Streaming adds nothing here.

ML model training. Training a machine learning model requires a large, static dataset processed multiple times. Streaming the training data adds enormous complexity without improving model quality.

Large-scale historical ETL. Migrating three years of transactional data into a new warehouse schema is a batch workload. The data already exists. There is no real-time requirement. Batch processes it once and moves on.

Compliance reporting. Monthly, quarterly, or annual regulatory reports that pull and aggregate data across long time windows are batch workloads. The business cost of a slightly delayed report is low. The complexity of a streaming system is not justified.

When Streaming Is the Right Answer

Fraud detection. Payment authorization systems need to evaluate whether a transaction is fraudulent before it clears, typically in under 500 milliseconds. A batch pipeline running every 30 minutes would approve or deny transactions without the context of what happened in the last 30 minutes.

Real-time feature serving for ML inference. When a deployed ML model needs features computed from recent user behavior to make a prediction, streaming pipelines update the feature store in real time. A recommendation model running on features from last night's batch is operating blind to today's context.

Live operational dashboards. A supply chain control tower showing current inventory levels, in-transit shipments, and order status across hundreds of warehouses needs second-level freshness. An overnight batch job cannot surface a stockout until the next morning.

IoT and sensor telemetry. In manufacturing, logistics, and energy, IoT devices generate continuous streams of sensor data that batch pipelines were not built to ingest or process. Predictive maintenance models that detect equipment issues before failure require streaming ingestion of live sensor data.

The Middle Ground Teams Often Miss: Micro-Batch

Between batch and streaming sits micro-batch processing. It is the pattern that Apache Spark Structured Streaming uses by default, and it solves most "near real-time" requirements without the full complexity of continuous streaming.

Micro-batch runs the same pipeline logic as streaming but on a very short fixed interval: every 30 seconds, every minute, every 5 minutes. Data builds up for the interval, then the batch processes it. Latency is measured in seconds to low minutes rather than hours.

Most use cases that stakeholders describe as "real-time" actually tolerate micro-batch latency. A dashboard that refreshes every minute looks real-time to every user. A data freshness requirement of "under 5 minutes" is achievable with micro-batch at a fraction of the streaming infrastructure cost.

Here is how the decision tree actually looks in practice:

Hours of latency are fine: standard batch on a schedule
Minutes of latency are fine: micro-batch with short trigger intervals
Sub-minute latency is required and the output triggers action: true streaming with Spark Structured Streaming
Sub-second latency is required: Real-Time Mode on Databricks Spark Structured Streaming

The Real Cost of Streaming: What Teams Underestimate

A simple batch ETL pipeline costs between $15,000 and $50,000 to build. A production streaming pipeline with proper monitoring costs between $50,000 and $200,000 or more. That is a 4x to 10x difference at the build stage alone.

Operational cost compounds on top of that. Streaming systems need always-on compute, persistent state storage, continuous monitoring for lag and backpressure, and engineers who can respond to incidents at any hour.

Three costs teams consistently underestimate:

State management. Streaming pipelines that compute windowed aggregations, sessionization, or joins across event streams must maintain state across every event. State grows with data volume. Managing state storage, checkpointing, and cleanup is a continuous engineering concern with no equivalent in batch.

Exactly-once delivery. Guaranteeing that each event is processed exactly once, not duplicated or dropped, requires careful coordination between the message queue, the stream processor, and the output destination. Getting this wrong means silent duplicate records or missing events in production.

Schema evolution. When a source system changes its event schema, a batch pipeline fails loudly on the next scheduled run. A streaming pipeline may silently accept the new schema, produce corrupt output, and keep running for days before anyone notices.

None of this means streaming is wrong. It means streaming should be chosen when the use case justifies the cost, not because it sounds more modern than batch.

Lambda vs Kappa: Two Ways to Run Both at Once

Many production systems need both patterns. Two architectural approaches define how teams organize that combination.

Lambda Architecture

Lambda runs two parallel pipelines. A batch layer reprocesses the full historical dataset on a schedule and produces accurate, complete results. A speed layer processes real-time events and produces approximate but current results. A serving layer merges outputs from both and delivers whichever is more current and accurate.

The batch layer produces trusted, complete data. The speed layer fills in the gap between now and the last batch run. When the batch layer catches up, it overrides the speed layer's approximate output.

Lambda works well when accuracy matters for historical data but approximate freshness is acceptable for recent data. The real cost is operational: two separate pipelines to build, test, and maintain.

Kappa Architecture

Kappa replaces the dual-pipeline design with a single streaming pipeline that handles everything. All data, historical and real-time, flows through the same stream processor.

Historical reprocessing works by replaying events from a durable message queue like Apache Kafka, which retains events for a configurable window. To reprocess, you replay from the beginning of the queue through the same pipeline code. No separate batch layer needed.

Kappa is simpler to maintain but requires your message queue to retain data long enough to support replays. It also requires that your transformation logic works correctly as a streaming pipeline, which rules out certain types of complex, multi-pass batch transformations.

Quick Reference: Which Pattern for Which Use Case

Use Case	Pattern	Why
Nightly revenue reporting	Batch	Data freshness within hours is fine
ML model training	Batch	Requires full static dataset
Historical data migration	Batch	Data already exists, no real-time constraint
Fraud detection	Streaming	Decision must happen before transaction clears
Real-time ML feature serving	Streaming	Model inference needs current behavioral context
IoT anomaly detection	Streaming	Equipment failure cannot wait for next batch
Live inventory dashboards	Streaming	Stockout response needs current state
Monthly compliance reports	Batch	Fixed window, no freshness urgency

My Rule of Thumb

Before you write a line of code, ask: does the output of this pipeline trigger an action, or does it inform analysis?

If it triggers an action and that action loses value after a few minutes, build streaming.

If it informs analysis and the insights hold up for a few hours, build batch.

And if your stakeholders say "real-time" but can actually accept updates every few minutes, build micro-batch. It gives you most of the freshness at a fraction of the cost.

The goal is not to use the most impressive technology. The goal is to ship the simplest system that meets the actual latency requirement and does not wake anyone up at 3 AM.

This post is part of a series on modern data engineering. For more on how these patterns connect to ETL vs ELT design choices, how Databricks handles both batch and streaming in one platform, and how to design for schema evolution at scale, check out the Modern Data Engineering Guide.

DEV Community