Every data pipeline makes one foundational decision before a single line of code is written.
Does it process data in scheduled chunks, or does it process data as events arrive?
That is the batch versus streaming decision. It looks simple on paper. In practice, it shapes everything: the tools you use, the infrastructure you maintain, the guarantees you can make about data freshness, and the cost you pay every month to keep it running.
Teams that build streaming pipelines when batch would have sufficed end up maintaining complex infrastructure for a problem that did not require it. Teams that build batch when their use case demands real-time discover the gap at the worst possible moment.
This post gives you a decision framework, a side-by-side comparison, and working code examples for both patterns. The full deep-dive with architecture diagrams, Lambda vs Kappa tradeoffs, and Databricks Real-Time Mode is at the original article on Lucent Innovation.
How Batch Works (and Where It Breaks Down)
Batch collects data over a window of time, then processes it all at once when a scheduled trigger fires.
Think of it like doing laundry. You do not wash one shirt the moment it gets dirty. You wait for a full load, then run the machine. Data accumulates throughout the day. At a scheduled time, the job picks up everything, transforms it, and loads the output.
Where batch wins:
- Complex multi-table joins and heavy aggregations with no time pressure per record
- ML model training on large, static datasets
- Nightly financial reconciliation and compliance reporting
- Historical data migrations and backfills
Where batch breaks down:
- Use cases where decisions depend on data that is happening right now
- A fraud detection system running on nightly batch is not a fraud system. It is a fraud reporting system.
Batch Pipeline Example (PySpark)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, date_trunc
spark = SparkSession.builder.appName("NightlyRevenueRollup").getOrCreate()
# Read from Delta Lake Bronze table
raw_orders = spark.read.format("delta").load("/mnt/bronze/orders")
# Transform: aggregate daily revenue per product
daily_revenue = (
raw_orders
.filter(col("status") == "completed")
.withColumn("order_date", date_trunc("day", col("created_at")))
.groupBy("order_date", "product_id")
.agg(sum("amount").alias("total_revenue"))
)
# Write to Delta Lake Gold table
(
daily_revenue.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", "order_date = current_date()")
.save("/mnt/gold/daily_revenue")
)
This job runs on a schedule (nightly, hourly, whatever the SLA allows). It reads a full time window, transforms it, and overwrites the output partition. Simple failure mode: if it breaks, you fix the logic and rerun the window.
How Streaming Works (and Where It Costs You)
Streaming treats data as a continuous flow of individual events. Each event is processed the moment it arrives, without waiting for others to accumulate.
Think of it like a moving walkway at an airport. Nobody waits for 500 people to gather before the walkway starts. It runs continuously. Each person moves forward the moment they step on.
A streaming pipeline runs 24 hours a day, 7 days a week, processing each event within milliseconds to seconds of arrival.
Where streaming wins:
- Fraud detection before a transaction clears
- Real-time personalization based on current browsing behavior
- Operational dashboards that need second-level granularity
- IoT and sensor telemetry for predictive maintenance
Where streaming costs you:
- Always-on compute, persistent state storage, and continuous monitoring
- State management across windowed aggregations grows with data volume
- Schema changes from source systems can corrupt output silently while the pipeline keeps running
- A simple batch ELT pipeline costs $15,000–$50,000 to build. A production streaming pipeline with proper monitoring: $50,000–$200,000+
Streaming Pipeline Example (Spark Structured Streaming)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, window, sum
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
spark = SparkSession.builder.appName("FraudSignalStream").getOrCreate()
# Define schema for incoming payment events
payment_schema = (
StructType()
.add("transaction_id", StringType())
.add("user_id", StringType())
.add("amount", DoubleType())
.add("merchant_id", StringType())
.add("event_time", TimestampType())
)
# Read from Kafka topic
raw_stream = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:9092")
.option("subscribe", "payment_events")
.option("startingOffsets", "latest")
.load()
.select(from_json(col("value").cast("string"), payment_schema).alias("data"))
.select("data.*")
)
# Compute 5-minute rolling spend per user for anomaly detection
windowed_spend = (
raw_stream
.withWatermark("event_time", "10 minutes")
.groupBy(
col("user_id"),
window(col("event_time"), "5 minutes")
)
.agg(sum("amount").alias("total_spend_5min"))
)
# Write results to Delta Lake for downstream fraud scoring
(
windowed_spend.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/mnt/checkpoints/fraud_signal")
.start("/mnt/silver/fraud_signals")
)
This pipeline runs continuously. The checkpointLocation ensures exactly-once processing if the job restarts. The watermark handles late-arriving events. The windowed aggregation maintains rolling state in memory.
The Decision Framework: One Question Answers It
What happens if the data is one hour old?
If the answer is nothing meaningful, batch is the right choice.
If the answer is a real business loss, streaming earns its complexity.
Four follow-up questions to confirm:
| Question | If Yes | If No |
|---|---|---|
| Does stale data cause a direct business loss? | Streaming | Batch |
| Does the output trigger a real-time action? | Streaming | Batch |
| Does your team have streaming ops experience? | Streaming feasible | Stick to batch |
| Would hourly refreshes satisfy the requirement? | Micro-batch | Streaming |
Micro-Batch: The Middle Ground Most Teams Overlook
Between batch and streaming sits micro-batch. It is the pattern Spark Structured Streaming uses by default and the one that solves most "near real-time" requirements without full streaming complexity.
Micro-batch runs the same pipeline logic as streaming but on a short fixed interval: every 30 seconds, every minute, every 5 minutes.
Most stakeholders who say they want "real-time" data would be fully satisfied with a dashboard that refreshes every minute. That is micro-batch, not streaming, and it costs a fraction of the infrastructure.
Latency requirement → Pattern
─────────────────────────────────────────────
Hours → Batch (scheduled)
Minutes → Micro-batch (short trigger)
Sub-minute + action → Streaming (Structured Streaming)
Sub-second + action → Real-Time Mode (Databricks RTM)
Micro-Batch in Practice
# Micro-batch: trigger every 60 seconds instead of continuously
(
windowed_spend.writeStream
.format("delta")
.outputMode("append")
.trigger(processingTime="60 seconds") # <-- this is the only change
.option("checkpointLocation", "/mnt/checkpoints/fraud_signal")
.start("/mnt/silver/fraud_signals")
)
One line change. 60-second freshness. Dramatically simpler operational model.
Both Patterns in One Place: Databricks Lakeflow
On Databricks, batch and streaming live in the same pipeline definition using Lakeflow Spark Declarative Pipelines. No separate tools. No second codebase.
import dlt
from pyspark.sql.functions import col, sum, window
# STREAMING TABLE: processes each event as it arrives
@dlt.table(
name="silver_payment_events",
comment="Cleaned payment events, streamed from Kafka"
)
def silver_payment_events():
return (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:9092")
.option("subscribe", "payment_events")
.load()
)
# MATERIALIZED VIEW: runs as batch, re-computes on a schedule
@dlt.table(
name="gold_daily_revenue",
comment="Daily revenue rollup, refreshed hourly"
)
def gold_daily_revenue():
return (
dlt.read("silver_payment_events")
.groupBy("merchant_id", "order_date")
.agg(sum("amount").alias("total_revenue"))
)
The streaming table (silver_payment_events) processes events continuously. The materialized view (gold_daily_revenue) refreshes on a trigger or schedule. Both are governed by Unity Catalog. Both write to Delta Lake. One pipeline definition, two patterns, zero context switching.
Real-World Pattern Map
| Use Case | Pattern | Why |
|---|---|---|
| Nightly revenue reporting | Batch | Hourly freshness acceptable |
| ML model training | Batch | Full static dataset required |
| Historical data migration | Batch | No real-time constraint |
| Fraud detection | Streaming | Decision before transaction clears |
| Live inventory dashboard | Streaming | Stockout response requires current state |
| IoT anomaly detection | Streaming | Equipment failure cannot wait |
| Stakeholder dashboard "refreshed often" | Micro-batch | Minutes of freshness, batch cost |
| Compliance reporting | Batch | Fixed time window, no urgency |
The Full Picture
This post covers the decision framework and working code examples for both patterns. The full article at Lucent Innovation goes deeper on:
- Lambda Architecture vs Kappa Architecture for systems that need both patterns simultaneously
- Databricks Real-Time Mode (GA March 2026) and how it achieves single-digit millisecond P99 latency without switching to Flink
- The three hidden streaming costs teams consistently underestimate: state management, exactly-once delivery, and silent schema drift
- How ELT handles streaming more naturally than ETL due to the pre-transformation bottleneck
Read the full guide here:
Batch vs Streaming Pipelines: How to Choose in 2026
Which pattern are you currently running in production? Drop it in the comments.
Top comments (0)