Krunal Kanojiya

Posted on Jun 26

Batch vs Real-Time Streaming: When to Use Each (with Examples)

#dataengineering #python #productivity #databricks

Every data pipeline makes one foundational decision before a single line of code is written.

Does it process data in scheduled chunks, or does it process data as events arrive?

That is the batch versus streaming decision. It looks simple on paper. In practice, it shapes everything: the tools you use, the infrastructure you maintain, the guarantees you can make about data freshness, and the cost you pay every month to keep it running.

Teams that build streaming pipelines when batch would have sufficed end up maintaining complex infrastructure for a problem that did not require it. Teams that build batch when their use case demands real-time discover the gap at the worst possible moment.

This post gives you a decision framework, a side-by-side comparison, and working code examples for both patterns. The full deep-dive with architecture diagrams, Lambda vs Kappa tradeoffs, and Databricks Real-Time Mode is at the original article on Lucent Innovation.

How Batch Works (and Where It Breaks Down)

Batch collects data over a window of time, then processes it all at once when a scheduled trigger fires.

Think of it like doing laundry. You do not wash one shirt the moment it gets dirty. You wait for a full load, then run the machine. Data accumulates throughout the day. At a scheduled time, the job picks up everything, transforms it, and loads the output.

Where batch wins:

Complex multi-table joins and heavy aggregations with no time pressure per record
ML model training on large, static datasets
Nightly financial reconciliation and compliance reporting
Historical data migrations and backfills

Where batch breaks down:

Use cases where decisions depend on data that is happening right now
A fraud detection system running on nightly batch is not a fraud system. It is a fraud reporting system.

Batch Pipeline Example (PySpark)

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, date_trunc

spark = SparkSession.builder.appName("NightlyRevenueRollup").getOrCreate()

# Read from Delta Lake Bronze table
raw_orders = spark.read.format("delta").load("/mnt/bronze/orders")

# Transform: aggregate daily revenue per product
daily_revenue = (
    raw_orders
    .filter(col("status") == "completed")
    .withColumn("order_date", date_trunc("day", col("created_at")))
    .groupBy("order_date", "product_id")
    .agg(sum("amount").alias("total_revenue"))
)

# Write to Delta Lake Gold table
(
    daily_revenue.write
    .format("delta")
    .mode("overwrite")
    .option("replaceWhere", "order_date = current_date()")
    .save("/mnt/gold/daily_revenue")
)

This job runs on a schedule (nightly, hourly, whatever the SLA allows). It reads a full time window, transforms it, and overwrites the output partition. Simple failure mode: if it breaks, you fix the logic and rerun the window.

How Streaming Works (and Where It Costs You)

Streaming treats data as a continuous flow of individual events. Each event is processed the moment it arrives, without waiting for others to accumulate.

Think of it like a moving walkway at an airport. Nobody waits for 500 people to gather before the walkway starts. It runs continuously. Each person moves forward the moment they step on.

A streaming pipeline runs 24 hours a day, 7 days a week, processing each event within milliseconds to seconds of arrival.

Where streaming wins:

Fraud detection before a transaction clears
Real-time personalization based on current browsing behavior
Operational dashboards that need second-level granularity
IoT and sensor telemetry for predictive maintenance

Where streaming costs you:

Always-on compute, persistent state storage, and continuous monitoring
State management across windowed aggregations grows with data volume
Schema changes from source systems can corrupt output silently while the pipeline keeps running
A simple batch ELT pipeline costs $15,000–$50,000 to build. A production streaming pipeline with proper monitoring: $50,000–$200,000+

Streaming Pipeline Example (Spark Structured Streaming)

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, window, sum
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

spark = SparkSession.builder.appName("FraudSignalStream").getOrCreate()

# Define schema for incoming payment events
payment_schema = (
    StructType()
    .add("transaction_id", StringType())
    .add("user_id", StringType())
    .add("amount", DoubleType())
    .add("merchant_id", StringType())
    .add("event_time", TimestampType())
)

# Read from Kafka topic
raw_stream = (
    spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "kafka-broker:9092")
    .option("subscribe", "payment_events")
    .option("startingOffsets", "latest")
    .load()
    .select(from_json(col("value").cast("string"), payment_schema).alias("data"))
    .select("data.*")
)

# Compute 5-minute rolling spend per user for anomaly detection
windowed_spend = (
    raw_stream
    .withWatermark("event_time", "10 minutes")
    .groupBy(
        col("user_id"),
        window(col("event_time"), "5 minutes")
    )
    .agg(sum("amount").alias("total_spend_5min"))
)

# Write results to Delta Lake for downstream fraud scoring
(
    windowed_spend.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/mnt/checkpoints/fraud_signal")
    .start("/mnt/silver/fraud_signals")
)

This pipeline runs continuously. The checkpointLocation ensures exactly-once processing if the job restarts. The watermark handles late-arriving events. The windowed aggregation maintains rolling state in memory.

The Decision Framework: One Question Answers It

What happens if the data is one hour old?

If the answer is nothing meaningful, batch is the right choice.

If the answer is a real business loss, streaming earns its complexity.

Four follow-up questions to confirm:

Question	If Yes	If No
Does stale data cause a direct business loss?	Streaming	Batch
Does the output trigger a real-time action?	Streaming	Batch
Does your team have streaming ops experience?	Streaming feasible	Stick to batch
Would hourly refreshes satisfy the requirement?	Micro-batch	Streaming

Micro-Batch: The Middle Ground Most Teams Overlook

Between batch and streaming sits micro-batch. It is the pattern Spark Structured Streaming uses by default and the one that solves most "near real-time" requirements without full streaming complexity.

Micro-batch runs the same pipeline logic as streaming but on a short fixed interval: every 30 seconds, every minute, every 5 minutes.

Most stakeholders who say they want "real-time" data would be fully satisfied with a dashboard that refreshes every minute. That is micro-batch, not streaming, and it costs a fraction of the infrastructure.

Latency requirement   →   Pattern
─────────────────────────────────────────────
Hours                 →   Batch (scheduled)
Minutes               →   Micro-batch (short trigger)
Sub-minute + action   →   Streaming (Structured Streaming)
Sub-second + action   →   Real-Time Mode (Databricks RTM)

Micro-Batch in Practice

# Micro-batch: trigger every 60 seconds instead of continuously
(
    windowed_spend.writeStream
    .format("delta")
    .outputMode("append")
    .trigger(processingTime="60 seconds")   # <-- this is the only change
    .option("checkpointLocation", "/mnt/checkpoints/fraud_signal")
    .start("/mnt/silver/fraud_signals")
)

One line change. 60-second freshness. Dramatically simpler operational model.

Both Patterns in One Place: Databricks Lakeflow

On Databricks, batch and streaming live in the same pipeline definition using Lakeflow Spark Declarative Pipelines. No separate tools. No second codebase.

import dlt
from pyspark.sql.functions import col, sum, window

# STREAMING TABLE: processes each event as it arrives
@dlt.table(
    name="silver_payment_events",
    comment="Cleaned payment events, streamed from Kafka"
)
def silver_payment_events():
    return (
        spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "kafka-broker:9092")
        .option("subscribe", "payment_events")
        .load()
    )

# MATERIALIZED VIEW: runs as batch, re-computes on a schedule
@dlt.table(
    name="gold_daily_revenue",
    comment="Daily revenue rollup, refreshed hourly"
)
def gold_daily_revenue():
    return (
        dlt.read("silver_payment_events")
        .groupBy("merchant_id", "order_date")
        .agg(sum("amount").alias("total_revenue"))
    )

The streaming table (silver_payment_events) processes events continuously. The materialized view (gold_daily_revenue) refreshes on a trigger or schedule. Both are governed by Unity Catalog. Both write to Delta Lake. One pipeline definition, two patterns, zero context switching.

Real-World Pattern Map

Use Case	Pattern	Why
Nightly revenue reporting	Batch	Hourly freshness acceptable
ML model training	Batch	Full static dataset required
Historical data migration	Batch	No real-time constraint
Fraud detection	Streaming	Decision before transaction clears
Live inventory dashboard	Streaming	Stockout response requires current state
IoT anomaly detection	Streaming	Equipment failure cannot wait
Stakeholder dashboard "refreshed often"	Micro-batch	Minutes of freshness, batch cost
Compliance reporting	Batch	Fixed time window, no urgency

The Full Picture

This post covers the decision framework and working code examples for both patterns. The full article at Lucent Innovation goes deeper on:

Lambda Architecture vs Kappa Architecture for systems that need both patterns simultaneously
Databricks Real-Time Mode (GA March 2026) and how it achieves single-digit millisecond P99 latency without switching to Flink
The three hidden streaming costs teams consistently underestimate: state management, exactly-once delivery, and silent schema drift
How ELT handles streaming more naturally than ETL due to the pre-transformation bottleneck

Read the full guide here:
Batch vs Streaming Pipelines: How to Choose in 2026

Which pattern are you currently running in production? Drop it in the comments.