Dylan Dumont

Posted on Apr 9

Stream Processing vs Batch Processing: When to Use Each

#distributed #systems #architecture #backend

Batch processing offers throughput and accuracy, while stream processing offers latency. Your architecture must decide which metric survives production requirements.

What We're Building

We are designing a financial transaction monitoring system. This system requires real-time fraud detection for immediate blocking but also daily reconciliation for regulatory compliance. We cannot choose one paradigm exclusively. We must structure our data pipelines to handle high-volume historical data efficiently while processing live events with low latency. The core trade-off involves computational resources, consistency models, and operational complexity.

Step 1 — Define Latency Requirements

The first step is distinguishing between micro-batch and real-time needs. Micro-batches are small chunks processed periodically. Real-time implies processing each record as it arrives. We define a latency threshold T. If user experience demands T < 100ms, batch is insufficient. If the system tolerates T < 5 minutes, batch becomes viable for cost savings.

# Latency configuration struct
from dataclasses import dataclass

@dataclass
class ProcessingConfig:
    batch_interval_seconds: int
    stream_latency_threshold_ms: int
    consistency_model: str

config = ProcessingConfig(
    batch_interval_seconds=60,
    stream_latency_threshold_ms=100,
    consistency_model="at_least_once"
)

Defining these constraints upfront prevents over-engineering the infrastructure with unnecessary state stores.

Step 2 — Implement Batch Layer

For the reconciliation task, we ingest terabytes of historical logs. Batch processing allows us to parallelize work across a cluster without worrying about message ordering for every single event. We use Apache Spark logic to aggregate data over fixed windows. This approach maximizes throughput by utilizing local disk storage and minimizing network overhead for heavy computation.

# Pseudo-code for Spark Batch Job
def process_daily_batch(input_path, output_path):
    df = spark.read.parquet(input_path)
    windowed_data = df.groupBy("date").sum("amount")
    windowed_data.write.parquet(output_path)
    spark.catalog.createTable("daily_report", windowed_data)

Batch processing is cost-effective for large datasets where immediate insight is not critical to business logic.

Step 3 — Implement Stream Layer

For fraud detection, we need a continuous consumer. We subscribe to a message broker like Apache Kafka. The stream processor maintains state in memory or in RocksDB to detect patterns within sliding windows. This requires handling backpressure and ensuring exactly-once semantics if financial integrity is non-negotiable.

# Pseudo-code for Stream Consumer
def process_stream_events(consumer_group):
    for event in kafka_consumer:
        risk_score = risk_model.predict(event)
        if risk_score > threshold:
            alert_service.notify(event.user_id)
            sink.write("alert_log", event)

Stream processing introduces overhead but enables immediate intervention, critical for high-stakes environments like finance.

Step 4 — Manage Stateful Processing

State management is the hidden cost of both paradigms. Batch layers often store state in Parquet files on object storage. Stream layers often require managed systems like Flink or Kafka Streams to retain checkpoints. Without managing state, you lose context required for windowing. A stateless stream consumer cannot aggregate totals or detect sequence anomalies across a session.

[BATCH] -> [SINK STORE] -> [LOAD ON DEMAND]
            |
[STREAM] <- [STATE STORE] <- [CHECKPOINT]

Separating state management from computation simplifies scaling and debugging during failures.

Step 5 — Operational Trade-offs

Finally, consider the operational reality. Batch jobs are easier to pause and restart but introduce a cold start latency. Stream jobs require constant monitoring of consumer lag and state store health. Hybrid architectures must orchestrate both via a service bus. Overlapping state stores can lead to data duplication. You must design APIs that unify these views.

# Hybrid orchestration logic
def orchestrate_system():
    if current_time % BATCH_INTERVAL == 0:
        trigger_batch_job()
    else:
        continue_stream_loop()

Balancing these requirements ensures robustness without sacrificing performance.

Key Takeaways

Latency vs. Throughput — Batch wins on volume and cost.
Consistency Models — Batch offers strong consistency. Stream needs eventual consistency.
State Management — Batch uses durable storage. Stream uses in-memory checkpoints.
Data Schemas — Batch handles batched schemas. Stream handles streaming schemas.
Failure Handling — Batch handles offline recovery. Stream handles replayable logs.

What's Next

Next, you might explore event-driven architectures using Apache Kafka and Apache Flink. You can look into optimizing micro-batch latency for near real-time needs. Consider whether a Lambda architecture or a Kappa architecture best fits your team's constraints.

DEV Community