Distributed Spring Batch Coordination – Part 3: Fault Tolerance and Node Recovery

#springbatch #java #opensource #cloudnative

In this part of the series, we’ll dive into how the framework ensures reliability and resilience by detecting failures and automatically rebalancing the workload across a dynamic cluster of nodes. With no centralized scheduler or messaging broker dependency, node health and task recovery are entirely database-coordinated, lightweight, and highly configurable.

🔥 Failure Happens — Let’s Handle It Gracefully

In traditional Spring Batch setups, node failure often leads to partial execution or requires manual intervention. With this coordination framework, failure is a first-class citizen, and node recovery is built-in.

The system uses a two-step failure detection mechanism:

✅ Step 1: Detecting Unreachable Nodes

Every active node updates its timestamp in the BATCH_NODES table at regular intervals (heartbeat).

If a node fails to update within a configurable timeout (e.g., 30 seconds), it is marked as UNREACHABLE.

This state gives the node time to recover from temporary issues like GC pauses or transient network glitches.

✅ Step 2: Deregistering Stale Nodes

If the node remains unreachable beyond a second threshold (e.g., 2–3 minutes), it is considered stale and removed from the coordination table.

At this point, any tasks (partitions) it was executing become eligible for reassignment — but only if they were marked as is_transferrable = true when created. This allows for fine-grained control over which partitions can move between nodes and which are sticky by design.

🔄 Node Rejoin Logic

If the node recovers after being marked unreachable or deregistered, it can:

Re-register itself into the BATCH_NODES table
Participate in future partition assignment rounds
Remain fully stateless from a master node’s perspective

🧠 Masterless but Intelligent Coordination

This design avoids a single point of failure by decentralizing the intelligence:

The node that launches the job acts as the temporary “master” for coordination
Other nodes participate in a shared decision-making protocol using the database
No Zookeeper, Redis, or Kafka needed

📊 Built-in Observability in v2.0.0

This version includes Actuator-based monitoring endpoints:

/actuator/health:
- batchCluster: shows total and active nodes
- batchClusterNode: current node ID, load, heartbeat, status, uptime
/actuator/batch-cluster:
- Lists all registered nodes with status, load, heartbeat timestamps, and host info

This gives real-time visibility into the state of the cluster without requiring any additional dashboards or monitoring tools.

💡 What's Next?

In Part 4, we’ll walk through how to build a real-world distributed Spring Batch job using this coordination framework. You'll learn:

How to integrate the framework into your Spring Boot project using Maven
How to launch a multi-node cluster and observe active node discovery
How round-robin and fixed-node partitioning work in practice
How task distribution is calculated dynamically based on the current cluster size

We'll demonstrate this with a practical ETL example, where the master node splits a large CSV file into logical partitions (e.g., by line number or ID range), and each worker node processes its assigned chunk by transforming it into XML.

This simulates real-world scenarios like customer data migration or financial transaction reporting, helping you confidently apply the framework to production-scale workloads.

Stay tuned — the next part is hands-on and code-heavy!