In this part of the series, we’ll dive into how the framework ensures reliability and resilience by detecting failures and automatically rebalancing the workload across a dynamic cluster of nodes. With no centralized scheduler or messaging broker dependency, node health and task recovery are entirely database-coordinated, lightweight, and highly configurable.
🔥 Failure Happens — Let’s Handle It Gracefully
In traditional Spring Batch setups, node failure often leads to partial execution or requires manual intervention. With this coordination framework, failure is a first-class citizen, and node recovery is built-in.
The system uses a two-step failure detection mechanism:
✅ Step 1: Detecting Unreachable Nodes
Every active node updates its timestamp in the BATCH_NODES
table at regular intervals (heartbeat).
If a node fails to update within a configurable timeout (e.g., 30 seconds), it is marked as UNREACHABLE
.
This state gives the node time to recover from temporary issues like GC pauses or transient network glitches.
✅ Step 2: Deregistering Stale Nodes
If the node remains unreachable beyond a second threshold (e.g., 2–3 minutes), it is considered stale and removed from the coordination table.
At this point, any tasks (partitions) it was executing become eligible for reassignment — but only if they were marked as is_transferrable = true
when created. This allows for fine-grained control over which partitions can move between nodes and which are sticky by design.
🔄 Node Rejoin Logic
If the node recovers after being marked unreachable or deregistered, it can:
- Re-register itself into the
BATCH_NODES
table - Participate in future partition assignment rounds
- Remain fully stateless from a master node’s perspective
🧠 Masterless but Intelligent Coordination
This design avoids a single point of failure by decentralizing the intelligence:
- The node that launches the job acts as the temporary “master” for coordination
- Other nodes participate in a shared decision-making protocol using the database
- No Zookeeper, Redis, or Kafka needed
📊 Built-in Observability in v2.0.0
This version includes Actuator-based monitoring endpoints:
-
/actuator/health
:-
batchCluster
: shows total and active nodes -
batchClusterNode
: current node ID, load, heartbeat, status, uptime
-
-
/actuator/batch-cluster
:- Lists all registered nodes with status, load, heartbeat timestamps, and host info
This gives real-time visibility into the state of the cluster without requiring any additional dashboards or monitoring tools.
💡 What's Next?
In Part 4, we’ll walk through how to build a real-world distributed Spring Batch job using this coordination framework. You'll learn:
- How to integrate the framework into your Spring Boot project using Maven
- How to launch a multi-node cluster and observe active node discovery
- How round-robin and fixed-node partitioning work in practice
- How task distribution is calculated dynamically based on the current cluster size
We'll demonstrate this with a practical ETL example, where the master node splits a large CSV file into logical partitions (e.g., by line number or ID range), and each worker node processes its assigned chunk by transforming it into XML.
This simulates real-world scenarios like customer data migration or financial transaction reporting, helping you confidently apply the framework to production-scale workloads.
Stay tuned — the next part is hands-on and code-heavy!
Top comments (0)