Janardhan Chejarla

Posted on Aug 3

Distributed Spring Batch Coordination, Part 5: Monitoring, Observability, and Health Checks

#springbatch #java #opensource #cloudnative

🔍 Introduction

As distributed jobs scale across nodes, observability becomes essential. In this part, we explore how the spring-batch-db-cluster-partitioning framework exposes real-time cluster health, task loads, and node statuses — giving developers the visibility needed for debugging, performance tuning, and production readiness.

📊 Exposing Cluster State with Actuator Endpoints

Spring Boot Actuator endpoints provide a natural interface to expose cluster state. This framework adds two custom indicators:

✅ `/actuator/health`

This includes:

"batchCluster": {
  "status": "UP",
  "details": {
    "Total Active Nodes": "3",
    "Total Nodes in Cluster": "3"
  }
},
"batchClusterNode": {
  "status": "UP",
  "details": {
    "Current Load (number of live tasks)": "2",
    "Node Status": "ACTIVE",
    "Node Id": "worker-1",
    "Last Heartbeat Update Time": "2025-08-03T02:43:07.133+00:00",
    "Start Time": "2025-08-03T02:42:32.092+00:00"
  }
}

This helps you instantly verify:

Overall cluster health, how many nodes available in total.
If the current node is active and responsive
How many tasks it's executing
When it last sent a heartbeat

🌐 `/actuator/batch-cluster` – Full Cluster Snapshot

This custom endpoint provides a full view of the entire cluster state:

{
  "nodes": [
    {
      "Started At": "2025-08-03T02:42:32.092+00:00",
      "Node Id": "worker-1",
      "Current Load (# of tasks)": 2,
      "Host": "worker-1.company.local",
      "Last Heartbeat": "2025-08-03T02:43:07.133+00:00",
      "Status": "ACTIVE"
    }
  ],
  "totalNodes": 3
}

It includes:

All registered nodes
Current task count
Last heartbeat timestamp
Node status (ACTIVE, UNREACHABLE, etc.)

📈 Node Load Metrics

Cluster load is computed based on:

Active tasks being executed per node
Heartbeat freshness
Task reassignment if a node becomes unreachable

This allows external monitoring tools to:

Detect load imbalance
Alert on stale heartbeats
Audit execution trends over time

📌 Best Practices

📦 Use static node IDs (e.g., worker-1, worker-2) for easier observability
🛠️ Integrate with Prometheus/Grafana using custom endpoints or intermediate exporters
🧪 Monitor node health to detect failures before partitions get stuck

✅ What’s Next (Part 6 Preview)

In the next part, we’ll cover:

⚠️ Failure handling and retries
🧯 What happens when a node crashes mid-job
🔄 How tasks are reassigned or resumed
🧠 Retry strategies to ensure data consistency

⭐ Want More?

Explore the code: GitHub – spring-batch-db-cluster-partitioning
Read earlier parts in the series: Dev.to article series

DEV Community

Distributed Spring Batch Coordination, Part 5: Monitoring, Observability, and Health Checks

🔍 Introduction

📊 Exposing Cluster State with Actuator Endpoints

✅ `/actuator/health`

🌐 `/actuator/batch-cluster` – Full Cluster Snapshot

📈 Node Load Metrics

📌 Best Practices

✅ What’s Next (Part 6 Preview)

⭐ Want More?

Top comments (0)

🔍 Introduction

📊 Exposing Cluster State with Actuator Endpoints

✅ /actuator/health

🌐 /actuator/batch-cluster – Full Cluster Snapshot

📈 Node Load Metrics

📌 Best Practices

✅ What’s Next (Part 6 Preview)

⭐ Want More?

✅ `/actuator/health`

🌐 `/actuator/batch-cluster` – Full Cluster Snapshot