DEV Community

Janardhan Chejarla
Janardhan Chejarla

Posted on

Distributed Spring Batch Coordination, Part 5: Monitoring, Observability, and Health Checks

πŸ” Introduction

As distributed jobs scale across nodes, observability becomes essential. In this part, we explore how the spring-batch-db-cluster-partitioning framework exposes real-time cluster health, task loads, and node statuses β€” giving developers the visibility needed for debugging, performance tuning, and production readiness.


πŸ“Š Exposing Cluster State with Actuator Endpoints

Spring Boot Actuator endpoints provide a natural interface to expose cluster state. This framework adds two custom indicators:

βœ… /actuator/health

This includes:

"batchCluster": {
  "status": "UP",
  "details": {
    "Total Active Nodes": "3",
    "Total Nodes in Cluster": "3"
  }
},
"batchClusterNode": {
  "status": "UP",
  "details": {
    "Current Load (number of live tasks)": "2",
    "Node Status": "ACTIVE",
    "Node Id": "worker-1",
    "Last Heartbeat Update Time": "2025-08-03T02:43:07.133+00:00",
    "Start Time": "2025-08-03T02:42:32.092+00:00"
  }
}
Enter fullscreen mode Exit fullscreen mode

This helps you instantly verify:

  • Overall cluster health, how many nodes available in total.
  • If the current node is active and responsive
  • How many tasks it's executing
  • When it last sent a heartbeat

🌐 /actuator/batch-cluster – Full Cluster Snapshot

This custom endpoint provides a full view of the entire cluster state:

{
  "nodes": [
    {
      "Started At": "2025-08-03T02:42:32.092+00:00",
      "Node Id": "worker-1",
      "Current Load (# of tasks)": 2,
      "Host": "worker-1.company.local",
      "Last Heartbeat": "2025-08-03T02:43:07.133+00:00",
      "Status": "ACTIVE"
    }
  ],
  "totalNodes": 3
}
Enter fullscreen mode Exit fullscreen mode

It includes:

  • All registered nodes
  • Current task count
  • Last heartbeat timestamp
  • Node status (ACTIVE, UNREACHABLE, etc.)

πŸ“ˆ Node Load Metrics

Cluster load is computed based on:

  • Active tasks being executed per node
  • Heartbeat freshness
  • Task reassignment if a node becomes unreachable

This allows external monitoring tools to:

  • Detect load imbalance
  • Alert on stale heartbeats
  • Audit execution trends over time

πŸ“Œ Best Practices

  • πŸ“¦ Use static node IDs (e.g., worker-1, worker-2) for easier observability
  • πŸ› οΈ Integrate with Prometheus/Grafana using custom endpoints or intermediate exporters
  • πŸ§ͺ Monitor node health to detect failures before partitions get stuck

βœ… What’s Next (Part 6 Preview)

In the next part, we’ll cover:

  • ⚠️ Failure handling and retries
  • 🧯 What happens when a node crashes mid-job
  • πŸ”„ How tasks are reassigned or resumed
  • 🧠 Retry strategies to ensure data consistency

⭐ Want More?

Top comments (0)