Udayan Sawant

Posted on Nov 15

Availability - Heartbeats (2)

#systemdesign #availability #heartbeat #faulttolerance

We introduced heartbeats as periodic "I'm alive" messages in distributed systems, unpacked how they support failure detection and cluster membership, and compared different heartbeat topologies: centralized monitors, peer-to-peer checks, and gossip-based designs. Recap
We also talked about how intervals, timeouts, and simple failure detection logic turn into a real trade-off between fast detection and noisy false positives. With that mental model in place, we're ready to build a small system, examine its failure modes, and refine it toward something production-worthy.

Heartbeats Are More Than "I'm Alive"
Once you have a periodic signal, you can sneak in extra metadata.
Common piggybacked fields:

Current load (CPU, memory, request rate)
Version or build hash (for safe rolling deployments)
Epoch/term info (for consensus / leader election)
Shard ownership or partition state

Examples in real systems:

Load balancers: health checks may include not just "HTTP 200" but also whether the instance is overloaded.
Kubernetes: readiness and liveness probes gate scheduling/traffic. The kubelet periodically reports node status to the control plane.
Consensus protocols: Raft leaders send periodic heartbeats (AppendEntries RPCs, even empty) to assert leadership and prevent elections.

The heartbeat becomes a low-bandwidth control channel for the cluster.

A Tiny Heartbeat System in Python (for Intuition)

Let's sketch a simple heartbeat system in Python using asyncio.

Toy Model

Each worker node: Keeps sending heartbeats to a central monitor over HTTP.
The monitor: Tracks last seen times; Marks nodes as "suspected dead" if they go silent.

This is not production-ready, but it maps the theory to something concrete.

import time
from typing import Dict
from fastapi import FastAPI
from pydantic import BaseModel
import uvicorn

app = FastAPI()

HEARTBEAT_TIMEOUT = 5.0  # seconds
last_seen: Dict[str, float] = {}

class Heartbeat(BaseModel):
    node_id: str
    ts: float

@app.post("/heartbeat")
async def heartbeat(hb: Heartbeat):
    last_seen[hb.node_id] = hb.ts
    return {"status": "ok"}

@app.get("/status")
async def status():
    now = time.time()
    status = {}
    for node, ts in last_seen.items():
        delta = now - ts
        status[node] = {
            "last_seen": ts,
            "age": delta,
            "alive": delta < HEARTBEAT_TIMEOUT,
        }
    return status

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Worker (Node) Sending Heartbeats

import asyncio
import time
import httpx

MONITOR_URL = "http://localhost:8000/heartbeat"
NODE_ID = "node-1"
INTERVAL = 1.0  # seconds

async def send_heartbeats():
    async with httpx.AsyncClient() as client:
        while True:
            payload = {"node_id": NODE_ID, "ts": time.time()}
            try:
                await client.post(MONITOR_URL, json=payload, timeout=1.0)
            except Exception as e:
                # In real systems, you'd log this and possibly backoff
                print(f"Failed to send heartbeat: {e}")
            await asyncio.sleep(INTERVAL)

if __name__ == "__main__":
    asyncio.run(send_heartbeats())

Run the monitor, start a couple of workers, and then kill one worker process. Within ~5 seconds, /status will show it as not alive.

You just implemented:

A heartbeat sender
A central monitor
Timeouts and liveness calculation

In real systems, this evolves into:

Redundant monitors (no single point of failure)
Persistent state or shared stores (so status survives restarts)
Gossip instead of centralization
Smarter failure detectors

But the mental model stays the same.

Pitfalls: Where Heartbeats Get You in Trouble
Heartbeats are simple; failure detection is not.

1. Network Partitions vs Crashes
If a node stops sending heartbeats, did it:

Crash?
Lose network connectivity in one direction?
Hit a local resource issue (GC freeze, kernel stall)?
Suffer a partial partition where only some peers can see it?

From the cluster's point of view, all of these look similar: no heartbeat.
This is why systems often distinguish:

suspected vs definitely dead
transient vs permanent failure

And why many protocols allow nodes to rejoin after being declared dead, usually with higher "generation" or epoch numbers.

2. False Positives (Flapping)
If your timeout is too aggressive, you end up in the nightmare scenario:

Node is alive but slow.
You mark it dead.
Failover kicks in.
The node comes back.
Now you have duplicate leaders or conflicting state.

To avoid this, production systems often:

Require multiple missed heartbeats before declaring failure.
Use suspicion levels rather than booleans.
Back off decisions if there's a known network issue.

Scalability and Overhead In very large clusters, heartbeats aren't free:

A fully connected graph (everyone heartbeating everyone) is O(N²).
Even centralized monitoring can become a bottleneck in big deployments.

Mitigations:

Gossip / partial views instead of full meshes.
Hierarchical monitors (local agents report to regional controllers).
Adaptive intervals (idle components heartbeat less often).

Heartbeats in Systems You Already Know
This isn't an academic pattern - you've already met it in many places:

Kubernetes: Nodes and pods are constantly being probed; readiness/liveness checks and node status reporting are heartbeat flavored under the hood.
Distributed Databases (Cassandra, etcd, ZooKeeper): Use heartbeats for membership, leader election, and ensuring quorum health. Cassandra combines gossip + φ-accrual detectors to avoid premature death certificates.
Service Meshes / API Gateways: Sidecars and control planes trade health info to know where to route traffic.
Load Balancers & Health Checks: From AWS ALB to Nginx, health checks (active or passive) are heartbeat cousins: same idea, different framing.

Design Checklist for Heartbeats (In the Real World)
When you add heartbeats to a system, ask yourself:

Who monitors what? Central node? Peer-to-peer? Gossip?
What's the interval and timeout? How fast do you need detection vs how noisy is the environment?
What exactly happens on failure? Do you remove from load balancer, trigger leader election, alert humans?
How do nodes rejoin? Can a previously-dead node come back safely (with a new epoch/generation)?
What's the scale? 10 nodes, 1000 nodes, or 100k IoT devices with flaky connections?
Do you piggyback metadata? Version, load, shard info, etc.

If you can answer these, your heartbeat design is already ahead of many real production setups.

TL;DR
Heartbeats are the kind of thing you rarely brag about in postmortems or blog posts - until they break.
They're just small, repetitive, almost boring messages. But they give distributed systems something like a nervous system: a way to sense which parts are alive, which are failing, and when to adapt.
Design them carelessly, and you get false alarms, flapping nodes, and mysterious outages. Design them thoughtfully, and your system can lose machines, racks, and zones while users barely notice.
In a distributed world, silence is ambiguity. Heartbeats turn that silence into information.

DEV Community

Availability - Heartbeats (2)

Top comments (0)