Udayan Sawant

Posted on Nov 15

Availability — Heartbeats (1)

#systemdesign #availability #heartbeats #faulttolerance

Picture this: you’re on-call, it’s 3 a.m., and a cluster node silently dies.

No crash loop. No helpful logs. Just… absence.

In a distributed system, absence is deadly. A single node going missing can stall leader election, corrupt data, or make your clients hang indefinitely. You don’t get stack traces from a dead machine. You just get silence.

Heartbeats are how we turn that silence into a signal.

They’re stupidly simple — tiny “I’m alive” messages — but they sit right in the critical path of availability, failover, and system correctness. Let’s walk through them like system designers, not checkbox-monitoring enjoyers.

What is a Heartbeat, Really?
In computing, a heartbeat is a periodic signal from one component to another that says:

“I’m still here, and I’m (probably) fine.”

It might be a UDP packet, an HTTP request, a gRPC call, or even a row update in a database table. The payload is often tiny — sometimes just a timestamp or status flag. If the receiver doesn’t see a heartbeat within some window (a timeout), it starts suspecting that node is unhealthy or dead.

That’s all. No magic. Just a repeating pulse.

Yet that pulse powers:

Cluster membership
Load balancer health checks
Leader election
Failure detection in consensus algorithms

Why Distributed Systems Need Heartbeats
Monoliths don’t worry much about “is this process alive?” — if it dies, everything is obviously dead. In distributed systems, the failure of a machine you’ve never heard of can stall the whole system. Heartbeats give us a way to notice and react quickly.

Common uses:

Failure detection: Nodes or a central monitor track who is “alive.” Once a node misses several heartbeats, it’s marked as failed and removed from routing, quorums, or replicas.
Cluster membership: Heartbeats feed into membership protocols: which nodes are “in the cluster”? This is crucial for consistent hashing, sharding, and quorum calculations.
Leader and coordinator health: Leaders send heartbeats to followers (e.g., Raft’s AppendEntries with no-op payloads), letting them know the leader is still in charge and preventing unnecessary elections.
Load balancer / service discovery: Load balancers and service registries use heartbeats (or active health checks) to decide which backend instances are healthy enough to receive traffic.
Under the hood, most of these boil down to the same core pattern: periodic liveness signals + timeouts + some failure detection logic.

The Minimal Anatomy of a Heartbeat System
Let’s deconstruct the pattern into a few building blocks. Different systems change the details, but the shape is usually the same.

Sender (the node being monitored)

Periodically sends a heartbeat.
Often includes: Its ID. A timestamp, Optional metadata (load, version, epoch, etc.)

Receiver (the monitor)

Tracks the last time it heard from each node.
Stores something like: {node_id: last_heartbeat_timestamp}.

Interval

How often heartbeats are sent: every 100 ms? 500 ms? 5 seconds?
Smaller interval = faster failure detection but more overhead.

Timeout

How long the receiver waits before declaring “this node might be dead.”
Usually multiple intervals, e.g. timeout = 3 * interval + slack.

Failure detection logic

Naive version: “If last heartbeat older than timeout ⇒ node is dead.”
Smarter versions use suspicion levels, probabilistic detectors, or multiple missed heartbeats before flipping to failed.

Almost every heartbeat implementation is just tweaking those parameters and adding guardrails.

The Big Trade-Off: Detection Speed vs Noise
Heartbeats look easy until you have to pick the numbers.

Say your interval is 1 second and your timeout is 3 seconds. That means:

You detect failures in ≤ 3 seconds
You risk marking nodes as dead during brief hiccups, GC pauses, or short network stalls

If you bump the timeout to 30 seconds:

Far fewer false positives
Much slower failover
(Imagine waiting 30 seconds for your primary database to be declared dead…)

Typical Formula
Many systems use something like:

timeout = k * interval + safety_margin

Where k might be 3–5.

Small k: Fast detection, higher false positives.
Large k: Slower detection, more stability.
More advanced designs use adaptive or probabilistic timeouts, like the φ-accrual failure detector (used in systems like Cassandra) that outputs a suspicion level instead of a binary “dead/alive.”

Topologies: Who Heartbeats to Whom?
Heartbeats aren’t just about what you send but also who you send it to.

Let’s look at some common patterns.

1. Centralized Monitor
One obvious design: a single monitoring service that all nodes send heartbeats to.

Each node → sends heartbeat to monitor
Monitor → maintains a map of node → last seen
Clients or other services query the monitor for cluster health
Pros:
Simple to reason about
Great for small clusters or control planes

Cons:

Single point of failure (unless replicated)
Can become a bottleneck as node count grows

Imagine 1000 nodes sending heartbeats every 500 ms to a central monitor. That’s 2000 messages per second just for health checks (in/out), which can compete with actual traffic in a busy system if designed poorly.

2. Peer-to-Peer Heartbeats
Instead of a central brain, nodes can monitor each other:

Each node pings a subset of other nodes.
If one node suspects another, it spreads suspicion via gossip or a membership protocol.

This reduces central bottlenecks and improves fault tolerance but complicates the logic: who monitors whom, and what happens if monitors fail?

3. Gossip-Based Heartbeats
Gossip protocols spread membership and heartbeat data gradually, like rumors at a party:

Each node periodically talks to a random peer.
They exchange:
Who they think is alive
Who they think is dead
Versioned membership info Cassandra is a classic example: it uses gossip + heartbeat-based failure detection and φ-accrual detectors to avoid snap decisions about node death.

So far we’ve treated heartbeats as the basic pulse of a distributed system: tiny periodic signals, timeouts, and topologies that decide who talks to whom. We’ve looked at how they detect failures, how they shape cluster membership, and how different designs (centralized, peer-to-peer, gossip) come with different trade-offs.

In Part 2, we’ll get more hands-on: we’ll build a tiny heartbeat system in Python, explore real-world pitfalls like false positives and partitions, connect this pattern to systems you already use (Kubernetes, Cassandra, etc.), and translate all of that into the kind of thinking that shines in system design interviews.

DEV Community

Availability — Heartbeats (1)

Top comments (0)