Heartbeat monitoring for AI agent pipelines

#ai #mcp #agents #devtools

You deploy an AI agent to run nightly. It summarises data, writes a report, sends a Slack message. You set up uptime monitoring on the endpoint. The monitor stays green. Three days later you notice the Slack messages stopped. The agent hasn't run since Tuesday — and nothing alerted you.

This is the failure mode heartbeat monitoring is designed to catch. Here's how it works and why it's particularly important for AI agent pipelines.

The dead man's switch pattern

A dead man's switch alerts when something stops happening. Traditional monitoring alerts when something starts happening — a server goes down, an error rate spikes, a response time increases.

For AI agents, the dangerous failure is silence. The agent stops running. No error is thrown. No endpoint goes down. The work just quietly ceases. A dead man's switch catches this by expecting a regular signal — if the signal stops, something is wrong.

The implementation is straightforward: at the end of every successful agent run, after the real work is done, send a ping to a heartbeat URL. If the ping stops arriving within the expected window, you get an alert.

Why AI agents need this more than traditional jobs

Traditional cron jobs fail loudly — a non-zero exit code, an exception in the logs, a failed database write. You usually know something went wrong.

AI agents fail quietly. The model might hit a rate limit and return a graceful fallback response. A tool call might silently fail and the agent continues without it. The task might complete but produce empty or corrupted output — and your application code never raises an error because it got a valid HTTP response.

In all these cases, the endpoint is up, the job "ran," and traditional monitoring sees nothing. The heartbeat sees everything — because the agent itself decides whether to send the ping, and it only pings on genuine success.

Wiring it up

Create the heartbeat once and save the token:

curl -X POST https://api.tickstem.dev/v1/heartbeats \
  -H "Authorization: Bearer $TICKSTEM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "nightly-summary-agent", "interval_secs": 86400, "grace_secs": 3600}'
# → {"token": "your-64-char-token", ...}

Then in your agent's task handler, ping only after all the real work is verified complete:

async function runNightlySummaryAgent() {
  const summary = await generateSummary()

  if (!summary || summary.length < 100) {
    throw new Error("summary generation failed or returned empty output")
  }

  await postToSlack(summary)
  await writeToDatabase(summary)

  // only ping after everything succeeded
  await fetch(`https://api.tickstem.dev/v1/heartbeats/${HEARTBEAT_TOKEN}/ping`, {
    method: "POST"
  }).catch(err => console.error("heartbeat ping failed:", err)) // non-fatal
}

The ping is fire-and-forget — a network error on the ping should never block your agent from returning.

Setting the right interval and grace window

The interval is how often you expect the agent to run. The grace window absorbs variance.

A practical starting point:

Hourly agents: interval 3600s, grace 600s
Daily agents: interval 86400s, grace 3600s
Weekly agents: interval 604800s, grace 7200s

After a week of runs, check your actual execution durations and tighten the grace window to 2-3x your p95 runtime.

Multi-step pipelines

For agents that run a pipeline — fetch data, process it, write results, notify downstream — consider a heartbeat per stage if any stage can fail silently. One heartbeat at the end of the full pipeline tells you the pipeline completed. Individual stage heartbeats tell you exactly where it stopped.

A useful rule: the heartbeat ping should only fire after your agent has verified its own output — database write succeeded, Slack message delivered, output passes a sanity check. Not before.

Pausing during deployments

Deployments are the most common source of false heartbeat alerts. Pause before deploying:

# before deploy
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_ID/pause \
  -H "Authorization: Bearer $TICKSTEM_API_KEY"

# after deploy completes
curl -s -X POST https://api.tickstem.dev/v1/heartbeats/$HEARTBEAT_ID/resume \
  -H "Authorization: Bearer $TICKSTEM_API_KEY"

Via MCP

If you're using Claude Code or another MCP-compatible client, the Tickstem MCP server exposes create_heartbeat and ping_heartbeat as native tools. The agent can set up its own dead man's switch during the initial scaffolding step.

Tickstem provides heartbeat monitoring, uptime checks, cron scheduling, and email verification under one API key. Free tier at app.tickstem.dev — no credit card required.