Register Named Health Checks for Your Agent So You Know What's Broken

#hermeschallenge #ai #python #agents

Production agents depend on external services: LLM providers, databases, vector stores, APIs. Any of them can go down. Without health checks, you find out when users report errors.

agent-health-check is a registry of named checks. Each check is a callable that returns OK or FAIL. The registry runs them all and gives you a structured report. Wire it to a /health endpoint. Call it before a batch job starts. Log it with your observability stack.

The Shape of the Fix

from agent_health_check import HealthRegistry, HealthStatus

registry = HealthRegistry()

@registry.check("anthropic_api")
def check_anthropic() -> bool:
    try:
        client.messages.create(
            model="claude-haiku-3-5-20241022",
            max_tokens=1,
            messages=[{"role": "user", "content": "ping"}],
        )
        return True
    except Exception:
        return False

@registry.check("database")
def check_db() -> bool:
    return db.ping()

@registry.check("vector_store")
def check_vectors() -> bool:
    return vector_db.is_healthy()

# Run all checks
report = registry.run_all()
print(report.status)   # HealthStatus.HEALTHY or HealthStatus.DEGRADED or HealthStatus.UNHEALTHY
print(report.summary)  # {"anthropic_api": True, "database": True, "vector_store": False}

One call, structured result. HEALTHY means all checks passed. DEGRADED means some failed. UNHEALTHY means all failed.

What It Does NOT Do

agent-health-check does not repair failing dependencies. It reports their status. Repair logic is your responsibility.

It does not run checks continuously. Each run_all() is a point-in-time snapshot. For continuous monitoring, run it on a schedule using schedule or a cron job.

It does not distinguish between recoverable and unrecoverable failures. A database that is momentarily overloaded and a database that is misconfigured both return False. Distinguishing them requires more context than a boolean check can provide.

Inside the Library

The registry stores checks as (name, callable, required) tuples. required=True means a failure makes the overall status UNHEALTHY or DEGRADED; required=False means advisory-only.

class HealthRegistry:
    def __init__(self):
        self._checks: list[HealthCheck] = []

    def check(self, name: str, required: bool = True, timeout: float = 5.0):
        def decorator(fn):
            self._checks.append(HealthCheck(name=name, fn=fn, required=required, timeout=timeout))
            return fn
        return decorator

    def run_all(self) -> HealthReport:
        results = {}
        for check in self._checks:
            try:
                with timeout_guard(check.timeout):
                    results[check.name] = check.fn()
            except TimeoutError:
                results[check.name] = False
            except Exception:
                results[check.name] = False

        required_results = {
            c.name: results[c.name]
            for c in self._checks if c.required
        }
        all_required_pass = all(required_results.values())
        any_required_fail = any(not v for v in required_results.values())

        if all_required_pass:
            status = HealthStatus.HEALTHY
        elif any_required_fail and not all_required_pass:
            status = HealthStatus.DEGRADED
        else:
            status = HealthStatus.UNHEALTHY

        return HealthReport(status=status, summary=results)

The timeout parameter on each check prevents a slow dependency from blocking the entire health check run. Default is 5 seconds per check.

Checks registered with required=False show up in report.summary but do not affect the overall status. Use this for advisory checks (vector store is nice to have but not required for basic function).

The check function interface is simple: () -> bool. No special return types, no exception handling required from the caller. The registry catches all exceptions and treats them as failures.

When to Use It

Use it for any agent that depends on external services. The /health endpoint pattern: mount the registry's run_all() behind a GET endpoint and call it from your load balancer or k8s readiness probe.

Use it before starting long batch jobs. A batch job that fails at item 500 because the vector store was down is worse than a batch job that fails immediately with "vector store health check failed."

Use it in your agent startup sequence. Before accepting any work, run the health registry. If required checks fail, log the summary and exit rather than running in a degraded state.

Skip it for local development and simple scripts. Health checks are an operational concern. They add value when there is an operations team monitoring them.

Install

pip install git+https://github.com/MukundaKatta/agent-health-check

from agent_health_check import HealthRegistry, HealthStatus
from fastapi import FastAPI, HTTPException

registry = HealthRegistry()
app = FastAPI()

@registry.check("anthropic", required=True, timeout=3.0)
def check_anthropic():
    return anthropic_client.is_available()

@registry.check("postgres", required=True, timeout=2.0)
def check_postgres():
    return db.execute("SELECT 1").rowcount == 1

@registry.check("redis", required=False, timeout=1.0)
def check_redis():
    return cache.ping()

@app.get("/health")
def health():
    report = registry.run_all()
    if report.status == HealthStatus.UNHEALTHY:
        raise HTTPException(status_code=503, detail=report.summary)
    return {"status": report.status.value, "checks": report.summary}

Sibling Libraries

Library	What it solves
`llm-circuit-breaker-py`	Open circuit after N failures, stop sending to broken provider
`llm-fallback-chain`	Route to backup provider when primary is unhealthy
`agent-deadline`	Time-bound the health check run itself
`agentsnap`	Track provider availability over time
`llm-stop-conditions`	Stop agent loop if dependencies become unavailable

The operational pattern: agent-health-check at startup and in the readiness probe, llm-circuit-breaker-py at runtime to stop sending to providers that start failing, llm-fallback-chain to route to backups when the circuit opens.

What's Next

Async health checks for async applications. The current run_all() runs checks sequentially in a thread pool. An async_run_all() with asyncio.gather() would be faster for async applications.

Detailed failure context: right now checks return bool. A HealthResult return type with optional detail string would let checks explain what failed: "connection refused on port 5432" vs "query timed out after 2s".

Built-in LLM provider checks: pre-built check callables for Anthropic, OpenAI, and Gemini that make minimal API calls and validate responses. Users could use these without writing the check themselves.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.