The hardest part of an autonomous AI agent is the unhappy path

#llm #ai #python #machinelearning

Most demos of AI agents show you the happy path: a clean question, a tidy answer, everyone claps. The interesting engineering is everywhere else. What does your agent do when the API it depends on is down? When the model would happily keep looping, and your credit card is attached to every step? When it has no data, but is perfectly capable of writing something that looks like data anyway?

I built an autonomous agent for a domain where those questions are not academic, and getting the unhappy path right turned out to be most of the work. Here is what I learned.

The project is github.com/gbadedata/bioagent.

What it does

BioAgent is an autonomous quality-control analyst for a genomics pipeline. You give it a sample ID and it does the rest on its own: it pulls concordance and reproducibility metrics from a live pipeline API through a set of tools, works out what the numbers mean against benchmark thresholds, builds a targeted PubMed query from the actual findings, searches the literature, and writes a structured, clinical-grade quality report. It streams the whole thing into a Streamlit chat as it reasons, and exposes a FastAPI endpoint a scheduler can call.

It is built with LangGraph and Claude. Why LangGraph, and not a plain "here is a list of tools" agent, is the whole point of this post.

Why a graph, and why bounded

A plain agent takes a question, maybe calls some tools, and answers. BioAgent has to make decisions in sequence: fetch data, then decide from what came back whether the literature is even worth searching; if the search is empty, broaden it and retry; if the pipeline is unreachable, stop and say so clearly. That is a state machine with cycles and conditional routing, which is exactly what LangGraph models.

%%{init: {'theme':'base','themeVariables':{'primaryColor':'#eef2f7','primaryBorderColor':'#1b2a4a','primaryTextColor':'#1b2a4a','lineColor':'#4c78a8','fontFamily':'Segoe UI, sans-serif'}}}%%
flowchart TD
    START([sample_id]) --> FETCH["fetch_data<br/>call 5 pipeline API tools"]
    FETCH -->|data collected| ANALYSE["analyse<br/>LLM builds a targeted PubMed query"]
    FETCH -->|"critical tools failed,<br/>retry budget remains"| FETCH
    FETCH -->|"critical tools failed,<br/>retries spent"| DEGRADE["graceful_degradation<br/>report what failed, invent nothing"]
    ANALYSE --> SEARCH["search_literature<br/>query PubMed, broaden and retry if empty"]
    SEARCH --> REPORT["synthesise_report<br/>LLM writes the QC report"]
    REPORT --> DONE([END])
    DEGRADE --> DONE

The property that matters most is that the graph is bounded. Every cycle has a hard retry limit; the agent physically cannot loop forever. When your agent calls paid APIs on every step, "cannot loop forever" is not a nice-to-have, it is a safety requirement.

And here is a single run, end to end:

%%{init: {'theme':'base','themeVariables':{'primaryColor':'#eef2f7','actorBkg':'#eef2f7','actorBorder':'#1b2a4a','actorTextColor':'#1b2a4a','signalColor':'#4c78a8','signalTextColor':'#1b2a4a','noteBkgColor':'#f4f7fb','noteBorderColor':'#4c78a8'}}}%%
sequenceDiagram
    participant U as User / API
    participant G as LangGraph
    participant P as Pipeline API
    participant C as Claude
    participant L as PubMed
    U->>G: analyse(sample_id)
    G->>P: runs, concordance, reproducibility, alerts
    P-->>G: metrics (or structured errors)
    alt critical data still missing after a retry
        G-->>U: graceful-degradation report (no invented data)
    else data collected
        G->>C: build a PubMed query from the metric values
        C-->>G: query
        G->>L: search, broaden and retry if empty
        L-->>G: citations and abstracts
        G->>C: synthesise the QC report from data and abstracts
        C-->>G: structured report
        G-->>U: report, citations, and tool trace
    end

Lesson 1: bounding a loop is easy to get subtly wrong

Here is the routing after the data-fetch step:

def route_after_fetch(state):
    critical = {"get_concordance_summary", "get_pipeline_runs"}
    critical_failed = critical.intersection(set(state["failed_tools"]))

    if critical_failed and state["fetch_retries"] > MAX_FETCH_RETRIES:
        return "graceful_degradation"
    if critical_failed and state["fetch_retries"] <= MAX_FETCH_RETRIES:
        return "fetch_data"          # retry, bounded
    return "analyse"

The idea is simple: if the critical tools failed and there is retry budget left, try again; if the budget is spent, give up gracefully; otherwise carry on.

The subtlety is that a bound is only a bound if the counter actually moves. If the node doing the work forgets to increment the retry count, the router keeps seeing "budget remains" forever, and the graceful exit is never reached. The agent loops until the framework's recursion limit trips and throws, which is the exact opposite of failing safely. The fix is a single line in the fetch node:

return {
    ...
    "fetch_retries": state.get("fetch_retries", 0) + 1,   # the bound only works if this moves
}

This is the kind of bug that never appears while you are building, because you are always testing the happy path where the API is up. It only shows up when the dependency breaks. So the real fix is not the one-line increment, it is a test that runs the agent with the API forced down and asserts it degrades rather than loops:

def test_full_run_degrades_when_api_down(mock_pipeline_api_down):
    result = run_agent("HG001")
    assert result["status"] == "degraded"

Here is the full suite, including that degrade-not-loop test, running green in CI:

Test the unhappy path, or you have not tested the part that matters.

Lesson 2: an agent with no data must not write a report

The most dangerous failure for this kind of system is not a crash. It is a confident, clinical-looking report generated from nothing. So when the critical tools cannot be reached, the graph routes to a dedicated node that reports exactly which tools failed, says what could not be retrieved, tells you how to start the API, and stops. It never fills the gap with plausible numbers.

That is not just a prompt instruction, it is enforced by a test that asserts the degraded report contains no fabricated metrics:

def test_report_does_not_hallucinate_metrics():
    result = graceful_degradation(state_with_failed_tools)
    assert "0.99" not in result["report"]
    assert "F1"   not in result["report"]

A rule the model is asked to follow is a hope. A rule a test enforces is a guarantee.

Lesson 3: ground the model in what it actually retrieved

The agent cites PubMed papers in its report, which is a quiet invitation to invent relevance. Early on it fetched abstracts and then discarded them, passing only the PMIDs downstream, so the model was asked to explain how papers supported the findings without ever seeing what those papers said. That is exactly the kind of shortcut that produces confident nonsense.

The fix was to carry the retrieved abstract text through to the report step and tell the model to ground its literature section only in the abstracts provided, and to say so plainly if none were retrieved. Retrieval is only grounding if the retrieved text actually reaches the place that writes the words.

Takeaways

Bound every loop, and prove it. In an autonomous agent, an unbounded retry is a runaway bill. A bound only counts if the counter advances, so test that it does.
Test the unhappy path. The happy path is the part that was always going to work. Force the dependency down and assert the agent fails safely.
No data, no report. Make "do not invent" a tested guarantee, not a polite request in a prompt.
Grounding means the retrieved text reaches the writer. Fetching abstracts and then ignoring them is worse than not fetching at all.

The code, the full LangGraph state machine, the tests, and the architecture diagrams are all in the repo: github.com/gbadedata/bioagent. If you want the MCP-server-plus-tool-using-agent take on the same ideas, I wrote that up separately at github.com/gbadedata/mcp-research-agent.