DEV Community

Amit
Amit

Posted on • Originally published at artificialcuriositylabs.ai

The Test Isn't Whether the Agent Picks the Right Tool. It's What Happens When the Tool Fails.

TL;DR

  • A hybrid agent (outer Opus 4.8 + browser tool + computer use tool) recovered from tool failures on its own — twice across two runs, with different failure modes each time.
  • A router-with-retry sends the same instruction twice. What the hybrid did was generate a genuinely new instruction conditional on why the first failed (GUI fighting back → switch to CLI).
  • This works because of layering: the outer model only sees clean text summaries from sub-agents, never raw screenshots. Failures are legible, so they're recoverable.
  • The test: make one of your agent's tools fail in a way the developer didn't anticipate. If the next instruction is identical to the first, you have a router with retry, not an agent.

Post 1 argued for the hierarchy: Native API → Existing connector → Browser-based automation → Computer use → Human last resort. Post 2 showed how to ship the cheap version of whichever tier you pick. This post is about a property the hybrid pattern gets that single-tier setups don't: the agent recovers from tool failures on its own.

I built a hybrid agent — outer Opus 4.8 with two Strands tools, one for an AgentCore Browser session and one for Anthropic computer use against a containerized Linux desktop. The pitch I wrote myself for the demo: "the outer model picks the right tool per task." That was supposed to be the headline.

It wasn't. The routing worked, sure — the model called the browser tool for Yelp and the desktop tool for LibreOffice, the way a sane orchestrator would. But the run that made me actually pay attention was the one where the desktop tool failed. The outer model noticed the failure, called the desktop tool a second time with a completely different approach, and finished the task. No retry library. No exception handler. The model decided.

I re-ran the same prompt two days later to make sure that wasn't luck. Different specific failure mode inside the desktop sub-agent. Same orchestration response. The recovery isn't a one-off. It's the stack's default behavior under failure.

This post is about that — what router-vs-agent actually means once you've watched a model handle a failed tool call, and why most "agent" demos don't show this.

The setup

A single Strands agent with three tools:

  • browse_with_agentcore(instruction) — opens a fresh AgentCore Browser session, runs a browser-use sub-agent on a self-contained instruction, returns the result. Sub-agent uses Sonnet 4.5; outer agent never sees the screenshots.
  • desktop_action(instruction) — boots an ephemeral Linux container with Xvfb + LibreOffice + a file manager + a terminal, runs an Anthropic computer-use loop against it, returns the result. Container is torn down after.
  • save_file_to_disk(path, content) — host filesystem write, for the user's final artifact.

The outer model was Opus 4.8 on Bedrock. The system prompt told it to route browser-shaped tasks to the browser tool, desktop-shaped tasks to the desktop tool, and to save the final output via the file tool.

The test prompt was a real cross-app workflow:

"Find 3 highly-rated Mexican restaurants near 525 Market Street on Yelp. Then create a spreadsheet at /tmp/lunch.ods inside the desktop container with columns Name | Rating | Walking time, filled with the 3 restaurants."

Browser tier for the search. Desktop tier for the spreadsheet. The exact case the hierarchy in Post 1 calls out as one of the legitimate uses of computer use — a workflow that genuinely spans tools no single tier covers.

Run 1: the desktop tool failed. The agent didn't.

Step 1 went cleanly. The outer model called browse_with_agentcore once with a self-contained Yelp instruction, got back three candidates with ratings and walking times, processed the result.

Step 2 was where things got interesting. The outer model called desktop_action to drive LibreOffice Calc through the GUI — open the app, click into a cell, type the headers, fill the data, save as ODS. The inner sub-agent hit max_turns wrestling the GUI. Recovery dialogs popped up. Empty windows. The kind of state browser-only agents never see, because browsers don't have desktop session detritus.

The container tore down. desktop_action returned to the outer agent with what amounted to "I tried but didn't finish." /tmp/lunch.ods did not exist.

A scripted automation would have crashed here. A naive agent would have called desktop_action again with the same instruction and hit the same wall.

What actually happened: the outer model noticed the failure was about driving the GUI, and called desktop_action a second time with a completely different approach:

"Use libreoffice --headless from the command line. Build the spreadsheet from a CSV. Don't use the GUI."

The second call worked. The inner sub-agent ran a bash script that wrote a CSV to /tmp/, ran libreoffice --headless --convert-to ods /tmp/lunch.csv, verified the file existed with ls -la, and returned success. Outer agent then reported what it did, including the failed first attempt, and flagged that the file lives inside the ephemeral container.

Total wall: 21:52. Cost: $5.44. Two desktop_action calls — one failed, one succeeded — orchestrated by an outer model that wasn't told to retry.

Run 2 confirmed the pattern

I re-ran the exact same prompt two days later. Different failure mode inside the desktop sub-agent (this time a stuck recovery dialog instead of max_turns), but the same orchestration behavior: the outer model recognized it as a GUI problem and pivoted to a non-GUI approach on the second call.

Total wall time was longer (30:18 vs 21:52) due to a harder Yelp session, but cost was nearly identical ($5.41 vs $5.44). The caching benefit scaled with the longer browser session.

Two runs. Two different internal failures. Two different recoveries. Same cost. The behavior wasn't luck — it was the architecture.

The real test

Most production "AI agents" fail one specific test that the hybrid run passed: when a tool fails, they don't write a different instruction conditional on why it failed.

They retry the same call, return a generic error, or escalate to a human. The hybrid run did something different — the inner sub-agent reported "max_turns hit fighting the GUI," and the outer model wrote a new instruction that skipped the GUI entirely.

A router-with-retry sends the same instruction twice. A router-with-smart-recovery follows a predeclared fallback. What the hybrid architecture demonstrated is the model generating a genuinely new instruction whose content depends on the specific failure that just occurred.

This is the property most vendor demos imply but don't deliver. The happy path (model picks tool, tool succeeds) is easy to show. Take the happy path away and most systems collapse, retry verbatim, or hand off to a human. The hybrid run generated a new approach on the fly — twice, across different failure modes — because the architecture made the failure legible to the layer above it.

The test isn't whether the agent picks the right tool. It's whether the model writes a new instruction conditional on what just failed.

Why this works (and where it breaks)

The recovery isn't magic. It falls out of the architecture. The hybrid stack has three distinct layers, each doing one job:

┌──────────────────────────────────────────────────────┐
│  OUTER MODEL  (Opus 4.8 via Bedrock)                 │
│  Strategy. Picks tools. Handles failures.            │
│  Reads text reports from sub-agents.                 │
│  Never sees a screenshot.                            │
└──────────────────────┬───────────────────────────────┘
                       │  ONE tool call per sub-task
                       │  (instruction in, summary out)
                       ▼
┌──────────────────────────────────────────────────────┐
│  INNER SUB-AGENT  (Sonnet 4.5 in browser-use,        │
│                    Opus 4.8 in computer-use)         │
│  Tactics. Screenshot → reason → click → screenshot.  │
│  All pixel reasoning lives here.                     │
│  Hundreds of screenshots in context per call.        │
└──────────────────────┬───────────────────────────────┘
                       │  CDP commands / xdotool actions
                       ▼
┌──────────────────────────────────────────────────────┐
│  SANDBOX  (managed Chromium / ephemeral container)   │
│  Execution. No reasoning. Pure action surface.       │
│  Torn down after each tool call. No shared state.    │
└──────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Three properties of this layering give you self-recovery:

The outer model never sees the screenshots. Inside desktop_action, hundreds of screenshots stack in the inner sub-agent's context. The outer Opus 4.8 only sees the inner sub-agent's text report at the end. When the inner agent reports "hit max_turns wrestling the GUI," the outer model is reasoning over a clean text summary, not a megabyte of pixel context. Failures are summarized, which makes them analyzable.

Each tool call is one round trip with a fresh sandbox. No shared state across calls. The first failed desktop_action call left no residue — the container was torn down, a new one booted. The outer model could call again with completely different instructions and not worry about contaminated state. Stateful retry systems often fail because the state from the first attempt corrupts the second.

The outer model is reasoning over text, not pixels. When the inner sub-agent reports "the GUI is fighting me," the outer model doesn't need to interpret a screenshot of the GUI. It needs to read English. Models are much better at the latter, and the retry decision is being made by the much-better-at-it half of the system.

The decision flow under failure looks like this:

   inner sub-agent fails
            │
            ▼
   text summary returns to outer model
            │
            ▼
   ┌────────────────────────────────┐
   │ outer model reads:              │
   │ "max_turns. GUI fought me."     │
   └────────────┬────────────────────┘
                │
                ▼
   outer model reasons: "GUI is the problem"
                │
                ▼
   outer model writes new instruction:
   "Skip the GUI. Use the CLI."
                │
                ▼
   second tool call → fresh sandbox → success
Enter fullscreen mode Exit fullscreen mode

A router-with-retry would re-send the same instruction. An agent generates a new one based on why the first failed.

This works in 2026 because Opus 4.8 is good enough to read a failure report and choose a different approach. It would not have worked reliably on Sonnet 3.5 in 2024 — that model would have either retried verbatim or thrown up its hands. Self-recovery is a capability that scaled with the model.

Where it breaks: failures the inner sub-agent doesn't report cleanly. If the sub-agent silently produces wrong output (rather than failing visibly), the outer model has no signal that something went wrong. Garbage in, summarized garbage out. The recovery loop only works if failure is legible to the layer above. Most production agents fail not because the model crashes but because the model produces plausible-looking output that's wrong, and nothing checks. The hybrid architecture doesn't fix that problem; it makes one specific class of failures (visible tool failures) recoverable.

How to test what you actually have

Most systems sold as "agents" in 2026 still fail the test the hybrid run passed.

To find out what yours actually does, make one of its tools fail on purpose in a way the developer didn't anticipate. Return a clear failure message ("I couldn't complete this step") and watch the next instruction the model generates.

  • If it retries the exact same call → you have a router with retry.
  • If it follows a pre-written fallback → you have a workflow with branches.
  • If it generates a genuinely new instruction whose content depends on why the previous one failed → you have the property most marketing copy promises and most systems lack.

The hybrid architecture surfaces this behavior because failures are summarized as text to a capable outer model, and each tool call runs in a fresh sandbox. Most production stacks collapse the layers or hide the failure signal, so the model never gets a clean chance to adapt.

This is a lower bar than full agency. It is also the bar most current "agent" products still miss.

The thing I'm still uncertain about

The recovery pattern held across two runs with different failure modes. That's enough to treat the behavior as real for this architecture.

The open question is the failure mode of self-recovery itself.

Two risks stand out:

  • Recovery loops: the outer model keeps trying new approaches that also fail, burning cost in a loop. A hard cap on recovery attempts would contain this, but I haven't implemented one.
  • False-positive recovery: the outer model misreads a successful (but verbose) sub-agent report as failure and re-runs work unnecessarily.

Both are real risks worth designing against. I simply haven’t run enough volume on this architecture yet to know how frequently they appear.

So what

Three things, depending on what you're building.

If you're shipping a product called "agentic," be honest about which bar your system clears. A workflow with predeclared retry isn't routing-with-smart-recovery. Routing-with-smart-recovery isn't full agency. The marketing doesn't have to lie if it picks the right verb. Vendors that overclaim today will be embarrassed when the next round of evals separates the categories.

If you're evaluating someone else's "agent," fail one of its tools on purpose in a way the developer didn't anticipate. Not a hardcoded fallback path. A real off-script failure. Watch what the next instruction looks like — same as the first, mechanically retried, or genuinely new and conditional on the failure report. The latter is rare and worth paying for.

If you're building this kind of hybrid setup, the layering (strong outer model on clean text summaries + fresh sandboxes per sub-task) is what produces the self-recovery behavior for free. Collapsing the layers (putting computer use directly in the outer loop) removes that property. Every screenshot then stacks in the outer model's context, failures become harder to summarize, and the model loses the clean signal it needs to adapt.

The model picking the right tool is table stakes. The model writing a genuinely new instruction when the chosen tool fails — that's the test most things sold as agents in 2026 still don't pass.


Part 3 of the Agent Tooling series.
← Part 2: Harness optimization

Top comments (0)