Most LLM proxies handle failures the same way — retry the request, fall back to another provider, or crash. None of them ask the more important question: what did the agent already complete before it failed?
That's the gap I built Trooper to fill.
The Problem
If you're running multi-agent workflows, you've probably hit this scenario:
- A subagent starts a long task — reviewing PRs, processing documents, running analysis
- It completes steps 1, 2, and 3
- On step 4 it hits a quota error, rate limit, or provider failure
- Your orchestration layer has no idea what completed
- You restart from scratch and repeat all the work
Most proxies handle the failure. Nobody handles the recovery.
What Trooper Does Differently
Trooper is a Go-based LLM proxy that sits between your agents and your LLM providers. It already handled fallback routing — if Claude hits quota, it falls back to local Ollama automatically.
But the new /recovery/{session_id} endpoint goes further. It tracks every step your agent completes in real time and tells your orchestration layer exactly where to resume.
The Recovery Endpoint
When your agent sends requests through Trooper, it captures every assistant response and extracts completed steps as they happen. When something fails, you call:
GET http://localhost:3000/recovery/{session_id}
And Trooper returns:
{
"session_id": "subagent-demo-1779630533",
"completed_steps": [
"completed pr #1",
"completed pr #2",
"completed pr #3"
],
"resume_from": 4,
"recovery_hint": "Resume from step 4"
}
Your parent agent now knows exactly what the subagent finished and where to restart it. No repeated work. No lost progress.
Demo
An agent reviewing 8 pull requests hits quota on PR #4. Trooper intercepts, returns the recovery payload, and the agent resumes from PR #4 using local Ollama.
[https://www.youtube.com/watch?v=NN2uwQZDCck]
How It Works
Trooper uses a two-tier memory system:
Anchor — the first two turns of a session, always preserved verbatim.
Tail — the most recent turns, stored in a rolling window.
When you call /recovery, Trooper scans all stored assistant messages for completion signals — words like "completed", "finished", "done", "merged", "deployed". It extracts one completed step per message, deduplicates by task identifier, and returns the ordered list.
The resume_from field is simply len(completed_steps) + 1 — telling your orchestration layer which step to restart the subagent on.
How to Use It
1. Start Trooper
git clone https://github.com/shouvik12/trooper
cd trooper
export CLAUDE_API_KEY=your-key
go run .
2. Point your agent at Trooper instead of Claude
# Instead of https://api.anthropic.com/v1/messages
POST http://localhost:3000/v1/chat/completions
3. Pass a session ID with each request
curl -X POST http://localhost:3000/v1/chat/completions \
-H "X-Session-ID: my-agent-session-1" \
-H "Content-Type: application/json" \
-d '{"model":"claude-haiku-4-5","max_tokens":100,"messages":[...]}'
4. Call recovery when something fails
curl http://localhost:3000/recovery/my-agent-session-1
What's Next
The recovery endpoint is the foundation for proper subagentic orchestration. Upcoming work:
- Parent agent integration — the recovery payload feeds directly back into the orchestration layer to automatically restart subagents from the right step
- Structured step tracking — support for agents that emit structured JSON progress instead of natural language
- Session replay — rewind any session to any point and branch from there
The Bigger Picture
As agent workflows get longer and more complex, failure recovery becomes a first-class concern. Trooper's approach — track everything as it happens, make recovery queryable — is a different philosophy from retry-and-hope.
Local-first by default. Cloud when you choose. And now recoverable when things go wrong.
GitHub: https://github.com/shouvik12/trooper
How are you handling agent failures in your workflows today? Drop a comment genuinely curious what patterns people are using.
Top comments (0)