DEV Community

Shouvik Palit
Shouvik Palit

Posted on

I Added a /recovery Endpoint to My LLM Proxy So Agents Never Lose Progress Mid-Task

Most LLM proxies handle failures the same way — retry the request, fall back to another provider, or crash. None of them ask the more important question: what did the agent already complete before it failed?

That's the gap I built Trooper to fill.


The Problem

If you're running multi-agent workflows, you've probably hit this scenario:

  • A subagent starts a long task — reviewing PRs, processing documents, running analysis
  • It completes steps 1, 2, and 3
  • On step 4 it hits a quota error, rate limit, or provider failure
  • Your orchestration layer has no idea what completed
  • You restart from scratch and repeat all the work

Most proxies handle the failure. Nobody handles the recovery.


What Trooper Does Differently

Trooper is a Go-based LLM proxy that sits between your agents and your LLM providers. It already handled fallback routing — if Claude hits quota, it falls back to local Ollama automatically.

But the new /recovery/{session_id} endpoint goes further. It tracks every step your agent completes in real time and tells your orchestration layer exactly where to resume.


The Recovery Endpoint

When your agent sends requests through Trooper, it captures every assistant response and extracts completed steps as they happen. When something fails, you call:

GET http://localhost:3000/recovery/{session_id}
Enter fullscreen mode Exit fullscreen mode

And Trooper returns:

{
  "session_id": "subagent-demo-1779630533",
  "completed_steps": [
    "completed pr #1",
    "completed pr #2",
    "completed pr #3"
  ],
  "resume_from": 4,
  "recovery_hint": "Resume from step 4"
}
Enter fullscreen mode Exit fullscreen mode

Your parent agent now knows exactly what the subagent finished and where to restart it. No repeated work. No lost progress.


Demo

An agent reviewing 8 pull requests hits quota on PR #4. Trooper intercepts, returns the recovery payload, and the agent resumes from PR #4 using local Ollama.

[https://www.youtube.com/watch?v=NN2uwQZDCck]


How It Works

Trooper uses a two-tier memory system:

Anchor — the first two turns of a session, always preserved verbatim.

Tail — the most recent turns, stored in a rolling window.

When you call /recovery, Trooper scans all stored assistant messages for completion signals — words like "completed", "finished", "done", "merged", "deployed". It extracts one completed step per message, deduplicates by task identifier, and returns the ordered list.

The resume_from field is simply len(completed_steps) + 1 — telling your orchestration layer which step to restart the subagent on.


How to Use It

1. Start Trooper

git clone https://github.com/shouvik12/trooper
cd trooper
export CLAUDE_API_KEY=your-key
go run .
Enter fullscreen mode Exit fullscreen mode

2. Point your agent at Trooper instead of Claude

# Instead of https://api.anthropic.com/v1/messages
POST http://localhost:3000/v1/chat/completions
Enter fullscreen mode Exit fullscreen mode

3. Pass a session ID with each request

curl -X POST http://localhost:3000/v1/chat/completions \
  -H "X-Session-ID: my-agent-session-1" \
  -H "Content-Type: application/json" \
  -d '{"model":"claude-haiku-4-5","max_tokens":100,"messages":[...]}'
Enter fullscreen mode Exit fullscreen mode

4. Call recovery when something fails

curl http://localhost:3000/recovery/my-agent-session-1
Enter fullscreen mode Exit fullscreen mode

What's Next

The recovery endpoint is the foundation for proper subagentic orchestration. Upcoming work:

  • Parent agent integration — the recovery payload feeds directly back into the orchestration layer to automatically restart subagents from the right step
  • Structured step tracking — support for agents that emit structured JSON progress instead of natural language
  • Session replay — rewind any session to any point and branch from there

The Bigger Picture

As agent workflows get longer and more complex, failure recovery becomes a first-class concern. Trooper's approach — track everything as it happens, make recovery queryable — is a different philosophy from retry-and-hope.

Local-first by default. Cloud when you choose. And now recoverable when things go wrong.

GitHub: https://github.com/shouvik12/trooper


How are you handling agent failures in your workflows today? Drop a comment genuinely curious what patterns people are using.

Top comments (0)