I Tested Privacy-Aware Routing with 4 AI Agents: What Actually Stayed Local

Shouvik Palit — Wed, 13 May 2026 02:40:52 +0000

Following up on my earlier Trooper experiments, I wanted to see if per-request privacy routing actually works in practice.

The test: 4 agents running simultaneously. Some handling public knowledge (OAuth security, Redis vs Memcached). Others handling sensitive data (API keys, customer PII).

The rule: Credentials and PII stay on my machine. Everything else can use Claude.

The Setup

Each agent gets a x_force_local flag:

Agent 1 - security-analyst (☁️ Claude)

Task: "What are the top 3 OAuth2 vulnerabilities?"  
Routing: Public knowledge, let Claude handle it

Agent 2 - credential-formatter (🔒 Qwen local)

Task: "Format as JSON: api_key=sk-prod-x7f9k2m, vault_url=https://vault.acme.io:8200"  
Routing: Contains credentials — must stay on machine

Agent 3 - architecture-advisor (☁️ Claude)

Task: "Redis or Memcached for session storage?"  
Routing: General best practices, use cloud

Agent 4 - compliance-reporter (🔒 Qwen local)

`Task: "Summarize: 47 tickets today. 3 had PII (Alice Johnson, Bob Chen, Maria Garcia)"  
Routing: Contains customer names — privacy violation if sent to cloud`

The Result

Every agent completed successfully:

Cloud agents: 3.8s and 2.4s (Claude handled complex reasoning)
Local agents: 2.4s and 1.2s (Qwen formatted data locally)

The critical part: API keys, vault URLs, and customer names never left my machine. Zero network calls to Anthropic for those two agents.

What Happened Under the Hood

When Agent 2 (credential-formatter) ran with x_force_local: true:

Request intercepted by Trooper proxy
Privacy flag detected
Routed to local Ollama instead of Claude API
Session context maintained via 3-layer system (Anchor/SITREP/Tail)
JSON response returned — credentials never hit the network

The vault URL and API key stayed on my hardware.

The Code

Using the OpenAI SDK (works with any OpenAI-compatible client):

from openai import OpenAI

client = OpenAI(
    api_key="your-anthropic-key",
    base_url="http://localhost:3000/v1",  # Trooper proxy
)

# Regular request → Claude
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "OAuth2 vulnerabilities?"}],
    extra_headers={"X-Session-ID": "security-analyst"}
)

# Privacy request → Qwen local
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Format: api_key=sk-prod..."}],
    extra_headers={"X-Session-ID": "credential-formatter"},
    extra_body={"x_force_local": True}  # This keeps it local
)

That's the entire API. One boolean flag controls routing.

Why This Matters

Most LLM proxies route between cloud providers. LiteLLM falls back from Claude to OpenAI. That's useful for uptime, but both destinations are someone else's servers.

Trooper's x_force_local routes to your machine. Different failure mode, different privacy guarantee.

When you need it:

Code refactoring with internal URLs
Proprietary algorithms (not secret, just yours)
Customer data that shouldn't leave your network
Cost control (force expensive operations local)
Offline work (flights, train rides, API outages)

When you don't:

Public API questions
General best practices
Complex reasoning that needs Claude's horsepower

The point isn't "local always" or "cloud always." It's per-request control based on what you're asking.

How Context Preservation Works

The hardest part of routing isn't switching models — it's maintaining conversation state.

Trooper uses a 3-layer compaction system:

**Anchor (~10%):** First 2 turns verbatim, never dropped  
**SITREP (~20%):** Rule-based summary of middle turns  
**Tail (~70%):** Last N turns verbatim

Total budget: 6144 tokens (configurable)

When Agent 4 (compliance-reporter) ran locally, Qwen received the anchor, a compressed SITREP of what Claude said earlier, and the immediate context.

What Doesn't Work Great

Local models aren't Claude. Qwen 2.5 is fast and solid for structured tasks (JSON formatting, parsing, summarization). But if you need deep reasoning, route to Claude.

Context compression is lossy. Trooper compresses middle turns into summaries. For precision-critical workflows, keep sessions short or increase the context window.

You need Ollama running. This isn't plug-and-play:

ollama pull qwen2.5:3b
ollama serve

I use qwen2.5:3b (2GB, fast) for most tasks. Switch to 7b (5GB) when I need better output quality.

Compared to My Previous Post

Last time I showed what happens when Claude quota runs out: Trooper automatically falls back to Ollama with context preserved. That's reactive — something breaks, the system recovers.

This is proactive: you tell it "keep this request local" before sending. Different problem, same underlying context system.

Try It Yourself

# 1. Pull local model
ollama pull qwen2.5:3b

# 2. Clone and run Trooper
git clone https://github.com/shouvik12/trooper
cd trooper
export CLAUDE_API_KEY=sk-ant-...
go run main.go providers.go classifier.go

Trooper starts on localhost:3000.

Point any OpenAI-compatible client at it and add x_force_local: true when you want privacy routing.

Repo: https://github.com/shouvik12/trooper

Feedback welcome — especially on edge cases or use cases I haven't considered.

This is v3.1. The x_force_local feature shipped last week. Still iterating on auto-routing classification.

How I built a Go proxy that keeps your LLM conversation alive when cloud quota runs out

Shouvik Palit — Sun, 03 May 2026 01:23:28 +0000

Introduction
If you've ever been mid-conversation with Claude or GPT, hit a quota limit, and switched to a local Ollama model,you know the pain. The local model has zero context. It's like walking into a meeting 45 minutes late and nobody catches you up.
I got frustrated enough to build something about it. That something is Trooper.

What is Trooper
Trooper is a lightweight Go proxy (~850 lines, two files) that sits between your application and your LLM providers. When a cloud provider returns a quota error (429, 402, 529), Trooper automatically falls back to a local Ollama instance without dropping the conversation context.
Single binary. Zero dependencies. Easy to audit since it sits in front of your API keys.

The real problem: context loss on fallback
Most fallback proxies solve the routing problem but ignore the context problem. They either pass the raw message history as-is (which blows up the local model's context window) or they truncate the oldest turns (which kills continuity).
Neither works well in practice.

The solution: three-layer context compaction
Trooper uses a structured compaction strategy before handing off to Ollama:
Anchor : The first two turns of the conversation are always preserved. These establish the original intent and set the tone.
SITREP : The middle turns get compressed into a structured summary called a SITREP. It extracts intent, entities, open loops, recent actions, and resolved items. The local model gets situational awareness, not raw history.
Tail : The most recent turns are preserved within a configurable token budget.

A real SITREP looks like this in the logs:

📦  Context compaction triggered — 538 tokens exceeds 500 budget
📦  Context compaction complete
    Total turns    : 7
    Anchor turns   : 2 (~43 tokens)
    Middle turns   : 2 → SITREP (~71 tokens)
    Recent turns   : 3 (~323 tokens)
    Tokens used    : 437 / 500
    SITREP         : intent="trooper" stage=unclear confidence=0.60 open=1 actions=0 resolved=0

The local model knows what you were working on, what's broken, what's been resolved, and what the last few exchanges were. That's enough to keep the conversation coherent.

Why Go
Single binary distribution was the main reason. No runtime, no dependencies, drop it anywhere and it runs. The codebase being ~850 lines also means anyone can read the whole thing in an afternoon — important for something that proxies API keys.

Provider support
Trooper currently supports Claude, Gemini, and OpenAI as cloud providers with automatic fallback to Ollama. The provider chain is configurable via environment variables.

What's next
V3.0 is focused on foundation hardening — concurrency fixes and improved error handling. V3.1 will improve the SITREP extraction quality on longer conversations, which is where intent detection starts to degrade today.

Try it
github.com/shouvik12/trooper
Would love feedback on the context compaction approach — especially from anyone running larger local models. What's your cold-start latency on fallback?

DEV Community: Shouvik Palit