Shouvik Palit

Posted on May 13

I Tested Privacy-Aware Routing with 4 AI Agents: What Actually Stayed Local

#agents #ai #llm #privacy

Following up on my earlier Trooper experiments, I wanted to see if per-request privacy routing actually works in practice.

The test: 4 agents running simultaneously. Some handling public knowledge (OAuth security, Redis vs Memcached). Others handling sensitive data (API keys, customer PII).

The rule: Credentials and PII stay on my machine. Everything else can use Claude.

The Setup

Each agent gets a x_force_local flag:

Agent 1 - security-analyst (☁️ Claude)

Task: "What are the top 3 OAuth2 vulnerabilities?"  
Routing: Public knowledge, let Claude handle it

Agent 2 - credential-formatter (🔒 Qwen local)

Task: "Format as JSON: api_key=sk-prod-x7f9k2m, vault_url=https://vault.acme.io:8200"  
Routing: Contains credentials — must stay on machine

Agent 3 - architecture-advisor (☁️ Claude)

Task: "Redis or Memcached for session storage?"  
Routing: General best practices, use cloud

Agent 4 - compliance-reporter (🔒 Qwen local)

`Task: "Summarize: 47 tickets today. 3 had PII (Alice Johnson, Bob Chen, Maria Garcia)"  
Routing: Contains customer names — privacy violation if sent to cloud`

The Result

Every agent completed successfully:

Cloud agents: 3.8s and 2.4s (Claude handled complex reasoning)
Local agents: 2.4s and 1.2s (Qwen formatted data locally)

The critical part: API keys, vault URLs, and customer names never left my machine. Zero network calls to Anthropic for those two agents.

What Happened Under the Hood

When Agent 2 (credential-formatter) ran with x_force_local: true:

Request intercepted by Trooper proxy
Privacy flag detected
Routed to local Ollama instead of Claude API
Session context maintained via 3-layer system (Anchor/SITREP/Tail)
JSON response returned — credentials never hit the network

The vault URL and API key stayed on my hardware.

The Code

Using the OpenAI SDK (works with any OpenAI-compatible client):

from openai import OpenAI

client = OpenAI(
    api_key="your-anthropic-key",
    base_url="http://localhost:3000/v1",  # Trooper proxy
)

# Regular request → Claude
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "OAuth2 vulnerabilities?"}],
    extra_headers={"X-Session-ID": "security-analyst"}
)

# Privacy request → Qwen local
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Format: api_key=sk-prod..."}],
    extra_headers={"X-Session-ID": "credential-formatter"},
    extra_body={"x_force_local": True}  # This keeps it local
)

That's the entire API. One boolean flag controls routing.

Why This Matters

Most LLM proxies route between cloud providers. LiteLLM falls back from Claude to OpenAI. That's useful for uptime, but both destinations are someone else's servers.

Trooper's x_force_local routes to your machine. Different failure mode, different privacy guarantee.

When you need it:

Code refactoring with internal URLs
Proprietary algorithms (not secret, just yours)
Customer data that shouldn't leave your network
Cost control (force expensive operations local)
Offline work (flights, train rides, API outages)

When you don't:

Public API questions
General best practices
Complex reasoning that needs Claude's horsepower

The point isn't "local always" or "cloud always." It's per-request control based on what you're asking.

How Context Preservation Works

The hardest part of routing isn't switching models — it's maintaining conversation state.

Trooper uses a 3-layer compaction system:

**Anchor (~10%):** First 2 turns verbatim, never dropped  
**SITREP (~20%):** Rule-based summary of middle turns  
**Tail (~70%):** Last N turns verbatim

Total budget: 6144 tokens (configurable)

When Agent 4 (compliance-reporter) ran locally, Qwen received the anchor, a compressed SITREP of what Claude said earlier, and the immediate context.

What Doesn't Work Great

Local models aren't Claude. Qwen 2.5 is fast and solid for structured tasks (JSON formatting, parsing, summarization). But if you need deep reasoning, route to Claude.

Context compression is lossy. Trooper compresses middle turns into summaries. For precision-critical workflows, keep sessions short or increase the context window.

You need Ollama running. This isn't plug-and-play:

ollama pull qwen2.5:3b
ollama serve

I use qwen2.5:3b (2GB, fast) for most tasks. Switch to 7b (5GB) when I need better output quality.

Compared to My Previous Post

Last time I showed what happens when Claude quota runs out: Trooper automatically falls back to Ollama with context preserved. That's reactive — something breaks, the system recovers.

This is proactive: you tell it "keep this request local" before sending. Different problem, same underlying context system.

Try It Yourself

# 1. Pull local model
ollama pull qwen2.5:3b

# 2. Clone and run Trooper
git clone https://github.com/shouvik12/trooper
cd trooper
export CLAUDE_API_KEY=sk-ant-...
go run main.go providers.go classifier.go