DEV Community

Cover image for I put one proxy in front of every AI tool my team uses 85% cache hits, 75% lower cost
Anil Prasad
Anil Prasad

Posted on • Originally published at open.substack.com

I put one proxy in front of every AI tool my team uses 85% cache hits, 75% lower cost

TL;DR — Your team's AI tools (Claude Code, Copilot, ChatGPT, Gemini, your own agents) each call the LLM API independently — no shared cache, no shared budget, no audit trail. AgentMesh is an open-source proxy that sits in front of all of them and runs every call through a three-layer cache, per-team quotas, cheapest-model routing, and a tamper-evident audit log. You point your tools at it with two env vars. On a reproducible benchmark (no API keys): 85% cache hits, 75% lower cost. Apache 2.0. → pip install agentmesh-proxy

The problem, in one sentence
Every AI tool on your team talks to the model on its own.
Claude Code has its own connection. Copilot has its own. The ChatGPT tab in someone's browser has its own. Your LangGraph service has its own. None of them share a cache, a budget, or an audit log — so the same prompt gets paid for over and over, a runaway loop in one service is invisible to the others, and nobody can answer "what did we send to third-party APIs last quarter?"
This isn't a discipline problem. It's a missing layer. So I built it.

60-second quickstart

pip install agentmesh-proxy sentence-transformers
agentmesh serve --port 8080 --demo
Enter fullscreen mode Exit fullscreen mode

Point any tool at it — no code changes:

# Claude Code, or any Anthropic SDK tool
export ANTHROPIC_BASE_URL=http://localhost:8080

# Copilot / Cursor / any OpenAI SDK tool
export OPENAI_BASE_URL=http://localhost:8080/v1
Enter fullscreen mode Exit fullscreen mode

Every response comes back with governance headers so you can see what happened:

X-AgentMesh-Cache:     hit          # exact | semantic | miss
X-AgentMesh-Tokens:    0            # 0 on a cache hit
X-AgentMesh-Cost-USD:  0.000000     # $0 on a cache hit
Enter fullscreen mode Exit fullscreen mode

That's the whole integration. The agent code never knows the proxy is there.

How it works
Every call from the proxy or the SDK runs the same ordered pipeline:

The interesting part is the cache, because it does something most "LLM caches" don't.

Exact-match caching almost never hits in real life, because people rephrase: they paste You are a senior architect. in front of the question, switch between optimise and optimize, wrap things in markdown. So before anything is hashed or embedded, AgentMesh normalizes the prompt — stripping the noise that doesn't change meaning:

from agentmesh.optimizer.normalizer import normalize_prompt

normalize_prompt("You are a senior architect. **Please** review this design...")
# -> "review this design ..."   (persona prefix, markdown, filler removed)
Enter fullscreen mode Exit fullscreen mode

compares by cosine similarity:

from agentmesh import SemanticCache

cache = SemanticCache(similarity_threshold=0.70)   # tune per workload
cache.put("Review this microservices design for scaling issues", response)

# Different wording, same intent -> still a hit
hit = cache.get("Analyze this distributed system design")
Enter fullscreen mode Exit fullscreen mode

normalize, then embed is the whole trick — it's the difference between a cache that almost never hits and one that hits ~85% of the time.

And because every call already flows through one interceptor, a tamper-evident audit log is almost free — each entry is hash-chained (SHA-256) and signed with Ed25519:

from agentmesh import AuditTrail
trail = AuditTrail()
# ... calls happen ...
assert trail.verify()   # walks the chain, checks every prev_hash + signature
Enter fullscreen mode Exit fullscreen mode

The benchmark (run it yourself, no API keys)

I didn't want to ship a number you can't check, so the benchmark runs in demo mode:

python examples/benchmark.py
Enter fullscreen mode Exit fullscreen mode


Total requests          20
Exact cache hits         2  (10%)
Semantic cache hits     15  (75%)
Total misses             3  (15%)

Cost WITHOUT AgentMesh  $0.0030
Cost WITH AgentMesh     $0.0008
Savings                 $0.0023  (75%)
Enter fullscreen mode Exit fullscreen mode

20 requests, 5 topics, 4 phrasings each. 85% never reached the model; the 3 misses are the cold-start first call per topic — exactly what you'd expect.

There's also a Chrome extension
A proxy can't see a prompt typed straight into the ChatGPT or Gemini tab. So there's an extension: declarativeNetRequest reroutes api.anthropic.com / api.openai.com to localhost:8080, and content scripts show a governance overlay before the prompt is sent. Stats persist across service-worker restarts.

What's deliberately not built yet

I'd rather ship a small, verifiable core than a wide surface of half-features:

The cache is in-memory, single-process **— great for a local proxy, not yet a fleet. **Redis is next.
No native VS Code panel (env vars + the Chrome extension for now).
No SAML/SSO identity propagation; quotas key on a team header.

None of these are research problems — they're scope. PRs welcome, especially the Redis backend.
Try it / contribute

pip install agentmesh-proxy sentence-transformers
python examples/benchmark.py     # 85% cache hits, 75% lower cost
Enter fullscreen mode Exit fullscreen mode

Repo (star it): https://github.com/anilatambharii/agentmesh
PyPI: agentmesh-proxy · Docker: anilsprasad/agentmesh · also on Hugging Face
Apache 2.0

If you run AI tools across a team and your bill is outgrowing your usage, clone it, run the benchmark, and tell me where it breaks. What would you build on top of this?

Top comments (0)