A lot of MCP servers I see in the wild look like this:
@mcp.tool()
async def get_thing(id: str):
resp = await httpx.get(f"https://api.example.com/things/{id}")
return resp.json()
Fetch, forward, done. A thin HTTP proxy with a JSON Schema wrapper. For some
use cases, that's enough.
The servers I keep coming back to do something different. They hold state and
pre-compute answers. An agent hitting a thin wrapper might need three round
trips and 30 seconds. The same agent hitting a server that does real work gets
its answer in one call, under a millisecond.
Preloaded in-memory index
Here's a failure mode I run into constantly: the agent needs to find something
but doesn't know the exact ID. Most APIs only support exact lookups. No ID, no
result. The conversation dead-ends with "I couldn't find that resource" and the
user gives up.
I built a server that wraps a CDN management API. Hundreds of properties, and
the agent regularly needs to find which one handles a given hostname. The API
has a search endpoint, but it's slow, requires exact matches, and sometimes
returns 403 depending on account permissions.
So the server loads every property into memory at startup:
class PropertyIndex:
_entries: list[PropertyEntry] = field(default_factory=list)
_name_index: dict[str, int] = field(default_factory=dict)
async def load(self, refresh_interval: int = 300):
await self._build_index()
self._refresh_task = asyncio.create_task(self._refresh_loop())
def search_by_name(self, query: str, limit: int = 50):
names = [e.property_name for e in self._entries]
matches = process.extract(
query, names, scorer=fuzz.WRatio,
limit=limit, score_cutoff=50,
)
return [self._entries[idx].to_dict() for _, score, idx in matches]
Builds once by fanning out parallel API calls, deduplicates, refreshes every
five minutes in the background. Lookups take under a millisecond.
Without this, the agent guesses at exact property names, picks the wrong one,
retries, burns three turns. With the index, someone types "the CDN config for
checkout" and gets the right answer first try. That's the kind of difference
that decides whether people keep using the agent or go back to doing it
manually.
I did the same thing for a CI/CD server. The API lets you fetch a build config
by ID, but there's no fuzzy search. If you don't know the ID, you're stuck. The
server caches all build configurations at startup, runs fuzzy matching against
them. The agent says "find the deploy job for the payments service" and gets a
ranked list instantly, even though the CI system itself can't do that.
Embedded analytical database
I have another server that sits in front of a relational database. Some tables
have 20 million rows. The agent needs to answer analytical questions, things
like "which providers have the highest volume in this region?" or "show me the
top performers for a given category."
The database wasn't designed for these queries. It was built for a web UI with
narrow, well-indexed lookups. The agent's access patterns are different: it asks
broad analytical questions that require joins across tables the application
never joins. Adding indexes wasn't an option either, because the database is
owned by another team and optimizing it for an AI agent's query patterns wasn't
on anyone's roadmap. Some of these queries took 10 to 30 seconds on a read
replica, and in an agent loop where that latency gets multiplied by however many
tool calls the agent needs, the conversation times out before it gets anywhere.
The server embeds DuckDB in-process and loads pre-aggregated views and lookup
tables at startup. Some are straight copies of small reference tables. Others
are materialized summaries that flatten joins the source database was never
designed to run efficiently, the kind of cross-table aggregations that make
sense for an analytical question but would be expensive on a schema built for
transactional web UI lookups:
class DuckDBCache:
async def start(self):
self._conn = duckdb.connect(":memory:")
for key, config in fast_configs.items():
await self._load_table(key, config)
self._ready = True
self._deferred_task = asyncio.create_task(
self._load_deferred(deferred_configs)
)
self._refresh_task = asyncio.create_task(self._refresh_loop())
Each table has a fingerprint query (a cheap COUNT(*) or checksum) that the
refresh loop checks before doing a full reload. Large tables load in the
background after the server is already taking requests. If something asks for a
table that hasn't loaded yet, it falls back to the source database.
The 30-second query now takes under a millisecond. The agent can actually have a
back-and-forth with the user instead of timing out after the first question.
There's a query-result cache on top of this too. It has a prewarm manifest,
basically a list of common queries that run at startup so the first person to
use the agent on Monday morning doesn't sit through a cold start.
class QueryCache:
async def get_or_compute(self, cache_key, compute_fn, ttl=None):
cached = self.get(cache_key)
if cached is not None:
return cached
result = await compute_fn()
if "error" not in result:
self._put(cache_key, result, ttl or self._default_ttl)
return result
It skips caching error responses. If a query fails because the database is
temporarily overloaded, you don't want that failure served for the next hour.
That one took a production outage to figure out.
Data transformation
Every server I build strips the upstream API response before returning it.
Token usage scales with response size, and most APIs return 10x more data than
the agent will ever look at.
One API I work with returns objects with 60+ fields. The server keeps maybe 8:
def _slim_record(r: dict):
return _strip_nulls({
"id": r.get("id"),
"name": r.get("name"),
"total_value": _cents_to_major(r.get("total_value_cents")),
"annual_value": _cents_to_major(r.get("annual_value_cents")),
"start_date": r.get("start_date"),
"end_date": r.get("end_date"),
"status": _effective_status(r),
})
_cents_to_major converts cents to dollars. The raw API stores monetary values
in cents. Before I added this conversion, 100% of the reports the agent
generated had wrong numbers. Every dollar amount was off by a factor of 100. A $2,000 contract showed up as
$200,000 in the report because the agent treated cents as dollars. No amount of prompt engineering fixed it reliably. Moving the conversion
into the server did.
_effective_status is the other one worth mentioning. The API's status field
can say "active" on a record that ended three months ago. The platform's own UI
derives the real status from multiple fields, so the MCP server does the same:
def _effective_status(r: dict) -> str:
stage = r.get("stage")
if stage in ("terminated", "not_renewed"):
return "inactive"
if r.get("end_date_not_applicable") or r.get("renewal_type") == "perpetual":
return r.get("status", "undetermined")
end_date = r.get("end_date")
if end_date:
if date.fromisoformat(end_date) < date.today():
return "inactive"
return r.get("status", "undetermined")
Now the agent gives the same answer a human would get looking at the UI.
Stripping nulls across a list of 50 records also saves a few thousand tokens
per response, which adds up.
A log aggregation server I built does something similar: auto-appends
| json auto to queries that don't have a field extraction operator, truncates
raw log lines to 500 characters, converts epoch-millisecond timestamps to
ISO 8601. Small fixes that add up to the agent not wasting turns fighting the
format.
Download once, serve from cache
Some data is expensive to fetch. PDF documents. Code bundles in tgz archives.
The pattern: download on first access, extract the text, build a line offset
index, serve everything from memory after that.
class CachedFile:
def __init__(self, content: str):
self.content = content
offsets = [0]
pos = 0
while True:
pos = content.find("\n", pos)
if pos == -1:
break
pos += 1
if pos < len(content):
offsets.append(pos)
self._offsets = offsets
def get_lines(self, start: int, end: int) -> str:
return self.content[self._offsets[start-1] : self._line_end_offset(end)]
I use this for CDN edge function code bundles and PDF documents (extracted with
PyMuPDF). After the first download, the agent reads by line range, searches
with regex, lists the file tree. No repeat downloads. Reading through a
200-page document becomes "just read" instead of "download, extract, read" on
every question.
When thin is fine
Not everything needs this treatment. A server that translates natural language
to a query language and passes it to an API is fine as a thin wrapper. The
translation is the value there. Same for simple lookup tools.
The question I ask: does the agent hit the same data twice? Does the API return
more than the agent needs? Is the API response time slow enough that the agent
loop feels broken? If yes, the server should be doing work.
The multiplier
When a person uses a web UI, they look at a page, think, click something else.
One request at a time, processed by a human brain. An agent works differently.
It makes five tool calls, stuffs all five responses into its context window, and
reasons over them at once. A slow response gets multiplied by every call. A
60-field JSON blob gets multiplied by every call. It adds up fast.
I've measured the difference. CDN property lookups went from three agent turns
to one once the fuzzy index was in place. Analytical queries went from timing
out at 30 seconds to returning in under a millisecond from DuckDB. And every
single dollar amount in every report was wrong until the server started
converting cents for the agent.
You can try to fix that last one with prompt engineering. I tried for weeks. The agent still got it wrong often enough that I couldn't trust the output.
Top comments (0)