We've all been there. Claude hits an MCP error, tries a different approach, hits another one, tries again — and eventually figures it out. You wait a minute, maybe two, it's fine.
But here's the thing nobody talks about: when a whole team uses Claude Code daily, those minutes stack. One person watches the agent spin through three wrong SQL dialects before landing on the right one. Another waits while it retries the same failed tool call four times. Someone else loses the thread of a complex session because query results flooded the context. Multiply that by four people, every day, and you're not talking about a minor inconvenience anymore — you're talking about hours of engineering time burned while Claude figures out things it should have known from the start.
I started tracking it. Two weeks, 23+ of these incidents across my analytics team. And the worst part? The agent wasn't being stupid. It was being sabotaged by the server we gave it.
We were using the official aws-dataprocessing MCP server from AWS Labs. Good project, well-maintained, 34 tools covering Glue, EMR, Athena, IAM, S3. We needed Athena. That's it. Five tools out of thirty-four.
I should've noticed sooner.
The thing is, it worked — sometimes. When it worked it was great. But when it didn't, it failed in the most demoralizing way possible: the agent would try something, get an error, try a variation, get a different error, try again, get the first error back. You'd watch it spin for ten minutes on something a junior analyst could fix in thirty seconds.
After a while I started looking at why it failed. Not the specific errors — the underlying reasons. And it turned out there were five of them, all happening at once.
The first thing I noticed was the metadata blindspot.
Athena is strict about types. If your column is varchar and you write a query treating it like a date, you get an error. Simple. But Claude didn't know our column types — there was no mechanism to tell it upfront. So it guessed. And when it guessed wrong and got an error, it... guessed again. I watched it try three different CAST approaches on the same column, each one wrong in a different way, before I just typed event_date is STRING in the chat and it immediately fixed everything. The information was always there — we just never gave it to the agent.
The second thing was the context bill.
34 tools × ~600 tokens each = roughly 20K tokens just for tool descriptions. Before the agent runs a single query, a fifth of its context window is gone. On a normal session that's annoying. With parallel sub-agents — which we use constantly — it's a disaster. Each sub-agent gets the full 34-tool payload. When someone on the team was running parallel agents on a complex analysis task, each one was starting with almost no usable context. Dozens of failures in a single session — and now it made sense.
Third: dialect hallucinations.
Claude kept writing TIMESTAMP_SUB. That's BigQuery syntax. Athena runs Presto/Trino, which uses DATE_ADD('day', -N, CURRENT_DATE). Every single time someone ran a date filter, the agent defaulted to what it knew best. Because nothing in the tool description said "hey, this isn't BigQuery."
Fourth: the DROP VIEW trap.
The server has a built-in SQL analyzer that blocks write operations. Makes sense as a safety feature. Except it classified DROP VIEW as a write operation and blocked it — even though we had --allow-write enabled. Drop a view before recreating it, like you do in any dbt workflow, and you get a permissions error that makes zero sense. Eight attempts. Eight times the same wall.
Fifth: the data dump problem.
No default row limit on query results. SELECT * on a large table returns 1,000 rows of JSON into context. One query. Multiply by parallel agents running simultaneously, and the useful context just disappears.
So. 160 lines of Python later, here's what we changed — and why each thing actually matters.
We cut 33 tools.
This sounds obvious but it isn't. You can't just disable tools in the config. We wrote a new entry point, server_athena.py, that imports the original handler but only registers one tool instead of thirty-four. Same codebase, same logic, different surface area. Context overhead dropped from ~20K tokens to ~1K. That single change made parallel sub-agents viable again.
We merged three calls into one.
The previous flow was: start-query-execution → poll get-query-execution until it finishes → get-query-results. Three tool calls minimum, each one a decision point where the agent could drift. We added an execute-query operation that handles all of it internally — 500ms polling, 30 second timeout, returns results directly. For 95% of queries, it's one call. For slow queries, it returns the execution ID so you can check back.
The agent stops forgetting what it was doing mid-query.
We preloaded the schema.
At startup, the server scans all databases and caches every table, every column, every type. It also downloads our dbt manifest from S3 — so it knows which model owns which table, who's responsible for it, how fresh the data is supposed to be. This all gets baked into the server's system instructions before Claude sees anything.
Then we added get-all-schemas with two modes: a cheap compact mode that returns table names and descriptions (so the agent can orient), and a deep mode where you ask for specific tables and get full column types plus dbt lineage. The agent orients cheaply, drills when it needs to.
No more guessing column types. The metadata is just there.
We made errors actually helpful.
This was the biggest change in practice. The old behavior: query fails, agent gets a raw error string, guesses what went wrong. New behavior: the server parses the error type and returns the specific context needed to fix it.
TABLE_NOT_FOUND → full list of available tables
COLUMN_NOT_FOUND → all columns for tables in the query
TYPE_MISMATCH → column types with correct CAST suggestion
PARTITION_MISMATCH → partition keys and their actual types
QUERY_EXHAUSTED → hints: add LIMIT, use APPROX_DISTINCT
SYNTAX_ERROR → "this is Presto, use DATE_ADD not DATE_SUB"
Most errors now resolve in one retry. The spiraling stopped almost immediately.
We compressed the output.
JSON query results are verbose. We switched to TOON (Token-Oriented Object Notation) — pipe-separated tabular format. Same data, 74% fewer tokens on our actual queries. Sounds like micro-optimization. At 50 queries deep into a complex analysis session, it's the difference between the agent remembering what it was asked and losing the thread.
The results:
| Before | After | |
|---|---|---|
| Tools in context | 34 | 1 |
| Context overhead | ~20K tokens | ~1K tokens |
| Calls per query | 3+ | 1 |
| Default row limit | none | 100 |
| Error response | raw string | schema + hint |
| Result format | JSON | TOON (-74%) |
The wasted cycles stopped. The team runs parallel agents on complex tasks without babysitting them. The agent gets it right on the first or second try.
The thing I keep coming back to is that none of this required a smarter model. Claude was fine the whole time. We handed it a server built for maximum coverage — 34 tools, every AWS data service, zero assumptions about your use case — and expected it to perform like a specialist. AWS Labs built that server for everyone. We needed it to work for us.
If your agent keeps spinning in circles, before blaming the model, check what you're loading into its context:
- How many tools does it see? Do you actually use all of them?
- What does it know about your data model before the first call?
- When it gets an error, does it get a hint or a wall?
- How big are the responses coming back?
We write MCP servers as access layers. But they're also the agent's cognitive environment. Design them badly and even a great model will look stupid.
If you're just starting to think about MCP servers — what they're for, when they make sense, how to structure them — I wrote about that here. This post is what happens two levels deeper, when the theory meets production.
Top comments (0)