DEV Community

Cover image for How API Data Bloat is Ruining Your AI Agents (And How I Cut Token Usage by 98% in Python)
Craig Mac
Craig Mac

Posted on

How API Data Bloat is Ruining Your AI Agents (And How I Cut Token Usage by 98% in Python)

If you are building autonomous AI agents right now using OpenAI, Anthropic, or local models, you have probably run into the exact same wall I did.

You build a smart agent. You give it access to a few API tools (web search, database queries, a CRM integration). You set it loose.

It works great for about four turns. Then, suddenly, it forgets its core instructions. It starts hallucinating. And when you check your API dashboard the next morning, your token usage has spiked so high you think your API key got leaked.

What happened? API Data Bloat.

The 50KB JSON Problem
Here is the dirty secret of agentic workflows: APIs were built for traditional software, not for LLM context windows.

When your AI agent decides to call a tool—let's say it searches for a user profile in a database—the API doesn't just return the user's name and email. It returns a massive 40KB wall of raw JSON containing timestamps, nested metadata, tracking IDs, and null fields.

Your AI only needed about 120 bytes of that data to answer the user's question. But because of how most agent frameworks operate, the entire 40KB payload gets dumped directly into the active context window.

This causes two massive problems:

The Cost: You are paying for tens of thousands of useless tokens on every single tool call.
Context Compaction: LLMs have finite memory. When you shove 40KB of junk JSON into the chat history, the LLM is forced to push out its original system prompt and early conversation history. The agent gets "dumb" because its working memory is full of tracking IDs.
The Flawed Solution: "Just use a cheaper model"
When developers see their API bills explode, their first instinct is to swap out GPT-4o or Claude 3.5 Sonnet for a cheaper, smaller model to save money.

But cheap models deliver cheap reasoning. The problem isn't that the smart models are too expensive; the problem is that you are feeding them garbage data they didn't ask for.

I got tired of this, so I built a middleware fix.

Enter: The OpenClaw Context Saver
I built and open-sourced a drop-in tool called the OpenClaw Context Saver. It is pure Python, has zero external dependencies, and acts as a protective shield for your LLM's context window.

It cuts agent token usage by 70% to 98% by solving the data bloat problem before the data ever reaches the AI.

Here is how it works under the hood:

  1. Sandboxed Execution (ctx_run)
    Instead of the LLM calling the API directly and eating the response, the LLM calls my ctx_run sandbox. The sandbox executes the API call in an isolated layer.

  2. Intent-Driven Filtering
    Before passing the data back to the LLM, the Context Saver intercepts the massive JSON payload. It shrinks it down, extracting only the specific data points the agent actually needs to complete its current reasoning step.

  3. Session Continuity (The Magic Trick)
    What if the agent needs the rest of that data later?
    Instead of throwing the extra data away, the Context Saver indexes the full payload in a lightweight background database (SQLite). It passes a tiny, 120-byte summary into the active context window, along with a reference ID. If the agent realizes it needs more details three turns later, it can instantly retrieve them from the background index without re-running the API call.

The Real-World Impact
Let's look at the difference on a standard background agent task:

❌ WITHOUT Context Saver:

Agent calls API ➔ 20 KB raw JSON floods context.
Agent calls API again ➔ 30 KB raw JSON floods context.
Result: Session memory maxes out, working state is lost, and you burn ~750,000 tokens a day just on background noise.
✅ WITH Context Saver:

Agent calls ctx_run ➔ 120-byte summary enters context (full data indexed in the background).
Agent calls ctx_batch ➔ 500-byte combined summary enters context.
Result: Massive cost savings, perfect memory retention, and you can afford to keep using the smartest models available.
Stop burning tokens.
If you are optimizing AI agents, building autonomous systems, or just looking to drastically reduce your LLM API costs without sacrificing reasoning quality, drop this into your architecture today.

💻 I just open-sourced it. You can grab the code, check out the examples, and star the repo here:
https://github.com/tlancas25/openclaw-context-saver

I'm a solo dev, so I'd love to hear your feedback. Drop a comment if you've been struggling with context limits, or open an issue on GitHub if you want to see a specific feature!

Top comments (1)

Collapse
 
apex_stack profile image
Apex Stack

The session continuity pattern you described — indexing full payloads in SQLite and passing tiny summaries with reference IDs — is essentially what we had to build for a different reason on a financial data platform.

We pull data from yfinance for 8,000+ stock tickers and generate analysis pages with a local LLM. The raw API response for a single ticker (financials, holders, earnings, dividends, news) can be 50-80KB of JSON. When we pipe that into Llama 3 for content generation, we hit exactly the problem you're describing — the model starts losing track of its analysis framework because the context is drowning in raw numbers.

Our solution was similar in spirit but different in execution: we pre-extract a "ticker profile" (about 2KB of the most analytically relevant fields) and pass that to the LLM instead of the full payload. The full data lives in Supabase and the LLM can reference specific fields if it needs to drill deeper.

The 98% reduction claim tracks with our experience. For financial data specifically, the signal-to-noise ratio in API responses is absurdly low — most of the payload is metadata, timestamps, and null optional fields that the model doesn't need.

One thing I'd be curious about: how does the Context Saver handle cases where the "intent" is ambiguous? In financial analysis, sometimes the model doesn't know it needs a specific data point until it's partway through reasoning. Does the SQLite retrieval add meaningful latency in those cases?