DEV Community

Jason (AKA SEM)
Jason (AKA SEM)

Posted on • Originally published at Medium on

Stop Feeding Your AI the Entire Filing Cabinet. It Doesn’t Need It.

The most expensive architectural mistake in agent systems isn’t the model you chose. It’s how much context you’re shipping on every single call.

I have been a software developer since 1994. I have spent the last eighteen months building ArgentOS — an intent-native multi-agent operating system with 18 specialized agents, persistent memory, tool harnesses, and a guardrail system that forces agents to prove their work or get looped back. I run it every day. I build real client deliverables with it. I have burned through enough API tokens to know exactly where the money goes.

And I’m here to tell you: most of it is wasted.

Not on bad prompts. Not on hallucinations. Not on the wrong model. On resending context the model has already seen, will immediately forget, and doesn’t need for the task at hand.

This is the architectural flaw sitting underneath every agent system that treats frontier APIs like a chat interface. And fixing it changes everything — your cost structure, your latency, your privacy model, and what hardware your product actually needs to run on.

The Dumb Loop

Here’s how most agent systems work today, including — honestly — how I was running parts of ArgentOS until recently.

You send a prompt to the API. The model responds. You append that response to the conversation history. Next turn, you send the whole history again — including the response you just got back — plus the new message. The model processes it all, responds, you append again. Repeat.

Every turn, the payload grows. Every turn, you’re paying to re-process information the model already generated and will never recall. The model is stateless. It retains nothing between calls. You know this intellectually. But the API’s chat interface design makes it feel like a conversation, so you treat it like one — shipping the full transcript every time as if the model is sitting there reading its own notes.

It’s an O(n²) cost pattern for what should be O(1) retrieval.

For a single-agent chatbot, this is annoying but manageable. For a multi-agent orchestration system — 18 agents, tool call chains, persistent memory lookups, concurrent workflows — it’s a financial sinkhole. I’ve watched single complex workflows burn through 250K+ input tokens when the actual reasoning work needed maybe 40K.

That’s not a rounding error. That’s 6x overspend. On every workflow. Every day.

Separate Context From Compute

The fix isn’t better prompting. It isn’t a cheaper model. It’s an architectural separation that seems obvious once you see it but almost nobody is implementing cleanly.

Context is a local asset. The frontier model is a remote compute service. Treat it accordingly.

Your harness should own, store, index, and retrieve all context locally. When frontier-grade reasoning is required, the harness should assemble a minimal context packet — just what the model needs for this specific call — and send only that. The response comes back, gets integrated into local state, and the cycle continues. The model never sees the full picture. It sees a curated briefing every time.

Think about how you’d work with an outside consultant. You don’t ship them your entire company drive and say “figure it out.” You prepare a briefing. You include the relevant background, the specific question, the constraints, and the deliverable format. Everything else stays in your filing cabinet. The consultant does their work with what you gave them and hands back a result.

That’s the architecture. Your harness is the organization. The frontier model is the consultant. The briefing is the context packet.

The Harness Doesn’t Need a Brain. It Needs a Librarian.

Here’s where I got stuck for a while, and I think a lot of builders get stuck in the same place.

If the harness needs to decide what context is relevant before calling the API, doesn’t that mean the harness needs its own intelligence? And if I put a local LLM in the harness to make that decision, won’t it be too dumb to do it well? I’m trying to avoid calling the frontier for every interaction, but the local model isn’t smart enough to orchestrate. It feels like a loop — I want frontier intelligence without frontier cost, and there’s no clean way to get both.

The way out is realizing that the local layer doesn’t need to be a thinker. It needs to be a librarian.

Deciding “what context is relevant to this intent” is not a reasoning task. It’s a retrieval and ranking task. You embed the intent, run a vector similarity search against your context stores, score the results, and assemble the top-ranked chunks into a packet. That’s not an LLM workload. That’s a database query with an embedding step.

An embedding model like all-MiniLM-L6-v2 runs on CPU. No GPU. Sub-100 milliseconds. It'll run on a MacBook Air. It'll run on whatever hardware your customers happen to have. The vector search runs against PostgreSQL with pgvector — standard database infrastructure. The packet assembly is pure application logic. Token budgeting, sliding window management, priority filling — that's just code.

No local LLM required. No GPU required. The retrieval pipeline handles context projection, and the frontier API handles reasoning. Each one does what it’s good at.

The TOON Context Packet

So you’ve separated context from compute. Your harness retrieves relevant information locally and assembles a briefing for the frontier model. The next question is: what format does that briefing take?

Right now, most systems serialize context as JSON. Some use raw text. Both are wasteful. JSON is verbose — repeated keys, braces, brackets, quotation marks on every value. When you’re assembling a context packet with agent state, retrieved memories, tool definitions, and entity references, JSON’s structural overhead adds up fast.

This is where I started looking at TOON — Token-Oriented Object Notation. It’s an open format designed specifically for LLM input. It encodes the same JSON data model but strips the syntactic noise, using YAML-style indentation for nesting and CSV-style tabular layout for uniform arrays.

The benchmarks caught my attention: approximately 40% fewer tokens than JSON with equal or better LLM comprehension accuracy. That’s not a tradeoff. That’s a free lunch — cheaper and the model understands it better.

The reason is the schema-aware header syntax. When you encode an array of objects in TOON, you declare the field names once in a header — [N]{field1,field2,field3} — and then each object is just a row of values. The model sees the schema explicitly declared upfront and then parses rows against it. That's structurally easier to follow than JSON, where the model has to infer the schema by reading repeated key-value pairs.

For a context packet, this is ideal. Think about what’s in the packet:

Agent state for multiple agents? Uniform array of objects — agent ID, status, current task, last action. Declare the fields once, stream the rows.

Retrieved memory chunks? Uniform array — timestamp, relevance score, content, source. One header, N rows.

Available tools? Uniform array — name, description, parameters. One header, N rows.

Conversation turns? Tabular — role, timestamp, content. One header, N rows.

This is exactly the data shape where TOON compresses hardest. You’re not just saving tokens on the wire — you’re giving the model a cleaner, more parseable input that produces better results.

What Goes in the Packet

I’ve been working through what the minimal viable context packet looks like for ArgentOS. Seven fields. Everything the frontier model needs to do its work. Nothing it doesn’t.

Intent. What are we trying to accomplish. One clear statement. This is the task specification, not a conversation.

Constraints. What rules apply. Compliance frameworks, operator preferences, output restrictions. Non-negotiable guardrails that must carry through to the response.

Entities. The specific nouns involved. A client name, a server IP, a domain, a ticker symbol. Just the concrete references the model needs to work with.

Context. Retrieved knowledge relevant to this intent. Memory hits scored by the projection layer for relevance. Not everything in the store — just the top-ranked chunks that fit the token budget.

State. Where we are in a workflow. What’s been tried, what succeeded, what failed. The model needs trajectory to avoid repeating failed approaches.

Tools. Available tools for this specific call. Not every tool in the system — just the ones scoped to this intent. Narrowing the tool set improves selection accuracy.

Response Format. What shape the output should take. Structured report, decision with confidence score, tool call plan, generated content. The harness needs to parse the response programmatically, so tell the model the shape upfront.

Seven fields. All structured. All compressible via TOON. The frontier model gets a clean briefing, does its work, and returns a result. The harness indexes that result into local memory and moves on.

Token Budgeting

The packet operates under a fixed token budget. Not “send as much as fits in the context window” — a deliberately constrained budget that forces the projection layer to prioritize.

The filling order matters:

System prompt and intent come first — the model must know what to do. Constraints and entities come next — non-negotiable context. Tools and state fill the middle tier. Retrieved context fills whatever budget remains, ranked by relevance score, lowest scores dropped first.

If the budget is tight, context chunks get trimmed. The model always knows the task, the rules, and the entities. Depth is variable. This is the opposite of how most systems work, where everything gets sent and the model sorts through it. Here, the harness sorts through it before the model ever sees it.

The Operator Model Implication

This architecture has a second-order effect that’s potentially more significant than the cost savings.

If the frontier model only gets called for actual reasoning work — complex analysis, multi-step planning, nuanced generation — then the interactive layer, the thing the human operator actually talks to, doesn’t need to be frontier-grade.

It needs to understand natural language intent. It needs to route to the correct agent. It needs to present results clearly. It does not need to be the smartest model on the market. What makes a lesser model viable at the interactive layer isn’t raw intelligence — it’s the harness constraining it.

I’ve spent months building guardrail systems in ArgentOS — evidential proof checks that force agents to show their work, tool-use enforcement that won’t let an agent claim it did something without the harness confirming the tool was actually called, anti-hallucination loops that check assertions against available evidence and re-route when they don’t hold up.

Those guardrails are what make a non-frontier model reliable at the conversational layer. The discipline comes from the system, not the model. The frontier API becomes a service you call when you need heavy reasoning — not the thing powering every keystroke.

The product implications are significant. If your operator-facing model can run locally on customer hardware — CPU only, no GPU, no special requirements — and the frontier API only gets called for surgical reasoning tasks with compressed TOON packets, you’ve fundamentally changed the deployment model. The customer’s cost drops. Their privacy improves — sensitive context stays local. Their latency improves — the interactive layer doesn’t round-trip to an API on every turn. And your product runs anywhere, on anything.

The Experiment, Not the Pivot

I’m not rewriting ArgentOS around this idea tomorrow. That would be reckless. But I am running an experiment.

The approach is simple: pick one agent interaction that currently ships full context to the API. Build the projection layer for just that case. Assemble the TOON packet, send it, and compare the results and token cost against the current approach. Same intent, both paths, side by side.

If the projected path maintains quality with significantly fewer tokens — and the benchmarks suggest it will — that’s the proof point. Expand from there. If it degrades, figure out why. Is it a retrieval problem? A compression problem? A budget problem? Fix the specific failure and re-test.

This is how you validate an architectural thesis without betting the farm on it. One seam. One measurement. One decision.

The Bigger Picture

I’ve written before about the harness layer being the real moat — that anyone can call an API, but the intelligence of what you send to it is the defensible advantage. I’ve written about organizational memory being the compounding asset that a fresh install can never replicate.

This is the next layer of that argument.

The moat isn’t just what you remember. It’s how efficiently you deploy what you remember into the narrow window of a frontier API call. Two organizations with identical memory stores and identical model access will get dramatically different results if one is shipping the full filing cabinet while the other is shipping a curated briefing.

Context projection is the skill of the harness. TOON is the wire format that makes it cost-effective. And the separation of interactive intelligence from reasoning intelligence is the deployment model that makes it accessible.

The harness owns context. The frontier model rents it. And the compression format for the lease matters more than most people think.

Jason Brashear is the creator of ArgentOS, an intent-native multi-agent operating system, and a partner at Titanium Computing. He has been a software developer since 1994 and writes about intent engineering, agentic architecture, and frontier operations. Find him on GitHub at webdevtodayjason.

Top comments (0)