What You Ask the Model to Do Matters More Than Which Model You Use
Most advice about AI agent costs starts and ends with tokens. Cache your prompts. Batch your requests. Use a cheaper model. And those tactics help, the same way compressing images helps a slow website. They’re optimizations at the wrong layer.
The bigger problem is architectural. Teams building multi-agent systems default to routing everything through an LLM because it’s the easiest pattern, not because it’s the right one. Every status check, every file validation, every data comparison, every formatted notification goes through a model that charges per token and introduces the possibility of hallucination on every call. The convenience of “just let the AI figure it out” becomes a tax on every operation in the system.
Gartner predicts that over 40% of agentic AI projects will be canceled by 2027 due to escalating costs and unclear value. The escalating costs are addressable. The unclear value problem is a different challenge, requiring better product-market fit and outcome measurement. This article addresses the cost side: the architecture decisions that make AI agent systems expensive to operate, and the framework for fixing them. The question nobody asks is whether those LLM calls should be LLM calls at all.
We run a multi-agent production system with dozens of recurring LLM sessions across research, content, analytics, and infrastructure tasks. When we audited our own operations, the biggest cost savings didn’t come from model downgrades or prompt compression. They came from identifying entire sessions that had no business being LLM calls in the first place. File-existence checks running on premium reasoning engines. Status notifications routed through models that cost 100x what a formatted string costs. Structured data comparisons wrapped in conversational AI sessions.
We built a framework for that audit. We call it the Four Axes of Agent Efficiency: Script-It, Ground-It, Skill-It, Slim-It.
The Four Axes of Agent Efficiency
Each axis addresses a different category of misallocated LLM usage. Together, they form an audit lens for any multi-agent system, from a single agent with scheduled tasks to a coordinated team of dozens.
The framework targets precision, not reduction. Use AI where it adds genuine value, and use simpler tools everywhere else. An agent writing editorial content genuinely needs a capable model. That same system doesn’t need a premium model to check whether a file exists.
Axis 1: Script-It — Replace Deterministic Sessions with Scripts
The pattern is straightforward. An AI agent runs on a schedule, follows the exact same steps every time, reads structured data, applies fixed rules, and outputs structured results. The LLM adds no novel reasoning. It just follows instructions.
We had a cron error triage system running as two separate LLM sessions. The first analyzed error logs by reading JSON and classifying entries through pattern matching. The second ran on our most expensive model to apply fixes like increasing timeouts or updating configuration values. Both tasks are entirely deterministic. The analysis logic already existed in a script; the LLM was wrapping shell commands and formatting a Discord message.
The fix: we enhanced the existing Python script with two flags. One applies low-risk configuration fixes directly. The other posts a formatted notification. One system cron replaced two LLM sessions. Identical behavior, zero AI cost, and faster execution because there’s no model inference latency.
This is the most common pattern we see in production agent systems. An LLM call that exists because it was the fastest thing to build, not because it needs reasoning. The agent architecture made it easy to route everything through the model, so everything got routed through the model. The audit catches these by asking a simple question: does this task produce different outputs for different inputs based on judgment, or is it following a fixed procedure?
Scripts don’t hallucinate.
Identification checklist:
- The process reads structured data (JSON, config files, databases) and outputs structured data
- No natural language generation in the output
- The same input always produces the same output
- The LLM session is short with predictable tool calls
- The task is pure validation, comparison, or aggregation
Axis 2: Ground-It — Move State and Decisions into Structured Data
Two agents communicate state through prose. Or an agent reads a large markdown file to determine what stage a work item has reached. Or an agent interprets unstructured text to make a decision that should be based on an explicit data field.
This one matters beyond cost. When Agent A writes a status update in natural language and Agent B interprets it, there’s an interpretation gap. Agent B might misread the status, miss a nuance, or make a different assumption about what “nearly done” means. A JSON field that says "status": "awaiting-review" is unambiguous.
The grounding mechanism depends on your system’s scale. For smaller single-host systems, JSON files work well: they’re simple, human-readable, and require no infrastructure beyond the file system. For larger multi-agent or multi-host deployments, a proper database makes more sense — something like Supabase or PostgreSQL that multiple agents and hosts can query concurrently. The principle is the same either way: state lives in explicit, structured fields that any agent can read without interpretation. JSON is the right starting point. A database becomes necessary when state needs to be queried, filtered, or joined across agents and machines.
We migrated a content pipeline’s status tracking from a markdown file that agents would read and interpret to a JSON file with explicit status fields, timestamps, and stage data. The migration took roughly half a day and required updating the read/write functions in three agents. That single change eliminated an entire class of state-interpretation bugs. Items stopped getting stuck because one agent misread which stage they were in.
The cost savings from Ground-It are real, but the reliability improvement is the bigger win. When five agents all read the same status field and get the same answer every time, the system’s behavior becomes predictable. When they each interpret a paragraph of prose, you get five slightly different interpretations. In production, “slightly different” means items get processed twice, skipped entirely, or stuck in limbo.
Structured data doesn’t get misinterpreted.
Identification checklist:
- An agent reads a large file to find one data point
- Two agents communicate state through prose instead of structured data
- State is implicit (file exists in folder X means status Y) rather than explicit in a tracker
- An agent makes a decision based on interpreting another agent’s natural language output
Axis 3: Skill-It — Codify Repeated Processes
An agent performs the same multi-step operation regularly but works it out from scratch each session. It reads documentation, figures out the API format, discovers the correct file paths, and assembles the procedure. Every session burns context and tokens on re-discovery.
This axis has a direct accuracy payoff. A codified skill with explicit steps, file paths, and expected outputs doesn’t just save tokens. It eliminates the errors that come from improvisation. An agent following a skill file doesn’t guess the wrong endpoint, doesn’t try a deprecated API, doesn’t format a file incorrectly. Every error an agent doesn’t make is context it doesn’t waste on retries.
One of our agents was re-reading 20KB of style guides and reference files every session to calibrate its output voice for social media posting. A pre-computed 2KB checklist with extracted hard rules and representative examples provides the same calibration at roughly 10% of the context cost. It also eliminates the risk of the agent focusing on an irrelevant section of a lengthy reference document.
The compound effect matters here. If a process runs three times a day and burns 20KB of context each time versus 2KB, that’s 54KB of wasted context daily, across a single agent. Across a multi-agent system with dozens of recurring tasks, the savings from codifying repeated processes can dwarf what you’d get from switching to a cheaper model.
Codified skills don’t forget steps.
Identification checklist:
- An agent does the same multi-step process regularly but no skill file exists
- An agent “discovers” how to do something by reading documentation each session
- A process requires specific tool calls in a specific order (codifiable sequence)
- Error logs show repeated mistakes in a process that should be routine
Axis 4: Slim-It — Reduce Unnecessary Context and Call Count
An agent session loads large context files that are only partially relevant, makes multiple tool calls to gather data it doesn’t use, or makes a follow-up LLM call for something trivial.
This is often the easiest axis to act on, and the one with the fastest payback. A content pipeline stage running on the most expensive model (for genuine quality reasons) was making a separate LLM call just to post a templated notification: “Draft ready for review — Brief [ID], edit link: [URL].” The brief ID and URL are known variables. The notification requires zero reasoning. It’s string formatting. Moving it to a scripted step after the stage completes eliminates an unnecessary call on the most expensive model in the system.
Context bloat is the subtler form of this problem. An agent loads a 15KB reference document when it only needs three fields from it. Over dozens of sessions per day, that surplus context costs real money and, more importantly, dilutes the model’s attention. Smaller, focused context means better output quality, not just lower cost.
Identification checklist:
- Session token count is much higher than output token count (reading a lot, producing little)
- Agent loads full context files when it only needs a subset
- Agent makes a follow-up LLM call for a status update or notification
- Multiple tool calls gather data that feeds a single decision
The Five-Step Audit Process
The framework becomes practical through a repeatable audit methodology. This is the process a technical leader can run on their own system starting next week.
- Inventory all processes. List every scheduled job, every recurring agent task, every inter-agent workflow. Include the model each process uses and how often it runs. You can’t optimize what you haven’t mapped.
- Measure before optimizing. Know what each process actually costs: tokens multiplied by model price multiplied by frequency. A $0.02/day process isn’t worth rewriting. A $5/day process is.
- Score each process on the four axes. For each recurring task, ask: Is this deterministic (Script-It)? Is it interpreting state that should be structured (Ground-It)? Is it rediscovering steps that should be codified (Skill-It)? Is it loading or doing more than necessary (Slim-It)?
- Prioritize by frequency times cost. Daily sessions on expensive models come first. High-frequency low-cost tasks rank above low-frequency high-cost ones because their savings compound faster.
- Implement in tiers. Quick wins first: model downgrades, script replacements, notification templating. Architectural changes later: database-backed state tracking, context budgets, skill libraries.
The audit itself is valuable even before implementation. Just categorizing your agent workload across the four axes reveals how much of your LLM spend goes to tasks that don’t require language model capabilities. In our experience, the inventory alone tends to surface surprises. Most teams discover that a significant portion of their scheduled agent work is deterministic, and they simply never questioned it because the system was working.
One caveat: the four-axis audit optimizes steady-state costs. It does not protect against acute cost spikes from runaway agents, retry loops, or configuration bugs. For that, you need a separate cost circuit breaker system — the two systems address different failure modes and work best in combination.
“Working” and “efficient” are different things. A premium reasoning model can absolutely check whether a file exists. It will get it right every time. But including one survey finding 80 to 90% cost reductions when optimization strategies are applied systematically — suggest the savings ceiling is significant. Our own first-pass results (eliminating 10-12 daily sessions) align with that direction, though we haven’t completed a full-system audit to measure a final percentage. The four-axis audit identifies which strategies apply where.
What the Results Look Like
We applied this framework to our production system. The first pass focused on our infrastructure agent.
Six LLM cron sessions were replaced by five system scripts. That single change eliminated roughly 10 to 12 LLM sessions per day. Two of those sessions were running on our most expensive model, and their entire job was checking whether a file existed and then exiting. That’s a premium reasoning engine doing the work of os.path.exists().
Beyond the first batch, we identified 17 additional high-cost sessions across remaining agents with clear categorization: which sessions genuinely need their current model (creative work, editorial judgment) versus which are over-provisioned for the task (procedural coordination, data lookups, formatted notifications). Of the 17, roughly 8 were Script-It candidates — deterministic processes wrapped in LLM sessions. Four were Slim-It opportunities — context bloat or unnecessary follow-up calls. Three were genuine LLM tasks that could move to a less expensive model without quality loss. Two required more investigation before categorizing. The distribution skews toward Script-It, which is consistent with the pattern we see across agent systems: the easiest waste to accumulate is also the easiest to eliminate.
The numbers shift based on your system’s scale and model choices, but the pattern is consistent. AI agent costs can explode when multi-agent systems hit production scale, with monthly bills potentially 10x higher than projected. Most of that explosion comes from architectural decisions, not token pricing. The same model that costs $0.05 per session for a quick lookup costs $2 to $5 per session for a complex editorial task. When you’re running dozens of the former that should be scripts, the waste compounds fast.
We found context reduction opportunities throughout: unnecessary file loads during blog post writing, status notifications routed through expensive models, and dedup checks running on items with no new content to compare. Each one was a small win individually. Together, they reshaped the cost profile of the entire system.
The pattern holds across different system architectures. Whether you’re running agents on OpenAI, Anthropic, Google, or open-source models, the question is the same: is this task using a reasoning engine for reasoning, or for convenience? The audit framework is model-agnostic because it targets the architecture, not the provider.
Efficiency Is Accuracy
Efficiency and accuracy aren’t competing goals. They’re the same goal.
Every unnecessary LLM call is an opportunity for an error. Every time an agent improvises a procedure that should be codified, it might get a step wrong. Every time state is communicated through prose instead of structured data, there’s an interpretation gap waiting to cause a failure.
The most reliable AI systems are the ones that use AI the least for things AI isn’t needed for.
Scripts don’t hallucinate. JSON fields don’t get misinterpreted. Codified skills don’t forget steps. The AI becomes more effective precisely because it’s freed from busywork to focus on the tasks that actually require intelligence: editorial judgment, creative synthesis, ambiguous decision-making, novel problem-solving.
This mirrors a pattern across the gap between conversational AI and agentic systems. Conversational AI needs to handle anything a user might say. Agentic systems should do the opposite: constrain everything that can be constrained, and reserve the model’s reasoning capacity for the genuinely ambiguous work.
What Should Stay as LLM Calls
The framework is an audit tool for identifying where AI reasoning adds genuine value. Some work demands a language model: creative writing, editorial judgment, novel problem-solving, and any natural language output meant for human consumption where tone and clarity matter. The question for each process isn’t “can an LLM do this?” It’s “does this task benefit from reasoning?” If the answer is no, there’s a simpler, cheaper, more reliable tool for the job.
FAQ
How much can you save with this framework?
Our first batch eliminated 10 to 12 LLM sessions per day and replaced 6 scheduled sessions with 5 system scripts. Exact savings depend on your model costs and session frequency. Start with the highest-frequency tasks running on your most expensive model. Those tend to yield the largest immediate savings.
Where should you start the audit?
Start with Slim-It. It produces the easiest wins because you’re cutting waste without rewriting anything. Then Script-It, where candidates are the clearest. Ground-It comes next for its reliability impact. Skill-It has the longest payoff but delivers the most context savings over time.
What if my team doesn’t have the engineering capacity to write replacement scripts?
Start with the audit itself. Just identifying which processes are misallocated is valuable for planning and budgeting. When you do start replacing, Script-It candidates are typically 20 to 50 line scripts. The frequency-times-cost prioritization ensures you’re tackling the highest-ROI items first, not rewriting everything at once.
Does this only apply to large multi-agent systems?
The framework applies to any system making recurring LLM API calls, even a single agent with scheduled tasks. The principles (don’t use reasoning for deterministic work, don’t use prose for structured state) are universal. A solo agent with 10 cron jobs has the same optimization surface as a team of 7 with 60.
How do you measure whether an LLM call is “unnecessary”?
Apply the identification checklists. If the same input always produces the same output, if the task is pure validation or comparison, if the agent is reading structured data to find one field, those are candidates. Prioritize by cost times frequency. The measurement isn’t subjective; it’s mechanical.
Won’t AI models eventually become cheap enough that this doesn’t matter?
Cost is only half the argument. The reliability gains persist regardless of token pricing. Scripts don’t hallucinate at any price point. Structured data doesn’t develop interpretation gaps when models get cheaper. A file-existence check routed through an LLM is still an unnecessary failure surface, whether the call costs $0.50 or $0.005. Cheaper models don’t fix architectural decisions.
Getting Started
The audit is the first step, and it doesn’t require any code changes. Map your processes. Measure their costs. Score them on the four axes. The framework will show you where your system is spending reasoning capacity on work that doesn’t require reasoning.
If you’re at the stage of evaluating whether to build agent systems at all, the audit framework can inform your architecture from day one. Teams that design with the four axes in mind, reserving LLM calls for genuine reasoning and building scripts, structured data, and skills from the start, avoid the cost curve that catches teams who default everything to the model and optimize later.
For organizations evaluating how to architect their agent systems, this kind of operational audit is part of the design process from day one. The agents get better not just through better prompts or newer models, but through a disciplined architecture that matches every task to the right tool.
The most reliable AI systems use AI the least for things AI isn’t needed for. That’s not a limitation. That’s the design goal — and the four-axis audit is how you measure whether your architecture is actually achieving it.


Top comments (0)