Where Does an Agent's Money Go?
A cost breakdown of one agent invocation:
Input tokens:
System prompt Fixed — paid on every single call
Tool schemas Fixed — one entry per registered tool
Conversation history Grows linearly with turns
Retrieved context Dynamic
Output tokens:
Reasoning traces The Thought steps in ReAct
Tool call arguments One per tool invocation
Final response What the user actually sees
Latency breakdown:
LLM inference Usually > 90% of total latency
Tool execution Usually < 10%, but stacks up when sequential
Every optimization falls into one of two buckets: reduce token count or reduce wait time. Four experiments ahead to quantify what each strategy actually delivers.
Demo 1: Token Cost Breakdown — Trim the System Prompt
The system prompt is sent to the model on every call. It's the most overlooked fixed cost.
Two versions compared:
MINIMAL_PROMPT = "You are a helpful assistant."
# → 6 tokens
VERBOSE_PROMPT = """You are an extremely helpful, knowledgeable, and professional AI assistant
for WonderLab's enterprise software platform. You specialize in providing accurate weather
information... Always be thorough, comprehensive, and leave no important detail unexplained."""
# → 107 tokens
Token counts:
Minimal ( 6 tokens): 'You are a helpful assistant.'
Verbose (107 tokens): 'You are an extremely helpful...'
Extra per call: 101 tokens
101 tokens might sound small. At GPT-4o input pricing ($2.50 / 1M tokens):
- 10K calls/day → $0.25/day extra
- 1M calls/day → $25/day, $750/month
Latency measurement (2 runs, same query):
Agent Run 1 Run 2 Avg Answer
Minimal 6.90s 3.39s 5.15s The current weather in Beijing is 25°C...
Verbose 3.10s 4.21s 3.66s The current weather in Beijing is 25°C...
Verbose averaged lower latency than Minimal — counter-intuitive.
The explanation: 2 samples are nowhere near enough to measure LLM latency. API response time varies ±50% or more depending on server load. You need at least 10–20 samples and a median to see a stable pattern. The apparent difference here is pure noise.
System prompt trimming saves token cost, not latency. Latency optimization requires different tools.
Prompt Caching (advanced): Claude and OpenAI APIs support explicit prompt caching. In Claude:
response = client.messages.create(
model="claude-sonnet-4-6",
system=[{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # mark as cacheable
}],
messages=[...],
)
# First call: writes to cache (billed normally)
# Subsequent calls with same prefix: cache hit, ~90% cost discount
print(response.usage.cache_read_input_tokens) # tokens served from cache
print(response.usage.cache_creation_input_tokens) # tokens written to cache
For systems with 10K+ token system prompts (RAG results, tool docs, background knowledge), Prompt Caching is the single highest-leverage optimization available.
Demo 2: Model Routing — Skip Agent Overhead for Simple Queries
Core idea: spend one cheap classification call to decide whether a query actually needs an agent. Queries that don't need tools get answered directly, skipping the multi-turn ReAct loop.
ROUTING_SYSTEM = """Classify the user query. Reply with ONLY one word:
- "direct" if answerable from general knowledge (no real-time data)
- "agent" if requires a tool call (weather, product pricing, calculation)"""
def classify_query(query: str) -> str:
resp = llm.invoke([SystemMessage(ROUTING_SYSTEM), HumanMessage(query)])
return "agent" if "agent" in resp.content.lower() else "direct"
def routed_run(query: str):
route = classify_query(query)
if route == "direct":
resp = llm.invoke([HumanMessage(query)]) # direct answer, no agent overhead
else:
result = full_agent.invoke(...) # full agent execution
Five test queries:
Query Route Total Tools
What is the capital of France? direct 2194ms []
Explain machine learning in one sentence. direct 2011ms []
What's the weather in Shanghai right now? agent 4213ms ['get_weather']
How much does WonderBot Pro cost per month? agent 6033ms ['get_product_info']
What is 299 multiplied by 12? agent 3878ms ['calculator']
Classification accuracy: 5/5. But look at the numbers:
-
directqueries: ~2000ms total — this includes the routing call (~1s) plus the direct LLM answer (~1s) -
agentqueries: 4000–6000ms total — routing call (~1s) plus full agent (~3–5s)
The hidden cost of routing: every routing decision is an extra LLM call. For queries that must go through the agent, routing adds ~1 second of overhead with no benefit.
The ROI of routing depends on the ratio of "toolless queries" in your workload:
If > 40% of queries need no tools:
routing saves: (direct_query_count × agent_overhead)
routing costs: (all_queries × routing_call_cost)
→ net positive
If < 20% of queries need no tools:
routing is mostly overhead — skip it
Measure your actual workload distribution before deploying a routing layer.
Demo 3: Parallel Tool Calls — 3.0x Speedup
When two or more tool calls are independent, there's no reason to run them sequentially.
async def fetch_weather_async(city: str) -> str:
await asyncio.sleep(0.1) # 100ms simulated I/O latency
...
async def run_parallel(cities: list[str]) -> list[str]:
return await asyncio.gather(*[fetch_weather_async(c) for c in cities])
def run_sequential(cities: list[str]) -> list[str]:
for city in cities:
time.sleep(0.1) # sequential, blocking
...
3 cities × 100ms latency, 3 runs each:
Sequential avg: 300.4ms (expected ~300ms) ✓
Parallel avg: 101.4ms (expected ~100ms) ✓
Speedup : 3.0x (66% faster)
Exactly what theory predicts. N independent tool calls in parallel: latency drops from N×t to t.
LangGraph handles this natively. When the LLM emits multiple tool_calls in a single response turn, create_react_agent executes them in parallel automatically. No asyncio boilerplate needed — just declare your tool functions as async def:
@lc_tool
async def get_weather(city: str) -> str: # ← async declaration
"""Get current weather for a city."""
result = await weather_api.fetch(city) # non-blocking I/O
return result
The prerequisite: the LLM needs to recognize that multiple tool calls are independent and emit them together in one turn. Weaker models may still call them one by one.
Demo 4: Tool Result Cache — 0ms vs 100ms
When it applies: the same tool is called multiple times with the same arguments within a short window (user asks the same city's weather twice, or a multi-step agent needs the same data at different points in reasoning).
_cache: dict[str, tuple[str, float]] = {}
CACHE_TTL_S = 60.0
def get_weather_cached(city: str) -> tuple[str, bool]:
key = f"weather:{city.lower()}"
now = time.time()
if key in _cache:
result, ts = _cache[key]
if now - ts < CACHE_TTL_S:
return result, True # cache hit
# miss — call real tool
time.sleep(0.1)
result = fetch_from_api(city)
_cache[key] = (result, now)
return result, False
Six calls, 3 unique cities:
City Status Time Note
Beijing MISS 100.2ms 1st call
Shanghai MISS 100.2ms 1st call
Beijing HIT ✓ 0.0ms 2nd call
Shenzhen MISS 100.2ms 1st call
Shanghai HIT ✓ 0.0ms 3rd call
Beijing HIT ✓ 0.0ms 4th call
Hit rate: 3/6 = 50%
Miss latency: ~100ms (real tool call)
Hit latency: < 1ms (dict lookup)
TTL guidelines:
| Data type | TTL | Reasoning |
|---|---|---|
| Weather | 5–15 min | Changes slowly; users don't need realtime precision |
| Product pricing | Hours | Rarely changes |
| Inventory / stock | < 1 min or none | Business-critical freshness |
| Write operations | Never | Side effects must not be replayed |
Never cache tools with side effects (file writes, emails, database mutations). Replaying the same call with the same parameters will produce duplicate side effects.
Design Checklist
Token optimization
- [ ] Count system prompt tokens; question every sentence ("is this actually needed?")
- [ ] Move static reference docs (product manuals, API docs) to RAG retrieval — inject only on demand
- [ ] Cap conversation history at 10–20 recent turns; summarize what's pruned
- [ ] High-volume + large system prompts → evaluate Claude/OpenAI Prompt Caching ROI
Model routing
- [ ] Measure the fraction of your queries that need no tools before building a router
- [ ] Routing classifier prompt must reflect your actual intent boundary — not a generic direct/agent split
- [ ] Measure routing overhead (one extra LLM call) against the agent overhead it avoids
Parallel tool calls
- [ ] Declare tool functions as
async def; LangGraph parallelizes independent calls automatically - [ ] Identify "must-be-sequential" tools (output of A feeds input of B) vs truly independent ones
- [ ] Multi-provider scenario: parallel calls to different services → total latency = max(individual latencies), not sum
Tool result caching
- [ ] Prioritize idempotent tools (same input → same output)
- [ ] Set TTL per tool based on data freshness requirements; don't use one TTL for everything
- [ ] Never cache write/side-effect tools
- [ ] Production: use Redis instead of an in-memory dict for multi-instance cache sharing
Summary
Five core takeaways:
- Token cost is the most controllable cost: system prompt, tool schemas, conversation history — each can be measured and reduced without changing the model or architecture
- Latency measurement needs sufficient samples: 2-run results can be misleading (verbose was "faster" here); stabilize with 10+ samples and median values
- Model routing has hidden overhead: routing adds one LLM call per query; it only turns positive when > 40% of queries need no tools
- Parallel tool calls are the cleanest optimization: N independent calls → latency from N×t to t; LangGraph supports this natively with async tool functions
- Cache ROI depends on hit rate: below 30% hit rate, the complexity of caching outweighs the benefit; TTL design matters more than the cache implementation itself
Up next: Harness Engineering — Complete System — expanding from the five-element introduction to the full 8-layer framework, including the action space registry, permission budget system, and a complete threat model.
References
- Anthropic Prompt Caching documentation
- LangGraph tool calling concepts
- Full demo code for this series: agent-17-cost-optimization
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (0)