DEV Community

Cover image for Agent Series (18): Cost & Performance Optimization — Cheaper and Faster
WonderLab
WonderLab

Posted on

Agent Series (18): Cost & Performance Optimization — Cheaper and Faster

Where Does an Agent's Money Go?

A cost breakdown of one agent invocation:

Input tokens:
  System prompt         Fixed — paid on every single call
  Tool schemas          Fixed — one entry per registered tool
  Conversation history  Grows linearly with turns
  Retrieved context     Dynamic

Output tokens:
  Reasoning traces      The Thought steps in ReAct
  Tool call arguments   One per tool invocation
  Final response        What the user actually sees

Latency breakdown:
  LLM inference         Usually > 90% of total latency
  Tool execution        Usually < 10%, but stacks up when sequential
Enter fullscreen mode Exit fullscreen mode

Every optimization falls into one of two buckets: reduce token count or reduce wait time. Four experiments ahead to quantify what each strategy actually delivers.


Demo 1: Token Cost Breakdown — Trim the System Prompt

The system prompt is sent to the model on every call. It's the most overlooked fixed cost.

Two versions compared:

MINIMAL_PROMPT = "You are a helpful assistant."
# → 6 tokens

VERBOSE_PROMPT = """You are an extremely helpful, knowledgeable, and professional AI assistant
for WonderLab's enterprise software platform. You specialize in providing accurate weather
information... Always be thorough, comprehensive, and leave no important detail unexplained."""
# → 107 tokens
Enter fullscreen mode Exit fullscreen mode

Token counts:

  Minimal  (  6 tokens): 'You are a helpful assistant.'
  Verbose  (107 tokens): 'You are an extremely helpful...'
  Extra per call: 101 tokens
Enter fullscreen mode Exit fullscreen mode

101 tokens might sound small. At GPT-4o input pricing ($2.50 / 1M tokens):

  • 10K calls/day → $0.25/day extra
  • 1M calls/day → $25/day, $750/month

Latency measurement (2 runs, same query):

  Agent       Run 1    Run 2     Avg   Answer
  Minimal     6.90s    3.39s   5.15s  The current weather in Beijing is 25°C...
  Verbose     3.10s    4.21s   3.66s  The current weather in Beijing is 25°C...
Enter fullscreen mode Exit fullscreen mode

Verbose averaged lower latency than Minimal — counter-intuitive.

The explanation: 2 samples are nowhere near enough to measure LLM latency. API response time varies ±50% or more depending on server load. You need at least 10–20 samples and a median to see a stable pattern. The apparent difference here is pure noise.

System prompt trimming saves token cost, not latency. Latency optimization requires different tools.

Prompt Caching (advanced): Claude and OpenAI APIs support explicit prompt caching. In Claude:

response = client.messages.create(
    model="claude-sonnet-4-6",
    system=[{
        "type": "text",
        "text": LARGE_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},  # mark as cacheable
    }],
    messages=[...],
)
# First call: writes to cache (billed normally)
# Subsequent calls with same prefix: cache hit, ~90% cost discount
print(response.usage.cache_read_input_tokens)     # tokens served from cache
print(response.usage.cache_creation_input_tokens) # tokens written to cache
Enter fullscreen mode Exit fullscreen mode

For systems with 10K+ token system prompts (RAG results, tool docs, background knowledge), Prompt Caching is the single highest-leverage optimization available.


Demo 2: Model Routing — Skip Agent Overhead for Simple Queries

Core idea: spend one cheap classification call to decide whether a query actually needs an agent. Queries that don't need tools get answered directly, skipping the multi-turn ReAct loop.

ROUTING_SYSTEM = """Classify the user query. Reply with ONLY one word:
- "direct"  if answerable from general knowledge (no real-time data)
- "agent"   if requires a tool call (weather, product pricing, calculation)"""

def classify_query(query: str) -> str:
    resp = llm.invoke([SystemMessage(ROUTING_SYSTEM), HumanMessage(query)])
    return "agent" if "agent" in resp.content.lower() else "direct"

def routed_run(query: str):
    route = classify_query(query)
    if route == "direct":
        resp = llm.invoke([HumanMessage(query)])   # direct answer, no agent overhead
    else:
        result = full_agent.invoke(...)             # full agent execution
Enter fullscreen mode Exit fullscreen mode

Five test queries:

  Query                                              Route    Total    Tools
  What is the capital of France?                     direct   2194ms   []
  Explain machine learning in one sentence.          direct   2011ms   []
  What's the weather in Shanghai right now?          agent    4213ms   ['get_weather']
  How much does WonderBot Pro cost per month?        agent    6033ms   ['get_product_info']
  What is 299 multiplied by 12?                      agent    3878ms   ['calculator']
Enter fullscreen mode Exit fullscreen mode

Classification accuracy: 5/5. But look at the numbers:

  • direct queries: ~2000ms total — this includes the routing call (~1s) plus the direct LLM answer (~1s)
  • agent queries: 4000–6000ms total — routing call (~1s) plus full agent (~3–5s)

The hidden cost of routing: every routing decision is an extra LLM call. For queries that must go through the agent, routing adds ~1 second of overhead with no benefit.

The ROI of routing depends on the ratio of "toolless queries" in your workload:

If > 40% of queries need no tools:
  routing saves: (direct_query_count × agent_overhead)
  routing costs: (all_queries × routing_call_cost)
  → net positive

If < 20% of queries need no tools:
  routing is mostly overhead — skip it
Enter fullscreen mode Exit fullscreen mode

Measure your actual workload distribution before deploying a routing layer.


Demo 3: Parallel Tool Calls — 3.0x Speedup

When two or more tool calls are independent, there's no reason to run them sequentially.

async def fetch_weather_async(city: str) -> str:
    await asyncio.sleep(0.1)   # 100ms simulated I/O latency
    ...

async def run_parallel(cities: list[str]) -> list[str]:
    return await asyncio.gather(*[fetch_weather_async(c) for c in cities])

def run_sequential(cities: list[str]) -> list[str]:
    for city in cities:
        time.sleep(0.1)   # sequential, blocking
        ...
Enter fullscreen mode Exit fullscreen mode

3 cities × 100ms latency, 3 runs each:

  Sequential  avg:  300.4ms   (expected ~300ms) ✓
  Parallel    avg:  101.4ms   (expected ~100ms) ✓
  Speedup        :  3.0x     (66% faster)
Enter fullscreen mode Exit fullscreen mode

Exactly what theory predicts. N independent tool calls in parallel: latency drops from N×t to t.

LangGraph handles this natively. When the LLM emits multiple tool_calls in a single response turn, create_react_agent executes them in parallel automatically. No asyncio boilerplate needed — just declare your tool functions as async def:

@lc_tool
async def get_weather(city: str) -> str:     # ← async declaration
    """Get current weather for a city."""
    result = await weather_api.fetch(city)   # non-blocking I/O
    return result
Enter fullscreen mode Exit fullscreen mode

The prerequisite: the LLM needs to recognize that multiple tool calls are independent and emit them together in one turn. Weaker models may still call them one by one.


Demo 4: Tool Result Cache — 0ms vs 100ms

When it applies: the same tool is called multiple times with the same arguments within a short window (user asks the same city's weather twice, or a multi-step agent needs the same data at different points in reasoning).

_cache: dict[str, tuple[str, float]] = {}
CACHE_TTL_S = 60.0

def get_weather_cached(city: str) -> tuple[str, bool]:
    key = f"weather:{city.lower()}"
    now = time.time()

    if key in _cache:
        result, ts = _cache[key]
        if now - ts < CACHE_TTL_S:
            return result, True      # cache hit

    # miss — call real tool
    time.sleep(0.1)
    result = fetch_from_api(city)
    _cache[key] = (result, now)
    return result, False
Enter fullscreen mode Exit fullscreen mode

Six calls, 3 unique cities:

  City         Status             Time  Note
  Beijing      MISS            100.2ms  1st call
  Shanghai     MISS            100.2ms  1st call
  Beijing      HIT  ✓            0.0ms  2nd call
  Shenzhen     MISS            100.2ms  1st call
  Shanghai     HIT  ✓            0.0ms  3rd call
  Beijing      HIT  ✓            0.0ms  4th call

  Hit rate:  3/6 = 50%
  Miss latency:  ~100ms  (real tool call)
  Hit  latency:  < 1ms   (dict lookup)
Enter fullscreen mode Exit fullscreen mode

TTL guidelines:

Data type TTL Reasoning
Weather 5–15 min Changes slowly; users don't need realtime precision
Product pricing Hours Rarely changes
Inventory / stock < 1 min or none Business-critical freshness
Write operations Never Side effects must not be replayed

Never cache tools with side effects (file writes, emails, database mutations). Replaying the same call with the same parameters will produce duplicate side effects.


Design Checklist

Token optimization

  • [ ] Count system prompt tokens; question every sentence ("is this actually needed?")
  • [ ] Move static reference docs (product manuals, API docs) to RAG retrieval — inject only on demand
  • [ ] Cap conversation history at 10–20 recent turns; summarize what's pruned
  • [ ] High-volume + large system prompts → evaluate Claude/OpenAI Prompt Caching ROI

Model routing

  • [ ] Measure the fraction of your queries that need no tools before building a router
  • [ ] Routing classifier prompt must reflect your actual intent boundary — not a generic direct/agent split
  • [ ] Measure routing overhead (one extra LLM call) against the agent overhead it avoids

Parallel tool calls

  • [ ] Declare tool functions as async def; LangGraph parallelizes independent calls automatically
  • [ ] Identify "must-be-sequential" tools (output of A feeds input of B) vs truly independent ones
  • [ ] Multi-provider scenario: parallel calls to different services → total latency = max(individual latencies), not sum

Tool result caching

  • [ ] Prioritize idempotent tools (same input → same output)
  • [ ] Set TTL per tool based on data freshness requirements; don't use one TTL for everything
  • [ ] Never cache write/side-effect tools
  • [ ] Production: use Redis instead of an in-memory dict for multi-instance cache sharing

Summary

Five core takeaways:

  1. Token cost is the most controllable cost: system prompt, tool schemas, conversation history — each can be measured and reduced without changing the model or architecture
  2. Latency measurement needs sufficient samples: 2-run results can be misleading (verbose was "faster" here); stabilize with 10+ samples and median values
  3. Model routing has hidden overhead: routing adds one LLM call per query; it only turns positive when > 40% of queries need no tools
  4. Parallel tool calls are the cleanest optimization: N independent calls → latency from N×t to t; LangGraph supports this natively with async tool functions
  5. Cache ROI depends on hit rate: below 30% hit rate, the complexity of caching outweighs the benefit; TTL design matters more than the cache implementation itself

Up next: Harness Engineering — Complete System — expanding from the five-element introduction to the full 8-layer framework, including the action space registry, permission budget system, and a complete threat model.


References


Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

Top comments (0)