<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ROHITH SARGUNAN</title>
    <description>The latest articles on DEV Community by ROHITH SARGUNAN (@rohith_sargunan_56d47729f).</description>
    <link>https://dev.to/rohith_sargunan_56d47729f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3863361%2Fff9c40e8-1caa-40f5-aae6-ff8ea3edaeb9.jpg</url>
      <title>DEV Community: ROHITH SARGUNAN</title>
      <link>https://dev.to/rohith_sargunan_56d47729f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rohith_sargunan_56d47729f"/>
    <language>en</language>
    <item>
      <title>The Hidden Cost of Running LLM Applications at Scale</title>
      <dc:creator>ROHITH SARGUNAN</dc:creator>
      <pubDate>Wed, 15 Apr 2026 16:11:43 +0000</pubDate>
      <link>https://dev.to/rohith_sargunan_56d47729f/the-hidden-cost-of-running-llm-applications-at-scale-56lg</link>
      <guid>https://dev.to/rohith_sargunan_56d47729f/the-hidden-cost-of-running-llm-applications-at-scale-56lg</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1ic8tx5aqzvp6u2lnnl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1ic8tx5aqzvp6u2lnnl.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Everyone watches the model bill.&lt;/p&gt;

&lt;p&gt;You pick a model, check the pricing page, estimate your request volume, do the math. The number looks fine. You ship.&lt;/p&gt;

&lt;p&gt;Six months later the bill is three times what you expected and nobody can explain exactly where it went.&lt;/p&gt;

&lt;p&gt;I have been there. And after building multi-tenant LLM systems in production — serving enterprise clients across multiple services, two different LLM providers, an agentic orchestration layer, and a full retrieval pipeline — I can tell you exactly where the money goes.&lt;/p&gt;

&lt;p&gt;It is not the model. Not directly.&lt;/p&gt;

&lt;p&gt;It is five decisions you made before you ever called the model, most of which felt completely reasonable at the time.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. You are using the same model for every task
&lt;/h2&gt;

&lt;p&gt;This is the most common and most expensive mistake.&lt;/p&gt;

&lt;p&gt;When you start a new service you pick a model. It works. You move on. That model ends up handling every request type — simple lookups, complex reasoning, structured output, ambiguous multi-turn conversations — all running through the same expensive inference endpoint.&lt;/p&gt;

&lt;p&gt;We run two production services built on the same orchestration framework. One is a streaming Q&amp;amp;A chatbot over complex enterprise WMS documentation. Multi-turn, ambiguous queries, large documents, 20+ enterprise tenants. It needs a capable frontier model. It gets one.&lt;/p&gt;

&lt;p&gt;The other is a structured flow bot. Narrow scope. Well-defined process. Predictable input shape. It runs on &lt;strong&gt;Amazon Nova Micro&lt;/strong&gt;. A fraction of the cost. Faster. For what it actually does, it performs identically to the expensive alternative.&lt;/p&gt;

&lt;p&gt;Same tool registry. Same orchestrator. Same prompt templates. Different model injected at startup. The cost difference, multiplied across every request, every tenant, every month, is not small.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The question most engineers never ask:&lt;/strong&gt; what is the cheapest model that reliably does this specific job?&lt;/p&gt;

&lt;p&gt;That question changes per use case. Per tenant. Sometimes per request type. The only way to act on it without a rewrite every time is to build the abstraction layer first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="c1"&gt;# Provider abstraction — same interface regardless of model or vendor
&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLMBase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_completion_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
 &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_completion_response_streaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# OpenAI and Bedrock both implement this interface
&lt;/span&gt; &lt;span class="c1"&gt;# The orchestrator never knows which one it's talking to
&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# or "openai"
&lt;/span&gt; &lt;span class="n"&gt;orchestrator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenaiOrchestratorAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;/p&gt;

&lt;p&gt;Swapping Nova Micro for GPT-4o-mini is a config change. If you are tightly coupled to one provider it is a rewrite. That flexibility is what lets you make cost decisions without architectural consequences.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Your agentic loop has no ceiling
&lt;/h2&gt;

&lt;p&gt;A single LLM call has a predictable cost. An agentic loop does not.&lt;/p&gt;

&lt;p&gt;In a tool-use orchestrator the model calls a tool, gets a result, decides what to do next. Normal runs take two or three iterations. Edge cases can take more. Without a hard cap, one unexpected input becomes a runaway process.&lt;/p&gt;

&lt;p&gt;We hit this during testing. A tool was returning output the model did not know how to handle. No ceiling in place. The model called the same tool seven times trying to recover. It never did. The API timed out. The cost of that single test run was a useful reminder.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OpenaiOrchestratorAgent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_loops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_client&lt;/span&gt;
 &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_loops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_loops&lt;/span&gt; &lt;span class="c1"&gt;# Hard ceiling. Not optional.
&lt;/span&gt;
 &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_run_non_streaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
 &lt;span class="n"&gt;loop_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
 &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;loop_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_loops&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="n"&gt;loop_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
 &lt;span class="c1"&gt;# ... call model, execute tools ...
&lt;/span&gt;
 &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Maximum iteration loops reached&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ten is a generous ceiling for most agentic workflows. If your agent consistently needs more than ten iterations to answer a question, that is a design problem, not a reason to raise the cap.&lt;/p&gt;

&lt;p&gt;The loop cap is your billing safety valve. Set it before you need it.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Your retrieval is pulling too many chunks
&lt;/h2&gt;

&lt;p&gt;Every retrieval result is input tokens. Every input token costs money.&lt;/p&gt;

&lt;p&gt;If your knowledge base returns 10 chunks when 3 would answer the question, you are paying for 7 tokens on every single request. Across every user, every tenant, every conversation turn.&lt;/p&gt;

&lt;p&gt;Most teams accept the default retrieval count because it is convenient. &lt;code&gt;numberOfResults&lt;/code&gt; gets set once during initial setup and never revisited.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="c1"&gt;# Bedrock KB search config
&lt;/span&gt; &lt;span class="n"&gt;search_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vectorSearchConfiguration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numberOfResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Deliberate. Not default.
&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overrideSearchType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HYBRID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# HYBRID vs SEMANTIC per use case
&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tenant_filter&lt;/span&gt; &lt;span class="c1"&gt;# Hard tenant isolation
&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;/p&gt;

&lt;p&gt;The right number depends on your chunk size, your query complexity, and your context window budget. For a narrow structured Q&amp;amp;A flow, 3 is often enough. For complex multi-document reasoning, 8 might be appropriate. The point is to have a reason for the number — not to accept whatever the SDK defaults to.&lt;/p&gt;

&lt;p&gt;Retrieval configuration is a cost decision. Treat it like one.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. Your tool schemas are carrying weight the model does not need
&lt;/h2&gt;

&lt;p&gt;Every tool you register with the LLM becomes part of the input on every call where that tool is available. Every field in every tool schema is tokens. This multiplies across every iteration of every loop of every request.&lt;/p&gt;

&lt;p&gt;The mistake is putting everything in the schema — including context that belongs to the system, not the model.&lt;/p&gt;

&lt;p&gt;Tenant identifiers. Knowledge base names. Session IDs. Internal routing parameters. None of these are decisions the LLM needs to make. They are facts the system already knows. Putting them in the model-facing schema means the model reasons over them, the schema grows, and the token count on every call goes up.&lt;/p&gt;

&lt;p&gt;The fix is to split tool parameters into two categories:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knowledge_qna&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search the knowledge base and answer the question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
 &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
 &lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="c1"&gt;# LLM provides: query
&lt;/span&gt; &lt;span class="c1"&gt;# System injects: kb_name, filter_config, session_id
&lt;/span&gt; &lt;span class="n"&gt;internal_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kb_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filter_config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
 &lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;knowledge_qna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kb_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filter_config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM sees one parameter: &lt;code&gt;query&lt;/code&gt;. The service layer injects the rest at runtime. The model schema stays lean. The internal parameters never appear in the tool schema that gets sent to the model on every call.&lt;/p&gt;

&lt;p&gt;At 15 tools with bloated schemas, this is not a minor optimisation. It compounds.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. You are synthesising when you don't need to
&lt;/h2&gt;

&lt;p&gt;The default agentic pattern is: call a tool → get a result → pass the result back to the LLM → let the model compose a final answer. This is correct when the LLM needs to reason over the result, combine it with other context, or explain it to the user.&lt;/p&gt;

&lt;p&gt;It is not correct when the tool already generated the final output.&lt;/p&gt;

&lt;p&gt;If a tool produces a deterministic result — a formatted document, a structured XML payload, a direct lookup value — routing that back through the LLM for synthesis costs tokens, adds latency, and introduces a layer of unpredictability you do not need. The model might rephrase it. It might add hedging language. It might change the format. None of that is useful.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;modify_label_xml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Modify the label XML template&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
 &lt;span class="n"&gt;direct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Result is the response. Bypasses LLM synthesis.
&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;labelgen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
 &lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;modify_label_xml&lt;/span&gt;&lt;span class="p"&gt;(...):&lt;/span&gt;
 &lt;span class="c1"&gt;# Tool generates the final output directly
&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;updated_xml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="c1"&gt;# In the orchestrator — direct tools short-circuit the loop
&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tool_entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;direct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Done. No synthesis step.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;direct&lt;/code&gt; flag tells the orchestrator that the tool result is the response. The loop stops. No additional LLM call. No synthesis tokens. The result is exactly what the tool produced.&lt;/p&gt;

&lt;p&gt;Not every tool should be direct. But every tool should have an explicit answer to the question: does the LLM need to do anything with this result, or is the result already the answer?&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting it together
&lt;/h2&gt;

&lt;p&gt;None of these are advanced optimisations. They are decisions that look small in isolation and expensive in aggregate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;Cost impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Using a frontier model for every task&lt;/td&gt;
&lt;td&gt;5-20x per request vs a smaller model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No loop cap&lt;/td&gt;
&lt;td&gt;Unbounded. One edge case can cost more than a day of normal traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default retrieval count&lt;/td&gt;
&lt;td&gt;2-3x token waste on every retrieval call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bloated tool schemas&lt;/td&gt;
&lt;td&gt;Multiplicative - every extra token across every tool across every call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unnecessary synthesis&lt;/td&gt;
&lt;td&gt;One extra LLM call per tool invocation that didn't need it&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The engineers whose AI systems stay affordable at scale are not the ones who picked the cheapest model. They are the ones who made deliberate decisions about all five of these things before the bill told them they had to.&lt;/p&gt;

&lt;p&gt;Cost is not something you optimise later. By the time you are optimising it reactively, you have already explained the numbers to someone you did not want to.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this was useful, I write about building production AI systems — RAG, agentic workflows, multi-tenant architecture, and the engineering decisions that don't show up in tutorials. Follow along.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agenticai</category>
      <category>llm</category>
      <category>production</category>
    </item>
  </channel>
</rss>
