<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David AMARA</title>
    <description>The latest articles on DEV Community by David AMARA (@david_amara_e9b61428737e0).</description>
    <link>https://dev.to/david_amara_e9b61428737e0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4006696%2F520dd1e7-518c-47d3-bf32-96b84061d555.png</url>
      <title>DEV Community: David AMARA</title>
      <link>https://dev.to/david_amara_e9b61428737e0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/david_amara_e9b61428737e0"/>
    <language>en</language>
    <item>
      <title>Per-agent GPU cost: what LangSmith can't tell you</title>
      <dc:creator>David AMARA</dc:creator>
      <pubDate>Sun, 28 Jun 2026 15:54:29 +0000</pubDate>
      <link>https://dev.to/david_amara_e9b61428737e0/per-agent-gpu-cost-what-langsmith-cant-tell-you-52fo</link>
      <guid>https://dev.to/david_amara_e9b61428737e0/per-agent-gpu-cost-what-langsmith-cant-tell-you-52fo</guid>
      <description>&lt;p&gt;Your AI agents are running. Your GPU bill arrives: &lt;strong&gt;$47,000 this month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The CTO asks: &lt;em&gt;"Which agent is responsible for what?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You open LangSmith. It says your pricing agent used 18 million tokens. Helpful — but what does that &lt;strong&gt;cost&lt;/strong&gt; in GPU?&lt;/p&gt;

&lt;p&gt;The answer: you don't know. And neither does LangSmith.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap nobody talks about
&lt;/h2&gt;

&lt;p&gt;Every agent observability tool — LangSmith, Arize Phoenix, Helicone, Datadog LLM Obs — counts the same thing: &lt;strong&gt;tokens&lt;/strong&gt;. Prompt tokens in, completion tokens out, maybe a latency percentile.&lt;/p&gt;

&lt;p&gt;But tokens are not cost.&lt;/p&gt;

&lt;p&gt;The same 1,000 tokens on Llama-70B cost &lt;strong&gt;14x more GPU&lt;/strong&gt; than on Mistral-7B. One runs on 2× H100 ($7/hr). The other fits on a single L4 ($0.80/hr). Your token counter treats them as identical.&lt;/p&gt;

&lt;p&gt;When you host your own LLMs — on GKE, on bare metal, in a colo — the cost isn't per-token. It's per-GPU-hour. Your GPUs are reserved 24/7 whether they process 10 requests or 10,000.&lt;/p&gt;

&lt;p&gt;The question isn't "how many tokens" — it's &lt;strong&gt;"how many GPU-hours is this agent consuming, and at what rate?"&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What a DSI actually needs to know
&lt;/h2&gt;

&lt;p&gt;After deploying 50 agents on an on-prem GPU fleet, the questions are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which agent costs the most in GPU &lt;strong&gt;this month&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;Can I &lt;strong&gt;cap&lt;/strong&gt; an agent's spend before it blows the budget?&lt;/li&gt;
&lt;li&gt;Which agents use the expensive model when a cheaper one would work?&lt;/li&gt;
&lt;li&gt;If I migrate from Llama-70B to Mistral-7B, &lt;strong&gt;which agents break&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;Is there an agent that's &lt;strong&gt;running away&lt;/strong&gt; right now?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No token counter answers these. You need a layer that sits between your agents and your LLMs — at the GPU level.&lt;/p&gt;

&lt;h2&gt;
  
  
  The missing layer: an LLM inference proxy
&lt;/h2&gt;

&lt;p&gt;We built an OpenAI-compatible proxy that sits between any AI agent and any LLM server (vLLM, Ollama, TGI). It's transparent — your agents don't know it's there.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;OPENAI_BASE_URL&lt;/span&gt;=&lt;span class="n"&gt;http&lt;/span&gt;://&lt;span class="n"&gt;vllm&lt;/span&gt;:&lt;span class="m"&gt;8000&lt;/span&gt;/&lt;span class="n"&gt;v1&lt;/span&gt;

&lt;span class="c"&gt;# After — one URL change
&lt;/span&gt;&lt;span class="n"&gt;OPENAI_BASE_URL&lt;/span&gt;=&lt;span class="n"&gt;http&lt;/span&gt;://&lt;span class="n"&gt;vibops&lt;/span&gt;-&lt;span class="n"&gt;proxy&lt;/span&gt;:&lt;span class="m"&gt;8004&lt;/span&gt;/&lt;span class="n"&gt;v1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent adds one header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;X-VibOps-Agent-Id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pricing-agent-v2&lt;/span&gt;
&lt;span class="py"&gt;X-VibOps-Team&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;supply-chain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No SDK. No code change. Works with n8n, LangChain, CrewAI, Dify, or a raw &lt;code&gt;curl&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens inside the proxy
&lt;/h2&gt;

&lt;p&gt;Every request goes through 8 steps, all under 5ms overhead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Identify&lt;/strong&gt; — who is this agent? (cached 60s)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget check&lt;/strong&gt; — has this agent exceeded its monthly limit? → 429&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Policy check&lt;/strong&gt; — is this agent allowed to use this model? → 403&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route&lt;/strong&gt; — match model name to backend (vLLM, Ollama, TGI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forward&lt;/strong&gt; — transparent proxy, streaming supported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure&lt;/strong&gt; — tokens, latency, time-to-first-token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; — GPU-hours × cluster rate, not token × price&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log&lt;/strong&gt; — async batch to PostgreSQL (non-blocking)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result: a FinOps dashboard that looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent                    Model         GPU-hrs   Cost
supply-chain-optimizer   llama-70b     651h      $4,559
pricing-agent-v2         llama-70b     307h      $2,150
pricing-agent-v2         mistral-7b    181h        $218
marketing-content        llama-70b     132h        $923
rh-screening-bot         mistral-7b    226h        $271
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can see that &lt;code&gt;supply-chain-optimizer&lt;/code&gt; is 54% of your GPU budget — and it only uses the most expensive model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget enforcement: the feature nobody else has
&lt;/h2&gt;

&lt;p&gt;Set a monthly limit per agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Set a $1,500/month budget on marketing-content-writer"
→ Budget created. Currently at 76% ($1,145 / $1,500).
  Alert at $1,200 (80%). Block at $1,500 (100%).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the agent hits the limit, the proxy returns HTTP 429. The agent stops consuming GPU. No human intervention needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model policy: which agent gets which LLM
&lt;/h2&gt;

&lt;p&gt;Not every agent needs a 70B model. Enforce it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"RH agents can only use Mistral models"
→ Rule created: rh-* → allowed: mistral-*
  rh-onboarding-assistant will be blocked on Llama-70B (403).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One glob pattern. Immediate enforcement. No code change on the agent side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dependency graph: impact analysis before you migrate
&lt;/h2&gt;

&lt;p&gt;Before swapping a model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"If we migrate Llama-70B, which agents are impacted?"

llama-3.1-70b
├── supply-chain-optimizer   100% dependent   $4,559/mo
├── pricing-agent-v2          41% dependent   $2,150/mo
├── marketing-content         32% dependent     $923/mo
└── rh-onboarding            100% dependent     $118/mo
    (already blocked by rh-* policy)

Total cost at risk: $7,750/month
Estimated saving if migrated to Mistral-7B: -$7,037 (-92%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What this is — and what it isn't
&lt;/h2&gt;

&lt;p&gt;This is &lt;strong&gt;not&lt;/strong&gt; an agent observability tool. It doesn't trace reasoning chains, version prompts, or evaluate hallucinations. LangSmith does that well.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;infrastructure control plane&lt;/strong&gt; for your LLM fleet. It answers the question that nobody else can: &lt;em&gt;how much does each agent cost in real GPU, and how do I control it?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The two are complementary. LangSmith tells you &lt;strong&gt;what&lt;/strong&gt; the agent decided. VibOps tells you &lt;strong&gt;how much it cost&lt;/strong&gt; to decide it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;The MCP server is open-source (MIT, 74 tools):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;git+https://github.com/VibOpsai/vibops-mcp.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or add it to Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add vibops vibops-mcp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;VIBOPS_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://your-instance &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;VIBOPS_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/VibOpsai/vibops-mcp" rel="noopener noreferrer"&gt;VibOpsai/vibops-mcp&lt;/a&gt;&lt;br&gt;
Website: &lt;a href="https://vibops.ai" rel="noopener noreferrer"&gt;vibops.ai&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by a team that got tired of explaining GPU bills to finance.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>gpu</category>
      <category>finops</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
