DEV Community

David AMARA
David AMARA

Posted on

Per-agent GPU cost: what LangSmith can't tell you

Your AI agents are running. Your GPU bill arrives: $47,000 this month.

The CTO asks: "Which agent is responsible for what?"

You open LangSmith. It says your pricing agent used 18 million tokens. Helpful — but what does that cost in GPU?

The answer: you don't know. And neither does LangSmith.

The gap nobody talks about

Every agent observability tool — LangSmith, Arize Phoenix, Helicone, Datadog LLM Obs — counts the same thing: tokens. Prompt tokens in, completion tokens out, maybe a latency percentile.

But tokens are not cost.

The same 1,000 tokens on Llama-70B cost 14x more GPU than on Mistral-7B. One runs on 2× H100 ($7/hr). The other fits on a single L4 ($0.80/hr). Your token counter treats them as identical.

When you host your own LLMs — on GKE, on bare metal, in a colo — the cost isn't per-token. It's per-GPU-hour. Your GPUs are reserved 24/7 whether they process 10 requests or 10,000.

The question isn't "how many tokens" — it's "how many GPU-hours is this agent consuming, and at what rate?"

What a DSI actually needs to know

After deploying 50 agents on an on-prem GPU fleet, the questions are:

  1. Which agent costs the most in GPU this month?
  2. Can I cap an agent's spend before it blows the budget?
  3. Which agents use the expensive model when a cheaper one would work?
  4. If I migrate from Llama-70B to Mistral-7B, which agents break?
  5. Is there an agent that's running away right now?

No token counter answers these. You need a layer that sits between your agents and your LLMs — at the GPU level.

The missing layer: an LLM inference proxy

We built an OpenAI-compatible proxy that sits between any AI agent and any LLM server (vLLM, Ollama, TGI). It's transparent — your agents don't know it's there.

# Before
OPENAI_BASE_URL=http://vllm:8000/v1

# After — one URL change
OPENAI_BASE_URL=http://vibops-proxy:8004/v1
Enter fullscreen mode Exit fullscreen mode

Each agent adds one header:

X-VibOps-Agent-Id: pricing-agent-v2
X-VibOps-Team: supply-chain
Enter fullscreen mode Exit fullscreen mode

That's it. No SDK. No code change. Works with n8n, LangChain, CrewAI, Dify, or a raw curl.

What happens inside the proxy

Every request goes through 8 steps, all under 5ms overhead:

  1. Identify — who is this agent? (cached 60s)
  2. Budget check — has this agent exceeded its monthly limit? → 429
  3. Policy check — is this agent allowed to use this model? → 403
  4. Route — match model name to backend (vLLM, Ollama, TGI)
  5. Forward — transparent proxy, streaming supported
  6. Measure — tokens, latency, time-to-first-token
  7. Cost — GPU-hours × cluster rate, not token × price
  8. Log — async batch to PostgreSQL (non-blocking)

The result: a FinOps dashboard that looks like this:

Agent                    Model         GPU-hrs   Cost
supply-chain-optimizer   llama-70b     651h      $4,559
pricing-agent-v2         llama-70b     307h      $2,150
pricing-agent-v2         mistral-7b    181h        $218
marketing-content        llama-70b     132h        $923
rh-screening-bot         mistral-7b    226h        $271
Enter fullscreen mode Exit fullscreen mode

Now you can see that supply-chain-optimizer is 54% of your GPU budget — and it only uses the most expensive model.

Budget enforcement: the feature nobody else has

Set a monthly limit per agent:

"Set a $1,500/month budget on marketing-content-writer"
→ Budget created. Currently at 76% ($1,145 / $1,500).
  Alert at $1,200 (80%). Block at $1,500 (100%).
Enter fullscreen mode Exit fullscreen mode

When the agent hits the limit, the proxy returns HTTP 429. The agent stops consuming GPU. No human intervention needed.

Model policy: which agent gets which LLM

Not every agent needs a 70B model. Enforce it:

"RH agents can only use Mistral models"
→ Rule created: rh-* → allowed: mistral-*
  rh-onboarding-assistant will be blocked on Llama-70B (403).
Enter fullscreen mode Exit fullscreen mode

One glob pattern. Immediate enforcement. No code change on the agent side.

Dependency graph: impact analysis before you migrate

Before swapping a model:

"If we migrate Llama-70B, which agents are impacted?"

llama-3.1-70b
├── supply-chain-optimizer   100% dependent   $4,559/mo
├── pricing-agent-v2          41% dependent   $2,150/mo
├── marketing-content         32% dependent     $923/mo
└── rh-onboarding            100% dependent     $118/mo
    (already blocked by rh-* policy)

Total cost at risk: $7,750/month
Estimated saving if migrated to Mistral-7B: -$7,037 (-92%)
Enter fullscreen mode Exit fullscreen mode

What this is — and what it isn't

This is not an agent observability tool. It doesn't trace reasoning chains, version prompts, or evaluate hallucinations. LangSmith does that well.

This is the infrastructure control plane for your LLM fleet. It answers the question that nobody else can: how much does each agent cost in real GPU, and how do I control it?

The two are complementary. LangSmith tells you what the agent decided. VibOps tells you how much it cost to decide it.

Try it

The MCP server is open-source (MIT, 74 tools):

pip install git+https://github.com/VibOpsai/vibops-mcp.git
Enter fullscreen mode Exit fullscreen mode

Or add it to Claude Code:

claude mcp add vibops vibops-mcp \
  -e VIBOPS_URL=https://your-instance \
  -e VIBOPS_TOKEN=your-token
Enter fullscreen mode Exit fullscreen mode

GitHub: VibOpsai/vibops-mcp
Website: vibops.ai


Built by a team that got tired of explaining GPU bills to finance.

Top comments (1)

Collapse
 
alexshev profile image
Alex Shev

Per-agent cost attribution is the layer most teams skip until the bill makes it painful. I would want the cost record tied to the agent role, tool path, model, retry count, and task outcome. Otherwise the number is just infrastructure spend, not a signal you can use to improve the workflow.