Your AI agents are running. Your GPU bill arrives: $47,000 this month.
The CTO asks: "Which agent is responsible for what?"
You open LangSmith. It says your pricing agent used 18 million tokens. Helpful — but what does that cost in GPU?
The answer: you don't know. And neither does LangSmith.
The gap nobody talks about
Every agent observability tool — LangSmith, Arize Phoenix, Helicone, Datadog LLM Obs — counts the same thing: tokens. Prompt tokens in, completion tokens out, maybe a latency percentile.
But tokens are not cost.
The same 1,000 tokens on Llama-70B cost 14x more GPU than on Mistral-7B. One runs on 2× H100 ($7/hr). The other fits on a single L4 ($0.80/hr). Your token counter treats them as identical.
When you host your own LLMs — on GKE, on bare metal, in a colo — the cost isn't per-token. It's per-GPU-hour. Your GPUs are reserved 24/7 whether they process 10 requests or 10,000.
The question isn't "how many tokens" — it's "how many GPU-hours is this agent consuming, and at what rate?"
What a DSI actually needs to know
After deploying 50 agents on an on-prem GPU fleet, the questions are:
- Which agent costs the most in GPU this month?
- Can I cap an agent's spend before it blows the budget?
- Which agents use the expensive model when a cheaper one would work?
- If I migrate from Llama-70B to Mistral-7B, which agents break?
- Is there an agent that's running away right now?
No token counter answers these. You need a layer that sits between your agents and your LLMs — at the GPU level.
The missing layer: an LLM inference proxy
We built an OpenAI-compatible proxy that sits between any AI agent and any LLM server (vLLM, Ollama, TGI). It's transparent — your agents don't know it's there.
# Before
OPENAI_BASE_URL=http://vllm:8000/v1
# After — one URL change
OPENAI_BASE_URL=http://vibops-proxy:8004/v1
Each agent adds one header:
X-VibOps-Agent-Id: pricing-agent-v2
X-VibOps-Team: supply-chain
That's it. No SDK. No code change. Works with n8n, LangChain, CrewAI, Dify, or a raw curl.
What happens inside the proxy
Every request goes through 8 steps, all under 5ms overhead:
- Identify — who is this agent? (cached 60s)
- Budget check — has this agent exceeded its monthly limit? → 429
- Policy check — is this agent allowed to use this model? → 403
- Route — match model name to backend (vLLM, Ollama, TGI)
- Forward — transparent proxy, streaming supported
- Measure — tokens, latency, time-to-first-token
- Cost — GPU-hours × cluster rate, not token × price
- Log — async batch to PostgreSQL (non-blocking)
The result: a FinOps dashboard that looks like this:
Agent Model GPU-hrs Cost
supply-chain-optimizer llama-70b 651h $4,559
pricing-agent-v2 llama-70b 307h $2,150
pricing-agent-v2 mistral-7b 181h $218
marketing-content llama-70b 132h $923
rh-screening-bot mistral-7b 226h $271
Now you can see that supply-chain-optimizer is 54% of your GPU budget — and it only uses the most expensive model.
Budget enforcement: the feature nobody else has
Set a monthly limit per agent:
"Set a $1,500/month budget on marketing-content-writer"
→ Budget created. Currently at 76% ($1,145 / $1,500).
Alert at $1,200 (80%). Block at $1,500 (100%).
When the agent hits the limit, the proxy returns HTTP 429. The agent stops consuming GPU. No human intervention needed.
Model policy: which agent gets which LLM
Not every agent needs a 70B model. Enforce it:
"RH agents can only use Mistral models"
→ Rule created: rh-* → allowed: mistral-*
rh-onboarding-assistant will be blocked on Llama-70B (403).
One glob pattern. Immediate enforcement. No code change on the agent side.
Dependency graph: impact analysis before you migrate
Before swapping a model:
"If we migrate Llama-70B, which agents are impacted?"
llama-3.1-70b
├── supply-chain-optimizer 100% dependent $4,559/mo
├── pricing-agent-v2 41% dependent $2,150/mo
├── marketing-content 32% dependent $923/mo
└── rh-onboarding 100% dependent $118/mo
(already blocked by rh-* policy)
Total cost at risk: $7,750/month
Estimated saving if migrated to Mistral-7B: -$7,037 (-92%)
What this is — and what it isn't
This is not an agent observability tool. It doesn't trace reasoning chains, version prompts, or evaluate hallucinations. LangSmith does that well.
This is the infrastructure control plane for your LLM fleet. It answers the question that nobody else can: how much does each agent cost in real GPU, and how do I control it?
The two are complementary. LangSmith tells you what the agent decided. VibOps tells you how much it cost to decide it.
Try it
The MCP server is open-source (MIT, 74 tools):
pip install git+https://github.com/VibOpsai/vibops-mcp.git
Or add it to Claude Code:
claude mcp add vibops vibops-mcp \
-e VIBOPS_URL=https://your-instance \
-e VIBOPS_TOKEN=your-token
GitHub: VibOpsai/vibops-mcp
Website: vibops.ai
Built by a team that got tired of explaining GPU bills to finance.
Top comments (1)
Per-agent cost attribution is the layer most teams skip until the bill makes it painful. I would want the cost record tied to the agent role, tool path, model, retry count, and task outcome. Otherwise the number is just infrastructure spend, not a signal you can use to improve the workflow.