DEV Community

Cover image for The $0 Problem: Why Every Tool Says Your On-Prem Inference is Free
Christopher Maher
Christopher Maher

Posted on

The $0 Problem: Why Every Tool Says Your On-Prem Inference is Free

If you run LLMs on your own hardware, every cost tracking tool in the ecosystem has the same answer for what it costs: $0.

OpenCost sees your GPU pods but has no concept of tokens. LiteLLM tracks tokens per user but hardcodes on-prem cost to zero. Langfuse traces requests but only prices cloud APIs. The FinOps Foundation's own working group explicitly says on-premises AI cost is "outside the scope."

Meanwhile, your GPUs cost real money. The H100s draw 700 watts each. Your electricity bill is real. The three-year amortization on $280K of hardware is real. But no tool computes:

true cost per token = (hardware amortization + electricity x GPU power draw) / tokens per hour
Enter fullscreen mode Exit fullscreen mode

We built InferCost to fix this.

What InferCost does

InferCost is an open-source Kubernetes operator (Apache 2.0) that computes the true cost of running AI inference on your own hardware. It's a single controller pod. No database, no UI to host. It plugs into Prometheus and Grafana you already run.

You declare your hardware economics in a CRD:

apiVersion: finops.infercost.ai/v1alpha1
kind: CostProfile
metadata:
  name: gpu-cluster
spec:
  hardware:
    gpuModel: "NVIDIA GeForce RTX 5060 Ti"
    gpuCount: 2
    purchasePriceUSD: 960
    amortizationYears: 3
  electricity:
    ratePerKWh: 0.08
    pueFactor: 1.0
Enter fullscreen mode Exit fullscreen mode

InferCost reads real-time GPU power draw from DCGM, scrapes token counts from your inference engine (llama.cpp, vLLM), does the math, and tells you what your inference actually costs. Per model. Per team. Per token.

What we found on real hardware

We deployed InferCost on a homelab running Qwen3-32B on 2x RTX 5060 Ti GPUs. Here are the real numbers:

  • Hourly infrastructure cost: $0.053 (amortization + electricity at actual GPU power draw)
  • Cost per million tokens: $0.41 under sustained load
  • Monthly projected: $38

Then we compared against cloud APIs (verified pricing as of March 2026):

Provider Cloud Cost On-Prem Cost Savings
Claude Opus 4.6 $9.82 $0.62 94%
GPT-5.4 $5.83 $0.62 89%
Gemini 2.5 Pro $3.84 $0.62 84%
GPT-5.4-nano $0.41 $0.62 Cloud 24% cheaper

That last row matters. When the cheapest cloud model is actually cheaper than your hardware, InferCost tells you. The point is not to prove on-prem always wins. The point is to give you the real numbers so you can decide.

A note on how we calculate cost

The $28/month on-prem number is your total infrastructure cost: hardware amortization plus electricity, running 24/7. Your GPUs cost money whether or not they're serving requests. The $0.41 per million tokens is the marginal cost during active inference (what each token costs when the system is busy).

The savings comparison uses total infrastructure cost because that's the honest number. If your GPUs sit idle half the time, that idle time still costs you. This is the same logic as any hardware TCO calculation: you amortize the full purchase price, not just the hours you used it.

This means your actual savings percentage depends on utilization. At high utilization (GPUs busy most of the day), the savings are dramatic. At low utilization, the math shifts toward cloud APIs for cheap models. InferCost shows you both realities so you can make the right call for each workload.

The CLI

$ brew install defilantech/tap/infercost
$ infercost compare --monthly

PROVIDER    MODEL              CLOUD/MONTH  ON-PREM/MONTH  SAVINGS/MONTH
Anthropic   claude-opus-4-6    $409         $28            $381 (93%)
OpenAI      gpt-5.4            $242         $28            $214 (88%)
Google      gemini-2.5-pro     $159         $28            $131 (82%)
Google      gemini-2.5-flash   $40          $28            $12 (30%)
OpenAI      gpt-5.4-nano       $20          $28            -$8 (cloud cheaper)
Enter fullscreen mode Exit fullscreen mode

What InferCost is NOT

It is not a cloud API cost tracker. If you want to monitor your OpenAI bill, tools like Helicone and LangSmith do that well. InferCost solves a different problem: the cost of running inference on hardware you own, where the economics involve amortization schedules and electricity bills, not API invoices.

It is also not locked to any specific inference stack. It works with LLMKube, but also with any Kubernetes deployment that runs llama.cpp or vLLM with Prometheus metrics exposed.

Why open source

The organizations that need on-prem cost tracking the most (healthcare, defense, finance, government) are the same ones that can't send cost data to a SaaS dashboard. They chose on-prem for data sovereignty. A cost tracking tool that phones home defeats the purpose.

InferCost runs entirely in your cluster. Your cost data never leaves your infrastructure. Apache 2.0, no telemetry, no cloud dependency.

Get started

# Install the CLI
brew install defilantech/tap/infercost

# Or deploy via Helm
helm repo add infercost https://defilantech.github.io/infercost
helm install infercost infercost/infercost \
  --set dcgm.endpoint=http://dcgm-exporter:9400/metrics
Enter fullscreen mode Exit fullscreen mode

If you're running inference on your own hardware and want to know what it actually costs, give it a try. Issues and PRs welcome.

Top comments (0)