Introduction
At some point in an AI company's growth, the GPU bill stops making sense, and we are looking at a cluster running at 3 am for a model that never shipped.
That's the bill that eventually lands on someone's desk, and the first instinct is a cleanup to identify waste and kill orphaned resources. It worked when cloud spend drifted slowly enough for a monthly review to catch up, but by 2025, AI infrastructure spending grew 166% year over year.
The job was run, and the bill for it would arrive only two weeks later. By that time, the same misconfigured job would run again and again. The bill review would become a historical reconstruction of what it was supposed to do, who approved it, and, by that time, people who could answer those questions had moved on to the next experiment.
Why AI Costs Break Normal Cost Logic
A standard cloud bill is predictable, because you spend more when you do more. AI workloads cost the same whether working or idle, and idle GPU doesn't throw alerts the way a failed process does; it just runs, or rather doesn't run, at full price. The costs build in the background while the dashboards stay quiet.
The Bill Behaves Less Predictably
When a GPU is involved, you can run the same cluster for two weeks with a different job schedule and receive a different bill each time. While GPU infrastructure is 5-10x more expensive than standard compute, to say that the difference between these two bills will be impressive is a mild way to put it.
Inference is the major cost driver in AI workflows: Gartner puts inference costs at 55% of AI-optimized IaaS spending by 2026 and expects them to reach 65% by 2029. Unlike training jobs, it doesn’t have a shutdown schedule, and becoming the majority of spend, unmanaged cost-per-query multiplies the bill with each new user added.
Low GPU Usage Gets Expensive Fast
The AI Infrastructure Alliance’s 2024 survey states that only 7% of organizations exceed 85% GPU usage at peak, while 53% sit between 51-70%, and 15% never even break 50%. Most idle usage comes from a capacity sized for worst-case demand that never arrives and training jobs that are finished, but keep active environments in case someone might need it soon.
An H100 capacity runs $2–4 per GPU-hour, billed whether the cluster is active or not. At 70% usage, an 8-GPU cluster carries roughly $3,700 a month in idle costs on a specialized provider, $7,000 on a major one.
Where The Money Leaks
For nine years, cloud waste has been the top optimization priority, actively declining for five of them. Flexera’s 2026 State of the Cloud Report shows that this year, cloud waste grew from 27% to 29%, with AI workloads as the major driver. The table below runs through the most common cloud waste categories.
The next model version is often already in training, while the environments from the previous one are still running. Shutdown schedules and TTLs would help, configuring them is hardly the highest thing in anyone’s priority list. According to Harness, FinOps in Focus, 2025, 68% of developers don't have fully automated cost savings practices implemented, and 86% state that it takes at least a week to find idle and orphaned resources and take action.
The State of FinOps 2025 report states that 63% of organizations are actively managing AI spends, however FinOps in Focus reports that only 39% of developers have full visibility into unused resources.
This shows that while cost visibility has grown, most organizations still haven’t built an attribution level that allows them to act on it. Without attribution, cost visibility is just watching the dashboard more closely wondering why the bill doesn’t move, which is far from a traceable and controlled bill.
What FinOps Looks Like in Practice
While spinning up a training job, engineers can check the history of similar jobs on the same model and at roughly the same volume, and estimate its future cost before committing resources. If a job is counting 3x over the estimate, it can be killed mid-run before it blows the bill.
This is how FinOps works: engineers see the financial consequences of their decisions in real time. Spend is traceable at the moment it’s created, oversized jobs can be stopped ASAP, and the final bill finally stops being a surprise.
Per-job attribution makes it possible, and it must exist before any job runs. Without it, the next engineer deciding whether to rerun a job has no way to know the last one cost $800, or that three nearly identical runs already happened this month.
Start with idle infrastructure
Non-production environments are the easiest place to start. They don't serve users, shutting them down automatically won't affect product performance, and most platforms support it natively. The reason it doesn't happen: restarting a GPU environment takes time, and the engineer who ran the job expects to come back to it.
Reduce the cost of live workloads
In many GenAI workloads, inference can account for 80-90% of total spend. If every request is routed to the most expensive model path by default, cost per query stays high, no matter if the task needs that level of reasoning or not. We ran into exactly that with one of our clients: simple lookups were taking the same expensive path as the work that actually needed the model.
Tracking What Runs
Enforce tagging at the pipeline level: model version and experiment ID as required fields. For resources already running without it, match costs using pipeline logs and timestamps; historical spend without attribution is largely unrecoverable, and the clock starts from when instrumentation goes in.
ClearML, Weights & Biases, and cloud-native cost explorers like AWS Cost Explorer, surface per-job cost data accurately once that metadata is consistently in place. The metrics worth tracking: cost per training run, GPU usage by job, and time-to-detection for idle resources.
How this played out in real systems
Neither of these cases started as a cost project: the cost results showed up because the underlying infrastructure problem got fixed. When the infrastructure stops working against itself, the bill reflects it.
400,000 customers, one infrastructure standard
The original brief was compliance — PCI-DSS, ISO, HIPAA across every AWS region. Meeting those standards required every region to be built on identical configurations.
SciForce moved the client's infrastructure to a single repeatable standard using Terraform and Terragrunt, so every region was built and managed from the same source. Deployments were automated through a Jenkins-to-Concourse transition and Wavefront monitoring was added to catch deviations early.
As a result, the time necessary for configuration and migration dropped by 52%, and the deployments on new compute resources became 63% faster. Once the infrastructure stopped drifting from region to region, the cost picture got easier to control, and total infrastructure TCO improved by 50%.
Query routing decision that cut AI processing costs by 39%
The client's AI assistant was answering every question the same way: routing all queries through the LLM regardless of what was being asked. Pulling a sales figure for last quarter costs roughly the same as summarizing six months of trend data if both go through GPT-4. One of those queries needs the model. The other doesn't.
SciForce built a hybrid processing layer that separated the two. Simple lookups, such as employee stats and sales figures, went through vector search and rule-based retrieval. Summarization and trend analysis went to the LLM. In practice, if a query was pulling a specific number from a known source, it didn’t need the model. If it needed the model to think, it went there.
After assessing seven models on speed, cost, and response quality, SciForce chose GPT-4o-mini for the LLM-routed queries because it held up on quality at a fraction of the cost of larger models. Guardrails were added to filter queries and validate responses, reducing hallucinations and costs.
The financial result was up to 46% reduction in LLM usage and costs for AI processing of queries lowered by 39%. Query routing also had a positive effect on overall tool performance: simple lookups are now processed 32% faster, and the answers have 68% less hallucinations.
Conclusion
The bill arrived. You can't explain it. And because you can't explain this one, you can't prevent the same mistakes from reappearing next month.
FinOps breaks this loop by putting a price tag on each job during provisioning. Attribution helps you predict the job's cost by comparing it to similar jobs before committing to it. If the job is already active but overspending, you can notice it early to stop it before it compounds the bill.
Which training job drove last month's GPU spend? If that takes more than a few minutes to answer, the attribution layer isn't there yet. SciForce can help build it.




Top comments (0)