Originally published on AI Tech Connect.
What you need to know Cheaper tokens, bigger bills. Per-token inference costs have fallen roughly 1,000x over three years for comparable capability (a widely reported estimate). Usage-based pricing means the savings rarely reach the invoice — demand grows to fill the headroom. Inference is the spend. It reportedly consumes around 80–90% of total compute dollars over a model's production life; one estimate puts inference at roughly $15–20 billion for every $1 billion spent on training. Utilisation is the elephant. A widely cited figure puts average GPU utilisation near 5%, framing a roughly $401 billion idle-infrastructure problem. Most deployed capacity simply sits there, metered and idle. Hardware keeps moving. New A5X bare-metal instances on NVIDIA Vera Rubin NVL72 rack-scale systems…
Top comments (0)