As AI adoption grows, technological maintenance isn’t the only component you need to keep up with; your budget also requires a watchful eye. Especially when inference workloads can scale data—and costs—quickly. Your AI inference bill comes down to three things: the hardware you use, the scale you need, and how fast it generates output.
If you’re curious how you can lower LLM inference spending, here are three tips to reduce your overall AI costs as you scale:
1. Diversify your hardware
Hardware is a major reason AI has historically been expensive: the only processing units available to run these workloads are GPUs, and demand exceeded supply (driving up costs). This is true for consumer-grade GPUs, where it's not uncommon to see prices two or three times above MSRP, and data center GPU scarcity is even worse.
For a long time, NVIDIA held a large market share with its physical hardware and compute unified device architecture (CUDA)-only frameworks. AMD has since introduced open-source ROCm and made it easier for teams to expand the hardware types they can use for their AI workloads, increasing GPU supply and reducing vendor lock-in.
2. Configuration (Model + KV cache) and quantization
When running LLM inference, pay attention to GPU capacity and speed, as they affect overall performance. You need a minimum amount of memory to even load and run a model. Additional capacity beyond that allows you to have a bigger KV cache, which is critical to high-throughput performance; the KV cache stores the history of each conversation for each user that the GPU is currently serving. Without it, your token generation becomes slower, and inference speed slows down. With it, you can serve more users at once and keep token generation steady.
Beyond using a KV cache and optimizing your model, consider quantization. This practice reduces precision, so less memory (or VRAM) is required to store tokens. A 5000-token conversation, for example, will take several gigabytes of GPU memory (VRAM) to store. These gigabytes contain a massive amount of numbers that the GPU reuses during inference. Each of these numbers requires 2 bytes of memory when using the default 16-bit precision. With 8-bit precision, you only need 1 byte to store the same number of tokens and reduce the overall memory requirements by half. Though your hardware must support 8-bit models for this to work effectively.
3. Optimize your parallelism setup
AI production workloads are massive and require gigabytes (or even terabytes) to just load models. Even if you could load a single model onto a GPU that supports 8-bit models, there’s no guarantee that you would be able to successfully have enough memory to run your model and associated activations (calculations the LLM does during inference) on just one GPU. This is where tensor parallelism and data parallelism improve performance.
When you spread your LLM models across multiple GPUs, you reduce the overall calculations (and memory) required per GPU, leaving plenty of room for activations and the KV cache. If you choose to apply this technique, consider the technical overhead of GPU data coordination and synchronization.
If you’re curious to see a practical application of these techniques, you can read our full Character.ai case study for a technical deep dive. With these workflows in place, the company reduced its inference costs by 50% while continuing to support an app with 10s of millions of users.
Top comments (0)