TL;DR — GPU isn't always the right call for inference. At Leaseweb, we benchmarked a dual-socket EPYC 9334 on 7B–20B LLMs and three TTS models. Here's what the numbers actually look like — and when CPU inference makes sense.
Why inference is where your budget actually disappears
Training is a one-time cost. Inference is not. Once a model is in production, it runs continuously — and cost per query scales directly with traffic. For many teams, inference spend overtakes training spend within months of launch.
The hardware decision for inference is also different from training. Training wants large GPU clusters with high-bandwidth interconnects. Inference wants low latency, high throughput per dollar, and enough memory bandwidth to serve quantised weights efficiently. Those requirements don't always point to a GPU.
The two metrics that actually matter for LLM inference
When a prompt hits an LLM, two stages happen:
- Prefill — the model converts input tokens, runs them through its layers, and builds a KV cache. Compute-bound. Ends when the first output token is generated.
- Decode — the model generates each subsequent token one at a time, reading from the KV cache. Memory-bandwidth-bound.
These stages have different performance profiles, which is why benchmarks report two numbers:
- Time to first token (TTFT) — elapsed time from prompt submission to first output token. Lower is better.
- Tokens per second (tok/s) — decode throughput. Higher is better, especially for batch and streaming workloads.
For TTS, the standard metric is real-time factor (RTF) — the ratio of processing time to audio duration. RTF below 1.0 means the model generates audio faster than real time. Above 1.0 and it can't keep up.
Hardware and software setup
| Component | Specification |
|---|---|
| CPU | AMD EPYC 9334 × 2 (dual socket) |
| Architecture | Zen 4 |
| Cores / threads per socket | 32 / 64 |
| Base clock | 2.7 GHz |
| L3 cache | 128 MB per socket |
| TDP | 210W per socket |
| Memory | 64 GB DDR5 |
Two tools were used: llama-bench (part of llama.cpp) for local model evaluation, and OpenLLM with llmperf for API-level throughput testing.
Test configuration: 512-token prompt, 128 tokens generated, 24 CPU threads for LLM. 180-character input, 32 CPU threads, 30 inference runs per model for TTS.
Models tested
| Model | Parameters | Type |
|---|---|---|
| DeepSeek-R1-0528-Qwen3-8B-Q4_K_M | 8B | LLM |
| Gpt-oss-20b | 20B | LLM |
| Llama-2-7b-Q4_K_M | 7B | LLM |
| Mistral-7B-Instruct-v0.2-Q4_K_M | 7B | LLM |
| Kokoro (ONNX Runtime) | 82M | TTS |
| Microsoft SpeechT5 | 150M | TTS |
| Coqui XTTS-v2 | 400M | TTS |
LLM results
Time to first token
| Model | Quantisation | TTFT |
|---|---|---|
| DeepSeek-R1-8B | Q4_K_M | 4.1s |
| DeepSeek-R1-8B | FP16 | 8.1s |
| GPT-OSS-20B | Q4 | 3.6s |
| GPT-OSS-20B | FP16 | 3.6s |
| Llama-2-7B | Q4_K_M | 4.8s |
| Mistral-7B | Q4_K_M | ~4.5s |
Switching GPT-OSS-20B to FP16 had minimal effect on TTFT. For DeepSeek, the same switch more than doubled it.
Decode throughput
| Model | Quantisation | Throughput |
|---|---|---|
| DeepSeek-R1-8B | Q4_K_M | 27.8 tok/s |
| DeepSeek-R1-8B | FP16 | 8.1 tok/s |
| GPT-OSS-20B | Q4 | 18.3 tok/s |
| GPT-OSS-20B | FP16 | 26.2 tok/s |
| Llama-2-7B | Q4_K_M | ~22 tok/s |
| Mistral-7B | Q4_K_M | ~20 tok/s |
The Q4 vs FP16 gap is significant for DeepSeek — a 3.4× throughput drop. For sustained batch workloads on CPU, Q4 quantisation is the practical default.
Memory utilisation
CPU utilisation stayed between 20–30% across all runs. Q4 models leave substantial DRAM headroom — useful for multi-tenant deployments where you want concurrent instances on the same node. DeepSeek at FP16 consumed close to 16 GB, which limits that option considerably.
GPU reference point
For comparison, the same FP16 throughput test ran on an Nvidia L4 GPU. The L4 produced 16.7 tok/s on DeepSeek-R1-8B and 58.6 tok/s on GPT-OSS-20B, versus 8.1 and 26.2 on the EPYC 9334. Roughly double the throughput. If throughput is your primary constraint, that gap matters. If cost predictability or workload type are the constraint, the CPU case still holds.
TTS results
| Model | RTF | Memory | Verdict |
|---|---|---|---|
| Kokoro (82M, ONNX) | 0.162 | ~0.5 GB | 6× faster than real time. Tight p50/p95 spread. |
| Microsoft SpeechT5 (150M) | 0.6 | ~1.4 GB | Comfortably real time. Good for single-speaker synthesis. |
| Coqui XTTS-v2 (400M) | 1.41 | ~4 GB | Cannot serve real-time audio. Strong fit for batch jobs. |
Kokoro is the standout — 82M parameters, RTF of 0.162, and consistent latency under load. XTTS-v2 is the most capable (voice cloning, multilingual) but at RTF 1.41 it belongs in overnight queues or batch audio generation, not streaming pipelines.
How to reproduce this
📌 Placeholder — confirm the exact commands used before publishing.
# LLM benchmark — llama-bench (part of llama.cpp)
llama-bench \
-m /path/to/model.gguf \
-p 512 \
-n 128 \
-t 24
# TTS benchmark — run per model, 30 iterations
# Kokoro: ONNX Runtime
# SpeechT5 + XTTS-v2: standard Python inference loop
# Input: 180-character text string, 32 threads
# API-level throughput — OpenLLM + llmperf
openllm start /path/to/model.gguf --backend llama-cpp
llmperf run \
--model <model-name> \
--num-concurrent-requests 1 \
--num-output-tokens 128 \
--num-input-tokens 512
Models sourced from HuggingFace. Search the model name directly (e.g. bartowski/DeepSeek-R1-0528-Qwen3-8B-GGUF) and pull the Q4_K_M variant for llama.cpp tests.
No special system prep was applied — no NUMA pinning or hugepage configuration. Results reflect default OS settings on the HPE ProLiant DL385 Gen11.
When to use CPU vs GPU for inference
CPU inference is a good fit for:
- Batch summarisation and document processing
- Audio transcription queues
- Overnight report generation
- Lightweight TTS (Kokoro, SpeechT5)
- Edge deployments with cost or availability constraints
- Multi-tenant setups with 7B–20B Q4 models
GPU is still the right call for:
- Real-time, latency-critical workloads at scale
- High-concurrency serving (maximise throughput)
- Models above 20B without quantisation
- Real-time TTS with complex models (XTTS-v2)
- Streaming use cases where TTFT < 1s is required
Takeaway
The EPYC 9334 handles 7B–20B parameter models at Q4 quantisation with predictable throughput and acceptable latency for a broad class of production workloads. It doesn't replace a GPU for every inference job. For the workloads listed above, it doesn't need to.
If you're running batch inference or TTS queues and paying GPU rates, it's worth running these numbers against your actual workload before assuming a GPU is necessary.
Top comments (0)