RubberDuckOps for Leaseweb

Posted on May 6

CPU Inference on AMD EPYC 9334: Real Numbers for LLM and TTS Workloads

#machinelearning #llm #benchmark #infrastructure

TL;DR — GPU isn't always the right call for inference. At Leaseweb, we benchmarked a dual-socket EPYC 9334 on 7B–20B LLMs and three TTS models. Here's what the numbers actually look like — and when CPU inference makes sense.

Why inference is where your budget actually disappears

Training is a one-time cost. Inference is not. Once a model is in production, it runs continuously — and cost per query scales directly with traffic. For many teams, inference spend overtakes training spend within months of launch.

The hardware decision for inference is also different from training. Training wants large GPU clusters with high-bandwidth interconnects. Inference wants low latency, high throughput per dollar, and enough memory bandwidth to serve quantised weights efficiently. Those requirements don't always point to a GPU.

The two metrics that actually matter for LLM inference

When a prompt hits an LLM, two stages happen:

Prefill — the model converts input tokens, runs them through its layers, and builds a KV cache. Compute-bound. Ends when the first output token is generated.
Decode — the model generates each subsequent token one at a time, reading from the KV cache. Memory-bandwidth-bound.

These stages have different performance profiles, which is why benchmarks report two numbers:

Time to first token (TTFT) — elapsed time from prompt submission to first output token. Lower is better.
Tokens per second (tok/s) — decode throughput. Higher is better, especially for batch and streaming workloads.

For TTS, the standard metric is real-time factor (RTF) — the ratio of processing time to audio duration. RTF below 1.0 means the model generates audio faster than real time. Above 1.0 and it can't keep up.

Hardware and software setup

Component	Specification
CPU	AMD EPYC 9334 × 2 (dual socket)
Architecture	Zen 4
Cores / threads per socket	32 / 64
Base clock	2.7 GHz
L3 cache	128 MB per socket
TDP	210W per socket
Memory	64 GB DDR5

Two tools were used: llama-bench (part of llama.cpp) for local model evaluation, and OpenLLM with llmperf for API-level throughput testing.

Test configuration: 512-token prompt, 128 tokens generated, 24 CPU threads for LLM. 180-character input, 32 CPU threads, 30 inference runs per model for TTS.

Models tested

Model	Parameters	Type
DeepSeek-R1-0528-Qwen3-8B-Q4_K_M	8B	LLM
Gpt-oss-20b	20B	LLM
Llama-2-7b-Q4_K_M	7B	LLM
Mistral-7B-Instruct-v0.2-Q4_K_M	7B	LLM
Kokoro (ONNX Runtime)	82M	TTS
Microsoft SpeechT5	150M	TTS
Coqui XTTS-v2	400M	TTS

LLM results

Time to first token

Model	Quantisation	TTFT
DeepSeek-R1-8B	Q4_K_M	4.1s
DeepSeek-R1-8B	FP16	8.1s
GPT-OSS-20B	Q4	3.6s
GPT-OSS-20B	FP16	3.6s
Llama-2-7B	Q4_K_M	4.8s
Mistral-7B	Q4_K_M	~4.5s

Switching GPT-OSS-20B to FP16 had minimal effect on TTFT. For DeepSeek, the same switch more than doubled it.

Decode throughput

Model	Quantisation	Throughput
DeepSeek-R1-8B	Q4_K_M	27.8 tok/s
DeepSeek-R1-8B	FP16	8.1 tok/s
GPT-OSS-20B	Q4	18.3 tok/s
GPT-OSS-20B	FP16	26.2 tok/s
Llama-2-7B	Q4_K_M	~22 tok/s
Mistral-7B	Q4_K_M	~20 tok/s

The Q4 vs FP16 gap is significant for DeepSeek — a 3.4× throughput drop. For sustained batch workloads on CPU, Q4 quantisation is the practical default.

Memory utilisation

CPU utilisation stayed between 20–30% across all runs. Q4 models leave substantial DRAM headroom — useful for multi-tenant deployments where you want concurrent instances on the same node. DeepSeek at FP16 consumed close to 16 GB, which limits that option considerably.

GPU reference point

For comparison, the same FP16 throughput test ran on an Nvidia L4 GPU. The L4 produced 16.7 tok/s on DeepSeek-R1-8B and 58.6 tok/s on GPT-OSS-20B, versus 8.1 and 26.2 on the EPYC 9334. Roughly double the throughput. If throughput is your primary constraint, that gap matters. If cost predictability or workload type are the constraint, the CPU case still holds.

TTS results

Model	RTF	Memory	Verdict
Kokoro (82M, ONNX)	0.162	~0.5 GB	6× faster than real time. Tight p50/p95 spread.
Microsoft SpeechT5 (150M)	0.6	~1.4 GB	Comfortably real time. Good for single-speaker synthesis.
Coqui XTTS-v2 (400M)	1.41	~4 GB	Cannot serve real-time audio. Strong fit for batch jobs.

Kokoro is the standout — 82M parameters, RTF of 0.162, and consistent latency under load. XTTS-v2 is the most capable (voice cloning, multilingual) but at RTF 1.41 it belongs in overnight queues or batch audio generation, not streaming pipelines.

How to reproduce this

📌 Placeholder — confirm the exact commands used before publishing.

# LLM benchmark — llama-bench (part of llama.cpp)
llama-bench \
  -m /path/to/model.gguf \
  -p 512 \
  -n 128 \
  -t 24

# TTS benchmark — run per model, 30 iterations
# Kokoro: ONNX Runtime
# SpeechT5 + XTTS-v2: standard Python inference loop
# Input: 180-character text string, 32 threads

# API-level throughput — OpenLLM + llmperf
openllm start /path/to/model.gguf --backend llama-cpp

llmperf run \
  --model <model-name> \
  --num-concurrent-requests 1 \
  --num-output-tokens 128 \
  --num-input-tokens 512

Models sourced from HuggingFace. Search the model name directly (e.g. bartowski/DeepSeek-R1-0528-Qwen3-8B-GGUF) and pull the Q4_K_M variant for llama.cpp tests.

No special system prep was applied — no NUMA pinning or hugepage configuration. Results reflect default OS settings on the HPE ProLiant DL385 Gen11.

When to use CPU vs GPU for inference

CPU inference is a good fit for:

Batch summarisation and document processing
Audio transcription queues
Overnight report generation
Lightweight TTS (Kokoro, SpeechT5)
Edge deployments with cost or availability constraints
Multi-tenant setups with 7B–20B Q4 models

GPU is still the right call for:

Real-time, latency-critical workloads at scale
High-concurrency serving (maximise throughput)
Models above 20B without quantisation
Real-time TTS with complex models (XTTS-v2)
Streaming use cases where TTFT < 1s is required

Takeaway

The EPYC 9334 handles 7B–20B parameter models at Q4 quantisation with predictable throughput and acceptable latency for a broad class of production workloads. It doesn't replace a GPU for every inference job. For the workloads listed above, it doesn't need to.

If you're running batch inference or TTS queues and paying GPU rates, it's worth running these numbers against your actual workload before assuming a GPU is necessary.

DEV Community