DEV Community

Cover image for CPU Inference on AMD EPYC 9334: Real Numbers for LLM and TTS Workloads
RubberDuckOps for Leaseweb

Posted on

CPU Inference on AMD EPYC 9334: Real Numbers for LLM and TTS Workloads

TL;DR — GPU isn't always the right call for inference. At Leaseweb, we benchmarked a dual-socket EPYC 9334 on 7B–20B LLMs and three TTS models. Here's what the numbers actually look like — and when CPU inference makes sense.


Why inference is where your budget actually disappears

Training is a one-time cost. Inference is not. Once a model is in production, it runs continuously — and cost per query scales directly with traffic. For many teams, inference spend overtakes training spend within months of launch.

The hardware decision for inference is also different from training. Training wants large GPU clusters with high-bandwidth interconnects. Inference wants low latency, high throughput per dollar, and enough memory bandwidth to serve quantised weights efficiently. Those requirements don't always point to a GPU.


The two metrics that actually matter for LLM inference

When a prompt hits an LLM, two stages happen:

  • Prefill — the model converts input tokens, runs them through its layers, and builds a KV cache. Compute-bound. Ends when the first output token is generated.
  • Decode — the model generates each subsequent token one at a time, reading from the KV cache. Memory-bandwidth-bound.

These stages have different performance profiles, which is why benchmarks report two numbers:

  • Time to first token (TTFT) — elapsed time from prompt submission to first output token. Lower is better.
  • Tokens per second (tok/s) — decode throughput. Higher is better, especially for batch and streaming workloads.

For TTS, the standard metric is real-time factor (RTF) — the ratio of processing time to audio duration. RTF below 1.0 means the model generates audio faster than real time. Above 1.0 and it can't keep up.


Hardware and software setup

Component Specification
CPU AMD EPYC 9334 × 2 (dual socket)
Architecture Zen 4
Cores / threads per socket 32 / 64
Base clock 2.7 GHz
L3 cache 128 MB per socket
TDP 210W per socket
Memory 64 GB DDR5

Two tools were used: llama-bench (part of llama.cpp) for local model evaluation, and OpenLLM with llmperf for API-level throughput testing.

Test configuration: 512-token prompt, 128 tokens generated, 24 CPU threads for LLM. 180-character input, 32 CPU threads, 30 inference runs per model for TTS.


Models tested

Model Parameters Type
DeepSeek-R1-0528-Qwen3-8B-Q4_K_M 8B LLM
Gpt-oss-20b 20B LLM
Llama-2-7b-Q4_K_M 7B LLM
Mistral-7B-Instruct-v0.2-Q4_K_M 7B LLM
Kokoro (ONNX Runtime) 82M TTS
Microsoft SpeechT5 150M TTS
Coqui XTTS-v2 400M TTS

LLM results

Time to first token

Model Quantisation TTFT
DeepSeek-R1-8B Q4_K_M 4.1s
DeepSeek-R1-8B FP16 8.1s
GPT-OSS-20B Q4 3.6s
GPT-OSS-20B FP16 3.6s
Llama-2-7B Q4_K_M 4.8s
Mistral-7B Q4_K_M ~4.5s

Switching GPT-OSS-20B to FP16 had minimal effect on TTFT. For DeepSeek, the same switch more than doubled it.

Decode throughput

Model Quantisation Throughput
DeepSeek-R1-8B Q4_K_M 27.8 tok/s
DeepSeek-R1-8B FP16 8.1 tok/s
GPT-OSS-20B Q4 18.3 tok/s
GPT-OSS-20B FP16 26.2 tok/s
Llama-2-7B Q4_K_M ~22 tok/s
Mistral-7B Q4_K_M ~20 tok/s

The Q4 vs FP16 gap is significant for DeepSeek — a 3.4× throughput drop. For sustained batch workloads on CPU, Q4 quantisation is the practical default.

Memory utilisation

CPU utilisation stayed between 20–30% across all runs. Q4 models leave substantial DRAM headroom — useful for multi-tenant deployments where you want concurrent instances on the same node. DeepSeek at FP16 consumed close to 16 GB, which limits that option considerably.

GPU reference point

For comparison, the same FP16 throughput test ran on an Nvidia L4 GPU. The L4 produced 16.7 tok/s on DeepSeek-R1-8B and 58.6 tok/s on GPT-OSS-20B, versus 8.1 and 26.2 on the EPYC 9334. Roughly double the throughput. If throughput is your primary constraint, that gap matters. If cost predictability or workload type are the constraint, the CPU case still holds.


TTS results

Model RTF Memory Verdict
Kokoro (82M, ONNX) 0.162 ~0.5 GB 6× faster than real time. Tight p50/p95 spread.
Microsoft SpeechT5 (150M) 0.6 ~1.4 GB Comfortably real time. Good for single-speaker synthesis.
Coqui XTTS-v2 (400M) 1.41 ~4 GB Cannot serve real-time audio. Strong fit for batch jobs.

Kokoro is the standout — 82M parameters, RTF of 0.162, and consistent latency under load. XTTS-v2 is the most capable (voice cloning, multilingual) but at RTF 1.41 it belongs in overnight queues or batch audio generation, not streaming pipelines.


How to reproduce this

📌 Placeholder — confirm the exact commands used before publishing.

# LLM benchmark — llama-bench (part of llama.cpp)
llama-bench \
  -m /path/to/model.gguf \
  -p 512 \
  -n 128 \
  -t 24
Enter fullscreen mode Exit fullscreen mode
# TTS benchmark — run per model, 30 iterations
# Kokoro: ONNX Runtime
# SpeechT5 + XTTS-v2: standard Python inference loop
# Input: 180-character text string, 32 threads
Enter fullscreen mode Exit fullscreen mode
# API-level throughput — OpenLLM + llmperf
openllm start /path/to/model.gguf --backend llama-cpp

llmperf run \
  --model <model-name> \
  --num-concurrent-requests 1 \
  --num-output-tokens 128 \
  --num-input-tokens 512
Enter fullscreen mode Exit fullscreen mode

Models sourced from HuggingFace. Search the model name directly (e.g. bartowski/DeepSeek-R1-0528-Qwen3-8B-GGUF) and pull the Q4_K_M variant for llama.cpp tests.

No special system prep was applied — no NUMA pinning or hugepage configuration. Results reflect default OS settings on the HPE ProLiant DL385 Gen11.


When to use CPU vs GPU for inference

CPU inference is a good fit for:

  • Batch summarisation and document processing
  • Audio transcription queues
  • Overnight report generation
  • Lightweight TTS (Kokoro, SpeechT5)
  • Edge deployments with cost or availability constraints
  • Multi-tenant setups with 7B–20B Q4 models

GPU is still the right call for:

  • Real-time, latency-critical workloads at scale
  • High-concurrency serving (maximise throughput)
  • Models above 20B without quantisation
  • Real-time TTS with complex models (XTTS-v2)
  • Streaming use cases where TTFT < 1s is required

Takeaway

The EPYC 9334 handles 7B–20B parameter models at Q4 quantisation with predictable throughput and acceptable latency for a broad class of production workloads. It doesn't replace a GPU for every inference job. For the workloads listed above, it doesn't need to.

If you're running batch inference or TTS queues and paying GPU rates, it's worth running these numbers against your actual workload before assuming a GPU is necessary.

Top comments (0)