DEV Community

Discussion on: Running Local LLMs, CPU vs. GPU - a Quick Speed Test

Collapse
 
adderek profile image
Maciej Wakuła

This depends much on the settings. I tried the same model and example query "tell me about Mars". Having Ryzen 3900 PRO CPU (12 cores, 24 threads, I got it for less than half price of 3900x), AMD RX 6700 (without x) which I also got cheap. RAM is pretty cheap as well so 128GB is in range of most. Using kobald-cpp rocm. With (14 layers on gpu, 14 cpu threads) it gave 6 tokens per second. (28,14) gave 15 T/s. (30,24) gave 4.43 T/s. Finally 35 layers, 24 CPU threads consumed total 7.3GB on GPU giving 34.61 T/s.

I'm writing to show that results depends very much on the settings.

Collapse
 
maximsaplin profile image
Maxim Saplin

JIC, I tested pure cases, 100% CPU and 100% offloading to GPU

Collapse
 
orlando_arroyo_1 profile image
Orlando Arroyo

How did you get to use 100% of the CPU?, which config or settings did you have?

Collapse
 
adderek profile image
Maciej Wakuła • Edited

You can offload all layers to GPU (CUDA, ROCm) or use CPU implementation (ex. HIPS). Just run LM Studio for your first steps. Run kobaldcpp or kobapldcpp-ROCm as second. Then try to use python and transformers. From there you should know enough about the basics to choose your directions. And remember that offloading all to GPU still consumes CPU

Image description

This is a peak when using full ROCm (GPU) offloading. See CPU usage on the left (initial CPU load is to start the tools, LLM was used on the peak at the end - there is GPU usage but also CPU used)
Image description

And this is windows - ROCm still is very limited on other operating systems :/