Discussion on: Running Local LLMs, CPU vs. GPU - a Quick Speed Test

View post

This depends much on the settings. I tried the same model and example query "tell me about Mars". Having Ryzen 3900 PRO CPU (12 cores, 24 threads, I got it for less than half price of 3900x), AMD RX 6700 (without x) which I also got cheap. RAM is pretty cheap as well so 128GB is in range of most. Using kobald-cpp rocm. With (14 layers on gpu, 14 cpu threads) it gave 6 tokens per second. (28,14) gave 15 T/s. (30,24) gave 4.43 T/s. Finally 35 layers, 24 CPU threads consumed total 7.3GB on GPU giving 34.61 T/s.

I'm writing to show that results depends very much on the settings.

Maxim Saplin • Mar 27 '24

JIC, I tested pure cases, 100% CPU and 100% offloading to GPU

Orlando Arroyo • Apr 3 '24

How did you get to use 100% of the CPU?, which config or settings did you have?

Maciej Wakuła • Apr 4 '24 • Edited

You can offload all layers to GPU (CUDA, ROCm) or use CPU implementation (ex. HIPS). Just run LM Studio for your first steps. Run kobaldcpp or kobapldcpp-ROCm as second. Then try to use python and transformers. From there you should know enough about the basics to choose your directions. And remember that offloading all to GPU still consumes CPU

This is a peak when using full ROCm (GPU) offloading. See CPU usage on the left (initial CPU load is to start the tools, LLM was used on the peak at the end - there is GPU usage but also CPU used)

And this is windows - ROCm still is very limited on other operating systems :/