DEV Community

Running Local LLMs, CPU vs. GPU - a Quick Speed Test

Maxim Saplin on March 11, 2024

This is the 1st part of my investigations of local LLM inference speed. Here're the 2nd and 3rd ones May 12 Update Putting together a table wit...

Read full post

Maciej Wakuła • Mar 27 '24

This depends much on the settings. I tried the same model and example query "tell me about Mars". Having Ryzen 3900 PRO CPU (12 cores, 24 threads, I got it for less than half price of 3900x), AMD RX 6700 (without x) which I also got cheap. RAM is pretty cheap as well so 128GB is in range of most. Using kobald-cpp rocm. With (14 layers on gpu, 14 cpu threads) it gave 6 tokens per second. (28,14) gave 15 T/s. (30,24) gave 4.43 T/s. Finally 35 layers, 24 CPU threads consumed total 7.3GB on GPU giving 34.61 T/s.

I'm writing to show that results depends very much on the settings.

Maxim Saplin • Mar 27 '24

JIC, I tested pure cases, 100% CPU and 100% offloading to GPU

Orlando Arroyo • Apr 3 '24

How did you get to use 100% of the CPU?, which config or settings did you have?

Maciej Wakuła • Apr 4 '24 • Edited

You can offload all layers to GPU (CUDA, ROCm) or use CPU implementation (ex. HIPS). Just run LM Studio for your first steps. Run kobaldcpp or kobapldcpp-ROCm as second. Then try to use python and transformers. From there you should know enough about the basics to choose your directions. And remember that offloading all to GPU still consumes CPU

This is a peak when using full ROCm (GPU) offloading. See CPU usage on the left (initial CPU load is to start the tools, LLM was used on the peak at the end - there is GPU usage but also CPU used)

And this is windows - ROCm still is very limited on other operating systems :/

Orlando Arroyo • Mar 16 '24 • Edited

Just for fun, here are some additional results:

iPad Pro M1 256GB, using LLM Farm to load the model: 12.05tok/s
Asus ROG Ally Z1 Extreme (CPU): 5.25 tok/s using the 25W preset, 5.05tok/s using the 15W preset

Update:
Asked a friend with a M3 Pro 12core CPU 18GB. Running from CPU: 17.93tok/s, GPU: 21.1tok/s

Maxim Saplin • Mar 16 '24

The CPU result for ROG is close to the one from 7840U, after all they almost identical CPUs

clegger • Mar 16 '24

The ROG Ally has a Ryzen Z1 Extreme which appears to be nearly identical to the 7840U, but from what I can discern, the NPU is disabled. So if / when LM Studio gets around to implementing support for that AI accelerator the 7840U should be faster at inferencing workloads.

Maxim Saplin • Mar 16 '24

AMD GPU seems to be an underdog in the ML world, when compared to Nvidia... I doubt that AMD's NPU will see better compatibility with ML stack than it's GPUs

Ricardo Meleschi • Sep 30 '24 • Edited

If you let me know what settings / template you used for this test, I'll run a similar test on my M4 iPad with 16GB Ram. I get wildly different tok/s depending on which LLM and which template I'm using now.

As of right now, with the fine-tuned LLM and the "TinyLLaMa 1B" template being used I get the following:

M4 iPad w 16GB Ram / 2TB Storage: 15.52t/s

Nicolay • Apr 30 '24 • Edited

On my rtx 3050 the speed was 28.6 tok/s.
Based on the comments above, I made a table.

RTX 3050         8gb    28.6 tok/s
RTX 3070 TI     8gb    41,75
RTX 4060         8gb    37.9 tok/s
RTX 4070         12gb   58.2tok
RTX 4080         8gb     78.1

Maxim Saplin • May 4 '24

Are all those videocards desktop ones?

Red Book • May 12 '24 • Edited

I came across your benchmark. It's very useful. Here is a result from my machine:

Ryzen 5 7600 128GB + MSI RX 7900 XTX 70.1 tok/s

The total system power draw 478 watts, idle 95 watts.

using Mistral Orca Dpo V2 Instruct v0.2 Slerp 7B Q6_K

Best,

PS I've been thinking to get the M4 Pro 96GB when it's available, just to run 70B models.

This benchmark shows a difference.
twitter.com/ronaldmannak/status/17...

Stefan • Jan 6 • Edited

Thanks for this :)

I think I'm skipping on a new GPU and stick with my RTX 2070S for now.

Initially I was planning on getting a "cheap" RTX 4060 Ti 16GB,
but with the 128 bit memory bus it's probably not much faster
and the RTX 4070 Ti SUPER or RTX 4080 are a bit too expensive for now IMHO.

LM Studio 0.3.6 (Build 8)
model: mistral-7b-finetuned-orca-dpo-v2-mistral-7b-instruct-v0.2-slerp
prompt: Tell me about Mars!

CPU only AMD Ryzen 9 7900@4000 Mhz fixed, DDR5@4800
12.41 tok/sec | 634 tokens | 0.25s to first token

CPU only AMD Ryzen 9 7900@4000 Mhz fixed, DDR5@5400
13.45 tok/sec | 638 tokens | 0.26s to first token

GPU only RTX 2070 SUPER 8GB @ Stock
65.40 tok/sec | 617 tokens | 0.20s to first token
66.35 tok/sec | 621 tokens | 0.06s to first token
10 runs total - these are the min and max in terms of tok/s

I guess the 256 bit memory bus keeps the RTX 2070S somewhat in the game =)

... tested some more ... 1060ti 6GB was a bit disappointing xD

GPU only RTX 1060ti 6GB @ Stock
21.91 tok/sec | 695 tokens | 0.09s to first token
23.01 tok/sec | 618 tokens | 0.04s to first token
min and max of 5 runs by tok/sec

GPU only RTX 3070 8GB @ Stock
63.10 tok/sec | 707 tokens | 0.26s to first token
63.25 tok/sec | 590 tokens | 0.02s to first token
min and max of 5 runs by tok/sec

Curious ... I did expect the 3070 to perform slightly better than the 2070S ... both have the same membus width of 256bit, 8GB and both run at 1750 (14000) Mhz for the memory.

Maxim Saplin • Jan 6

Indeed, memory bandwidth seems to be the key factor

Bharath B • Mar 17 '24

Intel i7 14700k - 9.82 token/s with no GPU offloading(peaked at 35% CPU usage in LMStudio. Guessing issue with multithreading)
Zotac Trinity non-OC 4080 Super - 71.61 tokens/s max GPU offloading

All numbers measured on non-overclocked factory default setup

Maxim Saplin • Mar 17 '24

Thanks for sharing the numbers!

Orlando Arroyo • Mar 20 '24

Indeed there’s something odd with the multithreading of the CPUs

Orlando Arroyo • Mar 15 '24

Adding some info here:

Running on a Razer Blade 2021 with a Ryzen 5900HX, a GF 3070Ti and 16GB RAM, I got 41.75tok/s. I used the same test as you, asking about Mars on the same model.

Hope that adds information to this very interesting topic.

Maxim Saplin • Mar 15 '24

Thanks for the contribution! I assume you used 100% GPU off-loading , right? Just checking:)

Orlando Arroyo • Mar 16 '24

Indeed, 100%GPU off-loading.

I also tested an Ryzen 7950X with 0% off loading, but there’s something odd. I set 32 threads but CPU use is not going beyond 60% and only gets 7tok/s. Any thoughts how about possible cause?

Just for fun, I’ll check with an Asus ROG Ally later (Z1 Extreme version).

Maxim Saplin • Mar 16 '24

Seems the threads param is ignored, I saw same behaviour when testing CPU inference

Orlando Arroyo • Apr 5 '24

Just a quick update: using a RTX 4070 Super gets 58.2tok/s

Oleksandr Davyskyba • May 17 '24

And RTX 4070 TI Super get 62tok/s

Maxim Saplin • May 12 '24

Is that a desktop card?

Melroy van den Berg • May 30 '24

Thank you for testing! Helped me a lot! AMD RX 7900 XTX is doing good..!

Melroy van den Berg • May 30 '24

Anybody with an AMD W7900?

Oliver Stutz • Apr 30 '24

78.51 tok/s with AMD 7900 XTX on RoCm Supported Version of LM Studio with llama 3
33 gpu layers (all while sharing the card with the screen rendering)

Jacek Skowroński • May 10

i5-12600k, Nvidia 1070 8GB.
with GPU: 30.8T/s
without GPU: 9.25T/s
default settings

Jared Goodpasture • Oct 31 '24

Thanks so much for keeping this post up to date 🙏

Gunter Spranz • May 21

AMD 9950X with 9070 XT: 108.34 tok/sec

Oleksandr Davyskyba • Sep 5 '24

Got thinkpad p14s with 7840u and 64gb lpddr5x and with mistral-7b-finetuned-orca-dpo-v2-Mistral-7B-Instruct-v0.2-slerp-GGUF got 15 T/s

Luk • Oct 22 '24

@maximsaplin It seems your table could use an update.

You mention AMD Radeon 780M iGPU.
I have it in my Ryzen 7840hs paired with LPDDR5x-6400 32gb and I am getting for the same model as you consistently ~ 10 T/s with 100% gpu offload. @oleksandr_davyskyba_5a399 is getting 15 T/s which means there is a big variance for this igpu.

Aside from that, I also tested AMD RX 6800XT 16GB GPU using Razer Core X Chroma via USB4/Thunderbolt connected to the same laptop and I am getting consistently ~50 T/s on it so only a very small difference compared to the result you posted, despite using via Thunderbolt.

Maxim Saplin • Oct 25 '24

Thanks! Thunderbolt os PCI-E doesn't seem that important for GPU, for LLMs most of the memory intensive operations are happening inside VRAM with little communication with the outer world (CPU, System RAM)

Juraj Bádal • Jul 16

Your output indicates that Ollama is running the model on your AMD RX 9070 eGPU and everything looks correct:

Total duration: 15.6s
Load duration: 79ms (very fast, typical for a local model already in VRAM)
Prompt eval count: 22 tokens
Prompt eval duration: 113ms
Prompt eval rate: 193.86 tokens/s (prompt tokens are usually processed faster)
Eval count: 1452 tokens
Eval duration: 15.41s
Eval rate: 94.22 tokens/s

Andy Harris • May 5 '24

Here's some additional config data for the list.

Laptop - 7940HS + 32gb RAM + RTX 4070 (8g)
GPU only - RTX 4070 mobile (8GB) = 30.69 T/S
CPU only - 7940HS + 32gb RAM = 8.28 T/S
Note. I'm not sure why the 4070 is posting lower than the 4060 mobile.

Desktop - R5 3600x + 80gb RAM + RX 6800XT (16gb)
GPU only - Radeon RX 6800 XT (16gb) = 52.92 T/S
CPU only - R5 3600x + 80gb RAM = 4.07 T/S

Maxim Saplin • May 7 '24

There're different power levels for 4xxx mobile GPUs - 40-140w. 4070 might be coming with a thinner laptop with TGP at arpind 40w. My 4060 Mobile has 105w TGP