DEV Community

Maxim Saplin
Maxim Saplin

Posted on

llama.cpp: CPU vs GPU, shared VRAM and Inference Speed

NVidia GPUs offer a Shared GPU Memory feature for Windows users, which allocates up to 50% of system RAM to virtual VRAM. If your GPU runs out of dedicated video memory, the driver can implicitly use system memory without throwing out-of-memory errors—application execution is not interrupted. Yet there's a performance toll.

Memory is the key constraint when dealing with LLMs. And VRAM is way more expensive than ordinary DDR4/DDR5 system memory. E.g. RTX 4090 with 24GB GDDR6 on board costs around $1700, while RTX 6000 with 48GB of GDDR6 goes above $5000. 2 sticks of G.Skill DDR5 with a total capacity of 96GB will cost you around $300. My workstation has RTX 4090 and 96GB of RAM, making 72GB of video memory available to the video card. Does it make sense to fill your PC with as much RAM as possible and have your LLM workloads use Shared GPU RAM?

Total video memory

I have already tested how GPU memory overflow into RAM influences LLM training speed. This time I've tried inference via LM Studio/llama.cpp using 4-bit quantized Llama 3.1 70B taking up 42.5GBs

LM Studio (a wrapper around llama.cpp) offers a setting for selecting the number of layers that can be offloaded to the GPU, with 100% making the GPU the sole processor. At the same time, you can choose to keep some of the layers in system RAM and have the CPU do part of the computations—the main purpose is to avoid VRAM overflows. Yet, as already mentioned, on Windows (not Linux), it is possible to overflow VRAM.

LM Studio offload setting

I tried 3 off-load settings:

  • 100% GPU - ~50% of model weights stayed in VRAM while the other half was located in RAM
  • 50% GPU and 50% CPU - this setting has filled VRAM almost completely without overflows into Shared memory
  • 100% CPU

Here's my hardware setup:

  • Intel Core i5 13600KF (OC to 5.5GHz)
  • 96GB DDR5 RAM at 4800MT/s (CL 30, RCD 30, RCDW 30, RP 30 ~ 70GB/s Read/Write/Copy at AIDA)
  • RTX 4090 24GB VRAM (OC, core at 3030MHz, VRAM +1600MHz ~37000 GPU Score at TimeSpy)

And here're the results:

Tokens/s Time-to-first Token (s) RAM used (GB)
100% GPU 0.69 4.66 60
50/50 GPU/CPU 2.32 0.42 42
100% CPU 1.42 0.71 42

Please note that for the time to the first token I used the "warm" metric, i.e. the time from the second generation (loaded the model, generated completion, and then clicked "regenerate"). For the cold time to the first token I got

  • 100% GPU ~6.9s
  • 50/50 CPU/GPU ~2.4s
  • 100% CPU ~30s

Besides, when using 100% GPU offload I had ~20GB more system RAM used (no matter whether "use_mlock" was set or not).

Apparently, there's not much point in Shared VRAM.

P.S> Screenshots...

  • 100% GPU
    100% GPU

  • 50/50 CPU/GP
    50/50 CPU/GP

  • 100% CPU
    100% CPU

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.