DEV Community

Cover image for DDR5 Speed, CPU and LLM Inference
Maxim Saplin
Maxim Saplin

Posted on

DDR5 Speed, CPU and LLM Inference

This is the 3rd part of my investigations of local LLM inference speed. Here're the 1st and 2nd ones

The speed of LLM inference is memory-bound. But what exactly does this mean? Is there a difference between standard JEDEC 4800MT/s and faster 6000MT/s XMP DDR5 sticks? Let's find out.

Test Environment

OS Windows 11 23H2 (22631.4371)
LLM Inference LM Studio 0.3.4 (Build 3), when testing 100% CPU off-load 12 threads were used, when testing 100% GPU off-load Flash Attention is enabled
CPU Intel Core i5 13600KF overclocked (performance core multipliers 57x, 56x, 54x, 53x and 2 cores at 54x vs stock multipliers of 51x)
RAM DDR5 G.Skill 6000MT/s 36-36-36-96, 2x32GB and 2x16GB*
Motherboard Z790 PG Lightning
GPU RTX 4090 24GB VRAM, overclocked (+1440MHz mem frequency, +150MHz core) and power limited to 84% (~390W)

*Made a few tests with 2x16GB and 2x32GB with a total of 96GB - due to CPU/MB limitations XMP frequencies were not achieved when all 4 slots were occupied. Max stable frequency was at 4800MT/s, timings 29-30-30-76. Most of the tests used 2x32GB config

Models

  • Mistral 7B: 6 bit Q6_K, 5.94GB mistral-7b-finetuned-orca-dpo-v2-Mistral-7B-Instruct-v0.2-slerp-GGUF, used with 32K context
  • Llama 3.1 8B: 16 bit, 16.07GB, meta-llama-3.1-8b-instruct.f16.gguf, used with 32K context (instead of supported 128K) to avoid VRAM overflows when measuring GPU for comparison

Results

Bumping DDR5 speed from 4800MT/s to 6000MT/s brought +20.3% and +23.0% generation speedup (Mistral and Llama correspondingly).

Mistral 7B

DDR5 TTFT (Cold), s TTFT (Warm), s TPS READ, MB,s WRITE, MB/s COPY, MB/s Latency, ns
4800 (4 sticks, 96GB) 0,89 0,11 9,42 69019,00 68482,00 69815,67 76,93
4800 (2 sticks, 64GB) 0,88 0,11 9,66 71032,67 71582,67 72058,00 77,70
6000 (2 sticks, 64GB) 0,66 0,09 11,34 87342,67 85591,00 85535,33 70,43
6200 (2 sticks, 64GB) 0,84 0,09 11,93 90268,00 88714,00 88178,67 68,57
Correl 0,99600 0,99640 0,99644 -0,98861
R^2 0,99202 0,99282 0,99290 0,97734

Llama 3.1

DDR5 TTFT (Cold, s TTFT (Warm, s TPS READ, MB,s WRITE, MB/s COPY, MB/s Latency, ns
4800 (4 sticks, 96GB) 2,46 0,30 3,86 69019,00 68482,00 69815,67 76,93
4800 (2 sticks, 64GB) 2,38 0,26 4,00 71032,67 71582,67 72058,00 77,70
6000 (2 sticks, 64GB) 2,78 0,22 4,74 87342,67 85591,00 85535,33 70,43
6200 (2 sticks, 64GB) 2,73 0,21 4,87 90268,00 88714,00 88178,67 68,57
Correl 0,99924 0,99969 0,99983 -0,98161
R^2 0,99849 0,99939 0,99966 0,96356
  • Faster DDR5 means faster generation speed
  • There's a STRONG linear correlation between tokens per second and AIDA-reported memory speeds (in my case read, write, and copy speeds also correlated, hence the data can't say if the particular metric is more important)

AIDA Memory Tests

Do Cores/Threads Matter

Not that much. You might be better off with fewer/slower cores yet faster memory:

Threads TPS
1 3,18
2 5,46
3 7,70 73,0%
4 9,42
5 10,3
6 *10,55 *
8 10,83
10 11,04
12 11,35 107,58%

3 cores/threads demonstrated 73% or 6 cores/treads. 12 threads (those ones relied on hyper threading rather than on more physical cores) brought an additional 7.6% boost over 6 core baseline.

CPU via GPU

For reference here's the comparison of 6200MT/s CPU results to RTX 4090 GPU:

CPU TPS GPU TPS
Mistral 7B 11,93 112,23
Llama 3.1 8B 4,87 55,46

Approach, Notes

  • After changing the memory config I ran AIDA Memory Tests 3 times and averaged them in the final table
  • For each model I used the same dialog every time regenerating the last message "Tell me about Mars"
  • Recorded 4 results for each model and averaged them
    • TTFT Cold - time to first token during the first generation right after the model was loaded
    • TTFT Warm - time to the first token in subsequent generations
    • I actually did 2 measurements of Llama 3.1 at 6200 and got exhausted waiting for the results, anyways they almost didn't fluctuate The 4-stick configuration is slower than the 2-stick configuration even with the same speed and timings. Additionally, on consumer hardware, you are unlikely to get any speeds above 4800MT/s with 4 sticks due to MB and CPU memory controller limitations. Always try using 2 slots.
  • 6200 was unstable OC, failed OCCT stress test

LM Studio Screenshot

Top comments (0)