Maxim Saplin

Posted on Oct 12, 2024 • Edited on Feb 17

DDR5 Speed, CPU and LLM Inference

#ai #machinelearning #chatgpt #llm

This is the 3rd part of my investigations of local LLM inference speed. Here're the 1st and 2nd ones

The speed of LLM inference is memory-bound. But what exactly does this mean? Is there a difference between standard JEDEC 4800MT/s and faster 6000MT/s XMP DDR5 sticks? Let's find out.

Test Environment


OS	Windows 11 23H2 (22631.4371)
LLM Inference	LM Studio 0.3.4 (Build 3), when testing 100% CPU off-load 12 threads were used, when testing 100% GPU off-load Flash Attention is enabled
CPU	Intel Core i5 13600KF overclocked (performance core multipliers 57x, 56x, 54x, 53x and 2 cores at 54x vs stock multipliers of 51x)
RAM	DDR5 G.Skill 6000MT/s 36-36-36-96, 2x32GB and 2x16GB*
Motherboard	Z790 PG Lightning
GPU	RTX 4090 24GB VRAM, overclocked (+1440MHz mem frequency, +150MHz core) and power limited to 84% (~390W)

*Made a few tests with 2x16GB and 2x32GB with a total of 96GB - due to CPU/MB limitations XMP frequencies were not achieved when all 4 slots were occupied. Max stable frequency was at 4800MT/s, timings 29-30-30-76. Most of the tests used 2x32GB config

Models

Mistral 7B: 6 bit Q6_K, 5.94GB mistral-7b-finetuned-orca-dpo-v2-Mistral-7B-Instruct-v0.2-slerp-GGUF, used with 32K context
Llama 3.1 8B: 16 bit, 16.07GB, meta-llama-3.1-8b-instruct.f16.gguf, used with 32K context (instead of supported 128K) to avoid VRAM overflows when measuring GPU for comparison

Results

Bumping DDR5 speed from 4800MT/s to 6000MT/s brought +20.3% and +23.0% generation speedup (Mistral and Llama correspondingly).

Mistral 7B

DDR5	TTFT (Cold), s	TTFT (Warm), s	TPS	READ, MB,s	WRITE, MB/s	COPY, MB/s	Latency, ns
4800 (4 sticks, 96GB)	0,89	0,11	9,42	69019,00	68482,00	69815,67	76,93
4800 (2 sticks, 64GB)	0,88	0,11	9,66	71032,67	71582,67	72058,00	77,70
6000 (2 sticks, 64GB)	0,66	0,09	11,34	87342,67	85591,00	85535,33	70,43
6200 (2 sticks, 64GB)	0,84	0,09	11,93	90268,00	88714,00	88178,67	68,57

			Correl	0,99600	0,99640	0,99644	-0,98861
			R^2	0,99202	0,99282	0,99290	0,97734

Llama 3.1

DDR5	TTFT (Cold, s	TTFT (Warm, s	TPS	READ, MB,s	WRITE, MB/s	COPY, MB/s	Latency, ns
4800 (4 sticks, 96GB)	2,46	0,30	3,86	69019,00	68482,00	69815,67	76,93
4800 (2 sticks, 64GB)	2,38	0,26	4,00	71032,67	71582,67	72058,00	77,70
6000 (2 sticks, 64GB)	2,78	0,22	4,74	87342,67	85591,00	85535,33	70,43
6200 (2 sticks, 64GB)	2,73	0,21	4,87	90268,00	88714,00	88178,67	68,57

			Correl	0,99924	0,99969	0,99983	-0,98161
			R^2	0,99849	0,99939	0,99966	0,96356

Faster DDR5 means faster generation speed
There's a STRONG linear correlation between tokens per second and AIDA-reported memory speeds (in my case read, write, and copy speeds also correlated, hence the data can't say if the particular metric is more important)

Do Cores/Threads Matter

Not that much. You might be better off with fewer/slower cores yet faster memory:

Threads	TPS
1	3,18
2	5,46
3	7,70	73,0%
4	9,42
5	10,3
6	10,55
8	10,83
10	11,04
12	11,35	107,58%

3 cores/threads demonstrated 73% of 6 cores/treads speed. 12 threads (those ones relied on hyper threading rather than on more physical cores) brought an additional 7.6% boost over 6 core baseline.

CPU via GPU

For reference here's the comparison of 6200MT/s CPU results to RTX 4090 GPU:

	CPU TPS	GPU TPS
Mistral 7B	11,93	112,23
Llama 3.1 8B	4,87	55,46

Approach, Notes

After changing the memory config I ran AIDA Memory Tests 3 times and averaged them in the final table
For each model I used the same dialog every time regenerating the last message "Tell me about Mars"
Recorded 4 results for each model and averaged them
- TTFT Cold - time to first token during the first generation right after the model was loaded
- TTFT Warm - time to the first token in subsequent generations
- I actually did 2 measurements of Llama 3.1 at 6200 and got exhausted waiting for the results, anyways they almost didn't fluctuate The 4-stick configuration is slower than the 2-stick configuration even with the same speed and timings. Additionally, on consumer hardware, you are unlikely to get any speeds above 4800MT/s with 4 sticks due to MB and CPU memory controller limitations. Always try using 2 slots.
6200 was unstable OC, failed OCCT stress test

Top comments (5)

kha84 • Oct 24 '24

Well, there're just three steps remain to be done:

switch from Windows to Linux
abandon proprietary LMStudio in favor of free and open source Jan.AI (or cortex.cpp)
give it another try with Tensor-RT as an inference backend to get +50% TPS boost to your 4090 figures

Maxim Saplin • Oct 25 '24

Jan.AI supports TensorRT, any benchmarks putting it up against llama.cpp?

kha84 • Oct 28 '24

jan.ai/post/benchmarking-nvidia-te...

Maxim Saplin • Oct 29 '24

Benchmarks looks impressive, installed Jan and TensorRT dependencies and dciscoverd that there's only a handful of prebuilt models available in Jan, there're also no Llama 3 models in there which was a surprise. Also tried how to get Nvidia's Nemotron 70B (supposedly should be easy) - failed after 10 minutes of research. With that kind of support of models it's no competitor to llama.cpp ecosystem :(

kha84 • Oct 31 '24

To be honest, haven't got a chance to play with it myself yet - just lost the access to some good rig with powerful Nvidia GPU :( I've read that modes can be somehow "compiled" or processed to be used with Tensor-RT. Of course the list of supported models is not that wide, compared to gguf, but imho it needs to be tested in real life and the speed bump worth it.
Will rent a machine with GPU to test it out myself.