The 70B model is available now, but you need 8 GPU cards to run it. On an AWS g4dn.metal instance you have that (plus 96 CPU cores). However, with monitoring nvidia-smi I see that my GPUs are only 35% utilized with the 70B model (and less with the smaller models). I also notice that if leaving -t unspecified it uses 96 threads and this actually slows things down drastically. I found that -t 4 is about as good as it gets, leaving me with 92 CPU cores that I can't use but pay dearly for! Any idea how we can use the resources more fully or what causes the apparent contention with the high CPU thread count? And why we can't fully use the GPUs to at least 80%?
Llama.cpp is more about running LLMs on machines that otherwise couldn't due to CPU limitations, lack of memory, GPU limitations, or a combination of any limitations. If you are able to afford a machine with 8 GPUs and are going to be running it at scale, using vLLM or cross GPU inference via Transformers and Optimum are your best options.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
The 70B model is available now, but you need 8 GPU cards to run it. On an AWS g4dn.metal instance you have that (plus 96 CPU cores). However, with monitoring nvidia-smi I see that my GPUs are only 35% utilized with the 70B model (and less with the smaller models). I also notice that if leaving -t unspecified it uses 96 threads and this actually slows things down drastically. I found that -t 4 is about as good as it gets, leaving me with 92 CPU cores that I can't use but pay dearly for! Any idea how we can use the resources more fully or what causes the apparent contention with the high CPU thread count? And why we can't fully use the GPUs to at least 80%?
If you are running at that scale, it may be better to just use HuggingFace Transformers with all the optimizations (cross GPU inference via Optimum) or host an OpenAI Compatible server across all 8 GPUs with vLLM.
Llama.cpp is more about running LLMs on machines that otherwise couldn't due to CPU limitations, lack of memory, GPU limitations, or a combination of any limitations. If you are able to afford a machine with 8 GPUs and are going to be running it at scale, using vLLM or cross GPU inference via Transformers and Optimum are your best options.