Discussion on: How to run Llama 2 on anything

View post

Replies for: The 70B model is available now, but you need 8 GPU cards to run it. On an AWS g4dn.metal instance you have that (plus 96 CPU cores). However, with ...

If you are running at that scale, it may be better to just use HuggingFace Transformers with all the optimizations (cross GPU inference via Optimum) or host an OpenAI Compatible server across all 8 GPUs with vLLM.

Llama.cpp is more about running LLMs on machines that otherwise couldn't due to CPU limitations, lack of memory, GPU limitations, or a combination of any limitations. If you are able to afford a machine with 8 GPUs and are going to be running it at scale, using vLLM or cross GPU inference via Transformers and Optimum are your best options.