DEV Community

Discussion on: How to run Llama 2 on anything

Collapse
 
chand1012 profile image
Chandler

If you are running at that scale, it may be better to just use HuggingFace Transformers with all the optimizations (cross GPU inference via Optimum) or host an OpenAI Compatible server across all 8 GPUs with vLLM.

Llama.cpp is more about running LLMs on machines that otherwise couldn't due to CPU limitations, lack of memory, GPU limitations, or a combination of any limitations. If you are able to afford a machine with 8 GPUs and are going to be running it at scale, using vLLM or cross GPU inference via Transformers and Optimum are your best options.