DEV Community

Cover image for Running Large Language Models on the CPU
Youdiowei Eteimorde
Youdiowei Eteimorde

Posted on

Running Large Language Models on the CPU

ChatGPT is at capacity

At some point while using ChatGPT, you might encounter the error shown above. The reason behind this lies in the necessity of substantial computational resources to run a large language model like ChatGPT. These models require a significant amount of computational power, which is typically provided by specialized hardware known as Graphical Processing Units (GPUs).

Unlike regular programs, large language models cannot efficiently run on Central Processing Units (CPUs). The cost associated with operating such massive language models is substantial. It's an expense that usually only large corporations like Google and Microsoft can afford due to the substantial resources involved.

There are two computationally intensive tasks that a language model must consider:

  • Training
  • Inference

Training is the process of instructing a language model on how to perform its intended task. It stands as the more computationally demanding process between the two. Training an LLM consumes both time and monetary resources.

While Inference is the utilization of a trained large language model. Whenever you engage with ChatGPT, you're utilizing it in the inference mode. Although less computationally intensive than training, running inference on LLMs remains relatively expensive due to the substantial requirement for GPUs. Especially if you are running inference on the scale of ChatGPT.

Considering all of this, you might be pondering whether it's feasible to run Large Language Models on a CPU. The answer is yes, at least for inference. In this article, we will delve into the recent advancements, techniques, and technology that have enabled LLMs to operate using nothing more than regular CPUs.

Introducing LLaMa

Meta's LLaMa in Jimmy neutron

LLaMa, or Large Language Model Meta AI, is a lightweight and efficient open-source Large Language Model developed by Meta. It is designed to deliver performance similar to models that are ten times its size.

An essential aspect of Large Language Models (LLMs) is their parameters, which play a crucial role in determining their performance. LLaMa introduces various versions with differing parameter counts, including a variant with 7 billion parameters and another boasting around 65 billion parameters.

Historically, enhancing LLM performance involved augmenting the number of parameters. This adheres to the scaling law, a strategy that has proven successful for models like the GPT series.

However, this strategy poses a challenge due to the inherent relationship between parameter increase and computational demands. As parameter count grows, so does the computational workload, presenting a trade-off between model complexity and processing efficiency.

Recent research has indicated a shift in the approach to enhancing LLM performance. Instead of solely increasing the number of parameters, an effective alternative involves augmenting the training data. LLaMa adopts this strategy, relying on data enrichment to achieve its improved performance.

LLaMa marked a groundbreaking achievement in the realm of LLMs. LLaMa was trained with 2048 A100 GPUs. The big breakthrough in LLaMa wasn't its training, but rather its inference step. The LLaMa model can be ran on a single GPU during inference, a distinct advantage compared to other LLMs that demand multiple GPUs for operation.

Although this single-GPU capability was remarkable, it is still a far cry from running on the CPU. To enable a lightweight LLM like LLaMa to run on the CPU, a clever technique known as quantization comes into play. But before we dive into the concept of quantization, let's first understand how LLMs store their parameters.

How are LLMs parameters stored

The parameters of a Large Language Model (LLM) are commonly stored as floating-point numbers. The majority of LLMs utilize 32-bit floating-point numbers, also known as single precision. However, certain layers of the model may employ different numerical precision levels, such as 16-bit.

Let's perform a simple mathematical calculation:

8 bits=1 byte32 bits=4 bytes 8 \text{ bits} = 1 \text{ byte}\quad 32 \text{ bits} = 4 \text{ bytes}

There are 8 bits in one byte, which means 32 bits equals 4 bytes. Consider a LLaMa model with 7 billion parameters, where these parameters are stored as 32-bit floats. To calculate the total memory required, we can multiply 4 bytes by 7 billion:

4bytes×7billion=28 billion bytes 4 \, \text{bytes} \times 7 \, \text{billion} = \text{28 billion bytes}

28 billion bytes is equivalent to 28 gigabytes (GB).

This implies that utilizing the model would require a memory capacity of 28 gigabytes (GB). However, it's worth considering the limited memory capacity of consumer based CPU devices. Such devices typically do not possess memory sizes that can accommodate 28 GB. We need to reduce the memory consumption of the LLM, and this is where quantization becomes relevant.

What is Quantization?

Quantization is the process of reducing the precision of the parameters of a LLM. In this context, it involves taking the existing parameters, which are typically stored as 32-bit floating-point numbers, and converting them to 4-bit integers. This reduction in precision significantly decreases the memory consumption required to store the model. While there might be a minor degradation in the model's performance, this impact is generally quite small and often negligible or unnoticeable.

Consider an array of values in floating-point representation:

[0.333,0.98,0.234] [0.333, 0.98, 0.234]

When we apply quantization, these values are converted to integers:

[2,15,0] [2, 15, 0]

While the two sets of numbers are different, they retain their meaning. For instance, the second value remains the largest in both arrays, and the third value is still the smallest.

Quantization can be likened to expressing values in a manner similar to using percentages. For example, someone might express a need for "3200 out of the 10000 cans". This expression can be simplified by stating the need as "32 percent of the cans". Although the specific numerical values differ, the conveyed information remains identical. This analogy shows how quantization simplifies value representation while retaining their inherent meaning.

Applying 4 bit quantization to LLaMa7B

Let's apply 4-bit quantization to the LLaMa model, which has 7 billion parameters, assuming that all parameters are originally stored as 32-bit..

8 bits=1 byte4 bits=0.5 bytes 8 \text{ bits} = 1 \text{ byte}\quad 4 \text{ bits} = 0.5 \text{ bytes}

Given that 1 byte equals 8 bits, 4 bits are equivalent to 0.5 bytes. Therefore, to determine the memory consumption after applying 4-bit quantization to a Large Language Model with 7 billion parameters, we multiply 0.5 bytes by 7 billion.

7,000,000,000×0.5bytes=3,500,000,000bytes 7,000,000,000 \times 0.5 \, \text{bytes} = 3,500,000,000 \, \text{bytes}

After performing our calculation, we find that the memory consumption is 3 billion bytes, equivalent to 3.5 GB. This indicates that it's possible to run the LLaMa 7 billion parameter model on a device with more than 3.5 GB of available RAM space.

The GGML ecosystem

Another key technology that has contributed to running LLMs on CPUs is GGML, which stands for Georgi Gerganov Machine Learning. It is named after its creator.

GGML is a machine learning library that implements tensor operations in C. Due to its implementation in C, it can facilitate the execution of ML models on diverse platforms. Additionally, GGML offers support for model quantization.

GGML was initially used to develop Whisper.cpp, a project designed to enable the execution of OpenAI's Whisper model on compact devices such as smartphones and other small devices. This laid the foundation for a subsequent project named llama.cpp, which facilitates the execution of LLaMa models on similarly compact devices.

Furthermore, GGML serves as a file format that enables the storage of model information within a single file.

Introduction to Llama.cpp

Llama.cpp is a runtime for LLaMa-based models that enables inference to be performed on the CPU, provided that the device has sufficient memory to load the model. It is written in C++ and utilizes the GGML library to execute tensor operations and carry out quantization processes.

Despite being written in C++, there exist bindings for several programming languages. The Python binding is referred to as llama-cpp-python.

Working with llama-cpp-python

Let's put into practice everything we've learned so far. We will utilize llama-cpp-python to execute a LLaMa 2-7B model. You can run this example locally if you have approximately 6.71 GB of available memory. If you don't, you can use the link below to run the code in Google Colab.

Open In Colab

Firstly, we need to obtain our model, convert it into GGML format, and then proceed to quantize it. However, there's no need to be concerned about this step since we can readily acquire an already quantized model from HuggingFace. A user named TheBloke has quantized numerous models.

wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_1.bin
Enter fullscreen mode Exit fullscreen mode

I used the command-line tool wget to download the model. This specific variant of LLaMa has undergone fine-tuning similar to ChatGPT.

pip install llama-cpp-python
Enter fullscreen mode Exit fullscreen mode

Next, we proceed to install llama-cpp-python. Once the installation is complete, you have the option to either create a Python file to contain your code or utilize the Python interpreter for execution.

from llama_cpp import Llama

LLM = Llama(model_path="./llama-2-7b-chat.ggmlv3.q4_1.bin")
Enter fullscreen mode Exit fullscreen mode

We load our model and subsequently instantiate it. Following that, we input our prompt to the model.

prompt = "Tell me about the Python programming language? "

output = LLM(prompt)
Enter fullscreen mode Exit fullscreen mode

Llama-cpp offers an API that is similar to the one provided by OpenAI. Here's the response generated by our provided prompt:

{'id': 'cmpl-4c10c54e-f6a1-4d80-87af-f63f19ce96c2',
 'object': 'text_completion',
 'created': 1692538402,
 'model': './llama-2-7b-chat.ggmlv3.q4_1.bin',
 'choices': [{'text': '\npython programming language\nThe Python programming language is a popular, high-level programming language that is used for a wide range of applications, such as web development, data analysis, artificial intelligence, scientific computing, and more. Here are some key features and benefits of Python:\n\n1. **Easy to learn**: Python has a simple syntax and is relatively easy to learn, making it a great language for beginners.\n2. **High-level language**: Python is a high-level language, meaning it abstracts away many low-level details, allowing developers to focus on the logic of their code rather than',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 10, 'completion_tokens': 128, 'total_tokens': 138}}
Enter fullscreen mode Exit fullscreen mode

The text returned from the model is this:

python programming language
The Python programming language is a popular, high-level programming language that is used for a wide range of applications, such as web development, data analysis, artificial intelligence, scientific computing, and more. Here are some key features and benefits of Python:

  1. Easy to learn: Python has a simple syntax and is relatively easy to learn, making it a great language for beginners.
  2. High-level language: Python is a high-level language, meaning it abstracts away many low-level details, allowing developers to focus on the logic of their code rather than

The response received is actually quite satisfactory. However, it's essential to bear in mind that this particular model isn't equivalent to ChatGPT in terms of capabilities. For enhanced performance, consider utilizing a model with higher quantization, such as the 8-bit version. For even more advanced results, acquiring the 13-billion-parameter models could be beneficial. Using prompting engineering techniques can also enhance the quality of the responses obtained.


The ability to run Large Language Models on the CPU represents a significant breakthrough in the field. This advancement paves the way for various applications, benefiting small businesses, researchers, hobbyists, and individuals who prefer not to share their data with third-party organizations. This development is set for continuous growth, and in the upcoming years, if not months, we can expect more accessibility to LLMs.

Here are few useful resources to help expand your knowledge on running LLMs on CPU.

  • The blog post and paper that introduced LLaMa.
  • Check out ggml website and read their manifesto to learn more about the project's philosphy.
  • Watch this video to have an understanding of the hardware ChatGPT runs on.
  • To get a more in depth overview of quantization watch this video
  • To understand the GGML file format read this.

Top comments (4)

Collapse
 
scenaristeur profile image
David

ggml does not seem to work with llama2
`gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from ./llama-2-7b-chat.ggmlv3.q4_1.bin

llama_load_model_from_file: failed to load model
Traceback (most recent call last):
File "/home/smag/dev/llm_on_cpu/first_test.py", line 3, in
LLM = Llama(model_path="./llama-2-7b-chat.ggmlv3.q4_1.bin")
File "/home/smag/.local/lib/python3.10/site-packages/llama_cpp/llama.py", line 323, in init
assert self.model is not None
AssertionError
`

perharps should migrate to GGUF huggingface.co/TheBloke/Llama-2-13...

Collapse
 
eteimz profile image
Youdiowei Eteimorde

Hi David, thanks for the heads up. I will update the article.

Collapse
 
sedigo profile image
cedric

How do i get the binary from a model on hugging face. I am trying to dowload the binary of the GGUf model inline with the command to switch to GGUF from GGLG but I unable to see how i can download from a the same model you used but the GGUF version.

Collapse
 
eteimz profile image
Youdiowei Eteimorde • Edited

Hi @sedigo Sorry for the late reply. The wget option I used in the ariticle doesn't appear to be working any more so you have to use huggingface_hub. Here's an example:

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="TheBloke/Llama-2-13B-GGUF", filename="llama-2-13b.Q4_0.gguf")
Enter fullscreen mode Exit fullscreen mode

What you need is the hugging face repo Id and the file name you want to download. Here's the repo I used for the this example. I then selected the file I wanted to download.

When you download it, it would give you the file path of the model. This was my file path: /root/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-GGUF/snapshots/b106d1c018ac999af9130b83134fb6b7c5331dea/llama-2-13b.Q4_0.gguf. Here's an example of using the model for inference.

from llama_cpp import Llama

model_path = "/root/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-GGUF/snapshots/b106d1c018ac999af9130b83134fb6b7c5331dea/llama-2-13b.Q4_0.gguf"

LLM = Llama(model_path=model_path)

prompt = "Tell me about the Python programming language? "
output = LLM(prompt)
Enter fullscreen mode Exit fullscreen mode