It's been fun playing with LLMs with a CPU. However, the novelty wears off as I watch the completion slowly display word by word. Enter the GPU, I have an older Ubuntu gaming laptop with a GPU I purchased for machine learning (I haven't played a game since Doom in the early 90's). Enabling LLM software to run on GPUs can be tricky because it is system and hardware-dependent. This article shows how I run llamafile on an NVIDIA RTX 2060. The examples in this article use llamafile, NVIDIA CUDA, Ubuntu 22.04, and Docker.
Check the GPU and NVIDIA CUDA software
Check if CUDA is installed. NVIDIA provides a utility to show the status of the GPU and CUDA.
% nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 Off | 00000000:01:00.0 On | N/A |
| N/A 44C P8 8W / 90W | 1322MiB / 6144MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
If you see a similar output, then CUDA is installed. Note the CUDA version. The CUDA version and compute capabiilty determine which base image to use when building your image. NVIDIA provides charts of compute capabilities of NVIDIA devices. Choose the menu item for your device. For example, I have a CUDA-Enable GeForce card, e.g., a RTX 2060
Configuring Docker
The NVIDIA Container Toolkit enables Docker to use the GPU. NVIDIA provides detailed instructions for installing the Container Toolkit. If you're impatient, the abbreviated steps are below.
% distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
% sudo apt-get update
% sudo apt-get install -y nvidia-docker2
In Linux, update /etc/docker/daemon.json
to configure the Docker Engine daemon and register the NVIDIA container runtime. If daemon.json
is absent, create a file with the following content.
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Restart Docker to apply the changes.
% sudo systemctl restart docker
Test the runtime is working.
% $ docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
If working correctly, you should see a similar result.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2060 Off | 00000000:01:00.0 On | N/A |
| N/A 44C P8 10W / 90W | 1530MiB / 6144MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Building the Llamafile Image
Let's break down the Dockerfile. I'll go through each part and explain the choices I made. The Dockerfile is a multi-stage build, and we'll start with building the llamafile binaries.
FROM debian:trixie as builder
WORKDIR /download
RUN mkdir out && \
apt-get update && \
apt-get install -y curl git gcc make && \
git clone https://github.com/Mozilla-Ocho/llamafile.git && \
curl -L -o ./unzip https://cosmo.zip/pub/cosmos/bin/unzip && \
chmod 755 unzip && mv unzip /usr/local/bin && \
cd llamafile && make -j8 && \
make install PREFIX=/download/out
This part of the Dockerfile builds the llamafile binaries. Like many open-source projects, llamafile is active, with new features and bug fixes delivered weekly. For this reason, I build llamafile from source instead of using binaries. In addition, llamafile can be built into a single executable that includes a model. To keep things simple, I omitted this step.
The next part of the multi-stage build is the image that runs llamafile with the host system's GPU. NVIDIA provides base images for CUDA that we can use for GPU-enabled applications like llamafile. Earlier, we took note of the CUDA version and the driver version. NVIDIA states the following:
The base NVIDIA CUDA image must match the CUDA version on the host machine. Suppose you have a later-model video card. In that case, you choose the latest image for the host platform. You have a choice of several Linux distributions and a choice between a development image with a software toolchain for building applications or a runtime version for deploying a prebuilt image.
NOTE: The GeForce RTX 2060 on my laptop was incompatible with the latest CUDA image. I could have fallen back to an earlier image, but the NVIDIA CUDA repository maintains Dockerfiles for different CUDA versions. I built an image specific to my CUDA version and Linux distribution. If you want to build your own CUDA image, download the Dockerfile and build it:
% docker build -t cuda12.2-base-ubuntu-22.04 .
This section installs and configures the CUDA toolkit, which is needed to enable the GPU for llamafile in a CUDA base image. Replace the base image that matches your CUDA version and Linux distribution. In addition, a user is created so that the container does not run as root.
FROM cuda-12.2-base-ubuntu-22.04 as out
RUN apt-get update && \
apt-get install -y linux-headers-$(uname -r) && \
apt-key del 7fa2af80 && \
apt-get update && \
apt-get install -y clang && \
apt-get install -y cuda-toolkit && \
addgroup --gid 1000 user && \
adduser --uid 1000 --gid 1000 --disabled-password --gecos "" user
USER user
The following section copies the llamafile binaries and man pages from the builder image and the LLM model. In this example, the LLM, codellama-7b-instruct.Q4_K_M.gguf
, was downloaded from Hugging Face. You can use any llamafile (or llama.cpp) compatible model in the GGUF format. Note that llamafile is started as a server that includes an OPENAI API endpoint and enables GPU usage with -ng 9999
.
WORKDIR /usr/local
COPY --from=builder /download/out/bin ./bin
COPY --from=builder /download/out/share ./share/man
COPY codellama-7b-instruct.Q4_K_M.gguf /model/codellama-7b-instruct.Q4_K_M.gguf
# Don't write log file
ENV LLAMA_DISABLE_LOGS=1
# Expose 8080 port.
EXPOSE 8080
# Set entrypoint.
ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]
# Set default command.
CMD ["--server", "--nobrowser", "-ngl", "9999","--host", "0.0.0.0", "-m", "/model/codellama-7b-instruct.Q4_K_M.gguf"]
Build and tag the image.
% docker build -t llamafile-codellama-gpu .
This is the complete Dockerfile.
FROM debian:trixie as builder
WORKDIR /download
RUN mkdir out && \
apt-get update && \
apt-get install -y curl git gcc make && \
git clone https://github.com/Mozilla-Ocho/llamafile.git && \
curl -L -o ./unzip https://cosmo.zip/pub/cosmos/bin/unzip && \
chmod 755 unzip && mv unzip /usr/local/bin && \
cd llamafile && make -j8 && \
make install PREFIX=/download/out
FROM cuda-12.2-base as out
RUN apt-get update && \
apt-get install -y linux-headers-$(uname -r) && \
apt-key del 7fa2af80 && \
apt-get update && \
apt-get install -y clang && \
apt-get install -y cuda-toolkit && \
addgroup --gid 1000 user && \
adduser --uid 1000 --gid 1000 --disabled-password --gecos "" user
USER user
WORKDIR /usr/local
COPY --from=builder /download/out/bin ./bin
COPY --from=builder /download/out/share ./share/man
COPY codellama-7b-instruct.Q4_K_M.gguf /model/codellama-7b-instruct.Q4_K_M.gguf
# Don't write log file.
ENV LLAMA_DISABLE_LOGS=1
# Expose 8080 port.
EXPOSE 8080
# Set entrypoint.
ENTRYPOINT ["/bin/sh", "/usr/local/bin/llamafile"]
# # Set default command.
CMD ["--server", "--nobrowser", "-ngl", "9999","--host", "0.0.0.0", "-m", "/model/codellama-7b-instruct.Q4_K_M.gguf"]
Enjoy the Performance Boost
I ran a prompt using continue.dev. an open-source LLM coding VSCode extension to test GPU enabled llamafile with codellama-7b-instruct.Q4_K_M.gguf
. Continue.dev can use any LLM with an OpenAI APO. I ran llamafile in server mode on the laptop with the GPU ad VSCode on a M2 Mac mini. I used the the following.
Prompt:
"Depth-first search for a URL and print the pages as a tree."
Completion:
import requests
from bs4 import BeautifulSoup
def depth_first_search(url):
"""Depth-first search for an URL and print the pages as a tree."""
# Initialize a set to keep track of visited URLs
visited = set()
# Create a queue to store the URLs to be visited
queue = [url]
while queue:
# Get the next URL from the queue
url = queue.pop(0)
# If the URL has not been visited, visit it and add its children to the queue
if url not in visited:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a'):
queue.append(link.get('href'))
print(f"{url} -> {', '.join(queue)}")
visited.add(url)
For a baseline, I ran the llamafile executable with the CPU
% llamafile --server --host 0.0.0.0 -m codellama-7b-instruct.Q4_K_M.gguf
With the CPU, the prompt evaluation process was 24 tokens/second, and the total time for the completion was over one minute.
print_timings: prompt eval time = 4963.79 ms / 121 tokens ( 41.02 ms per token, 24.38 tokens per second)
print_timings: eval time = 64191.60 ms / 472 runs ( 136.00 ms per token, 7.35 tokens per second)
print_timings: total time = 69155.39 ms
Next, I ran the llamafile executable with the GPU enabled.
% llamafile --server --host 0.0.0.0 -ngl 9999 -m codellama-7b-instruct.Q4_K_M.gguf
With the GPU, prompt evaluation was 17 times faster than the CPU, processing 426 tokens/second. The completion returned in 11 seconds, a significant improvement in response time.
print_timings: prompt eval time = 283.92 ms / 121 tokens ( 2.35 ms per token, 426.18 tokens per second)
print_timings: eval time = 11134.10 ms / 470 runs ( 23.69 ms per token, 42.21 tokens per second)
print_timings: total time = 11418.02 ms
I ran the llamafile container with the GPU enabled to see if containerization affected performance.
% docker run -it --gpus all --runtime nvidia -p 8111:8080 llamafile-codellama-gpu
Surprisingly, the container did not perform poorly and was slightly better than the GPU-enabled executable.
print_timings: prompt eval time = 257.56 ms / 121 tokens ( 2.13 ms per token, 469.80 tokens per second)
print_timings: eval time = 11498.98 ms / 470 runs ( 24.47 ms per token, 40.87 tokens per second)
print_timings: total time = 11756.53 ms
Summary
Running llamafile with the GPU enabled changes it from a toy application for experimentation and learning to a practical component in your software toolchain. In addition, containerizing an LLM lets anyone run the LLM without installing and downloading binaries. Users can pull an LLM from a repository and launch it with minimum setup. Containerization also opens up other avenues for deploying LLMs in orchestration frameworks.
Top comments (0)