Ajeet Singh Raina

Posted on Nov 30, 2024 • Edited on Dec 11, 2024

Running Ollama and Open WebUI containers on NVIDIA Jetson device with GPU Acceleration: A Complete Guide

#nvidia #jetson #iot #docker

NVIDIA Jetson devices are powerful platforms designed for edge AI applications, offering excellent GPU acceleration capabilities to run compute-intensive tasks like language model inference.

With official support for NVIDIA Jetson devices, Ollama brings the ability to manage and serve Large Language Models (LLMs) locally, ensuring privacy, performance, and offline operation. By integrating Open WebUI, you can enhance your workflow with an intuitive web interface for managing these models.

It is important to note that the NVIDIA Jetson Nano, equipped with 4GB of memory, can run smaller LLaMA models, particularly those with fewer parameters, such as the 7B models. However, due to its limited memory, running these models may require optimizations like quantization to reduce memory usage.

For instance, using 4-bit quantization can make it feasible to run these models on the Jetson Nano. It's important to note that while the Jetson Nano can handle these smaller models, performance may be constrained compared to more powerful hardware. Additionally, some users have reported challenges in utilizing GPU acceleration with pre-built binaries on the Jetson Nano, suggesting that building from source might be necessary to achieve optimal performance.

This guide will walk you through setting up Ollama on your Jetson device, integrating it with Open WebUI, and configuring the system for optimal GPU utilization. Whether you're a developer or an AI enthusiast, this setup allows you to harness the full potential of LLMs right on your Jetson device.

Pre-requisite

Jetson Orin Nano
A 5V 4Ampere Charger
64GB SD card
WiFi Adapter
Wireless Keyboard
Wireless mouse

Software

Download Jetson SD card image from this link
Raspberry Pi Imager installed on your local system

Preparing Your Jetson Nano

Unzip the SD card image
Insert SD card into your system.
Bring up Raspberry Pi Imager tool to flash image into the SD card

Prerequisite

Ensure that you have Jetpack 6.0 installed on your Jetson Orin Nano device. You can download the SDK Manager on the remote Windows or Linux and follow the tutorial from the official NVIDIA Developer site.

Step 1. Verify L4T Version

To check the L4T (Linux for Tegra) version on your NVIDIA Jetson device (e.g., Jetson Nano, Jetson Xavier), follow these steps:

Run the following command to retrieve your current L4T version.

head -n 1 /etc/nv_tegra_release

Here are the list of supported L4T versions:

35.3.1
35.4.1
35.5.0
36.3.0

If your L4T version does not match the supported versions listed above, you may need to re-flash the system on your NVIDIA Jetson device. You might need to use SDK Manager on another computer to re-flash the device. You can download the SDK Manager and follow the tutorial from the official NVIDIA Developer site.

Step 2. Keep `apt` up to date:

   sudo apt update && sudo apt upgrade

Step 3. Install `jetpack`:

   sudo apt install jetpack

Step 4. Add users

Add your user to the docker group and restart the Docker service to apply the change:

   sudo usermod -aG docker $USER && \
   newgrp docker && \
   sudo systemctl daemon-reload && \
   sudo systemctl restart docker

Step 5. Install jetson-examples:

   pip3 install jetson-examples

Step 6. Reboot system

   sudo reboot

Step 7. Install Ollama

   reComputer run ollama

Optional: If you run the above command via ssh and encounter the error command not found: reComputer, you can resolve this by executing the following command:

   source ~/.profile

Step 8. Run a model

The smallest LLaMA model available for download is TinyLlama, a compact 1.1 billion parameter model. Despite its reduced size, TinyLlama demonstrates remarkable performance across various tasks, making it suitable for applications with limited computational resources. You can access TinyLlama through its GitHub repository or via Hugging Face.

Let's run the tinyllama model and perform tasks like generating Python code:

ollama run tinyllama
>>> > Can you write a Python script to calculate the factorial of a number?
Sure! Here’s the code:

def factorial(n):
    if n == 0 or n == 1:
        return 1
    else:
        return n * factorial(n - 1)

num = int(input("Enter a number: "))
print(f"The factorial of {num} is {factorial(num)}")

Step 9. Install models (e.g. llama3.2) from Ollama Library

ollama pull llama3.2

Step 9. Install and run Open WebUI through Docker

docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda

Step 10. Install and run Open WebUI through docker

Once the installation is finished, you can access the GUI by visiting YOUR_SERVER_IP:3000 in your browser.

Access the API endpoints by navigating to YOUR_SERVER_IP/ollama/docs#/. For comprehensive documentation, please refer to the official resources: the Ollama API Documentation (recommended) and Open WebUI API Endpoints.

Using GPU

This installation method uses a single container image that bundles Open WebUI with Ollama, allowing for a streamlined setup via a single command. Choose the appropriate command based on your hardware setup:

sudo docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Using CPU only

For CPU Only: If you're not using a GPU, use this command instead:

sudo docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Both commands facilitate a built-in, hassle-free installation of both Open WebUI and Ollama, ensuring that you can get everything up and running swiftly.

Conclusion

Once configured, Open WebUI can be accessed at http://localhost:3000, while Ollama operates at http://localhost:11434. This setup provides a seamless and GPU-accelerated environment for running and managing LLMs locally on NVIDIA Jetson devices.

This guide showcases the power and versatility of NVIDIA Jetson devices when paired with Ollama and Open WebUI, enabling advanced AI workloads at the edge with ease and efficiency.

Deploy with ease. Manage efficiently. Scale faster.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

DEV Community

Running Ollama and Open WebUI containers on NVIDIA Jetson device with GPU Acceleration: A Complete Guide

Pre-requisite

Software

Preparing Your Jetson Nano

Prerequisite

Step 1. Verify L4T Version

Step 2. Keep `apt` up to date:

Step 3. Install `jetpack`:

Step 4. Add users

Step 5. Install jetson-examples:

Step 6. Reboot system

Step 7. Install Ollama

Step 8. Run a model

Step 9. Install models (e.g. llama3.2) from Ollama Library

Step 10. Install and run Open WebUI through docker

Using GPU

Using CPU only

Conclusion

Deploy with ease. Manage efficiently. Scale faster.

Top comments (0)

Okay

Pre-requisite

Software

Preparing Your Jetson Nano

Prerequisite

Step 1. Verify L4T Version

Step 2. Keep apt up to date:

Step 3. Install jetpack:

Step 4. Add users

Step 5. Install jetson-examples:

Step 6. Reboot system

Step 7. Install Ollama

Step 8. Run a model

Step 9. Install models (e.g. llama3.2) from Ollama Library

Step 10. Install and run Open WebUI through docker

Using GPU

Using CPU only

Conclusion

Deploy with ease. Manage efficiently. Scale faster.

Okay

Step 2. Keep `apt` up to date:

Step 3. Install `jetpack`: