Atharva Shirdhankar

Posted on May 31

Breathing Life into the Pi: Deploying Gemma 4 2B on a Raspberry Pi 5

#gemma #ai #raspberrypi #podman

Hi Everyone,
I’m back with a brand new project, and this one has been a long time coming.

For a while now, I’ve had this persistent urge to build my own dedicated, local AI server. But if you’ve ever tried running Large Language Models locally, you already know the universal roadblock every developer hits almost instantly: resource constraints. A few months ago, I was experimenting with the Gemma 3 1B model on my primary workstation. To be honest, I was incredibly impressed. For daily text-based tasks and general brainstorming, that tiny model punched way above its weight class. Because my main machine has a pretty decent hardware layout, Gemma 3 ran flawlessly, entirely local, and without a hint of lag.

But that experience left me with a lingering question. Can we take this optimization a step further? Could we move this workload completely off my main workstation and onto a tiny, low-power single-board computer?

I hadn’t tested any LLMs on my Raspberry Pi 5 yet. So, this time, I decided to skip the ultra-lightweight models and push the hardware to its absolute limit. In this tutorial, we are going to try and get the brand new Gemma 4 2B (8-bit Quantized) model up and running on a Raspberry Pi 5 with just 4GB of RAM. Can a credit-card-sized board with 4GB of memory actually host a production-ready, modern 2-billion parameter AI server? Let’s find out.

For this project I have below mentioned hardware,

Raspberry pi 5 4GB with Active Cooler installed
NVMe 500GB instead of SD card.

Honestly, it is only because I have this cooling fan and fast SSD storage that I am willing to take a risk and try running this model on a 4GB RAM board. Without these upgrades, trying to run a 2B model on a board this small would be a huge struggle and might not even work because of the limited memory.

But with our hardware ready to go, we are finally ready for the software side. Let's boot up the OS and clean up the system for maximum speed.

Steps to Create Local RPi AI Server

1. Boot RPi OS lite :

Since we are operating under tight hardware resource constraints, our system architecture needs to be as optimized and lightweight as possible. To achieve this, we are choosing the official Raspberry Pi OS Lite (64-bit) variant instead of the standard desktop edition.

Operating in a completely headless mode, meaning no graphical user interface (GUI) gives us a pure command-line environment. This ensures that every single megabyte of RAM and every spare CPU cycle on our Raspberry Pi 5 is preserved entirely for running our local LLM workload.

Lets first update packages

Update Linux packages

sudo apt update && sudo apt upgrade -y

Because Raspberry Pi OS Lite is stripped down for optimization, it does not come with the full Vim text editor pre-installed.
We can install it manually.

Install vim

sudo apt install vim -y

2. Setting up Podman :

Yes, we are using Podman instead of Docker for this build.

The main reason for this choice comes down to optimization. Podman is completely rootless and daemonless. Unlike Docker, which requires a heavy background service constantly running in your system memory, Podman has zero idle background RAM consumption on our host. When the container isn't actively working, it takes up zero overhead.

There is also another massive advantage to using Podman that makes it a perfect fit for this project, but I will save that surprise for the Services section later in this blog!

Install podman

Let's install Podman using our package manager:

sudo apt -y install podman

3. Download Gemma 4 2B Model from Hugging Face:

We are opting for the Q8_0 (8-bit quantized) variant of Gemma 4 E2B model packaged by the GGML team in the GGUF format. This gives us an incredible balance,it preserves almost all of the model's native intelligence and reasoning accuracy while dramatically lowering the RAM requirements so it runs flawlessly on our Raspberry Pi 5 without choking the system. Since we are using Llama.cpp here, GGUF is the only format it can actually use to run local inference

Since we are using headless OS mode we will install the model using wget linux command.

Command Structure :
wget -c -P /path/to/download/dir url-to-model-file

-c (Continue): If your SSH session drops or your internet hiccups halfway through this heavy multi-gigabyte download, adding this flag allows wget to automatically resume right where it left off instead of restarting from scratch.

-P (Prefix Directory): This automatically routes the incoming download stream straight into our newly created ~/models/ directory, keeping our home directory clean.
If we dont have ~/models/ directory existed, wget command creates new one for us.

Download Gemma 4 2B model :

wget -c -P ~/models/ https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q8_0.gguf

4. Llama Server Container :

Since we are building close to production ready server. We will go with llama server container image which is recommended for production servers. It only has tools which are needed to run our gemma model. No extra packages and tools are included in it. Since we are also keeping a optimized system in our mind. This container image will help us with that idealogy.

Running a Podman Test

Before we automate everything, let's run a quick manual test in the background to make sure Podman can download the image, find our model file, and start the server correctly.

podman run -d \
  --name llama-server \
  -p 8080:8080 \
  -v $HOME/models:/models:Z \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/gemma-4-E2B-it-Q8_0.gguf \
  -c 1024 \
  -t 4 \
  --host 0.0.0.0 \
  --port 8080

This podman command can be broken down into two parts for explaination -

Part 1: Podman Configuration Flags

These settings tell Podman how to create and isolate the container on your Raspberry Pi:

podman run: The core command that tells Podman to pull down the container image (if you don't have it yet) and start it up.

-d (Detached): Runs the container silently in the background. This keeps your terminal open so you can keep working.

--name llama-server: Assigns a friendly, custom name to our container so we can easily start, stop, or check its logs later without typing out a random ID string.

-p 8080:8080 (Port Mapping): Opens a bridge between your physical Raspberry Pi and the container. It takes traffic coming into the Pi's port 8080 and passes it straight into the container's port 8080.

-v $HOME/models:/models:Z (Volume Mount): Links the ~/models folder on your Pi's high-speed NVMe drive to a folder named /models inside the container. The :Z flag is incredibly important here,it tells Podman to configure SELinux security permissions so the rootless container is allowed to read your model file.

ghcr.io/ggml-org/llama.cpp:server: This is the exact address of the official, lightweight container image hosted on the GitHub Container Registry.

Part 2: Llama Server Internal Arguments

Everything written after the image name is passed directly to the AI application running inside the container:

-m /models/gemma-4-E2B-it-Q8_0.gguf (Model Path): Tells the server exactly which model file to load into memory from our shared folder.

-c 1024 (Context Size): Sets the maximum token limit (words and spaces) the model can handle in a single conversation. Restricting this to 1024 keeps our RAM usage low and optimized for the Pi 4GB.
Think of the context size as the short-term memory of the AI.
You can actually tweak this number all the way up to 8192 specifically for the Gemma model, but a larger memory requires a lot more RAM. For my Raspberry Pi 5 setup, I could push this up to 2048. However, keep in mind that raising this number means the system will likely need to use a swap partition on the NVMe drive. While relying on swap space might slow down the AI's response time slightly, it will absolutely work!

-t 4 (Threads): Dedicates exactly 4 CPU threads to calculate the AI responses. This perfectly matches the 4 physical cores of the Raspberry Pi 5 for maximum speed without overheating.

--host 0.0.0.0: Tells the AI server to accept connections from any IP address on your local network, rather than locking it down to just the Pi itself. Since I want to deploy this LLM server in my lab and easily access it from my main desktop or laptop over the LAN, this flag is absolutely essential.

--port 8080: Configures the web server inside the container to listen for requests on port 8080.

Once we run the Podman command, our AI server container will spin up in the background. Before we try talking to our new LLM, we should verify that the server initialized correctly and the Gemma model successfully loaded into the memory.

We can easily check the real-time status of our server using a simple curl command to ping its built-in health endpoint:

curl http://localhost:8080/health

What to expect from the output:
{"status": "ok"}: If you get a 200 OK response with this message, congratulations! Your lightweight server is fully healthy and standing ready to process prompts.

{"status": "loading model"}: If you get a 503 status, don't panic. It just means the server is still reading the 2B model file off your NVMe drive. Give it a few seconds and try again.

Automating with Podman Quadlets

Our manual test worked perfectly, and our server is healthy. But let’s be honest: running commands manually isn’t a great long-term solution.

Think about what happens if our Raspberry Pi pushes past its 4GB limit during a heavy task. The Linux kernel might trigger a sudden reboot to protect itself. Or, what if you simply shut down your Pi for the night and want to use your AI server the next morning? You would have to log back in and type out that massive podman run command all over again. That feels like a frustrating bottleneck.

As developers, we want to automate this so that our AI server starts up automatically every single time the Raspberry Pi boots.

This is where the massive advantage of Podman comes into play. Unlike Docker, Podman integrates natively with Linux systemd services using a tool called Quadlets. Instead of writing a messy script to manage our container, we can just create a single, clean .container file. Podman's built-in engine reads this file and automatically turns it into a background linux service.

Creating the Configuration File
First, we need to create the specific directory where Podman looks for these automated files:

mkdir -p $HOME/.config/containers/systemd/

Next, open a new file named llama-server.container inside that folder using Vim:

vim $HOME/.config/containers/systemd/llama-server.container

[Unit]
Description=Llama server for Gemma 4-2BQ Model
After=network-online.target

[Container]
Image=ghcr.io/ggml-org/llama.cpp:server
ContainerName=llama-server
PublishPort=8080:8080
Volume=%h/models:/models:Z
Exec=-m /models/gemma-4-E2B-it-Q8_0.gguf  -c 1024 -t 4 --host 0.0.0.0 --port 8080

[Service]
Restart=always

[Install]
WantedBy=default.target

How this configuration works:

The complete config does the same task as the podman run command which we used earlier, but with a catch. Whenever our raspberry pi boots up, our LLM server will start automatically without our intervention.

After=network-online.target: This tells the system to wait until the Raspberry Pi is fully connected to your network before trying to start the AI server.

[Container]: This is where Quadlets shine. Instead of a long, confusing command line, you get to list your container settings clearly line-by-line.

Volume=%h/models:/models:Z: In systemd configuration, %h is a neat shortcut that automatically grabs your user home directory path. This connects our NVMe storage folder safely.

Restart=always: This is our ultimate safety net. If the container crashes because it runs out of memory, or if the Pi reboots unexpectedly, Linux will step in and restart the server automatically.

Now that our configuration file is sitting in the Quadlet folder, we need to tell systemd to scan for changes, recognize our new file, and fire up the LLM server in the background.

Before we do that, we need to stop the manual Podman container we started earlier so it doesn't conflict with our new setup:

podman stop llama-server

Since we are running this as a rootless user setup, we will use the --user flag for all our systemd commands. Run the following to let systemd process the new configuration:

systemctl --user daemon-reload

Now, fire up the background service:

systemctl --user start llama-server.container

Check the service status :

systemctl --user status llama-server.container

This will show us the live logs of our llm server.

If you want an extra layer of confirmation, you can also run our familiar Podman command to verify that the container is alive

Testing on Mobile (LAN connection):

To be completely upfront, the output I got from running this Gemma model on the Raspberry Pi 5 was pretty decent, keeping in mind the compact device I was using. The model was smart enough for basic tasks. But yes, I will agree that the response time was pretty slow atleast in the first couple of runs while the model was still loading into memory. Because we are relying entirely on the Pi's CPU and dipping into a swap partition on the NVMe drive to manage our tight memory limits, token generation definitely takes some patience.

But as a proof of concept. It is an absolute win. It proves that you can host a fully functional, self-contained AI server on a tiny credit-card-sized computer for next to nothing.

Now, lot of people might be wondering: Do I need a Raspberry Pi 5 to build this local LLM setup?

The answer is simply no. You don’t need a Pi at all! You can run this exact same setup on a Linux cloud instance, a home server, or even an old Linux laptop you have lying around. In fact, if that old laptop or cloud server has a basic NVIDIA GPU installed in it, you are in for a massive upgrade.

If you have an NVIDIA graphics card, you just need to tweak a few lines in your configuration file. By shifting the heavy math from your CPU to your GPU, your LLM server will generate responses much, much faster.

All you have to do is update the [Container] section in your .container file like this:

[Container]
Image=ghcr.io/ggml-org/llama.cpp:server-cuda
ContainerName=llama-server
PublishPort=8080:8080
Volume=%h/models:/models:Z
Nvidia=all
Exec=-m /models/gemma-4-E2B-it-Q8_0.gguf -c 1024 -t 4 --host 0.0.0.0 --port 8080 -ngl 99

The Image Tag: We changed the container image from :server to :server-cuda so the software inside actually knows how to talk to NVIDIA hardware.

Nvidia=all: This is a specific Podman Quadlet setting that safely passes your graphics card directly into the container.

-ngl 99: The ngl flag stands for Number of GPU Layers. You can tweak this number anywhere from 1 to 99 to decide the partnership ratio between your GPU and CPU. Setting it to 99 tells the server to shove the entire model into your GPU’s ultra-fast Video RAM (VRAM), bypassing the slower CPU entirely!

I actually wish I could showcase a past setup I built using an older Gemma 1B model on my personal HP laptop. That machine only had 16GB of RAM and a very basic, entry-level NVIDIA MX110 GPU.

Back then, I didn't use Podman or automate anything with systemd services; I just ran a quick manual test using Docker. But even on that older, modest laptop graphics card, shifting the workload to the GPU worked beautifully. I did ranned into a problem where my LLM model wasn't able to detect GPU but somehow I sorted that out. But it's a proof that you don't need a high-end data center or a flagship graphics card to start experimenting with local AI!

What Next...?

Frankly speaking, this project is still a work in progress. My main goal for setting up this local server was to lay the foundation for something much bigger.

Moving forward, I want to build a Minimum Viable Product (MVP) application on top of this setup. By using this local Raspberry Pi server, I can easily test and integrate my APIs without spending a single dollar on cloud hosting. Once I feel the MVP application is working flawlessly and is ready for the world, I can seamlessly move the entire setup to a powerful cloud server, point my application's API endpoints to the cloud, and scale it up.

Aside from the development side, this project was also a personal milestone for me. The entire Podman and systemd automation methodology I used here came directly from my recent RHCSA (Red Hat Certified System Administrator) certification preparation. It was incredibly satisfying to take those core Linux system administration concepts out of the study guide and apply them directly to a modern LLMOps pipeline.

Linkedin Post :

redhat linux cert | Atharva Shirdhankar | 14 comments

300/300 on the RHCSA! 🚀 Finally a win worth sharing. I’m officially a Red Hat Certified System Administrator, and honestly, seeing all that time spent grinding in the terminal pay off feels incredible. This is just the start of my DevOps journey, but it feels good to have a solid foundation down. Now it's time to see how much further I can take this! 🐧💻 Certificate : https://lnkd.in/dg773hv2 #RHCSA #Linux #RedHat #DevOps #CareerGrowth #Learning | 14 comments on LinkedIn

linkedin.com

Thank you so much for reading all the way to the end!

If you enjoyed this deep dive into running local LLMs and automating containers with systemd, let's keep the conversation going. I love connecting with fellow developers, Linux enthusiasts, and anyone building in the AI infrastructure space.

Feel free to reach out, share your thoughts, or show off your own local server setups!

Let's connect:

LinkedIn: https://www.linkedin.com/in/atharvashirdhankar/

DEV Community