DEV Community

Cover image for Running Local AI on Linux With GPU: Ollama + Open WebUI + Gemma
Athreya aka Maneshwar
Athreya aka Maneshwar

Posted on

Running Local AI on Linux With GPU: Ollama + Open WebUI + Gemma

Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building **one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.*

Running modern small LLMs locally has become insanely easy on Linux with GPU acceleration, Docker, and Ollama, there are a few gotchas.

This post walks through the entire real-world setup:

  • Choosing the right models for a 4GB GPU
  • Installing Phi & Gemma with Ollama
  • Fixing NVIDIA Docker GPU runtime
  • Evaluating WebUI options (Jan vs AnythingLLM vs Open WebUI)
  • Running Open WebUI fully GPU-accelerated
  • Fixing Ollama networking issues

Let’s go step by step.

1. Choosing the right small models

You have:

  • GTX 1650 (4GB VRAM)
  • 16GB RAM

So the “small but surprisingly powerful” models are ideal.

Phi-3 / Phi-2 (Ollama: phi3, phi)

  • Extremely fast
  • Great reasoning for size
  • Runs smooth on 4GB

Gemma 2 (2B) (gemma2:2b)

  • Google quality
  • Very clean outputs
  • Heavier than Phi but still fits VRAM

TinyLlama (1.1B)

  • Ultra fast
  • Basic reasoning
  • Barebones but usable

Qwen 1.8B

  • Strong multilingual
  • Very fast
  • Great value model

Ranking for everyday use:

Model Speed Quality GPU RAM Notes
Phi-3 🚀 ⭐⭐⭐ fits Best small model
Gemma 2B ⭐⭐⭐⭐ fits Better answers, slower
Qwen 1.8B 🚀 ⭐⭐⭐ fits Multilingual beast
TinyLlama 1.1B 🛸 ⭐⭐ tiny Only for basic chat

2. Installing LLMs using Ollama

Ollama makes model management dead simple.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Pull models:

ollama pull phi:latest
ollama pull gemma2:2b
Enter fullscreen mode Exit fullscreen mode

Verify:

ollama list
Enter fullscreen mode Exit fullscreen mode

Sample output:

Works.

3. Installing NVIDIA Container Toolkit (for GPU support)

Docker cannot use your GPU until this is installed correctly.

This repo occasionally breaks, so install like this:

sudo rm /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo rm /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
Enter fullscreen mode Exit fullscreen mode

Add clean source:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/libnvidia-container.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
  sed 's#signed-by=.*#signed-by=/usr/share/keyrings/libnvidia-container.gpg#' |
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
Enter fullscreen mode Exit fullscreen mode

Install:

sudo apt update
sudo apt install -y nvidia-container-toolkit
Enter fullscreen mode Exit fullscreen mode

Enable Docker GPU runtime:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Enter fullscreen mode Exit fullscreen mode

Test:

docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
Enter fullscreen mode Exit fullscreen mode

If you see your GPU inside container: done.

4. Choosing a Web UI for Local LLMs

1. Jan AI (JanHQ) — 39.3k stars

Repo: https://github.com/janhq/jan

A clean Electron-based desktop app focused on being an offline ChatGPT replacement.

2. AnythingLLM — 51k stars

Repo: https://github.com/Mintplex-Labs/anything-llm

A full RAG framework and knowledge-base system with a UI on top.

3. Open WebUI — 115k stars (the one I used)

Repo: https://github.com/open-webui/open-webui

A fast, modern UI for LLMs with full GPU + Ollama support.

I went with Open WebUI because:

  • It integrates best with Ollama
  • Has GPU-optimized builds
  • Has the most features
  • Supports future scaling (agents, workflows, extensions)
  • Huge community (115k+ stars)

Running Open WebUI with CUDA support

Open WebUI is the best local AI UI and supports Ollama beautifully.

Run container:

docker run -d \
  -p 3000:8080 \
  --gpus all \
  -e OLLAMA_HOST=http://132.17.0.1:11434 \
  -e WEBUI_OLLAMA_BASE_URL=http://132.17.0.1:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:cuda
Enter fullscreen mode Exit fullscreen mode

5. Fixing “Open WebUI cannot connect to Ollama”

This was the painful part.
You will absolutely hit this error:

Cannot connect to host host.docker.internal:11434
Enter fullscreen mode Exit fullscreen mode

Even though:

  • Ollama server is running
  • Models exist
  • Curl works inside container

Root causes I fixed:

Fix A — Ollama was only listening on 127.0.0.1

Default Ollama binds to loopback only.

We updated systemd service:

sudo vim /etc/systemd/system/ollama.service
Enter fullscreen mode Exit fullscreen mode

Add:

Environment="OLLAMA_HOST=0.0.0.0:11434"
Enter fullscreen mode Exit fullscreen mode

Reload:

sudo systemctl daemon-reload
sudo systemctl restart ollama
Enter fullscreen mode Exit fullscreen mode

Verify:

sudo ss -tulpn | grep 11434
Enter fullscreen mode Exit fullscreen mode

It must show:

Fix B — Open WebUI DB saved wrong host (host.docker.internal)

Earlier runs stored the bad value in SQLite config.

We wiped the volume:

docker rm -f open-webui
docker volume rm open-webui
Enter fullscreen mode Exit fullscreen mode

Recreated container clean.

Fix C — Browser localStorage had the wrong URL saved

Open WebUI saves connection settings in the browser!

We went into:

Settings → Connections → Ollama

http://localhost:3000/admin/settings/connections
Enter fullscreen mode Exit fullscreen mode

It showed:

http://host.docker.internal:11434
Enter fullscreen mode Exit fullscreen mode

Changed to:

http://132.17.0.1:11434
Enter fullscreen mode Exit fullscreen mode

If UI refused to update:

Chrome DevTools → Application → Local Storage → Clear All
or:

localStorage.clear()
sessionStorage.clear()
Enter fullscreen mode Exit fullscreen mode

Refresh page.

Finally, WebUI started using the correct IP.

6. Verifying everything

From container:

docker exec -it open-webui curl http://132.17.0.1:11434/api/tags
Enter fullscreen mode Exit fullscreen mode

If you see phi + gemma JSON → success.

Then in Open WebUI:

✔ Models appear
✔ Chat works
✔ GPU is used
✔ Everything is finally stable

Conclusion

After a lot of debugging, the final working setup required fixing THREE layers:

  1. Ollama network binding (listening on 0.0.0.0)
  2. Docker environment overrides
  3. Open WebUI's internal + browser-stored connection configs

Once aligned, everything worked flawlessly.

FreeDevTools

👉 Check out: FreeDevTools

Any feedback or contributors are welcome!

It’s online, open-source, and ready for anyone to use.

⭐ Star it on GitHub: freedevtools

Top comments (1)

Collapse
 
sawyerwolfe profile image
Sawyer Wolfe

Great guide! Binding Ollama to 0.0.0.0 and clearing WebUI storage are clutch. On 4GB, Qwen2.5-1.5B Q4_K_M and Phi-3 Q4 run well. What other models or Docker/WebUI tweaks helped reduce VRAM spikes?