Athreya aka Maneshwar

Posted on Nov 17

Running Local AI on Linux With GPU: Ollama + Open WebUI + Gemma

#webdev #linux #ai #llm

Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building **one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.*

Running modern small LLMs locally has become insanely easy on Linux with GPU acceleration, Docker, and Ollama, there are a few gotchas.

This post walks through the entire real-world setup:

Choosing the right models for a 4GB GPU
Installing Phi & Gemma with Ollama
Fixing NVIDIA Docker GPU runtime
Evaluating WebUI options (Jan vs AnythingLLM vs Open WebUI)
Running Open WebUI fully GPU-accelerated
Fixing Ollama networking issues

Let’s go step by step.

1. Choosing the right small models

You have:

GTX 1650 (4GB VRAM)
16GB RAM

So the “small but surprisingly powerful” models are ideal.

Phi-3 / Phi-2 (Ollama: `phi3`, `phi`)

Extremely fast
Great reasoning for size
Runs smooth on 4GB

Gemma 2 (2B) (`gemma2:2b`)

Google quality
Very clean outputs
Heavier than Phi but still fits VRAM

TinyLlama (1.1B)

Ultra fast
Basic reasoning
Barebones but usable

Qwen 1.8B

Strong multilingual
Very fast
Great value model

Ranking for everyday use:

Model	Speed	Quality	GPU RAM	Notes
Phi-3	🚀	⭐⭐⭐	fits	Best small model
Gemma 2B	⚡	⭐⭐⭐⭐	fits	Better answers, slower
Qwen 1.8B	🚀	⭐⭐⭐	fits	Multilingual beast
TinyLlama 1.1B	🛸	⭐⭐	tiny	Only for basic chat

2. Installing LLMs using Ollama

Ollama makes model management dead simple.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull models:

ollama pull phi:latest
ollama pull gemma2:2b

Verify:

ollama list

Sample output:

Works.

3. Installing NVIDIA Container Toolkit (for GPU support)

Docker cannot use your GPU until this is installed correctly.

This repo occasionally breaks, so install like this:

sudo rm /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo rm /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

Add clean source:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/libnvidia-container.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
  sed 's#signed-by=.*#signed-by=/usr/share/keyrings/libnvidia-container.gpg#' |
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Install:

sudo apt update
sudo apt install -y nvidia-container-toolkit

Enable Docker GPU runtime:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Test:

docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

If you see your GPU inside container: done.

4. Choosing a Web UI for Local LLMs

1. Jan AI (JanHQ) — 39.3k stars

Repo: https://github.com/janhq/jan

A clean Electron-based desktop app focused on being an offline ChatGPT replacement.

2. AnythingLLM — 51k stars

Repo: https://github.com/Mintplex-Labs/anything-llm

A full RAG framework and knowledge-base system with a UI on top.

3. Open WebUI — 115k stars (the one I used)

Repo: https://github.com/open-webui/open-webui

A fast, modern UI for LLMs with full GPU + Ollama support.

I went with Open WebUI because:

It integrates best with Ollama
Has GPU-optimized builds
Has the most features
Supports future scaling (agents, workflows, extensions)
Huge community (115k+ stars)

Running Open WebUI with CUDA support

Open WebUI is the best local AI UI and supports Ollama beautifully.

Run container:

docker run -d \
  -p 3000:8080 \
  --gpus all \
  -e OLLAMA_HOST=http://132.17.0.1:11434 \
  -e WEBUI_OLLAMA_BASE_URL=http://132.17.0.1:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:cuda

5. Fixing “Open WebUI cannot connect to Ollama”

This was the painful part.
You will absolutely hit this error:

Cannot connect to host host.docker.internal:11434

Even though:

Ollama server is running
Models exist
Curl works inside container

Root causes I fixed:

Fix A — Ollama was only listening on 127.0.0.1

Default Ollama binds to loopback only.

We updated systemd service:

sudo vim /etc/systemd/system/ollama.service

Add:

Environment="OLLAMA_HOST=0.0.0.0:11434"

Reload:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Verify:

sudo ss -tulpn | grep 11434

It must show:

Fix B — Open WebUI DB saved wrong host (host.docker.internal)

Earlier runs stored the bad value in SQLite config.

We wiped the volume:

docker rm -f open-webui
docker volume rm open-webui

Recreated container clean.

Fix C — Browser localStorage had the wrong URL saved

Open WebUI saves connection settings in the browser!

We went into:

Settings → Connections → Ollama

http://localhost:3000/admin/settings/connections

It showed:

http://host.docker.internal:11434

Changed to:

http://132.17.0.1:11434

If UI refused to update:

Chrome DevTools → Application → Local Storage → Clear All
or:

localStorage.clear()
sessionStorage.clear()

Refresh page.

Finally, WebUI started using the correct IP.

6. Verifying everything

From container:

docker exec -it open-webui curl http://132.17.0.1:11434/api/tags

If you see phi + gemma JSON → success.

Then in Open WebUI:

✔ Models appear
✔ Chat works
✔ GPU is used
✔ Everything is finally stable

Conclusion

After a lot of debugging, the final working setup required fixing THREE layers:

Ollama network binding (listening on 0.0.0.0)
Docker environment overrides
Open WebUI's internal + browser-stored connection configs

Once aligned, everything worked flawlessly.

👉 Check out: FreeDevTools

Any feedback or contributors are welcome!

It’s online, open-source, and ready for anyone to use.

⭐ Star it on GitHub: freedevtools

Top comments (2)

Sawyer Wolfe • Nov 17

Great guide! Binding Ollama to 0.0.0.0 and clearing WebUI storage are clutch. On 4GB, Qwen2.5-1.5B Q4_K_M and Phi-3 Q4 run well. What other models or Docker/WebUI tweaks helped reduce VRAM spikes?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

1. Choosing the right small models

Phi-3 / Phi-2 (Ollama: phi3, phi)

Gemma 2 (2B) (gemma2:2b)

TinyLlama (1.1B)

Qwen 1.8B

Ranking for everyday use:

2. Installing LLMs using Ollama

Install Ollama:

Pull models:

Verify:

3. Installing NVIDIA Container Toolkit (for GPU support)

4. Choosing a Web UI for Local LLMs

1. Jan AI (JanHQ) — 39.3k stars

2. AnythingLLM — 51k stars

3. Open WebUI — 115k stars (the one I used)

Running Open WebUI with CUDA support

5. Fixing “Open WebUI cannot connect to Ollama”

Root causes I fixed:

Fix A — Ollama was only listening on 127.0.0.1

Fix B — Open WebUI DB saved wrong host (host.docker.internal)

Fix C — Browser localStorage had the wrong URL saved

6. Verifying everything

Conclusion

Phi-3 / Phi-2 (Ollama: `phi3`, `phi`)

Gemma 2 (2B) (`gemma2:2b`)