Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building **one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.*
Running modern small LLMs locally has become insanely easy on Linux with GPU acceleration, Docker, and Ollama, there are a few gotchas.
This post walks through the entire real-world setup:
- Choosing the right models for a 4GB GPU
- Installing Phi & Gemma with Ollama
- Fixing NVIDIA Docker GPU runtime
- Evaluating WebUI options (Jan vs AnythingLLM vs Open WebUI)
- Running Open WebUI fully GPU-accelerated
- Fixing Ollama networking issues
Let’s go step by step.
1. Choosing the right small models
You have:
- GTX 1650 (4GB VRAM)
- 16GB RAM
So the “small but surprisingly powerful” models are ideal.
Phi-3 / Phi-2 (Ollama: phi3, phi)
- Extremely fast
- Great reasoning for size
- Runs smooth on 4GB
Gemma 2 (2B) (gemma2:2b)
- Google quality
- Very clean outputs
- Heavier than Phi but still fits VRAM
TinyLlama (1.1B)
- Ultra fast
- Basic reasoning
- Barebones but usable
Qwen 1.8B
- Strong multilingual
- Very fast
- Great value model
Ranking for everyday use:
| Model | Speed | Quality | GPU RAM | Notes |
|---|---|---|---|---|
| Phi-3 | 🚀 | ⭐⭐⭐ | fits | Best small model |
| Gemma 2B | ⚡ | ⭐⭐⭐⭐ | fits | Better answers, slower |
| Qwen 1.8B | 🚀 | ⭐⭐⭐ | fits | Multilingual beast |
| TinyLlama 1.1B | 🛸 | ⭐⭐ | tiny | Only for basic chat |
2. Installing LLMs using Ollama
Ollama makes model management dead simple.
Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Pull models:
ollama pull phi:latest
ollama pull gemma2:2b
Verify:
ollama list
Sample output:
Works.
3. Installing NVIDIA Container Toolkit (for GPU support)
Docker cannot use your GPU until this is installed correctly.
This repo occasionally breaks, so install like this:
sudo rm /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo rm /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
Add clean source:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/libnvidia-container.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
sed 's#signed-by=.*#signed-by=/usr/share/keyrings/libnvidia-container.gpg#' |
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
Install:
sudo apt update
sudo apt install -y nvidia-container-toolkit
Enable Docker GPU runtime:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Test:
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
If you see your GPU inside container: done.
4. Choosing a Web UI for Local LLMs
1. Jan AI (JanHQ) — 39.3k stars
Repo: https://github.com/janhq/jan
A clean Electron-based desktop app focused on being an offline ChatGPT replacement.
2. AnythingLLM — 51k stars
Repo: https://github.com/Mintplex-Labs/anything-llm
A full RAG framework and knowledge-base system with a UI on top.
3. Open WebUI — 115k stars (the one I used)
Repo: https://github.com/open-webui/open-webui
A fast, modern UI for LLMs with full GPU + Ollama support.
I went with Open WebUI because:
- It integrates best with Ollama
- Has GPU-optimized builds
- Has the most features
- Supports future scaling (agents, workflows, extensions)
- Huge community (115k+ stars)
Running Open WebUI with CUDA support
Open WebUI is the best local AI UI and supports Ollama beautifully.
Run container:
docker run -d \
-p 3000:8080 \
--gpus all \
-e OLLAMA_HOST=http://132.17.0.1:11434 \
-e WEBUI_OLLAMA_BASE_URL=http://132.17.0.1:11434 \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:cuda
5. Fixing “Open WebUI cannot connect to Ollama”
This was the painful part.
You will absolutely hit this error:
Cannot connect to host host.docker.internal:11434
Even though:
- Ollama server is running
- Models exist
- Curl works inside container
Root causes I fixed:
Fix A — Ollama was only listening on 127.0.0.1
Default Ollama binds to loopback only.
We updated systemd service:
sudo vim /etc/systemd/system/ollama.service
Add:
Environment="OLLAMA_HOST=0.0.0.0:11434"
Reload:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Verify:
sudo ss -tulpn | grep 11434
It must show:
Fix B — Open WebUI DB saved wrong host (host.docker.internal)
Earlier runs stored the bad value in SQLite config.
We wiped the volume:
docker rm -f open-webui
docker volume rm open-webui
Recreated container clean.
Fix C — Browser localStorage had the wrong URL saved
Open WebUI saves connection settings in the browser!
We went into:
Settings → Connections → Ollama
http://localhost:3000/admin/settings/connections
It showed:
http://host.docker.internal:11434
Changed to:
http://132.17.0.1:11434
If UI refused to update:
Chrome DevTools → Application → Local Storage → Clear All
or:
localStorage.clear()
sessionStorage.clear()
Refresh page.
Finally, WebUI started using the correct IP.
6. Verifying everything
From container:
docker exec -it open-webui curl http://132.17.0.1:11434/api/tags
If you see phi + gemma JSON → success.
Then in Open WebUI:
✔ Models appear
✔ Chat works
✔ GPU is used
✔ Everything is finally stable
Conclusion
After a lot of debugging, the final working setup required fixing THREE layers:
- Ollama network binding (listening on 0.0.0.0)
- Docker environment overrides
- Open WebUI's internal + browser-stored connection configs
Once aligned, everything worked flawlessly.
👉 Check out: FreeDevTools
Any feedback or contributors are welcome!
It’s online, open-source, and ready for anyone to use.
⭐ Star it on GitHub: freedevtools





Top comments (1)
Great guide! Binding Ollama to 0.0.0.0 and clearing WebUI storage are clutch. On 4GB, Qwen2.5-1.5B Q4_K_M and Phi-3 Q4 run well. What other models or Docker/WebUI tweaks helped reduce VRAM spikes?