Lightning Developer

Posted on Sep 10

Build, Run, Chat: Creating a Self-Hosted LLM Setup

#webdev #llm #pinggy #ai

Over the past couple of years, self-hosting large language models (LLMs) has gone from being a niche experiment to a serious option for developers, researchers, and even small teams. Instead of depending on cloud APIs, you can run models like Llama 3, Mistral, or Gemma directly on your own system. This comes with three big advantages: your data stays private, you avoid API costs, and you can customize the environment however you want.

What’s even more encouraging is that modern tools have simplified the process to the point where you don’t need deep DevOps expertise. With Docker, Ollama for model management, and Open WebUI for a user-friendly interface, you can set up a local ChatGPT-like environment in just a few steps.

Let’s break down the process.

Why Bother Self-Hosting?

Before diving into the how, it’s worth revisiting the why.

Privacy and Control: Everything stays on your machine or server, which is critical for sensitive or proprietary data.
Cost Efficiency: While cloud-based APIs look affordable at first, costs can balloon with heavy usage. Once your hardware is set up, local inference has no per-token fees.
Flexibility: You can switch models, run multiple ones side by side, fine-tune them for niche tasks, and integrate them directly into your workflows.

Prerequisites

Self-hosting doesn’t require an advanced setup to get started.

Hardware: Small models like Llama 3B run on 8 GB RAM machines, while larger ones may demand 40 GB or more. A GPU helps but isn’t mandatory.
Software: Docker is the only must-have. Since everything runs in containers, you don’t clutter your system with dependencies.

Step 1: Install Docker

Head to docker.com and download Docker Desktop for your OS. Installation is straightforward: run the installer, follow the prompts, and restart if required. Once installed, check it with:

docker --version

Allocate enough resources in Docker’s settings—at least 8 GB RAM and a few CPU cores are recommended for smooth performance.

Step 2: Run Ollama in Docker

Ollama makes managing models painless. To spin it up in a container, run:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

This command sets up persistent storage, exposes Ollama’s API, and runs it in the background. Verify it’s running with:

docker ps

Step 3: Download a Model

Now that Ollama is running, it’s time to pull in a model. For example:

docker exec -it ollama ollama pull llama3.2:3b

This downloads Llama 3B (about 2 GB). You can try heavier options later, such as Llama 8B (~4.7 GB) or even 70B (~40 GB).

Step 4: Add a Chat Interface with Open WebUI

Working through the command line is functional, but not very friendly. Open WebUI provides a clean, browser-based interface. Launch it with:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

Once it’s running, visit http://localhost:3000 and create an admin account. You’ll then have a familiar chat-style interface with dropdowns for selecting different models.

Sharing Beyond Your Local Machine

Sometimes you’ll want teammates or collaborators to access your setup remotely. Tools like Pinggy allow you to securely expose your local interface without fiddling with router settings. With a simple command, you get a temporary public HTTPS link that can be shared and revoked anytime.

Run this command to share your Open WebUI interface:

ssh -p 443 -R0:localhost:3000 free.pinggy.io

It will generate a public HTTPS URL like https://abc123.pinggy.linkthat you can share with others.

Where to Go Next

At this stage, you’ve got a fully functional LLM running locally with a modern web interface. From here, you can:

Experiment with different models (e.g., Mistral for reasoning, CodeLlama for programming).
Fine-tune or adapt models for domain-specific tasks.
Connect Ollama’s API to other applications for automation.

The beauty of this approach is its flexibility. You can start with a modest setup and scale up as your needs grow—whether that means switching to larger models, running multiple containers, or deploying on dedicated hardware.

Conclusion

Self-hosting LLMs used to be a complex project reserved for AI labs, but containerized tools like Ollama and Open WebUI have brought it within reach for almost anyone. With a little time and curiosity, you can build a private AI environment tailored to your needs—one that grows with you, saves money at scale, and keeps your data where it belongs: in your own hands.

DEV Community