Why I Built My Own Local AI Stack: Prioritizing Privacy & ROI
Integrating AI into a development workflow usually starts with a compromise: you either send your proprietary code to a third-party API (risking Data Privacy and Compliance) or you watch your "pay-per-token" bill spiral out of control (Operational Overhead).
As a Systems Administrator, I prefer a third option: Data Sovereignty. I wanted a private, secure, and fully observable AI environment, eliminating data leak risks (GDPR/KVKK compliance) while achieving significant Long-term ROI by running on my own hardware (Arch Linux + NVIDIA RTX 3050 Ti).
The real challenge wasn't just downloading an LLM; it was engineering a Scalable AI Infrastructure that runs efficiently on consumer hardware. I'm documenting how I orchestrated this microservices-based stack with Docker Compose, optimized Resource Management for limited VRAM, and established Full-Stack Observability with Grafana and Prometheus.
Scalable Microservices Architecture
I went with a modular, containerized approach to ensure internal Scalability and keep the host system clean:
-
Inference Management:
Ollama -
Secure Hardware Integration:
NVIDIA Container Toolkit -
User Experience (UX):
OpenWebUI(for a polished RAG-capable interface). -
The Observability Layer:
-
NVIDIA DCGM Exporterfor real-time hardware telemetry. -
Prometheus&Grafanafor SLA monitoring and data retention.
-
Infrastructure & GPU Passthrough
Before deploying the containers, we must ensure the host operating system (Arch Linux in my case) allows Docker to access the GPU hardware. This is not enabled by default.
1.1 The Prerequisites: NVIDIA Container Toolkit
The bridge between Docker containers and the physical GPU is the NVIDIA Container Toolkit. Without this, the containers would only see the CPU, resulting in painfully slow inference speeds (0.5 tokens/sec).
Since I am running Arch Linux, the setup was straightforward:
# Install the toolkit
sudo pacman -S nvidia-container-toolkit
# Configure the Docker daemon to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
# Restart Docker to apply changes
sudo systemctl restart docker
To verify the passthrough is working, I ran a quick ephemeral container. If nvidia-smi prints the GPU stats inside Docker, we are green.
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi
Container Orchestration
I believe in "defining once, running everywhere." Instead of running disparate docker run commands, I defined the entire stack in a single docker-compose.yml file.
This file handles network isolation (creating a private ai-net), volume persistence (so our chat history isn't lost on reboot), and most importantly, GPU resource reservation.
Here is the complete configuration:
services:
# 🧠 1. AI ENGINE: Ollama
# This is the backend that runs the LLM inference.
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_storage:/root/.ollama
# Critical: This section reserves the NVIDIA GPU for this container
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
networks:
- ai-net
# 💻 2. INTERFACE: OpenWebUI
# A user-friendly frontend that connects to Ollama.
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- openwebui_storage:/app/backend/data
depends_on:
- ollama
restart: unless-stopped
networks:
- ai-net
# 🕵️ 3. METRICS EXPORTER: NVIDIA DCGM
# This container scrapes GPU metrics (Temp, Power, Utilization).
dcgm-exporter:
image: nvidia/dcgm-exporter:latest
container_name: dcgm-exporter
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- DCGM_EXPORTER_NO_HOSTNAME=1
ports:
- "9400:9400"
restart: unless-stopped
networks:
- ai-net
# 🗄️ 4. TIME-SERIES DB: Prometheus
# Collects the metrics exposed by dcgm-exporter.
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
restart: unless-stopped
networks:
- ai-net
# 📊 5. VISUALIZATION: Grafana
# Displays the metrics in a dashboard.
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3001:3000"
volumes:
- grafana_storage:/var/lib/grafana
restart: unless-stopped
networks:
- ai-net
volumes:
ollama_storage:
openwebui_storage:
prometheus_data:
grafana_storage:
networks:
ai-net:
driver: bridge
Prometheus Scraper Configuration
Prometheus needs to know exactly where to pull metrics from. I configured a 5-second scrape_interval. While 15s-30s is more common for production, a 5s interval is better for a local lab where we want to track immediate power spikes during token generation.
global:
scrape_interval: 5s # Scrape often for real-time visibility
scrape_configs:
- job_name: "gpu-metrics"
static_configs:
- targets: ["dcgm-exporter:9400"]
With the configuration files in place, a single command boots the entire cloud infrastructure:
docker compose up -d
Resource Management: Optimizing for 4GB VRAM
The infrastructure setup is the foundation, but Cost-Sensitive Resource Allocation, selecting a model that runs effectively on limited hardware, is where the real value lies. I optimized this build for an NVIDIA RTX 3050 Ti with 4GB of VRAM.
The VRAM Constraint
Popular models like DeepSeek R1 or Llama 3 (8B) usually require 5GB to 6GB of VRAM just to load. Offloading these models to system RAM via PCIe on a 4GB card results in unusable generation speeds, often dropping to 1-2 tokens per second.
I needed a model that fits entirely within the 4GB ceiling while remaining capable for coding tasks.
Choosing the right model: Qwen 2.5 Coder (3B)
I used Hugging Face to cross-reference benchmarks and VRAM requirements for various 1B, 3B, and 7B models. After evaluating the trade-offs between parameter count and inference speed, I settled on Qwen 2.5 Coder (3B Instruct).
Since it only occupies roughly 2.2 GB of VRAM, it leaves enough headroom for the context window without triggering a bottleneck. It's significantly faster than larger models that would force the system to swap to system RAM.
I specifically used the Instruct version; it's much better at actual code logic than the base model.
You can pull it with:
docker exec -it ollama ollama run qwen2.5-coder:3b
Verifying Telemetry
Once the containers are up, query the DCGM exporter directly to ensure the GPU is communicating correctly with Docker:
curl http://localhost:9400/metrics
You'll see keys like DCGM_FI_DEV_GPU_TEMP and DCGM_FI_DEV_POWER_USAGE. Now we need to visualize these in Grafana.
Dashboard Configuration
Since Grafana and Prometheus are on the same Docker network (ai-net), they communicate via container names.
- Log in to Grafana (
http://localhost:3001, defaultadmin/admin). - Add Prometheus as a data source and use the URL:
http://prometheus:9090.
- Import Dashboard ID 12239 (NVIDIA DCGM Exporter).
The result is a comprehensive command center for your local AI hardware.
Real-World Business Value & Metrics
I tested the stack with a typical DevOps automation request:
"Write a bash script that detects zombie processes and logs the action."
Full-Stack Observability Analysis
Checking the Grafana dashboard during inference gives the ultimate validation of the setup, proving System Reliability:
VRAM Efficiency: Usage peaked at 2.8 GB. This confirms that the 3B model is the sweet spot: fluent generation with Zero Licensing Overhead.
Scalability Benchmarks: ~438 prompt tokens/s and ~10 generation tokens/s. Proving that high-performance AI is possible without cloud dependency.
- Operational Sustainability: The GPU pulled a steady 20W and stayed at 64°C. This is a Low-Cost/High-Performance local solution.
Summary: Driving Innovation Internally
Building a local AI stack isn't just a technical exercise; it's a strategic move for Business Continuity and Data Security. By combining Docker, NVIDIA's toolkit, and a comprehensive observability layer, I've created a dev environment that's both secure and cost-efficient, proving that significant AI ROI can be achieved on existing on-premise hardware.









Top comments (0)