Hakan İSMAİL

Posted on Feb 13

Local LLM Ops: Building an Observable, GPU-Accelerated AI Cloud at Home with Docker & Grafana

#ai #beginners #tutorial #devops

Why I Built My Own Local AI Stack: Prioritizing Privacy & ROI

Integrating AI into a development workflow usually starts with a compromise: you either send your proprietary code to a third-party API (risking Data Privacy and Compliance) or you watch your "pay-per-token" bill spiral out of control (Operational Overhead).

As a Systems Administrator, I prefer a third option: Data Sovereignty. I wanted a private, secure, and fully observable AI environment, eliminating data leak risks (GDPR/KVKK compliance) while achieving significant Long-term ROI by running on my own hardware (Arch Linux + NVIDIA RTX 3050 Ti).

The real challenge wasn't just downloading an LLM; it was engineering a Scalable AI Infrastructure that runs efficiently on consumer hardware. I'm documenting how I orchestrated this microservices-based stack with Docker Compose, optimized Resource Management for limited VRAM, and established Full-Stack Observability with Grafana and Prometheus.

Scalable Microservices Architecture

I went with a modular, containerized approach to ensure internal Scalability and keep the host system clean:

Inference Management: Ollama
Secure Hardware Integration: NVIDIA Container Toolkit
User Experience (UX): OpenWebUI (for a polished RAG-capable interface).
The Observability Layer:
- NVIDIA DCGM Exporter for real-time hardware telemetry.
- Prometheus & Grafana for SLA monitoring and data retention.

Infrastructure & GPU Passthrough

Before deploying the containers, we must ensure the host operating system (Arch Linux in my case) allows Docker to access the GPU hardware. This is not enabled by default.

1.1 The Prerequisites: NVIDIA Container Toolkit

The bridge between Docker containers and the physical GPU is the NVIDIA Container Toolkit. Without this, the containers would only see the CPU, resulting in painfully slow inference speeds (0.5 tokens/sec).

Since I am running Arch Linux, the setup was straightforward:

# Install the toolkit
sudo pacman -S nvidia-container-toolkit

# Configure the Docker daemon to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker to apply changes
sudo systemctl restart docker

To verify the passthrough is working, I ran a quick ephemeral container. If nvidia-smi prints the GPU stats inside Docker, we are green.

docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi

Container Orchestration

I believe in "defining once, running everywhere." Instead of running disparate docker run commands, I defined the entire stack in a single docker-compose.yml file.

This file handles network isolation (creating a private ai-net), volume persistence (so our chat history isn't lost on reboot), and most importantly, GPU resource reservation.

Here is the complete configuration:

services:
  # 🧠 1. AI ENGINE: Ollama
  # This is the backend that runs the LLM inference.
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_storage:/root/.ollama
    # Critical: This section reserves the NVIDIA GPU for this container
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    networks:
      - ai-net

  # 💻 2. INTERFACE: OpenWebUI
  # A user-friendly frontend that connects to Ollama.
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - openwebui_storage:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped
    networks:
      - ai-net

  # 🕵️ 3. METRICS EXPORTER: NVIDIA DCGM
  # This container scrapes GPU metrics (Temp, Power, Utilization).
  dcgm-exporter:
    image: nvidia/dcgm-exporter:latest
    container_name: dcgm-exporter
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - DCGM_EXPORTER_NO_HOSTNAME=1
    ports:
      - "9400:9400"
    restart: unless-stopped
    networks:
      - ai-net

  # 🗄️ 4. TIME-SERIES DB: Prometheus
  # Collects the metrics exposed by dcgm-exporter.
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
    restart: unless-stopped
    networks:
      - ai-net

  # 📊 5. VISUALIZATION: Grafana
  # Displays the metrics in a dashboard.
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3001:3000"
    volumes:
      - grafana_storage:/var/lib/grafana
    restart: unless-stopped
    networks:
      - ai-net

volumes:
  ollama_storage:
  openwebui_storage:
  prometheus_data:
  grafana_storage:

networks:
  ai-net:
    driver: bridge

Prometheus Scraper Configuration

Prometheus needs to know exactly where to pull metrics from. I configured a 5-second scrape_interval. While 15s-30s is more common for production, a 5s interval is better for a local lab where we want to track immediate power spikes during token generation.

global:
  scrape_interval: 5s # Scrape often for real-time visibility

scrape_configs:
  - job_name: "gpu-metrics"
    static_configs:
      - targets: ["dcgm-exporter:9400"]

With the configuration files in place, a single command boots the entire cloud infrastructure:

docker compose up -d

Resource Management: Optimizing for 4GB VRAM

The infrastructure setup is the foundation, but Cost-Sensitive Resource Allocation, selecting a model that runs effectively on limited hardware, is where the real value lies. I optimized this build for an NVIDIA RTX 3050 Ti with 4GB of VRAM.

The VRAM Constraint

Popular models like DeepSeek R1 or Llama 3 (8B) usually require 5GB to 6GB of VRAM just to load. Offloading these models to system RAM via PCIe on a 4GB card results in unusable generation speeds, often dropping to 1-2 tokens per second.

I needed a model that fits entirely within the 4GB ceiling while remaining capable for coding tasks.

Choosing the right model: Qwen 2.5 Coder (3B)

I used Hugging Face to cross-reference benchmarks and VRAM requirements for various 1B, 3B, and 7B models. After evaluating the trade-offs between parameter count and inference speed, I settled on Qwen 2.5 Coder (3B Instruct).

Since it only occupies roughly 2.2 GB of VRAM, it leaves enough headroom for the context window without triggering a bottleneck. It's significantly faster than larger models that would force the system to swap to system RAM.

I specifically used the Instruct version; it's much better at actual code logic than the base model.

You can pull it with:

docker exec -it ollama ollama run qwen2.5-coder:3b

Verifying Telemetry

Once the containers are up, query the DCGM exporter directly to ensure the GPU is communicating correctly with Docker:

curl http://localhost:9400/metrics

You'll see keys like DCGM_FI_DEV_GPU_TEMP and DCGM_FI_DEV_POWER_USAGE. Now we need to visualize these in Grafana.

Dashboard Configuration

Since Grafana and Prometheus are on the same Docker network (ai-net), they communicate via container names.

Log in to Grafana (http://localhost:3001, default admin/admin).
Add Prometheus as a data source and use the URL: http://prometheus:9090.

Import Dashboard ID 12239 (NVIDIA DCGM Exporter).

The result is a comprehensive command center for your local AI hardware.

Real-World Business Value & Metrics

I tested the stack with a typical DevOps automation request:

"Write a bash script that detects zombie processes and logs the action."

Full-Stack Observability Analysis

Checking the Grafana dashboard during inference gives the ultimate validation of the setup, proving System Reliability:

VRAM Efficiency: Usage peaked at 2.8 GB. This confirms that the 3B model is the sweet spot: fluent generation with Zero Licensing Overhead.
Scalability Benchmarks: ~438 prompt tokens/s and ~10 generation tokens/s. Proving that high-performance AI is possible without cloud dependency.

Operational Sustainability: The GPU pulled a steady 20W and stayed at 64°C. This is a Low-Cost/High-Performance local solution.

Summary: Driving Innovation Internally

Building a local AI stack isn't just a technical exercise; it's a strategic move for Business Continuity and Data Security. By combining Docker, NVIDIA's toolkit, and a comprehensive observability layer, I've created a dev environment that's both secure and cost-efficient, proving that significant AI ROI can be achieved on existing on-premise hardware.

DEV Community