Here is the rewritten article in the format requested:
12 GPU Checks That Cut My Local AI Agent Setup Time by 75%
Running a local AI agent like qwen3.5:9b on a consumer GPU often ends in errors like "Out of VRAM" or "model loading failed" due to misconfiguration, not insufficient power. My RTX 5070 Ti 16GB initially seemed overkill, but tests revealed VRAM needs aren't linear.
Model Weight vs. Actual VRAM Usage
- qwen3.5:9b (Q4_K_M): 6.6GB (model) + KV cache + Working memory + Framework overhead (Ollama)
- Peak VRAM Usage with 4K context: Easily exceeds 10GB, risking OOM on 12GB GPUs
Code to Check Actual VRAM Usage (NVIDIA)
nvidia-smi --query-gpu=memory.used --format=csv,noheader
Bigger Isn't Always Better
- Mid-range newer GPUs (e.g., RTX 4060 Ti 16GB, RX 7700 XT) often outperform older high-end cards due to better architecture.
-
Use Case Determines VRAM Need:
- Simple tasks: 6-8GB (e.g., RTX 3060 12GB)
- Longer contexts: 10-12GB+
- Near-cloud tasks: 16GB+ (but overkill for most)
Honesty Moment: I initially wasted money on an overpowered GPU before realizing a 12GB mid-range card sufficed.
GPU Selection Beyond Specs
-
Driver & Framework Support:
- NVIDIA: Solid CUDA support (especially RTX 30/40 series)
- AMD: ROCm support, but limited for advanced features
- Example Compatibility Issue: qwen3.5:9b Q4_K_M runs on both RTX 4060 Ti and RX 7700 XT, but NVIDIA offers better stability.
-
Quantization Compatibility:
- Q4_K_M: Robust (CUDA 11.7+)
- Q5_K_M: Newer drivers required
- Q6_K, Extreme Quantizations: Limited to newer/higher-end cards
- Real-World Impact: Testing Q2_K on an older GTX 1080 resulted in consistent segfaults.
Safe Quantization Starter
ollama run qwen3.5:9b --quantization Q4_K_M
Pre-Flight Environment Checks
Skip these at your peril; they save hours of debugging:
# 1. GPU Driver Check
nvidia-smi
# 2. CUDA Version Check
nvidia-smi | grep "CUDA Version"
# 3. OS Type (WSL2 vs. Native Linux)
uname -r
# 4. Free VRAM Check
nvidia-smi --query-gpu=memory.free --format=csv,noheader
# 5. Docker GPU Support Check
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Docker Setup for Reproducible Environments
Why Docker:
- Environment isolation
- Easy backup & migration
- Resource limits
- Fast recovery
Minimal Viable docker-compose.yml (NVIDIA)
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
shm_size: '1gb'
volumes:
ollama_data:
Installing NVIDIA Container Toolkit (Ubuntu 22.04 Example)
# ... (installation steps as provided in the chapter)
Ollama Installation & Basic Operations
Method One: Via Docker (Recommended)
docker-compose up -d
Method Two: Direct Host Installation
curl -fsSL https://ollama.com/install.sh | sh
ollama serve
Essential Ollama Commands
-
Download & Run Model:
ollama run qwen3.5:9b -
List Models:
ollama list -
Remove Model:
ollama rm <model_name> -
Model Info:
ollama show <model_name> - API Call Example
curl http://localhost:11434/api/generate -d '{"prompt": "Hello, how are you?"}'
Resources
- Product Link for Advanced Setup Guides: https://jacksonfire526.gumroad.com?utm_source=devto&utm_medium=article&utm_campaign=2026-04-02-local-agent-playbook
- Free Resource: GPU Compatibility Checker Script: https://jacksonfire526.gumroad.com/l/cdliu?utm_source=devto&utm_medium=article&utm_campaign=2026-04-02-local-agent-playbook
Your Turn: What's the most common GPU misconfiguration you've encountered when setting up a local AI agent, and how did you resolve it?
Top comments (0)