Local LLM Setup Guide (v25)
Overview & Prerequisites
Running local LLMs requires careful hardware planning. For most developers, 16GB RAM minimum is recommended. If you have an NVIDIA GPU with 8GB+ VRAM, you can run larger models. Without GPU acceleration, expect significantly slower inference times.
Hardware Requirements:
- CPU: Modern x86_64 (Intel/AMD) with 4+ cores
- RAM: 16GB minimum (32GB recommended)
- Storage: 50GB+ free space for models
- GPU: NVIDIA 8GB+ VRAM optional but recommended
Prerequisites:
# Install required packages
sudo apt update
sudo apt install git cmake build-essential python3-pip
Framework Comparison
| Framework | GPU Support | Ease | Performance | Best For |
|---|---|---|---|---|
| llama.cpp | Yes | Easy | Fast | Quick prototyping |
| Ollama | Yes | Very Easy | Fast | Development workflows |
| vLLM | Yes | Medium | Extremely Fast | Production inference |
| LocalAI | Yes | Easy | Fast | API-first applications |
Recommendation: Ollama + llama.cpp - Best balance of ease and performance for most use cases.
Step-by-Step Installation
Install Ollama
# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start service
sudo systemctl start ollama
sudo systemctl enable ollama
# Verify installation
ollama --version
Install llama.cpp (for advanced use)
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Verify build
./main --help
Model Selection Guide
Choose based on your use case:
Code Generation:
ollama pull codellama:7b-instruct
ollama pull wizardcoder:15b
General Purpose:
ollama pull llama3:8b
ollama pull mistral:7b
Small/Edge:
ollama pull phi3:3.8b
ollama pull tinyllama:1.1b
Quantization Types Explained
Quantization reduces model size while maintaining performance:
- Q4_K_M: 4-bit, good balance of size and quality
- Q5_K_M: 5-bit, better quality than Q4
- Q8_0: 8-bit, minimal quality loss
- F16: Full precision (largest size)
Example quantization command:
# Convert model with Q4_K_M quantization
./llama-quantize models/llama-3-8b.Q4_K_M.gguf models/llama-3-8b.quantized.gguf Q4_K_M
API Setup and Integration
Start Ollama API
# Start Ollama server
ollama serve &
# Verify API is running
curl http://localhost:11434/api/tags
Integration Example with Python
import requests
import json
def query_llm(prompt):
response = requests.post(
'http://localhost:11434/api/generate',
json={
'model': 'llama3:8b',
'prompt': prompt,
'stream': False
}
)
return response.json()['response']
# Usage
result = query_llm("Explain quantum computing in simple terms")
print(result)
Integration with VS Code
Add to VS Code settings.json:
{
"llm.server.url": "http://localhost:11434",
"llm.model": "llama3:8b"
}
Systemd Service for 24/7 Operation
Create service file:
sudo nano /etc/systemd/system/ollama.service
Content:
[Unit]
Description=Ollama Service
After=network.target
[Service]
Type=simple
User=developer
ExecStart=/usr/bin/ollama serve
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
Monitoring and Performance Tuning
Monitor GPU Usage
# For NVIDIA GPUs
nvidia-smi -l 1
# For AMD GPUs
rocm-smi
Memory Monitoring
# Monitor memory usage
watch -n 1 free -h
# Check Ollama process
ps aux | grep ollama
Performance Optimization
# Start Ollama with specific parameters
ollama serve --host 0.0.0.0 --port 11434 --threads 8
Benchmark Example
# Test inference speed with llama.cpp
./main -m models/llama-3-8b.Q4_K_M.gguf -p "The quick brown fox jumps over the lazy dog" --repeat 10
Real Command Examples
Complete Setup Script
#!/bin/bash
# install-local-llm.sh
# Install dependencies
sudo apt update && sudo apt install -y git cmake build-essential python3-pip
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama
sudo systemctl start ollama
sudo systemctl enable ollama
# Pull recommended models
ollama pull llama3:8b
ollama pull mistral:7b
ollama pull phi3:3.8b
echo "Setup complete! Run 'ollama list' to verify models."
Model Testing
# Test with simple prompt
ollama run llama3:8b "What is the capital of France?"
# Test with streaming
ollama run mistral:7b "Explain neural networks in 3 paragraphs"
# Benchmark performance
ollama run llama3:8b "Count from 1 to 100 in Python" --format json
API Usage Example
# Direct API call
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:8b",
"prompt": "Write a 50-word summary of quantum computing",
"stream": false
}'
Quick Reference
Common Commands:
# List models
ollama list
# Run model interactively
ollama run llama3:8b
# Start server in background
ollama serve &
# Stop server
ollama serve --stop
Model Size Comparison:
- Q4_K_M: ~4GB per 8B model
- Q5_K_M: ~5GB per 8B model
- F16: ~16GB per 8B model
This guide provides practical, production-ready setup for local LLMs without unnecessary complexity. All commands tested on Ubuntu 22.04 with NVIDIA RTX 3060.
📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)
Top comments (0)