Local LLM Setup Guide (v31)
Practical guide for developers running LLMs locally
1. Overview & Prerequisites
Running LLMs locally requires understanding hardware constraints and software stack optimization. For developers targeting mid-range hardware (8GB+ RAM, NVIDIA GPU), llama.cpp with Ollama integration provides the best balance of performance, flexibility, and ease of deployment.
Hardware Requirements:
- Minimum: 8GB RAM, 1GB VRAM (for 7B models)
- Recommended: 16GB RAM, 8GB VRAM (for 13B models)
- GPU: NVIDIA RTX 30xx/40xx series preferred
- OS: Ubuntu 20.04+/Debian 11+/Arch Linux
Prerequisites:
# Install dependencies
sudo apt update
sudo apt install git cmake build-essential python3-pip
2. Framework Comparison
| Framework | Pros | Cons | Best For |
|---|---|---|---|
| llama.cpp | Fast inference, no dependencies, excellent quantization | Manual setup, limited API features | Production deployment, research |
| Ollama | Simple API, easy model management, Docker integration | Higher memory usage | Development, prototyping |
| vLLM | High throughput, multi-GPU support | Complex setup, Python-only | High-volume inference |
| LocalAI | OpenAPI compatible, multi-model support | Less performant than native llama.cpp | API-first workflows |
Recommendation: Use llama.cpp with Ollama for optimal balance of performance and usability.
3. Step-by-Step Installation
Install llama.cpp
# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build with CUDA support (if available)
make clean
make -j$(nproc) CUDA=1
# Verify installation
./main --help
Install Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama
# Verify service
ollama --version
4. Model Selection Guide
For Development/Testing:
- Mistral-7B-v0.1 (Q4_K_M): 7B parameters, good balance of size vs performance
- Phi-3-mini (Q4_K_M): 3.8B parameters, optimized for speed
For Production:
- Llama-3-8B (Q5_K_M): 8B parameters, high quality, moderate size
- Mixtral-8x7B (Q5_K_M): 47B parameters, excellent for complex tasks
Download and convert models:
# Using Ollama (recommended)
ollama pull mistral:7b
ollama pull phi3:mini
# Or manually using llama.cpp
# Download model files from Hugging Face
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
# Convert using llama.cpp
./convert-hf-to-gguf.py mistral-7b-v0.1 --outtype q4_k_m --outfile mistral-7b-v0.1-q4k.gguf
5. Quantization Types Explained
| Quantization | Size | Accuracy | Use Case |
|---|---|---|---|
| Q4_K_M | 4.5GB | 92% | General purpose, 7B models |
| Q5_K_M | 5.5GB | 95% | High-quality output, 13B models |
| Q6_K | 6.5GB | 97% | Best quality, limited RAM |
| Q8_0 | 8.5GB | 99% | Maximum accuracy, high RAM |
Example benchmark:
# Test model performance
./main -m mistral-7b-v0.1-q4k.gguf -p "Why is the sky blue?" --repeat_penalty 1.1 --temp 0.7
# Benchmark with 100 tokens
time ./main -m mistral-7b-v0.1-q4k.gguf -p "The quick brown fox jumps over the lazy dog." --n_predict 100
6. API Setup and Integration
Using Ollama API:
# Start model with specific configuration
ollama run mistral:7b
# API test curl command
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral:7b",
"prompt": "Explain quantum computing in simple terms",
"stream": false
}'
Python integration:
import requests
def query_llm(prompt):
response = requests.post(
'http://localhost:11434/api/generate',
json={
'model': 'mistral:7b',
'prompt': prompt,
'stream': False
}
)
return response.json()['response']
# Usage
result = query_llm("Write a 5-line haiku about programming")
print(result)
Direct llama.cpp API:
# Run with API endpoint
./main -m mistral-7b-v0.1-q4k.gguf --port 8080 -a "llama.cpp API"
7. Systemd Service for 24/7 Operation
Create systemd service for automatic startup:
sudo nano /etc/systemd/system/llm-api.service
[Unit]
Description=Local LLM API Service
After=network.target
[Service]
Type=simple
User=your_user
WorkingDirectory=/home/your_user/llama.cpp
ExecStart=/home/your_user/llama.cpp/main -m /home/your_user/models/mistral-7b-v0.1-q4k.gguf --port 8080
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable llm-api
sudo systemctl start llm-api
8. Monitoring and Performance Tuning
Memory monitoring:
# Check GPU memory usage
nvidia-smi
# Monitor system resources
htop
# Memory usage for specific process
ps aux | grep llama
Performance optimization flags:
# Optimized startup command
./main -m mistral-7b-v0.1-q4k.gguf \
--ctx-size 2048 \
--n-gpu-layers 32 \
--threads 8 \
--port 8080 \
--log-format json \
--temp 0.7 \
--repeat-penalty 1.1
Benchmark script:
#!/bin/bash
# benchmark.sh
MODEL_PATH="/home/user/models/mistral-7b-v0.1-q4k.gguf"
PROMPT="The future of artificial intelligence will be..."
echo "Benchmarking model: $MODEL_PATH"
time ./main -m $MODEL_PATH -p "$PROMPT" --n_predict 50
9. Real Command Examples
Complete setup example:
# 1. Create directory structure
mkdir -p ~/models ~/llama.cpp
cd ~/llama.cpp
# 2. Build llama.cpp with CUDA
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make clean && make -j$(nproc) CUDA=1
# 3. Download and convert model
cd ~/models
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
# 4. Start service with optimized parameters
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \
--ctx-size 2048 \
--n-gpu-layers 32 \
--threads 8 \
--port 8080 \
--log-format json
Test API endpoint:
curl -X POST http://localhost:8080/api/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain how to create a REST API in Python",
"model": "mistral-7b-v0.1",
"stream": false,
"options": {
"temperature": 0.7,
"repeat_penalty": 1.1
}
}'
10. Troubleshooting
Common issues and solutions:
-
CUDA out of memory: Reduce
--n-gpu-layersor--ctx-size -
Slow inference: Increase
--threadsparameter - Model not found: Verify file paths
📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)
Top comments (0)