Local LLM Setup Guide (v18)
1. Overview & Prerequisites
Running LLMs locally requires minimal hardware but careful resource management. This guide assumes:
- Ubuntu 20.04/22.04 or Debian 11/12
- 8GB+ RAM (16GB+ recommended)
- NVIDIA GPU with CUDA support (RTX 3060+), or CPU-only setup
- 20GB+ free disk space for models
For GPU-accelerated inference, install CUDA:
# Install NVIDIA drivers
sudo apt update
sudo apt install nvidia-driver-535
# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-4
2. Framework Comparison
| Framework | GPU Support | Ease of Use | Performance | Best For |
|---|---|---|---|---|
| llama.cpp | Yes | Medium | Fast | Quick prototyping |
| Ollama | Yes | Easy | Fast | Development/testing |
| vLLM | Yes | Medium | Fastest | Production inference |
| LocalAI | Yes | Easy | Fast | API-first workflows |
Recommendation: Use llama.cpp with Ollama for development workflow.
3. Step-by-Step Installation
Install llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Install Ollama (for easier model management):
curl -fsSL https://ollama.com/install.sh | sh
Test installation:
ollama run llama3:8b
Setup model directory:
mkdir -p ~/llm-models
cd ~/llm-models
4. Model Selection Guide
Use Case: Code Generation
- Model:
codellama:7borphi3:3.8b - RAM: 8GB minimum
- Command:
ollama run codellama:7b
Use Case: Chatbot
- Model:
llama3:8bormistral:7b - RAM: 8GB minimum
- Command:
ollama run llama3:8b
Use Case: High Precision
- Model:
llama3:70bormixtral:8x7b - RAM: 16GB minimum
- Command:
ollama run llama3:70b
5. Quantization Types Explained
Quantization reduces model size while maintaining performance:
- Q4_K_M: 4-bit quantization, 4.5GB for 7B model
- Q5_K_M: 5-bit quantization, 5.5GB for 7B model
- Q8_0: 8-bit quantization, 8GB for 7B model
- F16: Full precision, 16GB for 7B model
Example: Download and convert model:
# Download 7B model
ollama pull llama3:8b
# Convert to Q4_K_M (smallest size, good performance)
ollama run llama3:8b --quantize Q4_K_M
6. API Setup and Integration
Create API server with llama.cpp:
# Start llama.cpp server
./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 1234 \
--threads 8 \
--ctx-size 8192
Test API:
curl http://localhost:1234/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a Python function to reverse a string.",
"temperature": 0.7,
"max_tokens": 100
}'
Integrate with Python:
import requests
def llm_query(prompt):
response = requests.post(
'http://localhost:1234/completion',
json={
'prompt': prompt,
'temperature': 0.7,
'max_tokens': 200
}
)
return response.json()['content']
# Usage
result = llm_query("Explain quantum computing in simple terms")
7. Systemd Service for 24/7 Operation
Create service file:
sudo nano /etc/systemd/system/llm-server.service
Content:
[Unit]
Description=Local LLM Server
After=network.target
[Service]
Type=simple
User=your_username
WorkingDirectory=/home/your_username/llama.cpp
ExecStart=/home/your_username/llama.cpp/server \
-m /home/your_username/llm-models/llama3-8b-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 1234 \
--threads 8 \
--ctx-size 8192
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable llm-server
sudo systemctl start llm-server
8. Monitoring and Performance Tuning
Monitor GPU usage:
nvidia-smi -l 1 # Update every second
Monitor memory usage:
watch -n 1 free -h
Benchmark inference:
# Test 100 token generation
time ./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
--prompt "The future of AI is" \
--max-tokens 100 \
--threads 8
Performance tuning parameters:
- --ctx-size: 8192 for 8B models, 16384 for 70B models
- --threads: CPU cores / 2 for optimal performance
- --n-gpu-layers: Number of layers on GPU (default: 100 for 8B models)
9. Real Command Examples
Full workflow example:
# 1. Install dependencies
sudo apt update
sudo apt install git cmake build-essential
# 2. Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# 3. Download model
ollama pull llama3:8b
# 4. Start server
./server -m ~/llm-models/llama3-8b-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 1234 \
--threads 8 \
--ctx-size 8192 \
--n-gpu-layers 100
# 5. Test API
curl http://localhost:1234/completion \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello world", "max_tokens": 10}'
Production-ready startup script:
#!/bin/bash
# ~/start-llm.sh
MODEL_PATH="$HOME/llm-models/llama3-8b-Q4_K_M.gguf"
PORT=1234
if [ ! -f "$MODEL_PATH" ]; then
echo "Model not found at $MODEL_PATH"
exit 1
fi
echo "Starting LLM server on port $PORT..."
./server \
-m "$MODEL_PATH" \
--host 0.0.0.0 \
--port $PORT \
--threads 8 \
--ctx-size 8192 \
--n-gpu-layers 100
This setup provides a production-ready local LLM infrastructure with minimal hardware requirements and optimal performance. The combination of llama.cpp for low-level control and Ollama for easy model management gives developers the best of both worlds for local LLM development.
📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)
Top comments (0)