로컬 LLM 셋업 가이드 (v42)
Developer's Guide to Local LLM Deployment
Overview & Prerequisites
Running LLMs locally requires minimal hardware but significant RAM. For basic use cases, 8GB RAM is sufficient. For larger models, 16GB+ is recommended.
Prerequisites:
- Linux 64-bit system (Ubuntu 20.04+ recommended)
- 8GB+ RAM (16GB+ for larger models)
- NVIDIA GPU with CUDA support (optional but highly recommended)
- 20GB+ disk space
- Python 3.8+
# Check system requirements
lscpu | grep -i "model name"
free -h
nvidia-smi # if GPU available
Framework Comparison
| Framework | Pros | Cons | Best For |
|---|---|---|---|
| llama.cpp | Lightweight, native C++ | Limited features, manual setup | Developers wanting full control |
| Ollama | Simple API, easy deployment | Less flexible, slower startup | Quick prototyping |
| vLLM | Highest throughput | Complex setup, requires Python | Production high-throughput |
| LocalAI | Multi-model support, REST API | Resource-heavy | Enterprise use cases |
Recommendation: Use llama.cpp for minimal setup or Ollama for production-ready API.
Step-by-Step Installation
1. Install Dependencies
# Update system
sudo apt update && sudo apt upgrade -y
# Install build tools
sudo apt install build-essential git cmake python3-pip -y
# Install CUDA (if using GPU)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-11-8 -y
2. Install llama.cpp
# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build with GPU support
make clean
make -j$(nproc) LLAMA_CUDA=1
# Verify installation
./main --help
3. Download Model
# Create model directory
mkdir -p models
cd models
# Download a 7B model (example: LLaMA-v2)
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
# Verify download
ls -la llama-2-7b-chat.Q4_K_M.gguf
Model Selection Guide
| Model | Size | Best For |
|---|---|---|
| LLaMA-2-7B | 4GB | General purpose, chat |
| Mistral-7B | 4GB | Coding, instruction following |
| Phi-2 | 2GB | Lightweight, fast inference |
| TinyLlama | 1GB | Quick testing, demo |
# Example: Run LLaMA-2-7B
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
-c 2048 \
-n 128 \
--temp 0.7 \
-p "Q: What is the capital of France? A:"
Quantization Types Explained
- Q4_K_M: 4-bit quantization with k-mer optimization (best balance)
- Q5_K_M: 5-bit quantization (better quality, slightly larger)
- Q8_0: 8-bit quantization (highest quality, largest file)
- F16: Full precision (16-bit float, largest files)
# Convert model to different quantizations
./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M
# Benchmark different quantizations
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world"
API Setup and Integration
Ollama Setup (Recommended for API)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
ollama serve &
# Pull model
ollama pull llama2:7b-chat
# Test API
curl http://localhost:11434/api/generate -d '{
"model": "llama2:7b-chat",
"prompt": "Why is the sky blue?",
"stream": false
}'
Custom API Integration
# api_client.py
import requests
import json
class LocalLLM:
def __init__(self, base_url="http://localhost:11434"):
self.base_url = base_url
def generate(self, prompt, model="llama2:7b-chat"):
response = requests.post(
f"{self.base_url}/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()['response']
# Usage
llm = LocalLLM()
result = llm.generate("Extract key information from this invoice: [INVOICE DATA]")
print(result)
Systemd Service for 24/7 Operation
Create service file for automatic startup:
# Create service file
sudo nano /etc/systemd/system/llm.service
# Service configuration
[Unit]
Description=Local LLM Server
After=network.target
[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable llm.service
sudo systemctl start llm.service
sudo systemctl status llm.service
Monitoring and Performance Tuning
Performance Monitoring
# Monitor GPU usage
nvidia-smi -l 1
# Monitor CPU and memory
htop
# Benchmark inference time
time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt"
Tuning Parameters
# Optimize for speed
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
-c 2048 \
-n 32 \
--temp 0.1 \
--repeat-penalty 1.1
# Optimize for quality
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
-c 2048 \
-n 256 \
--temp 0.8 \
--repeat-penalty 1.2
Real Command Examples
Document Processing Pipeline
# Extract text from PDF
pdftotext -layout document.pdf extracted.txt
# Process with LLM
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
-n 256 \
-p "Extract structured data from this receipt text: $(cat extracted.txt)" \
--temp 0.3 > output.json
# Validate JSON
python3 -m json.tool output.json
Batch Processing Script
#!/bin/bash
# batch_processor.sh
for file in receipts/*.pdf; do
pdftotext -layout "$file" temp.txt
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
-n 256 \
-p "Extract invoice data: $(cat temp.txt)" \
--temp 0.1 > "${file%.pdf}.json"
rm temp.txt
done
Troubleshooting Tips
- Memory issues: Reduce context size (-c parameter)
- Slow startup: Use smaller models (Phi-2, TinyLlama)
-
GPU memory issues: Add
--no-gpuflag - Permission denied: Check file permissions and user groups
# Common troubleshooting commands
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test"
# Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
This guide provides a complete foundation for local LLM deployment that balances performance, cost, and usability. The recommended setup supports real-world
📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)
Top comments (0)