matias yoon

Posted on May 25

로컬 LLM 셋업 가이드 (v31)

#ai #llm #developers #tutorial

Local LLM Setup Guide (v31)

Practical guide for developers running LLMs locally

1. Overview & Prerequisites

Running LLMs locally requires understanding hardware constraints and software stack optimization. For developers targeting mid-range hardware (8GB+ RAM, NVIDIA GPU), llama.cpp with Ollama integration provides the best balance of performance, flexibility, and ease of deployment.

Hardware Requirements:

Minimum: 8GB RAM, 1GB VRAM (for 7B models)
Recommended: 16GB RAM, 8GB VRAM (for 13B models)
GPU: NVIDIA RTX 30xx/40xx series preferred
OS: Ubuntu 20.04+/Debian 11+/Arch Linux

Prerequisites:

# Install dependencies
sudo apt update
sudo apt install git cmake build-essential python3-pip

2. Framework Comparison

Framework	Pros	Cons	Best For
llama.cpp	Fast inference, no dependencies, excellent quantization	Manual setup, limited API features	Production deployment, research
Ollama	Simple API, easy model management, Docker integration	Higher memory usage	Development, prototyping
vLLM	High throughput, multi-GPU support	Complex setup, Python-only	High-volume inference
LocalAI	OpenAPI compatible, multi-model support	Less performant than native llama.cpp	API-first workflows

Recommendation: Use llama.cpp with Ollama for optimal balance of performance and usability.

3. Step-by-Step Installation

Install llama.cpp

# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with CUDA support (if available)
make clean
make -j$(nproc) CUDA=1

# Verify installation
./main --help

Install Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify service
ollama --version

4. Model Selection Guide

For Development/Testing:

Mistral-7B-v0.1 (Q4_K_M): 7B parameters, good balance of size vs performance
Phi-3-mini (Q4_K_M): 3.8B parameters, optimized for speed

For Production:

Llama-3-8B (Q5_K_M): 8B parameters, high quality, moderate size
Mixtral-8x7B (Q5_K_M): 47B parameters, excellent for complex tasks

Download and convert models:

# Using Ollama (recommended)
ollama pull mistral:7b
ollama pull phi3:mini

# Or manually using llama.cpp
# Download model files from Hugging Face
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# Convert using llama.cpp
./convert-hf-to-gguf.py mistral-7b-v0.1 --outtype q4_k_m --outfile mistral-7b-v0.1-q4k.gguf

5. Quantization Types Explained

Quantization	Size	Accuracy	Use Case
Q4_K_M	4.5GB	92%	General purpose, 7B models
Q5_K_M	5.5GB	95%	High-quality output, 13B models
Q6_K	6.5GB	97%	Best quality, limited RAM
Q8_0	8.5GB	99%	Maximum accuracy, high RAM

Example benchmark:

# Test model performance
./main -m mistral-7b-v0.1-q4k.gguf -p "Why is the sky blue?" --repeat_penalty 1.1 --temp 0.7

# Benchmark with 100 tokens
time ./main -m mistral-7b-v0.1-q4k.gguf -p "The quick brown fox jumps over the lazy dog." --n_predict 100

6. API Setup and Integration

Using Ollama API:

# Start model with specific configuration
ollama run mistral:7b

# API test curl command
curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral:7b",
    "prompt": "Explain quantum computing in simple terms",
    "stream": false
  }'

Python integration:

import requests

def query_llm(prompt):
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': 'mistral:7b',
            'prompt': prompt,
            'stream': False
        }
    )
    return response.json()['response']

# Usage
result = query_llm("Write a 5-line haiku about programming")
print(result)

Direct llama.cpp API:

# Run with API endpoint
./main -m mistral-7b-v0.1-q4k.gguf --port 8080 -a "llama.cpp API"

7. Systemd Service for 24/7 Operation

Create systemd service for automatic startup:

sudo nano /etc/systemd/system/llm-api.service

[Unit]
Description=Local LLM API Service
After=network.target

[Service]
Type=simple
User=your_user
WorkingDirectory=/home/your_user/llama.cpp
ExecStart=/home/your_user/llama.cpp/main -m /home/your_user/models/mistral-7b-v0.1-q4k.gguf --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable llm-api
sudo systemctl start llm-api

8. Monitoring and Performance Tuning

Memory monitoring:

# Check GPU memory usage
nvidia-smi

# Monitor system resources
htop

# Memory usage for specific process
ps aux | grep llama

Performance optimization flags:

# Optimized startup command
./main -m mistral-7b-v0.1-q4k.gguf \
  --ctx-size 2048 \
  --n-gpu-layers 32 \
  --threads 8 \
  --port 8080 \
  --log-format json \
  --temp 0.7 \
  --repeat-penalty 1.1

Benchmark script:

#!/bin/bash
# benchmark.sh
MODEL_PATH="/home/user/models/mistral-7b-v0.1-q4k.gguf"
PROMPT="The future of artificial intelligence will be..."

echo "Benchmarking model: $MODEL_PATH"
time ./main -m $MODEL_PATH -p "$PROMPT" --n_predict 50

9. Real Command Examples

Complete setup example:

# 1. Create directory structure
mkdir -p ~/models ~/llama.cpp
cd ~/llama.cpp

# 2. Build llama.cpp with CUDA
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make clean && make -j$(nproc) CUDA=1

# 3. Download and convert model
cd ~/models
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# 4. Start service with optimized parameters
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \
  --ctx-size 2048 \
  --n-gpu-layers 32 \
  --threads 8 \
  --port 8080 \
  --log-format json

Test API endpoint:

curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain how to create a REST API in Python",
    "model": "mistral-7b-v0.1",
    "stream": false,
    "options": {
      "temperature": 0.7,
      "repeat_penalty": 1.1
    }
  }'

10. Troubleshooting

Common issues and solutions:

CUDA out of memory: Reduce --n-gpu-layers or --ctx-size
Slow inference: Increase --threads parameter
Model not found: Verify file paths

📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

DEV Community