DEV Community

matias yoon
matias yoon

Posted on

로컬 LLM 셋업 가이드 (v31)

Local LLM Setup Guide (v31)

Practical guide for developers running LLMs locally

1. Overview & Prerequisites

Running LLMs locally requires understanding hardware constraints and software stack optimization. For developers targeting mid-range hardware (8GB+ RAM, NVIDIA GPU), llama.cpp with Ollama integration provides the best balance of performance, flexibility, and ease of deployment.

Hardware Requirements:

  • Minimum: 8GB RAM, 1GB VRAM (for 7B models)
  • Recommended: 16GB RAM, 8GB VRAM (for 13B models)
  • GPU: NVIDIA RTX 30xx/40xx series preferred
  • OS: Ubuntu 20.04+/Debian 11+/Arch Linux

Prerequisites:

# Install dependencies
sudo apt update
sudo apt install git cmake build-essential python3-pip
Enter fullscreen mode Exit fullscreen mode

2. Framework Comparison

Framework Pros Cons Best For
llama.cpp Fast inference, no dependencies, excellent quantization Manual setup, limited API features Production deployment, research
Ollama Simple API, easy model management, Docker integration Higher memory usage Development, prototyping
vLLM High throughput, multi-GPU support Complex setup, Python-only High-volume inference
LocalAI OpenAPI compatible, multi-model support Less performant than native llama.cpp API-first workflows

Recommendation: Use llama.cpp with Ollama for optimal balance of performance and usability.

3. Step-by-Step Installation

Install llama.cpp

# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with CUDA support (if available)
make clean
make -j$(nproc) CUDA=1

# Verify installation
./main --help
Enter fullscreen mode Exit fullscreen mode

Install Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify service
ollama --version
Enter fullscreen mode Exit fullscreen mode

4. Model Selection Guide

For Development/Testing:

  • Mistral-7B-v0.1 (Q4_K_M): 7B parameters, good balance of size vs performance
  • Phi-3-mini (Q4_K_M): 3.8B parameters, optimized for speed

For Production:

  • Llama-3-8B (Q5_K_M): 8B parameters, high quality, moderate size
  • Mixtral-8x7B (Q5_K_M): 47B parameters, excellent for complex tasks

Download and convert models:

# Using Ollama (recommended)
ollama pull mistral:7b
ollama pull phi3:mini

# Or manually using llama.cpp
# Download model files from Hugging Face
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# Convert using llama.cpp
./convert-hf-to-gguf.py mistral-7b-v0.1 --outtype q4_k_m --outfile mistral-7b-v0.1-q4k.gguf
Enter fullscreen mode Exit fullscreen mode

5. Quantization Types Explained

Quantization Size Accuracy Use Case
Q4_K_M 4.5GB 92% General purpose, 7B models
Q5_K_M 5.5GB 95% High-quality output, 13B models
Q6_K 6.5GB 97% Best quality, limited RAM
Q8_0 8.5GB 99% Maximum accuracy, high RAM

Example benchmark:

# Test model performance
./main -m mistral-7b-v0.1-q4k.gguf -p "Why is the sky blue?" --repeat_penalty 1.1 --temp 0.7

# Benchmark with 100 tokens
time ./main -m mistral-7b-v0.1-q4k.gguf -p "The quick brown fox jumps over the lazy dog." --n_predict 100
Enter fullscreen mode Exit fullscreen mode

6. API Setup and Integration

Using Ollama API:

# Start model with specific configuration
ollama run mistral:7b

# API test curl command
curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral:7b",
    "prompt": "Explain quantum computing in simple terms",
    "stream": false
  }'
Enter fullscreen mode Exit fullscreen mode

Python integration:

import requests

def query_llm(prompt):
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': 'mistral:7b',
            'prompt': prompt,
            'stream': False
        }
    )
    return response.json()['response']

# Usage
result = query_llm("Write a 5-line haiku about programming")
print(result)
Enter fullscreen mode Exit fullscreen mode

Direct llama.cpp API:

# Run with API endpoint
./main -m mistral-7b-v0.1-q4k.gguf --port 8080 -a "llama.cpp API"
Enter fullscreen mode Exit fullscreen mode

7. Systemd Service for 24/7 Operation

Create systemd service for automatic startup:

sudo nano /etc/systemd/system/llm-api.service
Enter fullscreen mode Exit fullscreen mode
[Unit]
Description=Local LLM API Service
After=network.target

[Service]
Type=simple
User=your_user
WorkingDirectory=/home/your_user/llama.cpp
ExecStart=/home/your_user/llama.cpp/main -m /home/your_user/models/mistral-7b-v0.1-q4k.gguf --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable llm-api
sudo systemctl start llm-api
Enter fullscreen mode Exit fullscreen mode

8. Monitoring and Performance Tuning

Memory monitoring:

# Check GPU memory usage
nvidia-smi

# Monitor system resources
htop

# Memory usage for specific process
ps aux | grep llama
Enter fullscreen mode Exit fullscreen mode

Performance optimization flags:

# Optimized startup command
./main -m mistral-7b-v0.1-q4k.gguf \
  --ctx-size 2048 \
  --n-gpu-layers 32 \
  --threads 8 \
  --port 8080 \
  --log-format json \
  --temp 0.7 \
  --repeat-penalty 1.1
Enter fullscreen mode Exit fullscreen mode

Benchmark script:

#!/bin/bash
# benchmark.sh
MODEL_PATH="/home/user/models/mistral-7b-v0.1-q4k.gguf"
PROMPT="The future of artificial intelligence will be..."

echo "Benchmarking model: $MODEL_PATH"
time ./main -m $MODEL_PATH -p "$PROMPT" --n_predict 50
Enter fullscreen mode Exit fullscreen mode

9. Real Command Examples

Complete setup example:

# 1. Create directory structure
mkdir -p ~/models ~/llama.cpp
cd ~/llama.cpp

# 2. Build llama.cpp with CUDA
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make clean && make -j$(nproc) CUDA=1

# 3. Download and convert model
cd ~/models
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# 4. Start service with optimized parameters
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \
  --ctx-size 2048 \
  --n-gpu-layers 32 \
  --threads 8 \
  --port 8080 \
  --log-format json
Enter fullscreen mode Exit fullscreen mode

Test API endpoint:

curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain how to create a REST API in Python",
    "model": "mistral-7b-v0.1",
    "stream": false,
    "options": {
      "temperature": 0.7,
      "repeat_penalty": 1.1
    }
  }'
Enter fullscreen mode Exit fullscreen mode

10. Troubleshooting

Common issues and solutions:

  1. CUDA out of memory: Reduce --n-gpu-layers or --ctx-size
  2. Slow inference: Increase --threads parameter
  3. Model not found: Verify file paths

📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

Top comments (0)