DEV Community

matias yoon
matias yoon

Posted on

로컬 LLM 셋업 가이드 (v42)

로컬 LLM 셋업 가이드 (v42)

Developer's Guide to Local LLM Deployment

Overview & Prerequisites

Running LLMs locally requires minimal hardware but significant RAM. For basic use cases, 8GB RAM is sufficient. For larger models, 16GB+ is recommended.

Prerequisites:

  • Linux 64-bit system (Ubuntu 20.04+ recommended)
  • 8GB+ RAM (16GB+ for larger models)
  • NVIDIA GPU with CUDA support (optional but highly recommended)
  • 20GB+ disk space
  • Python 3.8+
# Check system requirements
lscpu | grep -i "model name"
free -h
nvidia-smi  # if GPU available
Enter fullscreen mode Exit fullscreen mode

Framework Comparison

Framework Pros Cons Best For
llama.cpp Lightweight, native C++ Limited features, manual setup Developers wanting full control
Ollama Simple API, easy deployment Less flexible, slower startup Quick prototyping
vLLM Highest throughput Complex setup, requires Python Production high-throughput
LocalAI Multi-model support, REST API Resource-heavy Enterprise use cases

Recommendation: Use llama.cpp for minimal setup or Ollama for production-ready API.

Step-by-Step Installation

1. Install Dependencies

# Update system
sudo apt update && sudo apt upgrade -y

# Install build tools
sudo apt install build-essential git cmake python3-pip -y

# Install CUDA (if using GPU)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-11-8 -y
Enter fullscreen mode Exit fullscreen mode

2. Install llama.cpp

# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with GPU support
make clean
make -j$(nproc) LLAMA_CUDA=1

# Verify installation
./main --help
Enter fullscreen mode Exit fullscreen mode

3. Download Model

# Create model directory
mkdir -p models
cd models

# Download a 7B model (example: LLaMA-v2)
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

# Verify download
ls -la llama-2-7b-chat.Q4_K_M.gguf
Enter fullscreen mode Exit fullscreen mode

Model Selection Guide

Model Size Best For
LLaMA-2-7B 4GB General purpose, chat
Mistral-7B 4GB Coding, instruction following
Phi-2 2GB Lightweight, fast inference
TinyLlama 1GB Quick testing, demo
# Example: Run LLaMA-2-7B
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -c 2048 \
  -n 128 \
  --temp 0.7 \
  -p "Q: What is the capital of France? A:"
Enter fullscreen mode Exit fullscreen mode

Quantization Types Explained

  • Q4_K_M: 4-bit quantization with k-mer optimization (best balance)
  • Q5_K_M: 5-bit quantization (better quality, slightly larger)
  • Q8_0: 8-bit quantization (highest quality, largest file)
  • F16: Full precision (16-bit float, largest files)
# Convert model to different quantizations
./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M

# Benchmark different quantizations
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world"
Enter fullscreen mode Exit fullscreen mode

API Setup and Integration

Ollama Setup (Recommended for API)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve &

# Pull model
ollama pull llama2:7b-chat

# Test API
curl http://localhost:11434/api/generate -d '{
  "model": "llama2:7b-chat",
  "prompt": "Why is the sky blue?",
  "stream": false
}'
Enter fullscreen mode Exit fullscreen mode

Custom API Integration

# api_client.py
import requests
import json

class LocalLLM:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url

    def generate(self, prompt, model="llama2:7b-chat"):
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False
            }
        )
        return response.json()['response']

# Usage
llm = LocalLLM()
result = llm.generate("Extract key information from this invoice: [INVOICE DATA]")
print(result)
Enter fullscreen mode Exit fullscreen mode

Systemd Service for 24/7 Operation

Create service file for automatic startup:

# Create service file
sudo nano /etc/systemd/system/llm.service

# Service configuration
[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable llm.service
sudo systemctl start llm.service
sudo systemctl status llm.service
Enter fullscreen mode Exit fullscreen mode

Monitoring and Performance Tuning

Performance Monitoring

# Monitor GPU usage
nvidia-smi -l 1

# Monitor CPU and memory
htop

# Benchmark inference time
time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt"
Enter fullscreen mode Exit fullscreen mode

Tuning Parameters

# Optimize for speed
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -c 2048 \
  -n 32 \
  --temp 0.1 \
  --repeat-penalty 1.1

# Optimize for quality
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -c 2048 \
  -n 256 \
  --temp 0.8 \
  --repeat-penalty 1.2
Enter fullscreen mode Exit fullscreen mode

Real Command Examples

Document Processing Pipeline

# Extract text from PDF
pdftotext -layout document.pdf extracted.txt

# Process with LLM
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -n 256 \
  -p "Extract structured data from this receipt text: $(cat extracted.txt)" \
  --temp 0.3 > output.json

# Validate JSON
python3 -m json.tool output.json
Enter fullscreen mode Exit fullscreen mode

Batch Processing Script

#!/bin/bash
# batch_processor.sh

for file in receipts/*.pdf; do
    pdftotext -layout "$file" temp.txt
    ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
      -n 256 \
      -p "Extract invoice data: $(cat temp.txt)" \
      --temp 0.1 > "${file%.pdf}.json"
    rm temp.txt
done
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Tips

  1. Memory issues: Reduce context size (-c parameter)
  2. Slow startup: Use smaller models (Phi-2, TinyLlama)
  3. GPU memory issues: Add --no-gpu flag
  4. Permission denied: Check file permissions and user groups
# Common troubleshooting commands
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test"

# Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
Enter fullscreen mode Exit fullscreen mode

This guide provides a complete foundation for local LLM deployment that balances performance, cost, and usability. The recommended setup supports real-world


📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

Top comments (0)