matias yoon

Posted on May 25

로컬 LLM 셋업 가이드 (v42)

#ai #llm #developers #tutorial

로컬 LLM 셋업 가이드 (v42)

Developer's Guide to Local LLM Deployment

Overview & Prerequisites

Running LLMs locally requires minimal hardware but significant RAM. For basic use cases, 8GB RAM is sufficient. For larger models, 16GB+ is recommended.

Prerequisites:

Linux 64-bit system (Ubuntu 20.04+ recommended)
8GB+ RAM (16GB+ for larger models)
NVIDIA GPU with CUDA support (optional but highly recommended)
20GB+ disk space
Python 3.8+

# Check system requirements
lscpu | grep -i "model name"
free -h
nvidia-smi  # if GPU available

Framework Comparison

Framework	Pros	Cons	Best For
llama.cpp	Lightweight, native C++	Limited features, manual setup	Developers wanting full control
Ollama	Simple API, easy deployment	Less flexible, slower startup	Quick prototyping
vLLM	Highest throughput	Complex setup, requires Python	Production high-throughput
LocalAI	Multi-model support, REST API	Resource-heavy	Enterprise use cases

Recommendation: Use llama.cpp for minimal setup or Ollama for production-ready API.

Step-by-Step Installation

1. Install Dependencies

# Update system
sudo apt update && sudo apt upgrade -y

# Install build tools
sudo apt install build-essential git cmake python3-pip -y

# Install CUDA (if using GPU)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-11-8 -y

2. Install llama.cpp

# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with GPU support
make clean
make -j$(nproc) LLAMA_CUDA=1

# Verify installation
./main --help

3. Download Model

# Create model directory
mkdir -p models
cd models

# Download a 7B model (example: LLaMA-v2)
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

# Verify download
ls -la llama-2-7b-chat.Q4_K_M.gguf

Model Selection Guide

Model	Size	Best For
LLaMA-2-7B	4GB	General purpose, chat
Mistral-7B	4GB	Coding, instruction following
Phi-2	2GB	Lightweight, fast inference
TinyLlama	1GB	Quick testing, demo

# Example: Run LLaMA-2-7B
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -c 2048 \
  -n 128 \
  --temp 0.7 \
  -p "Q: What is the capital of France? A:"

Quantization Types Explained

Q4_K_M: 4-bit quantization with k-mer optimization (best balance)
Q5_K_M: 5-bit quantization (better quality, slightly larger)
Q8_0: 8-bit quantization (highest quality, largest file)
F16: Full precision (16-bit float, largest files)

# Convert model to different quantizations
./llama.cpp/quantize models/llama-2-7b-chat.Q4_K_M.gguf models/llama-2-7b-chat.Q5_K_M.gguf Q5_K_M

# Benchmark different quantizations
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 --temp 0.7 -p "Hello world"

API Setup and Integration

Ollama Setup (Recommended for API)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama service
ollama serve &

# Pull model
ollama pull llama2:7b-chat

# Test API
curl http://localhost:11434/api/generate -d '{
  "model": "llama2:7b-chat",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Custom API Integration

# api_client.py
import requests
import json

class LocalLLM:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url

    def generate(self, prompt, model="llama2:7b-chat"):
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={
                "model": model,
                "prompt": prompt,
                "stream": False
            }
        )
        return response.json()['response']

# Usage
llm = LocalLLM()
result = llm.generate("Extract key information from this invoice: [INVOICE DATA]")
print(result)

Systemd Service for 24/7 Operation

Create service file for automatic startup:

# Create service file
sudo nano /etc/systemd/system/llm.service

# Service configuration
[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/main -m /home/developer/models/llama-2-7b-chat.Q4_K_M.gguf -c 2048 -n 128 --temp 0.7
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable llm.service
sudo systemctl start llm.service
sudo systemctl status llm.service

Monitoring and Performance Tuning

Performance Monitoring

# Monitor GPU usage
nvidia-smi -l 1

# Monitor CPU and memory
htop

# Benchmark inference time
time ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -n 128 -p "Test prompt"

Tuning Parameters

# Optimize for speed
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -c 2048 \
  -n 32 \
  --temp 0.1 \
  --repeat-penalty 1.1

# Optimize for quality
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -c 2048 \
  -n 256 \
  --temp 0.8 \
  --repeat-penalty 1.2

Real Command Examples

Document Processing Pipeline

# Extract text from PDF
pdftotext -layout document.pdf extracted.txt

# Process with LLM
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
  -n 256 \
  -p "Extract structured data from this receipt text: $(cat extracted.txt)" \
  --temp 0.3 > output.json

# Validate JSON
python3 -m json.tool output.json

Batch Processing Script

#!/bin/bash
# batch_processor.sh

for file in receipts/*.pdf; do
    pdftotext -layout "$file" temp.txt
    ./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf \
      -n 256 \
      -p "Extract invoice data: $(cat temp.txt)" \
      --temp 0.1 > "${file%.pdf}.json"
    rm temp.txt
done

Troubleshooting Tips

Memory issues: Reduce context size (-c parameter)
Slow startup: Use smaller models (Phi-2, TinyLlama)
GPU memory issues: Add --no-gpu flag
Permission denied: Check file permissions and user groups

# Common troubleshooting commands
./llama.cpp/main -m models/llama-2-7b-chat.Q4_K_M.gguf -c 1024 -n 64 --temp 0.7 -p "Test"

# Check GPU memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

This guide provides a complete foundation for local LLM deployment that balances performance, cost, and usability. The recommended setup supports real-world

📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

DEV Community

로컬 LLM 셋업 가이드 (v42)

로컬 LLM 셋업 가이드 (v42)

Overview & Prerequisites

Framework Comparison

Step-by-Step Installation

1. Install Dependencies

2. Install llama.cpp

3. Download Model

Model Selection Guide

Quantization Types Explained

API Setup and Integration

Ollama Setup (Recommended for API)

Custom API Integration

Systemd Service for 24/7 Operation

Monitoring and Performance Tuning

Performance Monitoring

Tuning Parameters

Real Command Examples

Document Processing Pipeline

Batch Processing Script

Troubleshooting Tips

Top comments (0)