DEV Community

matias yoon
matias yoon

Posted on

로컬 LLM 셋업 가이드 (v41)

로컬 LLM 셋업 가이드 (v41)

Overview & Prerequisites

Running local LLMs requires hardware with adequate compute resources. Minimum specs:

  • CPU: 4+ cores, 3.2GHz+ (Intel i7 or AMD Ryzen 7 recommended)
  • RAM: 16GB minimum (32GB preferred for larger models)
  • Storage: 20GB+ SSD (fast NVMe preferred)
  • GPU: NVIDIA RTX 3060 or better (12GB+ VRAM) for acceleration

For CPU-only setups, expect 2-5x slower inference times. For optimal performance, use a GPU with CUDA support.

Framework Comparison

Framework Pros Cons Best For
llama.cpp Native C++, no dependencies, minimal memory usage Manual compilation required, limited API Developers wanting full control
Ollama Easy installation, Docker-based, API-ready Requires Docker, less customizable Quick prototyping
vLLM Ultra-fast inference, optimized for large models Complex setup, Python-heavy High-throughput applications
LocalAI Multi-model support, OpenAI-compatible API Resource-heavy, more complex Production deployments

Recommendation: Use llama.cpp for maximum performance and control, or Ollama for rapid prototyping.

Step-by-Step Installation

Install llama.cpp

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build the project
make

# Verify installation
./main --help
Enter fullscreen mode Exit fullscreen mode

Download a model

# Create model directory
mkdir -p models
cd models

# Download a 7B model (e.g. Mistral-7B)
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# Return to project root
cd ..
Enter fullscreen mode Exit fullscreen mode

Model Selection Guide

Use Case Recommended Model Reason
General chat Mistral-7B Balanced performance
Code generation CodeLlama-7B Better programming context
Research/academic Llama-2-70B Most comprehensive
Lightweight inference Phi-2 Fastest, smallest
# Example: Run Mistral-7B
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 200
Enter fullscreen mode Exit fullscreen mode

Quantization Types Explained

Quantization reduces model size while maintaining accuracy:

  • Q4_K_M: 4-bit, most balanced tradeoff (recommended for most use cases)
  • Q5_K_M: 5-bit, slightly better accuracy at cost of size
  • Q6_K: 6-bit, high accuracy but larger file size
  • Q8_0: 8-bit, near full precision
# Convert to different quantizations (requires convert-llama-ggml.py)
python3 convert-llama-ggml.py models/7B/ggml-model-f16.gguf

# Check quantization
file models/mistral-7b-v0.1.Q4_K_M.gguf
Enter fullscreen mode Exit fullscreen mode

API Setup and Integration

Simple HTTP server

# Run model with HTTP API (llama.cpp)
./server -m models/mistral-7b-v0.1.Q4_K_M.gguf \
  -c 2048 \
  --host 0.0.0.0 \
  --port 1234
Enter fullscreen mode Exit fullscreen mode

Test with curl

curl http://localhost:1234/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a 50-word explanation of machine learning",
    "n_predict": 100,
    "temperature": 0.7
  }'
Enter fullscreen mode Exit fullscreen mode

Integrate with existing tools

# config.yaml for tool integration
llm:
  endpoint: http://localhost:1234
  model: mistral-7b-v0.1.Q4_K_M.gguf
  max_tokens: 200
  temperature: 0.7
Enter fullscreen mode Exit fullscreen mode

Systemd Service for 24/7 Operation

Create service file:

sudo nano /etc/systemd/system/llm.service
Enter fullscreen mode Exit fullscreen mode

Content:

[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llama.cpp
ExecStart=/home/developer/llama.cpp/server \
  -m /home/developer/llama.cpp/models/mistral-7b-v0.1.Q4_K_M.gguf \
  -c 2048 \
  --host 0.0.0.0 \
  --port 1234
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable llm.service
sudo systemctl start llm.service
sudo systemctl status llm.service
Enter fullscreen mode Exit fullscreen mode

Monitoring and Performance Tuning

Benchmark script

#!/bin/bash
# benchmark.sh

MODEL_PATH="models/mistral-7b-v0.1.Q4_K_M.gguf"
PROMPT="The future of artificial intelligence is"

echo "Running benchmark..."
time ./main -m $MODEL_PATH -p "$PROMPT" -n 100 --temp 0.7
Enter fullscreen mode Exit fullscreen mode

Memory usage monitoring

# Check GPU usage (if using CUDA)
nvidia-smi

# Monitor system resources
htop

# Check model memory usage
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf -c 2048 --verbose
Enter fullscreen mode Exit fullscreen mode

Performance parameters

# Optimize for speed
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
  -c 1024 \
  --threads 8 \
  --batch-size 512 \
  --temp 0.3

# Optimize for quality
./main -m models/mistral-7b-v0.1.Q4_K_M.gguf \
  -c 2048 \
  --threads 4 \
  --temp 0.7
Enter fullscreen mode Exit fullscreen mode

Real Command Examples

Complete workflow for deployment

# 1. Setup environment
mkdir -p ~/llm/{models,logs}
cd ~/llm

# 2. Get and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# 3. Download model
cd ~/llm/models
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# 4. Test inference
cd ~/llm/llama.cpp
./main -m ../models/mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" -n 50

# 5. Start service
sudo systemctl start llm.service
Enter fullscreen mode Exit fullscreen mode

Integration with Python application

# llm_client.py
import requests
import json

class LLMClient:
    def __init__(self, base_url="http://localhost:1234"):
        self.base_url = base_url

    def generate(self, prompt, max_tokens=100):
        response = requests.post(
            f"{self.base_url}/completion",
            json={
                "prompt": prompt,
                "n_predict": max_tokens,
                "temperature": 0.7
            }
        )
        return response.json()

# Usage
client = LLMClient()
result = client.generate("Explain quantum computing")
print(result['content'])
Enter fullscreen mode Exit fullscreen mode

Key Performance Benchmarks

Model Quantization Inference Speed Memory Usage
Mistral-7B Q4_K_M 15-20 tokens/sec 5GB
Llama-2-7B Q4_K_M 12-18 tokens/sec 6GB
Phi-2 Q4_K_M 25-30 tokens/sec 3GB

Security Considerations

  1. Firewall: Limit access to localhost only
  2. Authentication: Add token-based auth in production
  3. Data handling: Never pass sensitive data through public APIs
  4. File permissions: Secure model files with proper ownership
# Secure model files
chmod 600 models/*.gguf
chown developer:developer models/*.gguf
Enter fullscreen mode Exit fullscreen mode

Next Steps

  1. Scale up: Add more models for diverse use cases
  2. Load balancing: Use multiple instances for high volume
  3. Caching: Implement response caching for repeated queries
  4. Monitoring: Add logging and metrics collection

This guide provides a


📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

Top comments (0)