matias yoon

Posted on May 25

로컬 LLM 셋업 가이드 (v30)

#ai #llm #developers #tutorial

로컬 LLM 셋업 가이드 (v30)

Overview & Prerequisites

로컬 LLM 실행은 클라우드 의존성에서 벗어나 데이터 보안과 비용 절감을 위해 중요합니다. 이 가이드는 Linux 기반 시스템에서 최적화된 로컬 LLM 셋업을 위한 실전 가이드입니다.

필수 조건:

Linux 64비트 시스템 (Ubuntu 20.04 이상 권장)
GPU (NVIDIA RTX 30xx 이상 권장)
최소 16GB RAM (32GB 권장)
최소 10GB 디스크 공간 (모델별로 다름)
git, curl, build-essential 설치

# 사전 설치 확인
sudo apt update && sudo apt install -y git curl build-essential

Framework Comparison

프레임워크	장점	단점	적합성
llama.cpp	높은 성능, 최소 의존성, C++ 기반	명령어 기반, 복잡한 설정	성능 중심 개발자
Ollama	쉬운 설치 및 관리, Docker 기반	성능 약간 저하	빠른 실험
vLLM	고속 추론, 다중 GPU 지원	복잡한 설치, 메모리 요구량 높음	대규모 배포
LocalAI	다양한 API 호환성, 쉬운 통합	초기 로딩 시간	API 중심 개발

추천: llama.cpp + systemd 조합. 최고의 성능과 관리 편의성.

Step-by-Step Installation

1. llama.cpp 설치

# 소스 다운로드 및 빌드
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

# 테스트 실행 (모델 다운로드 필요)
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf
./main -m mistral-7b-v0.1.Q4_K_M.gguf -p "Hello world" -n 10

2. 모델 다운로드 및 준비

# 모델 디렉토리 생성
mkdir -p ~/models
cd ~/models

# Mistral-7B 모델 다운로드 (Q4_K_M quantization)
wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# 폴더 구조 정리
mkdir -p ~/llama_service/{models,logs,config}

Model Selection Guide

모델	용도	추천 사양	성능
Mistral-7B	일반적인 추론, 코드 생성	16GB RAM, RTX 3060	85%
Llama-3-8B	고정밀 추론	24GB RAM, RTX 4070	90%
Phi-3-Mini	가벼운 작업	8GB RAM, CPU	70%
Mixtral-8x7B	복잡한 작업	48GB RAM, RTX 4090	95%

# 성능 테스트 예시
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -p "What is your purpose?" -n 100 -e

Quantization Types Explained

# Quantization 종류 및 성능 비교
# Q4_K_M: 최적화된 4bit quantization, 50% 메모리 절약
# Q5_K_M: 5bit quantization, 60% 메모리 절약
# Q6_K: 6bit quantization, 70% 메모리 절약
# Q8_0: 8bit quantization, 최대 품질

# 변환 명령어 예시
./convert-hf-to-gguf.py ~/models/Mistral-7B-v0.1/ --outtype q4_k_m --outfile mistral-7b-v0.1.Q4_K_M.gguf

API Setup and Integration

1. API 서버 실행

# API 서버 실행
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 -p "System prompt here"

# 또는 OpenAI API 호환 모드
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf --host 127.0.0.1 --port 8080 --api-key your-api-key

2. Python 클라이언트 예제

# client.py
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="sk-1234"
)

response = client.chat.completions.create(
    model="mistral",
    messages=[{"role": "user", "content": "Hello, world"}],
    max_tokens=100
)

print(response.choices[0].message.content)

Systemd Service for 24/7 Operation

1. 서비스 파일 생성

# /etc/systemd/system/llama.service
sudo tee /etc/systemd/system/llama.service << EOF
[Unit]
Description=Local LLM Service
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/llama.cpp
ExecStart=/home/ubuntu/llama.cpp/main -m /home/ubuntu/models/mistral-7b-v0.1.Q4_K_M.gguf -a 127.0.0.1:8080 --host 127.0.0.1 --port 8080
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

2. 서비스 관리

# 서비스 시작 및 활성화
sudo systemctl daemon-reload
sudo systemctl enable llama.service
sudo systemctl start llama.service
sudo systemctl status llama.service

# 로그 확인
sudo journalctl -u llama.service -f

Monitoring and Performance Tuning

1. 시스템 모니터링 스크립트

# monitor.sh
#!/bin/bash
while true; do
    echo "=== Memory Usage ==="
    free -h
    echo "=== GPU Usage ==="
    nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv -l 1
    echo "=== CPU Usage ==="
    top -b -n 1 | head -20
    sleep 30
done

2. 성능 최적화 파라미터

# 최적화된 실행 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \
      -a 127.0.0.1:8080 \
      --host 127.0.0.1 \
      --port 8080 \
      --n-gpu-layers 30 \
      --ctx-size 8192 \
      --temp 0.7 \
      --top-p 0.9 \
      --n-predict 100

3. 벤치마크 테스트

# 성능 테스트 명령어
./main -m ~/models/mistral-7b-v0.1.Q4_K_M.gguf \
      -p "Explain quantum computing in simple terms." \
      -n 100 \
      --timing

# 결과 예시
# Time for prompt:  18.25 ms
# Time for completion:  124.82 ms
# Total tokens: 100

Real Command Examples

1. 모델 변환 및 최적화


bash
# HF 모델을 GGUF로 변환
python3 convert-hf-to-gguf.py /home/ubuntu/models/Llama-3-8B \
    --outtype q4_k_m \
    --outfile llama3-8b.Q4_K_M.gguf

# 메모리 사용량 최적화
./main -m llama3-8b.Q4_K_M.gguf -n-gpu-layers 28 -ctx-size 4096

---

📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)

DEV Community

로컬 LLM 셋업 가이드 (v30)

로컬 LLM 셋업 가이드 (v30)

Overview & Prerequisites

Framework Comparison

Step-by-Step Installation

1. llama.cpp 설치

2. 모델 다운로드 및 준비

Model Selection Guide

Quantization Types Explained

API Setup and Integration

1. API 서버 실행

2. Python 클라이언트 예제

Systemd Service for 24/7 Operation

1. 서비스 파일 생성

2. 서비스 관리

Monitoring and Performance Tuning

1. 시스템 모니터링 스크립트

2. 성능 최적화 파라미터

3. 벤치마크 테스트

Real Command Examples

1. 모델 변환 및 최적화

Top comments (0)