matias yoon

Posted on May 26

로컬 LLM 셋업 가이드 (v43)

#ai #llm #developers #tutorial

로컬 LLM 셋업 가이드 (v43)

1. 개요 및 전제 조건

로컬 LLM (대규모 언어 모델) 실행은 데이터 보안, 비용 절감, 고속 응답을 원하는 개발자에게 이상적입니다. 이 가이드는 NVIDIA GPU가 있는 Linux 시스템을 기준으로 합니다.

전제 조건:

Linux (Ubuntu 22.04 이상 권장)
NVIDIA GPU (CUDA 11.8 이상)
최소 16GB RAM (32GB 이상 권장)
100GB 이상 여유 저장공간

사용할 도구:

# 시스템 정보 확인
nvidia-smi
free -h
lscpu

2. 프레임워크 비교

프레임워크	장점	단점	적합성
llama.cpp	빠른 실행, 최소 의존성	CPU만 지원 시 성능 저하	간단한 테스트용
Ollama	쉬운 설치, GUI 지원	자원 소모 많음	개발 테스트
vLLM	초고속 추론	고급 설정 필요	실시간 애플리케이션
LocalAI	API 호환성, 다중 모델	복잡한 설정	엔터프라이즈

우선권 추천: vLLM + llama.cpp 조합

3. 설치 가이드 (vLLM + llama.cpp)

vLLM 설치:

# Python 환경 설정
python3 -m venv vllm-env
source vllm-env/bin/activate
pip install vllm

# GPU 지원 확인
python3 -c "import torch; print(torch.cuda.is_available())"

llama.cpp 설치:

# 소스 코드 클론
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 빌드
make
# CPU만 사용 시: make LLAMA_NATIVE=0

# CMake 설정 (CUDA 사용)
mkdir build && cd build
cmake ..
make -j

4. 모델 선택 가이드

모델별 사용 사례:

Qwen2-7B (128K context)

코드 생성, 문서 요약
추론 속도: 40토큰/초
추천 RAM: 32GB

GLM-4-9B

문서 처리, 질문 답변
추론 속도: 35토큰/초
추천 RAM: 16GB

MiniMax-12B

복잡한 추론 작업
추론 속도: 30토큰/초
추천 RAM: 24GB

# 모델 다운로드 예시
wget https://huggingface.co/Qwen/Qwen2-7B-GGUF/resolve/main/qwen2-7b-q4_k_m.gguf

5. 양자화 유형 설명

유형	품질	크기	사용 사례
Q4_K_M	95%	3.8GB	일반 응용 프로그램
Q5_K_M	98%	4.8GB	정밀 추론
Q6_K	99%	5.5GB	고정밀 모델
Q8_0	100%	7.5GB	원본 품질 유지

# llama.cpp에서 양자화
./convert-hf-to-gguf.py --model-path ./models/Qwen2-7B --output-path ./models/qwen2-7b-q4_k_m.gguf

6. API 설정 및 통합

vLLM API 서버:

# 서버 시작
python3 -m vllm.entrypoints.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model Qwen/Qwen2-7B \
    --tensor-parallel-size 2

# 테스트
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen2-7B",
        "prompt": "Python에서 for 루프를 사용한 예제를 설명해주세요.",
        "max_tokens": 100,
        "temperature": 0.7
    }'

llama.cpp API:

# 서버 실행
./server -m ./models/qwen2-7b-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8001 \
    --threads 8 \
    --ctx-size 8192

# API 호출
curl http://localhost:8001/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "코드 생성 예제:",
        "n_predict": 50
    }'

7. Systemd 서비스 설정

/etc/systemd/system/llm-server.service

[Unit]
Description=LLM Server Service
After=network.target

[Service]
Type=simple
User=developer
WorkingDirectory=/home/developer/llm-app
ExecStart=/home/developer/llm-app/run.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

run.sh

#!/bin/bash
cd /home/developer/llm-app
source vllm-env/bin/activate
python3 -m vllm.entrypoints.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model Qwen/Qwen2-7B \
    --tensor-parallel-size 2

# 서비스 등록
sudo systemctl daemon-reload
sudo systemctl enable llm-server
sudo systemctl start llm-server
sudo systemctl status llm-server

8. 모니터링 및 성능 조정

성능 확인:

# GPU 사용량
nvidia-smi -l 1

# CPU 사용량
htop

# API 응답 시간
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:8000/v1/completions

curl-format.txt

    time_namelookup:  %{time_namelookup}\n
       time_connect:  %{time_connect}\n
    time_appconnect:  %{time_appconnect}\n
   time_pretransfer:  %{time_pretransfer}\n
      time_redirect:  %{time_redirect}\n
 time_starttransfer:  %{time_starttransfer}\n
                    ----------\n
         time_total:  %{time_total}\n

메모리 최적화:

# vLLM 최적화 설정
python3 -m vllm.entrypoints.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model Qwen/Qwen2-7B \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.8 \
    --swap-space 8

9. 실제 명령어 예제

1. 모델 빌드 및 실행:

# llama.cpp 모델 빌드
cd llama.cpp
make clean
make LLAMA_NATIVE=1
./convert-hf-to-gguf.py --model-path ./models/Qwen2-7B --output-path ./models/qwen2-7b-q4_k_m.gguf

# 서버 실행
./server -m ./models/qwen2-7b-q4_k_m.gguf \
    --host 0.0.0.0 \
    --port 8001 \
    --ctx-size 8192 \
    --threads 8

2. 성능 테스트:

# 단일 요청 테스트
curl -H "Content-Type: application/json" \
     -d '{"prompt": "Python에서 딕셔너리 생성 방법을 설명하세요.", "max_tokens": 50}' \
     http://localhost:8001/completion

# 병렬 테스트
ab -n 100 -c 10 http://localhost:8001/completion

3. API 통합 예제:


python
# Python API 호출
import requests
import json

def call_llm(prompt):
    response = requests.post(
        'http://localhost:8000/v1/completions',
        headers={'Content-Type': 'application/json'},
        data=json.dumps({
            'model': 'Qwen2-7B',
            'prompt':

---

📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)

DEV Community

로컬 LLM 셋업 가이드 (v43)

로컬 LLM 셋업 가이드 (v43)

1. 개요 및 전제 조건

2. 프레임워크 비교

3. 설치 가이드 (vLLM + llama.cpp)

vLLM 설치:

llama.cpp 설치:

4. 모델 선택 가이드

모델별 사용 사례:

5. 양자화 유형 설명

6. API 설정 및 통합

vLLM API 서버:

llama.cpp API:

7. Systemd 서비스 설정

8. 모니터링 및 성능 조정

성능 확인:

메모리 최적화:

9. 실제 명령어 예제

1. 모델 빌드 및 실행:

2. 성능 테스트:

3. API 통합 예제:

Top comments (0)