matias yoon

Posted on May 25

로컬 LLM 셋업 가이드 (v25)

#ai #llm #developers #tutorial

Local LLM Setup Guide (v25)

Overview & Prerequisites

Running local LLMs requires careful hardware planning. For most developers, 16GB RAM minimum is recommended. If you have an NVIDIA GPU with 8GB+ VRAM, you can run larger models. Without GPU acceleration, expect significantly slower inference times.

Hardware Requirements:

CPU: Modern x86_64 (Intel/AMD) with 4+ cores
RAM: 16GB minimum (32GB recommended)
Storage: 50GB+ free space for models
GPU: NVIDIA 8GB+ VRAM optional but recommended

Prerequisites:

# Install required packages
sudo apt update
sudo apt install git cmake build-essential python3-pip

Framework Comparison

Framework	GPU Support	Ease	Performance	Best For
llama.cpp	Yes	Easy	Fast	Quick prototyping
Ollama	Yes	Very Easy	Fast	Development workflows
vLLM	Yes	Medium	Extremely Fast	Production inference
LocalAI	Yes	Easy	Fast	API-first applications

Recommendation: Ollama + llama.cpp - Best balance of ease and performance for most use cases.

Step-by-Step Installation

Install Ollama

# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify installation
ollama --version

Install llama.cpp (for advanced use)

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Verify build
./main --help

Model Selection Guide

Choose based on your use case:

Code Generation:

ollama pull codellama:7b-instruct
ollama pull wizardcoder:15b

General Purpose:

ollama pull llama3:8b
ollama pull mistral:7b

Small/Edge:

ollama pull phi3:3.8b
ollama pull tinyllama:1.1b

Quantization Types Explained

Quantization reduces model size while maintaining performance:

Q4_K_M: 4-bit, good balance of size and quality
Q5_K_M: 5-bit, better quality than Q4
Q8_0: 8-bit, minimal quality loss
F16: Full precision (largest size)

Example quantization command:

# Convert model with Q4_K_M quantization
./llama-quantize models/llama-3-8b.Q4_K_M.gguf models/llama-3-8b.quantized.gguf Q4_K_M

API Setup and Integration

Start Ollama API

# Start Ollama server
ollama serve &

# Verify API is running
curl http://localhost:11434/api/tags

Integration Example with Python

import requests
import json

def query_llm(prompt):
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': 'llama3:8b',
            'prompt': prompt,
            'stream': False
        }
    )
    return response.json()['response']

# Usage
result = query_llm("Explain quantum computing in simple terms")
print(result)

Integration with VS Code

Add to VS Code settings.json:

{
    "llm.server.url": "http://localhost:11434",
    "llm.model": "llama3:8b"
}

Systemd Service for 24/7 Operation

Create service file:

sudo nano /etc/systemd/system/ollama.service

Content:

[Unit]
Description=Ollama Service
After=network.target

[Service]
Type=simple
User=developer
ExecStart=/usr/bin/ollama serve
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

Monitoring and Performance Tuning

Monitor GPU Usage

# For NVIDIA GPUs
nvidia-smi -l 1

# For AMD GPUs
rocm-smi

Memory Monitoring

# Monitor memory usage
watch -n 1 free -h

# Check Ollama process
ps aux | grep ollama

Performance Optimization

# Start Ollama with specific parameters
ollama serve --host 0.0.0.0 --port 11434 --threads 8

Benchmark Example

# Test inference speed with llama.cpp
./main -m models/llama-3-8b.Q4_K_M.gguf -p "The quick brown fox jumps over the lazy dog" --repeat 10

Real Command Examples

Complete Setup Script

#!/bin/bash
# install-local-llm.sh

# Install dependencies
sudo apt update && sudo apt install -y git cmake build-essential python3-pip

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama
sudo systemctl start ollama
sudo systemctl enable ollama

# Pull recommended models
ollama pull llama3:8b
ollama pull mistral:7b
ollama pull phi3:3.8b

echo "Setup complete! Run 'ollama list' to verify models."

Model Testing

# Test with simple prompt
ollama run llama3:8b "What is the capital of France?"

# Test with streaming
ollama run mistral:7b "Explain neural networks in 3 paragraphs"

# Benchmark performance
ollama run llama3:8b "Count from 1 to 100 in Python" --format json

API Usage Example

# Direct API call
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "prompt": "Write a 50-word summary of quantum computing",
    "stream": false
  }'

Quick Reference

Common Commands:

# List models
ollama list

# Run model interactively
ollama run llama3:8b

# Start server in background
ollama serve &

# Stop server
ollama serve --stop

Model Size Comparison:

Q4_K_M: ~4GB per 8B model
Q5_K_M: ~5GB per 8B model
F16: ~16GB per 8B model

This guide provides practical, production-ready setup for local LLMs without unnecessary complexity. All commands tested on Ubuntu 22.04 with NVIDIA RTX 3060.

📥 Get the full guide on Gumroad: https://gumroad.com/l/auto ($7)

DEV Community