llama.cpp is the engine that powers most local AI inference. Written in C/C++, it runs LLMs on CPU and GPU with incredible efficiency. And it includes a built-in HTTP server with an OpenAI-compatible API.
What Makes llama.cpp Special?
- Pure C/C++ — no Python, no PyTorch dependencies
- CPU optimized — AVX, AVX2, AVX-512 support
- Quantization — run 70B models on consumer hardware
- Metal/CUDA/Vulkan — GPU acceleration on all platforms
- Built-in server — OpenAI-compatible HTTP API
The Hidden API: Built-in Server
# Start the server
./llama-server -m models/llama-3-8b.gguf \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 35 \
--ctx-size 4096
import requests
# Chat completion API
response = requests.post('http://localhost:8080/v1/chat/completions', json={
'messages': [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Explain Docker briefly.'}
],
'temperature': 0.7,
'max_tokens': 200
})
print(response.json()['choices'][0]['message']['content'])
# Completion API
response = requests.post('http://localhost:8080/completion', json={
'prompt': 'The quick brown fox',
'n_predict': 50,
'temperature': 0.8
})
print(response.json()['content'])
# Embeddings
response = requests.post('http://localhost:8080/embedding', json={
'content': 'What is machine learning?'
})
print(f"Dimensions: {len(response.json()['embedding'])}")
# Tokenize
response = requests.post('http://localhost:8080/tokenize', json={
'content': 'Hello World'
})
print(f"Tokens: {response.json()['tokens']}")
Model Quantization API
# Quantize models for efficient inference
./llama-quantize models/llama-3-8b-f16.gguf models/llama-3-8b-q4_k_m.gguf Q4_K_M
# Quantization levels:
# Q2_K — smallest, lowest quality
# Q4_K_M — good balance (recommended)
# Q5_K_M — higher quality
# Q8_0 — near full precision
Quick Start
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
./llama-server -m model.gguf --port 8080
Why llama.cpp Powers Everything
A developer shared: "Ollama, LM Studio, Jan, GPT4All — they all use llama.cpp under the hood. Running it directly gives you the most control. I serve Llama 3 70B quantized to Q4 on a single RTX 4090, getting 30 tokens/sec."
Running local AI? Email spinov001@gmail.com or check my AI tools.
llama.cpp directly or through Ollama? What's your setup?
Top comments (0)