DEV Community

Alex Spinov
Alex Spinov

Posted on

llama.cpp Has a Free API You've Never Heard Of

llama.cpp is the engine that powers most local AI inference. Written in C/C++, it runs LLMs on CPU and GPU with incredible efficiency. And it includes a built-in HTTP server with an OpenAI-compatible API.

What Makes llama.cpp Special?

  • Pure C/C++ — no Python, no PyTorch dependencies
  • CPU optimized — AVX, AVX2, AVX-512 support
  • Quantization — run 70B models on consumer hardware
  • Metal/CUDA/Vulkan — GPU acceleration on all platforms
  • Built-in server — OpenAI-compatible HTTP API

The Hidden API: Built-in Server

# Start the server
./llama-server -m models/llama-3-8b.gguf \
  --host 0.0.0.0 --port 8080 \
  --n-gpu-layers 35 \
  --ctx-size 4096
Enter fullscreen mode Exit fullscreen mode
import requests

# Chat completion API
response = requests.post('http://localhost:8080/v1/chat/completions', json={
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Explain Docker briefly.'}
    ],
    'temperature': 0.7,
    'max_tokens': 200
})
print(response.json()['choices'][0]['message']['content'])

# Completion API
response = requests.post('http://localhost:8080/completion', json={
    'prompt': 'The quick brown fox',
    'n_predict': 50,
    'temperature': 0.8
})
print(response.json()['content'])

# Embeddings
response = requests.post('http://localhost:8080/embedding', json={
    'content': 'What is machine learning?'
})
print(f"Dimensions: {len(response.json()['embedding'])}")

# Tokenize
response = requests.post('http://localhost:8080/tokenize', json={
    'content': 'Hello World'
})
print(f"Tokens: {response.json()['tokens']}")
Enter fullscreen mode Exit fullscreen mode

Model Quantization API

# Quantize models for efficient inference
./llama-quantize models/llama-3-8b-f16.gguf models/llama-3-8b-q4_k_m.gguf Q4_K_M

# Quantization levels:
# Q2_K — smallest, lowest quality
# Q4_K_M — good balance (recommended)
# Q5_K_M — higher quality
# Q8_0 — near full precision
Enter fullscreen mode Exit fullscreen mode

Quick Start

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
./llama-server -m model.gguf --port 8080
Enter fullscreen mode Exit fullscreen mode

Why llama.cpp Powers Everything

A developer shared: "Ollama, LM Studio, Jan, GPT4All — they all use llama.cpp under the hood. Running it directly gives you the most control. I serve Llama 3 70B quantized to Q4 on a single RTX 4090, getting 30 tokens/sec."


Running local AI? Email spinov001@gmail.com or check my AI tools.

llama.cpp directly or through Ollama? What's your setup?

Top comments (0)