vLLM Has a Free API — Serve LLMs 24x Faster

#vllm #ai #llm #python

vLLM is a high-throughput LLM serving engine. It serves models 24x faster than Hugging Face Transformers with PagedAttention and continuous batching.

What Is vLLM?

vLLM is an open-source library for fast LLM inference and serving. It uses PagedAttention to efficiently manage GPU memory.

Features:

24x higher throughput than HF Transformers
OpenAI-compatible API
PagedAttention for memory efficiency
Continuous batching
Tensor/pipeline parallelism
LoRA support

Quick Start

pip install vllm

# Start server
vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8000

OpenAI-Compatible API

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.2-3B-Instruct","messages":[{"role":"user","content":"What is Docker?"}]}'

# Completions
curl http://localhost:8000/v1/completions \
  -d '{"model":"meta-llama/Llama-3.2-3B-Instruct","prompt":"Python is","max_tokens":50}'

Use with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Explain Kubernetes"}]
)
print(response.choices[0].message.content)