vLLM is a high-throughput LLM serving engine. It serves models 24x faster than Hugging Face Transformers with PagedAttention and continuous batching.
What Is vLLM?
vLLM is an open-source library for fast LLM inference and serving. It uses PagedAttention to efficiently manage GPU memory.
Features:
- 24x higher throughput than HF Transformers
- OpenAI-compatible API
- PagedAttention for memory efficiency
- Continuous batching
- Tensor/pipeline parallelism
- LoRA support
Quick Start
pip install vllm
# Start server
vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8000
OpenAI-Compatible API
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.2-3B-Instruct","messages":[{"role":"user","content":"What is Docker?"}]}'
# Completions
curl http://localhost:8000/v1/completions \
-d '{"model":"meta-llama/Llama-3.2-3B-Instruct","prompt":"Python is","max_tokens":50}'
Use with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "Explain Kubernetes"}]
)
print(response.choices[0].message.content)
Use Cases
- LLM serving — production-grade inference
- Cost reduction — serve more with less GPUs
- Self-hosted AI — private LLM deployment
- Batch inference — process large datasets
- API gateway — OpenAI-compatible endpoint
Need web data at scale? Check out my scraping tools on Apify or email spinov001@gmail.com for custom solutions.
Top comments (0)