vLLM is the fastest open-source LLM inference engine, achieving 2-24x higher throughput than HuggingFace Transformers. It uses PagedAttention for efficient memory management and powers inference at companies like Anyscale, Mistral, and Databricks.
Free, open source, with a built-in OpenAI-compatible API server.
Why Use vLLM?
- Fastest throughput — PagedAttention + continuous batching
- OpenAI-compatible — drop-in replacement for OpenAI API
- Any HF model — Llama, Mistral, Qwen, Phi, Gemma, and more
- Multi-GPU — tensor parallelism across GPUs
- Structured output — JSON schema enforcement
- Speculative decoding — even faster with draft models
Quick Setup
1. Install
pip install vllm
# Or Docker
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-Instruct-v0.3
2. Start API Server
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
--host 0.0.0.0 --port 8000 \
--max-model-len 8192
# With multiple GPUs
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
# With quantization
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq
3. Chat Completion
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "What is web scraping?"}],
"temperature": 0.7,
"max_tokens": 200
}' | jq '.choices[0].message.content'
4. Batch Inference
# Process multiple prompts efficiently
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"prompt": ["Translate to French: Hello", "Translate to Spanish: Hello", "Translate to German: Hello"],
"max_tokens": 50
}' | jq '.choices[] | {index: .index, text: .text}'
5. Structured Output
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Extract: John Smith works at Google as a senior engineer in NYC"}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"company": {"type": "string"},
"title": {"type": "string"},
"city": {"type": "string"}
},
"required": ["name", "company"]
}
}
}
}' | jq '.choices[0].message.content | fromjson'
Python Example
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Chat
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[{"role": "user", "content": "Best practices for ethical web scraping"}],
temperature=0.5
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[{"role": "user", "content": "Compare Scrapy vs BeautifulSoup"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
# Embeddings
emb = client.embeddings.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
input="web scraping best practices"
)
print(f"Embedding dim: {len(emb.data[0].embedding)}")
Key Endpoints
| Endpoint | Description |
|---|---|
| /v1/chat/completions | Chat (OpenAI-compatible) |
| /v1/completions | Text completion |
| /v1/embeddings | Text embeddings |
| /v1/models | List loaded models |
| /health | Health check |
| /metrics | Prometheus metrics |
Performance vs Alternatives
| Engine | Throughput | Features |
|---|---|---|
| vLLM | Highest | Full OpenAI API, structured output |
| TGI | High | Grammar support, Hugging Face native |
| Ollama | Medium | Easiest setup, desktop-friendly |
| llama.cpp | Medium | CPU-friendly, GGUF format |
Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors
Top comments (0)