vLLM is a high-throughput LLM serving engine. It uses PagedAttention to serve models 24x faster than HuggingFace Transformers. And it provides an OpenAI-compatible API — point your existing code at vLLM and get massive speedups.
What Makes vLLM Special?
- 24x faster — PagedAttention for efficient memory management
- OpenAI-compatible — drop-in replacement API
- Continuous batching — serves multiple requests efficiently
- Multi-GPU — tensor parallelism across GPUs
- Free — Apache 2.0 license
The Hidden API: OpenAI-Compatible Server
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--port 8000
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key='token')
# Drop-in replacement for OpenAI!
response = client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Explain Docker in 3 sentences.'}],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
# Streaming
stream = client.chat.completions.create(
model='meta-llama/Llama-3-8B-Instruct',
messages=[{'role': 'user', 'content': 'Write a haiku about coding.'}],
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content or '', end='')
# Embeddings
embedding = client.embeddings.create(
model='meta-llama/Llama-3-8B-Instruct',
input='What is vLLM?'
)
Offline Batch Inference API
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
params = SamplingParams(temperature=0.7, max_tokens=200)
prompts = [
'Explain quantum computing.',
'Write a Python sort function.',
'What is machine learning?'
]
# Process all prompts in parallel — hugely efficient
outputs = llm.generate(prompts, params)
for output in outputs:
print(output.outputs[0].text)
Quick Start
pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8B-Instruct
Why ML Teams Choose vLLM
An ML engineer shared: "We served Llama 3 with HuggingFace at 5 requests/sec. Switched to vLLM — 120 requests/sec on the same GPU. PagedAttention is a game changer for memory efficiency. Our inference costs dropped 90%."
Building AI infrastructure? Email spinov001@gmail.com or check my AI tools.
How do you serve LLMs? vLLM vs TGI vs Ollama?
Top comments (0)