DEV Community

Alex Spinov
Alex Spinov

Posted on

vLLM Has a Free API You've Never Heard Of

vLLM is a high-throughput LLM serving engine. It uses PagedAttention to serve models 24x faster than HuggingFace Transformers. And it provides an OpenAI-compatible API — point your existing code at vLLM and get massive speedups.

What Makes vLLM Special?

  • 24x faster — PagedAttention for efficient memory management
  • OpenAI-compatible — drop-in replacement API
  • Continuous batching — serves multiple requests efficiently
  • Multi-GPU — tensor parallelism across GPUs
  • Free — Apache 2.0 license

The Hidden API: OpenAI-Compatible Server

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --port 8000
Enter fullscreen mode Exit fullscreen mode
from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key='token')

# Drop-in replacement for OpenAI!
response = client.chat.completions.create(
    model='meta-llama/Llama-3-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Explain Docker in 3 sentences.'}],
    temperature=0.7,
    max_tokens=200
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model='meta-llama/Llama-3-8B-Instruct',
    messages=[{'role': 'user', 'content': 'Write a haiku about coding.'}],
    stream=True
)
for chunk in stream:
    print(chunk.choices[0].delta.content or '', end='')

# Embeddings
embedding = client.embeddings.create(
    model='meta-llama/Llama-3-8B-Instruct',
    input='What is vLLM?'
)
Enter fullscreen mode Exit fullscreen mode

Offline Batch Inference API

from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
params = SamplingParams(temperature=0.7, max_tokens=200)

prompts = [
    'Explain quantum computing.',
    'Write a Python sort function.',
    'What is machine learning?'
]

# Process all prompts in parallel — hugely efficient
outputs = llm.generate(prompts, params)
for output in outputs:
    print(output.outputs[0].text)
Enter fullscreen mode Exit fullscreen mode

Quick Start

pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8B-Instruct
Enter fullscreen mode Exit fullscreen mode

Why ML Teams Choose vLLM

An ML engineer shared: "We served Llama 3 with HuggingFace at 5 requests/sec. Switched to vLLM — 120 requests/sec on the same GPU. PagedAttention is a game changer for memory efficiency. Our inference costs dropped 90%."


Building AI infrastructure? Email spinov001@gmail.com or check my AI tools.

How do you serve LLMs? vLLM vs TGI vs Ollama?

Top comments (0)