DEV Community

Alex Spinov
Alex Spinov

Posted on

vLLM Has a Free API — Serve LLMs 24x Faster

vLLM is a high-throughput LLM serving engine. It serves models 24x faster than Hugging Face Transformers with PagedAttention and continuous batching.

What Is vLLM?

vLLM is an open-source library for fast LLM inference and serving. It uses PagedAttention to efficiently manage GPU memory.

Features:

  • 24x higher throughput than HF Transformers
  • OpenAI-compatible API
  • PagedAttention for memory efficiency
  • Continuous batching
  • Tensor/pipeline parallelism
  • LoRA support

Quick Start

pip install vllm

# Start server
vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8000
Enter fullscreen mode Exit fullscreen mode

OpenAI-Compatible API

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.2-3B-Instruct","messages":[{"role":"user","content":"What is Docker?"}]}'

# Completions
curl http://localhost:8000/v1/completions \
  -d '{"model":"meta-llama/Llama-3.2-3B-Instruct","prompt":"Python is","max_tokens":50}'
Enter fullscreen mode Exit fullscreen mode

Use with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Explain Kubernetes"}]
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Use Cases

  1. LLM serving — production-grade inference
  2. Cost reduction — serve more with less GPUs
  3. Self-hosted AI — private LLM deployment
  4. Batch inference — process large datasets
  5. API gateway — OpenAI-compatible endpoint

Need web data at scale? Check out my scraping tools on Apify or email spinov001@gmail.com for custom solutions.

Top comments (0)