DEV Community

Alex Spinov
Alex Spinov

Posted on

vLLM Has a Free API — The Fastest Open-Source LLM Inference Engine

vLLM is the fastest open-source LLM inference engine, achieving 2-24x higher throughput than HuggingFace Transformers. It uses PagedAttention for efficient memory management and powers inference at companies like Anyscale, Mistral, and Databricks.

Free, open source, with a built-in OpenAI-compatible API server.

Why Use vLLM?

  • Fastest throughput — PagedAttention + continuous batching
  • OpenAI-compatible — drop-in replacement for OpenAI API
  • Any HF model — Llama, Mistral, Qwen, Phi, Gemma, and more
  • Multi-GPU — tensor parallelism across GPUs
  • Structured output — JSON schema enforcement
  • Speculative decoding — even faster with draft models

Quick Setup

1. Install

pip install vllm

# Or Docker
docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3
Enter fullscreen mode Exit fullscreen mode

2. Start API Server

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 8192

# With multiple GPUs
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

# With quantization
vllm serve TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq
Enter fullscreen mode Exit fullscreen mode

3. Chat Completion

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "What is web scraping?"}],
    "temperature": 0.7,
    "max_tokens": 200
  }' | jq '.choices[0].message.content'
Enter fullscreen mode Exit fullscreen mode

4. Batch Inference

# Process multiple prompts efficiently
curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "prompt": ["Translate to French: Hello", "Translate to Spanish: Hello", "Translate to German: Hello"],
    "max_tokens": 50
  }' | jq '.choices[] | {index: .index, text: .text}'
Enter fullscreen mode Exit fullscreen mode

5. Structured Output

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "Extract: John Smith works at Google as a senior engineer in NYC"}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "person",
        "schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "company": {"type": "string"},
            "title": {"type": "string"},
            "city": {"type": "string"}
          },
          "required": ["name", "company"]
        }
      }
    }
  }' | jq '.choices[0].message.content | fromjson'
Enter fullscreen mode Exit fullscreen mode

Python Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Chat
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": "Best practices for ethical web scraping"}],
    temperature=0.5
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": "Compare Scrapy vs BeautifulSoup"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Embeddings
emb = client.embeddings.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    input="web scraping best practices"
)
print(f"Embedding dim: {len(emb.data[0].embedding)}")
Enter fullscreen mode Exit fullscreen mode

Key Endpoints

Endpoint Description
/v1/chat/completions Chat (OpenAI-compatible)
/v1/completions Text completion
/v1/embeddings Text embeddings
/v1/models List loaded models
/health Health check
/metrics Prometheus metrics

Performance vs Alternatives

Engine Throughput Features
vLLM Highest Full OpenAI API, structured output
TGI High Grammar support, Hugging Face native
Ollama Medium Easiest setup, desktop-friendly
llama.cpp Medium CPU-friendly, GGUF format

Need custom data extraction or scraping solution? I build production-grade scrapers for any website. Email: Spinov001@gmail.com | My Apify Actors

Top comments (0)