DEV Community

Cover image for Running LLMs on Windows: Native vLLM vs WSL vs llama.cpp Compared
Alan West
Alan West

Posted on

Running LLMs on Windows: Native vLLM vs WSL vs llama.cpp Compared

The Windows local LLM story just got interesting. Someone recently demonstrated Qwen3's 27B model running at 72 tokens per second on an RTX 3090 — natively on Windows. No WSL. No Docker. Just a portable vLLM launcher.

If you've been running local models on Windows, you know the pain. Let me break down how the landscape has shifted and help you pick the right inference stack.

Why This Comparison Matters Now

For the longest time, running vLLM on Windows meant one of two things: spin up WSL2 or wrestle with Docker Desktop. Both add overhead, complexity, and weird networking quirks. Native Windows support changes the calculus entirely.

I've been running local models for inference on my dev machine for months — mostly through llama.cpp and Ollama. When I saw native vLLM hitting 72 tok/s on a 3090 with a 27B parameter model, I had to dig in.

The Contenders

Here's what we're comparing:

  • Native vLLM on Windows — the new kid, portable launcher approach
  • vLLM via WSL2 — the established "proper" way
  • llama.cpp (direct) — the GGUF Swiss army knife
  • Ollama — the "just works" option

Setup Complexity

Native vLLM (Windows)

From what's been shared, the portable installer handles CUDA dependencies and sets up vLLM without requiring a Linux subsystem:

# Reportedly as simple as:
./vllm-launcher.exe --model Qwen/Qwen3-27B --gpu-memory-utilization 0.95

# The launcher handles:
# - CUDA toolkit detection/bundling
# - Python environment isolation
# - Model downloading and caching
Enter fullscreen mode Exit fullscreen mode

The "portable" aspect is key — no global Python installation conflicts, no PATH pollution.

vLLM via WSL2

# First, ensure WSL2 is set up with CUDA passthrough
wsl --install -d Ubuntu-22.04

# Inside WSL:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-27B \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95
Enter fullscreen mode Exit fullscreen mode

Works well, but you're maintaining a full Linux userspace. GPU passthrough occasionally breaks after Windows updates. Ask me how I know.

llama.cpp

# Download a GGUF quantized model
# Run the server with CUDA acceleration
./llama-server.exe -m qwen3-27b-q4_k_m.gguf \
  -ngl 99 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

# -ngl 99: offload all layers to GPU
# -c 8192: context window size
Enter fullscreen mode Exit fullscreen mode

Native Windows binary. No fuss. But you're using quantized models (usually Q4 or Q5), which trades some quality for speed and memory savings.

Ollama

# Literally just:
ollama run qwen3:27b

# Or serve it as an API:
ollama serve
# Then: curl http://localhost:11434/api/generate -d '{"model": "qwen3:27b", "prompt": "hello"}'
Enter fullscreen mode Exit fullscreen mode

Ollama wins on simplicity every single time. It's the brew install of local LLMs.

Performance Comparison (RTX 3090, 24GB VRAM)

Stack Model Format ~Throughput VRAM Usage Quality
Native vLLM FP16/BF16 ~72 tok/s ~22GB Full precision
WSL vLLM FP16/BF16 ~65-70 tok/s ~22GB + WSL overhead Full precision
llama.cpp Q4_K_M GGUF ~45-55 tok/s ~16GB Slight quality loss
Ollama Q4_K_M (internal) ~40-50 tok/s ~16GB Slight quality loss

Note: These are approximate numbers based on community reports. Your mileage will vary based on context length, batch size, and specific GPU silicon lottery.

The native vLLM numbers are impressive because you're getting full-precision inference without the WSL tax. That 5-10% overhead from the virtualization layer adds up.

When to Use What

Choose native vLLM if:

  • You need maximum throughput with full precision
  • You're building production-adjacent inference pipelines
  • You want PagedAttention and continuous batching
  • You don't want to maintain a WSL environment

Choose WSL vLLM if:

  • You need the full vLLM ecosystem (already battle-tested on Linux)
  • You're comfortable with WSL and already have it configured
  • You need features that might not be in the Windows port yet

Choose llama.cpp if:

  • You want maximum flexibility with model formats
  • You're fine with quantized models (honestly, Q5_K_M is barely distinguishable from FP16 for most tasks)
  • You need to run on machines with less VRAM
  • You want one static binary with zero dependencies

Choose Ollama if:

  • You want zero configuration
  • You're prototyping or doing local development
  • You need quick model switching
  • You're not chasing maximum throughput

Migration: From Ollama/llama.cpp to Native vLLM

If you're currently using Ollama or llama.cpp and want to try native vLLM for better throughput:

Step 1: Check Your VRAM Budget

A 27B parameter model in FP16 needs roughly 54GB in theory, but with vLLM's memory management, it reportedly fits in 24GB through aggressive KV-cache optimization. Confirm your GPU can handle it.

Step 2: Swap Your API Calls

vLLM exposes an OpenAI-compatible API, so migration is straightforward:

import openai

# Before (Ollama):
client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't validate this
)

# After (native vLLM):
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123"  # vLLM's default
)

# Your actual inference code stays the same
response = client.chat.completions.create(
    model="Qwen/Qwen3-27B",
    messages=[{"role": "user", "content": "Explain PagedAttention"}],
    temperature=0.7
)
Enter fullscreen mode Exit fullscreen mode

Since both expose OpenAI-compatible endpoints, your application code barely changes.

Step 3: Benchmark YOUR Workload

Don't trust anyone's benchmarks (including mine). Run your actual prompts:

import time

prompts = load_your_actual_prompts()  # Use real data

start = time.perf_counter()
for prompt in prompts:
    response = client.chat.completions.create(
        model="Qwen/Qwen3-27B",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512
    )
elapsed = time.perf_counter() - start
print(f"Total: {elapsed:.1f}s for {len(prompts)} prompts")
Enter fullscreen mode Exit fullscreen mode

The Bigger Picture

Native Windows support for vLLM is a big deal for the local inference ecosystem. The WSL requirement was a genuine barrier — not because it's hard to set up, but because it adds a layer of indirection that complicates deployment, debugging, and resource management.

That said, I wouldn't abandon llama.cpp or Ollama. They solve different problems. If you're running quantized models on consumer hardware and don't need continuous batching, llama.cpp remains excellent. If you want a five-second setup for prototyping, Ollama is unbeatable.

But if you're building anything that needs to serve multiple concurrent requests with full-precision models on Windows — native vLLM just became the obvious choice.

I'm planning to do more thorough benchmarks once the portable launcher stabilizes. For now, the early numbers are promising enough that it's worth keeping on your radar.

Top comments (0)