Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

#python #vllm #gpu #containers

The hardest part of GPU inference isn't the model — it's the environment. CUDA versions, driver compatibility, pip dependency conflicts. You can have a working quantization plugin and still spend an hour getting it to run on a fresh machine.

turboquant-vllm v1.1.0 ships a Containerfile that eliminates that setup. It extends the official vLLM image, installs the TQ4 compression plugin from PyPI, and verifies the plugin entry point at build time — not at runtime when you're debugging a silent fallback to uncompressed attention.

What Changed in v1.1

Container support. A single Containerfile bakes turboquant-vllm into the official vllm-openai image:

git clone https://github.com/Alberto-Codes/turboquant-vllm.git
cd turboquant-vllm
podman build -t vllm-turboquant -f infra/Containerfile.vllm .

Then serve a vision-language model with compressed KV cache:

podman run --rm \
  --device nvidia.com/gpu=all \
  --shm-size=8g \
  -p 8000:8000 \
  vllm-turboquant \
  --model allenai/Molmo2-8B \
  --attention-backend CUSTOM

One flag: --attention-backend CUSTOM. That's it.

Documentation site. Auto-generated API reference from docstrings, usage guides for vLLM, HuggingFace, and container deployment — including Quadlet examples for running as a systemd service. Deployed to GitHub Pages on every release.

Per-layer quality tests. 12 new cosine similarity tests verify compression fidelity at each of the 36 transformer layers, not just end-to-end output. This catches layer-specific precision degradation that whole-model tests miss.

Why This Design

The Containerfile is deliberately minimal — 11 lines. It installs from PyPI (not from source) and verifies the plugin entry point at build time:

FROM docker.io/vllm/vllm-openai:v0.18.0
ARG TURBOQUANT_VERSION=1.1.0

RUN pip install --no-cache-dir "turboquant-vllm[vllm]==${TURBOQUANT_VERSION}"

RUN python3 -c "\
import importlib.metadata; \
eps = [e for e in importlib.metadata.entry_points(group='vllm.general_plugins') \
       if e.name == 'tq4_backend']; \
assert len(eps) == 1, 'TQ4 entry point not found'"

Build-time verification matters because vLLM's plugin discovery is silent. If the entry point isn't registered, vLLM falls back to its default attention backend without any error. You'd serve uncompressed inference thinking you had 3.76x compression. The assert in the Containerfile makes that failure loud and early.

The TURBOQUANT_VERSION build arg means you can pin or upgrade versions without editing the file.

Getting Started

Install from PyPI if you don't need the container:

pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

Or build the container for reproducible deployment:

podman build -t vllm-turboquant -f infra/Containerfile.vllm .
podman run --rm --device nvidia.com/gpu=all -p 8000:8000 \
  vllm-turboquant --model allenai/Molmo2-8B --attention-backend CUSTOM

Verify compression is active in the logs:

INFO [cuda.py:257] Using AttentionBackendEnum.CUSTOM backend.

What's Next

Upstream vLLM contribution (vllm#38171 — 49 upvotes)
Full Flash Attention-style kernel fusion for multi-layer correctness
Stacking with token pruning (VL-Cache) for multiplicative VLM savings

PyPI | Docs | GitHub

DEV Community

Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

What Changed in v1.1

Why This Design

Getting Started

What's Next

Top comments (0)