DEV Community

Alberto Nieto
Alberto Nieto

Posted on • Originally published at alberto.codes

Compressed VLM inference from a single Containerfile — turboquant-vllm v1.1

The hardest part of GPU inference isn't the model — it's the environment. CUDA versions, driver compatibility, pip dependency conflicts. You can have a working quantization plugin and still spend an hour getting it to run on a fresh machine.

turboquant-vllm v1.1.0 ships a Containerfile that eliminates that setup. It extends the official vLLM image, installs the TQ4 compression plugin from PyPI, and verifies the plugin entry point at build time — not at runtime when you're debugging a silent fallback to uncompressed attention.

What Changed in v1.1

Container support. A single Containerfile bakes turboquant-vllm into the official vllm-openai image:

git clone https://github.com/Alberto-Codes/turboquant-vllm.git
cd turboquant-vllm
podman build -t vllm-turboquant -f infra/Containerfile.vllm .
Enter fullscreen mode Exit fullscreen mode

Then serve a vision-language model with compressed KV cache:

podman run --rm \
  --device nvidia.com/gpu=all \
  --shm-size=8g \
  -p 8000:8000 \
  vllm-turboquant \
  --model allenai/Molmo2-8B \
  --attention-backend CUSTOM
Enter fullscreen mode Exit fullscreen mode

One flag: --attention-backend CUSTOM. That's it.

Documentation site. Auto-generated API reference from docstrings, usage guides for vLLM, HuggingFace, and container deployment — including Quadlet examples for running as a systemd service. Deployed to GitHub Pages on every release.

Per-layer quality tests. 12 new cosine similarity tests verify compression fidelity at each of the 36 transformer layers, not just end-to-end output. This catches layer-specific precision degradation that whole-model tests miss.

Why This Design

The Containerfile is deliberately minimal — 11 lines. It installs from PyPI (not from source) and verifies the plugin entry point at build time:

FROM docker.io/vllm/vllm-openai:v0.18.0
ARG TURBOQUANT_VERSION=1.1.0

RUN pip install --no-cache-dir "turboquant-vllm[vllm]==${TURBOQUANT_VERSION}"

RUN python3 -c "\
import importlib.metadata; \
eps = [e for e in importlib.metadata.entry_points(group='vllm.general_plugins') \
       if e.name == 'tq4_backend']; \
assert len(eps) == 1, 'TQ4 entry point not found'"
Enter fullscreen mode Exit fullscreen mode

Build-time verification matters because vLLM's plugin discovery is silent. If the entry point isn't registered, vLLM falls back to its default attention backend without any error. You'd serve uncompressed inference thinking you had 3.76x compression. The assert in the Containerfile makes that failure loud and early.

The TURBOQUANT_VERSION build arg means you can pin or upgrade versions without editing the file.

Getting Started

Install from PyPI if you don't need the container:

pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM
Enter fullscreen mode Exit fullscreen mode

Or build the container for reproducible deployment:

podman build -t vllm-turboquant -f infra/Containerfile.vllm .
podman run --rm --device nvidia.com/gpu=all -p 8000:8000 \
  vllm-turboquant --model allenai/Molmo2-8B --attention-backend CUSTOM
Enter fullscreen mode Exit fullscreen mode

Verify compression is active in the logs:

INFO [cuda.py:257] Using AttentionBackendEnum.CUSTOM backend.
Enter fullscreen mode Exit fullscreen mode

What's Next

  • Upstream vLLM contribution (vllm#38171 — 49 upvotes)
  • Full Flash Attention-style kernel fusion for multi-layer correctness
  • Stacking with token pruning (VL-Cache) for multiplicative VLM savings

PyPI | Docs | GitHub

Top comments (0)