Jovan Chan

Posted on Jul 1 • Originally published at aifoss.dev

Open-Source Vision Language Models 2026: Which to Self-Host

#vlm #visionlanguagemodels #qwen3vl #selfhosted

This article was originally published on aifoss.dev

TL;DR: A 7–8B vision language model now fits in 8GB of VRAM and reads charts, screenshots, and scanned PDFs well enough for real work. Qwen3-VL is the best generalist you can actually run at home; DeepSeek-OCR is the specialist when the job is purely documents. The frontier (GLM-4.6V at 106B) needs a server, not a gaming GPU.

	Qwen3-VL 8B	InternVL3.5 8B	Gemma 3 4B	DeepSeek-OCR 2
Best for	General VQA + OCR + UI	Reasoning over images	Multilingual, low VRAM	Pure document parsing
Min VRAM (Q4)	~8GB	~8GB	~6GB	~8–10GB
License	Apache 2.0	MIT (check backbone)	Gemma Terms	MIT
The catch	Newer, fewer guides	Backbone license varies	Not Apache-clean	OCR only, not chat

Honest take: Start with Qwen3-VL 8B on Ollama. It's Apache 2.0, fits a single 8–12GB card, and covers OCR, charts, tables, and UI grounding in one model. Reach for DeepSeek-OCR only when you're processing documents at volume.

The open-source vision language model (VLM) field moved fast over the last year. The models people still cite in old blog posts — LLaVA, the original Qwen2-VL, PaliGemma, Idefics — are 2024-vintage. As of June 2026 the practical shortlist for self-hosters is shorter and far more capable. This is the comparison for someone who owns a consumer GPU (or rents one) and wants to pick one model for image understanding, OCR, or multi-modal RAG.

What a VLM actually does (and what to ignore)

A vision language model takes images plus text and answers in text. The useful tasks split into four buckets, and which model wins depends entirely on which bucket you care about:

Visual question answering (VQA) — "what's in this image," "describe this chart."
OCR and document parsing — turning a scanned invoice or PDF page into structured text or Markdown.
UI grounding — pointing at the right button in a screenshot, which agent frameworks lean on.
Multi-modal RAG — embedding images and pages so a retrieval pipeline can pull them back.

Ignore the leaderboard chasing. A model topping MMMU by two points means nothing if it doesn't fit your GPU or its license blocks commercial use. The two questions that actually decide your choice are: does it run on the VRAM I have, and can I legally ship what I build.

The 2026 shortlist

Qwen3-VL — the default generalist

Alibaba's Qwen3-VL family is the one to beat for self-hosters. It spans 2B, 4B, 8B, 30B-A3B (MoE), 32B, and 235B-A22B, and every size shares the same Apache 2.0 license, a 262,144-token context window, and the same core skills: document OCR, chart extraction, table parsing, UI grounding, and video understanding.

The Apache 2.0 license is the headline. Unlike Gemma or Llama Vision, there's no use-case carve-out and no acceptable-use addendum to read — you can build a commercial product on it without a lawyer. The 8B at Q4_K_M is about 6.1GB on disk and loads on an 8GB card, though you'll want 12–16GB to avoid memory pressure when you feed it large images. The 4B (~3.3GB at Q4_K_M) runs on practically anything with 6GB.

# Qwen3-VL 8B on Ollama (vision-capable tag)
ollama pull qwen3-vl:8b
ollama run qwen3-vl:8b "Extract the table in this image as Markdown." --image invoice.png

Expected output is a clean Markdown table reproducing the line items — not a paragraph describing the image. That distinction (structured extraction vs. vague description) is where the 2026 models pulled ahead of the LLaVA generation.

InternVL3.5 — the reasoning specialist

OpenGVLab's InternVL3.5 (released August 2025) is the model to pick when the task is reasoning over an image rather than just reading it — math diagrams, multi-step chart questions, science figures. The 8B scores 73.4 on MMMU and the flagship 241B-A28B hits 77.7, with 82.7 on MathVista, which puts it at or near the top of the open-source field and within reach of closed commercial systems.

The license is the catch. InternVL's own code is MIT, but each model variant pairs the InternViT vision encoder with a separate LLM backbone (Qwen2.5, etc.), and the weights inherit that backbone's license. Most variants land on Apache 2.0 or MIT in practice, but you should read the specific HuggingFace model card before shipping — don't assume.

GLM-4.6V — the frontier, if you have the hardware

Z.ai's GLM-4.5V (106B total / 12B active MoE, MIT-licensed) posted state-of-the-art results across 42 benchmarks when it landed in August 2025, beating Qwen2.5-VL-72B and trading blows with Gemini 2.5 Flash. The follow-up GLM-4.6V (September 30, 2025) added a 128K context window and native multimodal tool-calling — useful if you're wiring the VLM into an agent that calls functions.

Here's the honest part: 106B parameters do not fit a consumer GPU. Even heavily quantized, you're looking at a multi-GPU server or a rented cloud instance. If you want to test GLM-4.6V's frontier quality without buying an 8×GPU box, rent one by the hour on RunPod and tear it down when you're done. For local-only deployment, the 9B GLM-4.6V-Flash is the variant that fits a single card and keeps the native tool-calling — that's the one to pull for a home lab. Pair either with a real GPU build; runaihome.com has the hardware breakdowns.

Gemma 3 — multilingual on a budget

Google's Gemma 3 is multimodal across its 4B, 12B, and 27B sizes (the 270M and 1B are text-only), with a 128K context window and support for 140+ languages. The 4B runs at full precision on 8GB of VRAM; the 12B fits at Q4. Ollama supports it natively (ollama run gemma3:4b).

Gemma's strength is multilingual OCR and broad language coverage. Its weakness for this audience is licensing: Gemma ships under Google's Gemma Terms of Use, not Apache or MIT. It's permissive enough for most uses and allows commercial deployment, but it carries a prohibited-use policy you're agreeing to — which is why FOSS purists reach for Qwen3-VL first.

DeepSeek-OCR 2 — the document specialist

If your only job is documents — invoices, contracts, scanned archives, multilingual PDFs — a generalist VLM is the wrong tool. DeepSeek-OCR 2 (open-sourced January 27, 2026, MIT-licensed) is a 3B model built specifically for optical character recognition and layout parsing. It scored 91.09% on OmniDocBench v1.5, and its MoE decoder runs at roughly 570M active parameters per token, so a single A100-40G processes around 200,000 pages a day.

It runs on 8–10GB of VRAM in base mode. The trade-off is that it's not a chatbot — you don't have a conversation with it, you feed it pages and get structured text back. For a document-heavy local RAG pipeline, running DeepSeek-OCR for ingestion and a general LLM for the chat layer beats forcing one model to do both.

The decision table

Model	Sizes	License	Min VRAM (usable)	Ollama	Best at
Qwen3-VL	2B–235B	Apache 2.0	~6GB (4B) / ~8GB (8B)	Yes	Generalist OCR, charts, UI, video
InternVL3.5	1B–241B	MIT / backbone	~8GB (8B)	Partial	Reasoning over images, MMMU
GLM-4.6V	9B + 106B	MIT	~12GB (9B Flash)	Partial	Frontier quality, tool-calling
Gemma 3	4B–27B	Gemma Terms	~8GB (4B)	Yes	Multilingual, low-VRAM
DeepSeek-OCR 2	3B	MIT	~8–10GB	Via llama.cpp	High-volume document parsing

A note on "usable" VRAM: these are the figures to load a Q4 quant with a modest context. Push to long context or large input images and real serving VRAM climbs because of the KV cache and the vision encoder's activations. Budget headroom. If you're tight on memory, the GGUF quantization guide explains which quant level trades the least quality for the most savings.

A real problem you'll hit: the image just gets described, not read

The most common failure when people first run a local

DEV Community