Ollama supports vision models. LLaVA, Gemma 3, Moondream, Llama 3.2 Vision - pull them the same way you pull any other model. The inference works. The problem is the interface.
Here's what using a vision model through Ollama's API looks like:
curl http://localhost:11434/api/generate -d '{
"model": "llava",
"prompt": "What is in this image?",
"images": ["iVBORw0KGgoAAAANSUhEUgAA..."]
}'
That images field expects base64. For a typical screenshot, that's 50,000-200,000 characters pasted into a terminal command. Generate it with base64 -i screenshot.png, paste it into the JSON payload. It works. Nobody does it twice voluntarily.
The CLI ollama run llava supports a file path shorthand, but it's still a text-only workflow for a fundamentally visual task.
What vision models can do
Vision-language models process an image and a text prompt together. They don't just classify. They reason.
Image Q&A. "What's the error in this screenshot?" "How many people are in this photo?"
Document understanding. Point at a chart, table, or handwritten note. Ask it to extract data or describe relationships. Goes further than OCR - vision models understand layout and context.
UI analysis. Screenshot a web page and ask the model to identify elements, describe layout, or spot accessibility issues.
Scene description. Detailed descriptions for accessibility narration, content tagging, or creative prompts.
All of these work with Ollama's vision models on your Mac. The capability is there. What's missing is a way to use it that doesn't involve base64 strings.
The drag-and-drop approach
ModelPiper handles the encoding automatically. Drag an image onto the chat window. Type your question. The model sees both the image and text, responds in the same thread.
If the model runs through Ollama, the request goes to Ollama's /api/generate with the image payload. If it runs through ToolPiper's built-in engine, it goes to local llama.cpp. Either way, no base64 in sight.
The chat shows the image alongside the response, so you can compare what the model said to what's actually in the picture. For iterative work - "now describe just the chart in the upper right" - reference the same image across multiple messages.
Which models to use
| Model | Size | RAM | Best for |
|---|---|---|---|
| Moondream | 1.6B | ~1.5GB | Simple descriptions, 8GB Macs |
| Gemma 3 | 4B | ~3GB | Balanced quality/speed |
| LLaVA 1.6 | 7B | ~5GB | General-purpose image Q&A |
| LLaVA 1.6 | 13B | ~9GB | Complex visual reasoning |
| Llama 3.2 Vision | 11B | ~7GB | Strong reasoning, documents |
| Llama 3.2 Vision | 90B | ~48GB+ | Near-cloud quality (64GB+ Mac) |
Pull any of these with ollama pull <model>. They appear in ModelPiper's model selector when Ollama is connected as a provider.
Combining vision with pipelines
Vision models get more useful when chained:
- Vision + OCR: Apple Vision OCR extracts raw text, then a chat model summarizes or analyzes. More reliable than asking a vision model to read dense text directly.
- Vision + TTS: Describe an image, pipe the description to text-to-speech. Audio descriptions of visual content.
- Vision + Translation: Describe in English, translate to another language.
These multi-step workflows are where a pipeline builder earns its complexity.
Limitations worth knowing
Smaller models miss details. Moondream and even LLaVA 7B will miss fine text in screenshots, misread chart numbers, sometimes hallucinate details. For text extraction, Apple Vision OCR is more reliable.
Image downscaling. Vision models resize images internally to 336x336 or 672x672 pixels. Fine details below that resolution are lost. Crop to the relevant portion before sending.
Memory pressure. Vision models are larger than text-only models at the same parameter count because they include a vision encoder. LLaVA 7B uses more memory than Llama 3.2 7B text-only.
Cloud is still better for hard tasks. GPT-4 Vision and Claude outperform local models on complex document analysis and multi-object reasoning. For quick descriptions and simple Q&A, local models are good enough.
Top comments (0)