DEV Community

Ben Racicot
Ben Racicot

Posted on • Originally published at modelpiper.com

Using LLaVA With Ollama on Mac - Without the Base64 Encoding

Ollama supports vision models. LLaVA, Gemma 3, Moondream, Llama 3.2 Vision - pull them the same way you pull any other model. The inference works. The problem is the interface.

Here's what using a vision model through Ollama's API looks like:

curl http://localhost:11434/api/generate -d '{
  "model": "llava",
  "prompt": "What is in this image?",
  "images": ["iVBORw0KGgoAAAANSUhEUgAA..."]
}'
Enter fullscreen mode Exit fullscreen mode

That images field expects base64. For a typical screenshot, that's 50,000-200,000 characters pasted into a terminal command. Generate it with base64 -i screenshot.png, paste it into the JSON payload. It works. Nobody does it twice voluntarily.

The CLI ollama run llava supports a file path shorthand, but it's still a text-only workflow for a fundamentally visual task.

What vision models can do

Vision-language models process an image and a text prompt together. They don't just classify. They reason.

Image Q&A. "What's the error in this screenshot?" "How many people are in this photo?"

Document understanding. Point at a chart, table, or handwritten note. Ask it to extract data or describe relationships. Goes further than OCR - vision models understand layout and context.

UI analysis. Screenshot a web page and ask the model to identify elements, describe layout, or spot accessibility issues.

Scene description. Detailed descriptions for accessibility narration, content tagging, or creative prompts.

All of these work with Ollama's vision models on your Mac. The capability is there. What's missing is a way to use it that doesn't involve base64 strings.

The drag-and-drop approach

ModelPiper handles the encoding automatically. Drag an image onto the chat window. Type your question. The model sees both the image and text, responds in the same thread.

If the model runs through Ollama, the request goes to Ollama's /api/generate with the image payload. If it runs through ToolPiper's built-in engine, it goes to local llama.cpp. Either way, no base64 in sight.

The chat shows the image alongside the response, so you can compare what the model said to what's actually in the picture. For iterative work - "now describe just the chart in the upper right" - reference the same image across multiple messages.

Which models to use

Model Size RAM Best for
Moondream 1.6B ~1.5GB Simple descriptions, 8GB Macs
Gemma 3 4B ~3GB Balanced quality/speed
LLaVA 1.6 7B ~5GB General-purpose image Q&A
LLaVA 1.6 13B ~9GB Complex visual reasoning
Llama 3.2 Vision 11B ~7GB Strong reasoning, documents
Llama 3.2 Vision 90B ~48GB+ Near-cloud quality (64GB+ Mac)

Pull any of these with ollama pull <model>. They appear in ModelPiper's model selector when Ollama is connected as a provider.

Combining vision with pipelines

Vision models get more useful when chained:

  • Vision + OCR: Apple Vision OCR extracts raw text, then a chat model summarizes or analyzes. More reliable than asking a vision model to read dense text directly.
  • Vision + TTS: Describe an image, pipe the description to text-to-speech. Audio descriptions of visual content.
  • Vision + Translation: Describe in English, translate to another language.

These multi-step workflows are where a pipeline builder earns its complexity.

Limitations worth knowing

Smaller models miss details. Moondream and even LLaVA 7B will miss fine text in screenshots, misread chart numbers, sometimes hallucinate details. For text extraction, Apple Vision OCR is more reliable.

Image downscaling. Vision models resize images internally to 336x336 or 672x672 pixels. Fine details below that resolution are lost. Crop to the relevant portion before sending.

Memory pressure. Vision models are larger than text-only models at the same parameter count because they include a vision encoder. LLaVA 7B uses more memory than Llama 3.2 7B text-only.

Cloud is still better for hard tasks. GPT-4 Vision and Claude outperform local models on complex document analysis and multi-object reasoning. For quick descriptions and simple Q&A, local models are good enough.

Full article with setup steps and pipeline examples

Top comments (0)