DEV Community

Timothy Fosteman
Timothy Fosteman

Posted on

Multimodal Visual Understanding in Swift (aka: "why is this still so hard on-device?")

I’ve been spending a lot of time lately thinking about one thing: how to get good image-to-text understanding running locally, inside the Swift / Apple ecosystem, without it turning into a science project that never ships.

The whole world kinda moved from "computer vision does one thing" (detect faces, read text, classify a cat) to "a model that can look + reason + talk about it." That’s the MLLM / VLM shift. And once that door opened, it became obvious how limiting the classic Vision framework approach is. Vision is fast, and it’s rock solid for specific tasks... but it doesn’t have "world knowledge" It gives metadata. It doesn’t understand.

So if the goal is: "take an image and infer a high quality description", and maybe even make a json output, in Swift, today? there’s basically two roads, and neither one gets the job done.

1) The “Apple research demo” road (FastVLM etc.)
2) The “open model ecosystem” road (Gemma 3, Qwen-VL, etc.), which is better quality but is not at all integrated.
Even llama.cpp at moment of writing doesnt support image input.

Frustration comes first, resutls later.

FastVLM proves it’s possible to run high-res VLM inference on-device (and it can be fast), but the public checkpoints are broken and produce garbagegarbagegarbagegarbagegarbagegarbage. Try it for yourself: https://github.com/apple/ml-fastvlm

Meanwhile Gemma 3 can be a lot more capable, getting it into a clean Swift pipeline takes more work, and sometimes it’s like... congrats, now I own the whole stack.

Anyway. Here’s where my head is at.


The "old world": Vision framework was great, but narrow

On Apple platforms, visual understanding was: Vision framework requests (faces, text, barcodes, image classification). It’s super practical. But it’s not an open-ended reasoning engine. It won’t give a rich narrative description. That’s not a criticism, it’s just a specific tool.

VLMs fix that by projecting visual tokens into the LLM embedding space. So the language model gets to “see” (through the encoder) and then it can talk about it.


FastVLM: the Apple Silicon, Research-grade approach

FastVLM is interesting because it’s not just “throw a ViT at it and pray.” ViTs scale badly with resolution (quadratic cost), so most pipelines downsample images to 224/336/etc and then... yeah, good luck reading tiny text or fine details.

Check out Apple's own blog here: https://machinelearning.apple.com/research/fast-vision-language-models

FastVLM tries to keep resolution high without blowing up latency by using a hybrid backbone (FastViTHD): convolution early, then some transformer blocks. The whole point is: fewer tokens, less waste, faster time-to-first-token.

And honestly, as an engineering direction, I like it. It feels “designed for Apple hardware” instead of “ported from elsewhere.”

So, time for a real question: can I take the FastVLM style inference engine and pair it with weights that actually reason well? That’s where everything becomes... complicated.

Gemma 3: higher quality multimodal, but now it’s your job to integrate it

Gemma 3 being multimodal (and open weights) is a big deal for on-device dev. The family is bigger (4B/12B/27B etc), it’s capable, and it has actual vision support baked in.

Also, Gemma 3 uses SigLIP as the vision encoder (instead of the older CLIP patterns). And the pan-and-scan technique is a real “practical detail” that matters: fixed 896×896 style inputs are nice, until the image is wide, or has tiny details, or is a weird aspect ratio. Pan-and-scan crops the image into multiple segments so the model can “zoom.” It’s basically admitting reality: one crop isn’t enough.

There’s also that "E2B" style idea (effective parameter loading / caching tricks) which is exactly the kind of thing that makes or breaks on-device ability. 4GB of ram on my iPhone pro max tell the story.

But again: great model doesn’t automatically mean great Swift experience. You still need an inference stack that loads it, quantizes it, streams tokens, and doesn’t set the phone on fire.


MLX + Swift: the bridge that actually feels usable

Right now, if I’m being honest, MLX-Swift is the closest thing to a "professional" path for doing this without waiting for Apple to expose a full multimodal native API.

Why did i do it? because. ahem. i couldnt wait.

The MLX Swift examples have the abstractions you want: load a VLM container, prepare inputs (prompt + images), call generate, stream tokens. No magic. You even get to manage the "end" token. Painstakingly so.


Foundation Models framework: nice direction, but visual input is still... not really there

Apple’s Foundation Models framework is super promising for text. Type-safe structured outputs, guided generation, etc. But the thing everyone (me) wants is: "let me pass an image to the session like a normal citizen."

Right now it’s not really exposed as a general-purpose image-to-text API in the public SDK (at least not in the way i wants it). There are system-level "Visual Intelligence" features, but programmatic access is still limited.

So the Foundation Models framework is like: the future looks bright... but for now, the VLM path is external.


Exporting models to Core ML is still pain

In theory: train/convert/export, done.

In practice: conversion pipelines hit missing ops, unimplemented bitwise nonsense, or weird runtime errors. And that’s the problem: if the conversion breaks on one operator, the entire "Core ML pipeline" becomes a dead end until someone fixes it upstream.

So a lot of times MLX wins simply because it can load weights more directly and avoid the worst of conversion.

Quantization is not optional

If this is going on device, quantization is not "nice to have," it’s a matter of survival. 4-bit, sometimes lower, maybe even deal in booleans, or it’s not fitting. And it’s not just memory—latency matters too. The phone needs to respond like a product, not like a research notebook.

Also: unified memory is a big deal on Apple Silicon. When the vision encoder and the LLM can share memory and caches cleanly, time-to-first-token gets better. This stuff matters more than people think.


So what’s the practical path today?

If I had to summarize what I’d actually do, in real life:

1) If I need multimodal now: MLX-Swift + a solid open VLM (Gemma 3, Qwen-VL, etc.), quantized. Good lucjk converting those models, sir. I tip my hat to anyone who's up to try. (please send me your project if you're by chance doing it open source - i want your weights).

2) Set up google search alerts for Foundation Models framework patterns (structured outputs, tool calling), because eventually the native API will get more capable. And you want to be the first one to use it.

3) Don’t bet the whole product on "Core ML conversion will be easy" It aint.

And yeah, if Apple ships a real multimodal Foundation Models pipeline where images are first-class inputs tomorrow, that’s the end of this moaning blog post and its entire purpose.

Anyway. That’s the state of it from where I’m sitting. It’s a weird time: the capability is there, the hardware is there, but the ecosystem is still stitching together the "last mile" for developers.

Go, go, Apple.

Top comments (0)