Your phone has a camera and a processor powerful enough to run multimodal AI models. You can point it at a receipt, a document, a math problem, or anything else and ask questions about it — without uploading the image to any server.
Off Grid is a free, open-source app that runs vision AI entirely on your Android phone. No internet connection after the initial model download. No account. No data leaving your device. This guide covers how to set it up, which vision models work best, and what to expect on your hardware.
What Vision AI Can Do on Your Phone
Vision AI means the model can see images, not just read text. You give it a photo and a question, and it answers based on what it sees.
Vision AI
|
Attachments
|
Text Generation
|
Document analysis. Photograph a receipt, invoice, or contract. Ask the model to extract line items, totals, or specific clauses. No need to type anything out manually.
Homework and math. Point your camera at a math problem or a diagram. The model reads it and walks you through the solution.
Code on a screen. Take a photo of code on a whiteboard or another screen. Ask the model to explain it, find bugs, or convert it to a different language.
Scene description. Photograph anything and ask what it is, what's happening, or for details about specific objects.
Accessibility. Describe images, read text in photos, identify objects. All on device, all private.
What You Need
Minimum hardware: 6GB RAM, ARM64 processor. The smallest vision model (SmolVLM 500M) fits on most modern phones.
Recommended hardware: 8GB+ RAM, Snapdragon 8 Gen 2 or newer. This lets you run the 2B+ vision models that produce much more detailed and accurate responses.
What's different from cloud vision AI: Cloud services like ChatGPT with vision and Google Lens run massive models on servers. Your phone runs smaller vision-language models (500M to 8B parameters). They're less capable on edge cases but surprisingly good for everyday tasks. The tradeoff is privacy: your images never leave your device.
Which Vision Models to Use
Off Grid automatically downloads the companion file (mmproj) that vision models need alongside the main model. You don't need to configure anything.
SmolVLM 500M — The speed pick. About 600MB total download. Runs in roughly 7 seconds on flagship devices, 15 seconds on mid-range. Good enough for document reading, basic scene description, and text extraction. Start here.
SmolVLM 2.2B — The quality pick for 8GB+ phones. Much more detailed answers. About 10 to 15 seconds on flagship.
Qwen3-VL 2B — Strong multilingual vision understanding. If you need to analyze documents in languages other than English, this is the model.
Gemma 3n E4B — Google's mobile-optimized multimodal model. Vision plus audio understanding with selective activation to save memory.
How It Works
Vision-language models combine two systems: a vision encoder that "sees" the image and converts it into tokens the model understands, and a language model that reasons about those tokens alongside your text prompt.
When you attach an image and send a message, Off Grid loads both the main model and its multimodal projector (mmproj). The image is processed locally through the vision encoder, the resulting tokens are combined with your text prompt, and the language model generates a response. The entire pipeline runs on your phone.
Real World Performance
| Model | Flagship (8 Gen 3) | Mid-range |
|---|---|---|
| SmolVLM 500M | ~7 seconds | ~15 seconds |
| SmolVLM 2.2B | ~12 seconds | ~30 seconds |
| Qwen3-VL 2B | ~15 seconds | ~35 seconds |
These times are for the full inference — from sending the message to seeing the complete response. The first token appears faster.
Privacy for Sensitive Documents
This is where local vision AI matters most. Think about what you photograph:
Medical documents, insurance forms, tax returns, contracts, personal letters, ID cards, bank statements. Every time you upload one of these to a cloud AI for analysis, a copy exists on someone else's server.
With Off Grid, the image goes from your camera to your phone's RAM to the model and back to your screen. No network request. No server. No copy anywhere you don't control.
Off Grid is open source. Verify the code yourself.
Tips for Better Results
Good lighting matters. The model can only work with what the camera captures. A well-lit, sharp photo produces much better results than a blurry, dark one.
Crop to the relevant area. If you're analyzing a receipt, crop out everything except the receipt. Less visual noise means more accurate answers.
Be specific with your prompt. "What does this say?" is fine. "Extract all line items with prices from this receipt as a list" is better.
Use the right model size. For simple text extraction, SmolVLM 500M is fast and accurate. For complex reasoning about what's in an image, use a larger model.
Getting Started
- Install Off Grid from the Play Store
- Download a vision model (SmolVLM 500M to start — about 600MB)
- Tap the camera icon in chat to capture or select an image
- Type your question about the image
- Get your answer — entirely on device
Off Grid also runs text generation, image generation, voice transcription, tool calling, and document analysis. All locally, all in the same app. Check the GitHub for the latest releases.
Vision AI
Attachments
Text Generation
Top comments (0)