Mohammed Ali Chherawalla

Posted on Mar 1 • Edited on Mar 2

How to Run LLMs Locally on Your iPhone in 2026 (Completely Offline, No Subscription)

#ios #ai #privacy #llm

Apple's Neural Engine can process 35 trillion operations per second on the A17 Pro. Most of that power sits unused while you pay monthly subscriptions to ask questions on someone else's server.

Off Grid is a free, open-source app that runs large language models directly on your iPhone. No internet after the first download. No iCloud. No Apple Intelligence required. Just your phone and a model.

App Store | GitHub

What You Need

Minimum: iPhone 12 or newer (A14 chip), iOS 17+, 4GB+ RAM. Smaller models (0.6B to 1B) will run fine.

Recommended: iPhone 15 Pro or newer (A17 Pro or later), 8GB RAM. This is where on-device AI gets genuinely useful. 3B to 7B models run smoothly with hardware acceleration via Metal and the Apple Neural Engine.

Storage note: iPhones don't have expandable storage. Models range from 80MB to 4GB+. A 64GB iPhone with lots of photos might not have room for multiple large models. Check your available storage before downloading.

What Off Grid Can Do on iPhone

Six AI capabilities in one app, all running on your phone's silicon:

Text generation. Run Qwen 3, Llama 3.2, Gemma 3, Phi-4, or any GGUF model. Uses llama.cpp via Metal for GPU acceleration. Streaming responses with markdown rendering. 15 to 30 tokens per second on A17 Pro and later.

Image generation. On-device Stable Diffusion through Apple's ml-stable-diffusion pipeline with Core ML and Neural Engine acceleration. 8 to 15 seconds per image on iPhone 15 Pro. 20+ models available.

Vision AI. Attach a photo or use your camera and ask questions about what you see. SmolVLM and Qwen3-VL supported.

Voice transcription. On-device Whisper speech to text. Real-time partial transcription as you speak. No audio leaves your phone.

Tool calling. The model can chain web search, calculator, date/time, and device info together in an automatic loop. Works with models that support function calling format.

Document analysis. Attach PDFs, code files, CSVs, and more to your conversations.

Onboarding	Text Generation	Image Generation
Vision	Attachments

Which Models to Use on iPhone

Off Grid's model browser filters by your device so you never download something that won't run. Here's what works:

iPhone 12/13 (4GB RAM): Qwen 3 0.6B or SmolLM3 360M. Expect 8 to 15 tokens per second. Good for short answers and simple tasks.

iPhone 14/15 (6GB RAM): Qwen 3 1.5B or Phi-4 Mini. Noticeably better quality. 15 to 25 tokens per second with Metal acceleration.

iPhone 15 Pro/16 Pro (8GB RAM): The sweet spot. Llama 3.2 3B, Qwen 3 4B, or Gemma 3 run well. Quality at this size is genuinely useful for drafting, summarization, coding help, and analysis. 20 to 30+ tokens per second.

Quantization: Q4_K_M gives you the best balance of size, speed, and quality. Don't go below Q3 unless storage is very tight.

How iOS Hardware Acceleration Works

Apple's chips have three compute paths and Off Grid uses them automatically:

Metal (GPU): Available on all modern iPhones. Handles general purpose parallel computation. This is what llama.cpp uses for GPU-accelerated text inference.

Apple Neural Engine (ANE): A dedicated AI accelerator. Extremely fast and power efficient. Core ML targets the ANE directly for image generation.

CPU: Always available as a fallback. Slower but works for smaller models.

The advantage of iOS over Android: Apple's hardware and software stack is tightly integrated. If Off Grid works on one iPhone 15 Pro, it works on all of them. No fragmentation.

The KV Cache Trick That Triples Your Speed

Off Grid lets you configure KV cache quantization in settings. The KV cache stores your conversation context. By default it uses f16 (16-bit). Switching to q4_0 (4-bit) roughly triples inference speed with minimal quality impact.

The app nudges you to optimize after your first generation. This is the single biggest performance improvement you can make.

Memory Management on iOS

iOS is more aggressive about killing background apps than Android. Off Grid handles this with lifecycle-independent services. Text and image generation continue running even when you navigate away from the chat screen. But if you leave the app for a long time and iOS reclaims memory, you may need to reload the model.

The RAM budget is tighter on iOS. On an 8GB iPhone, you realistically have 4 to 5GB available. A 7B Q4 model needs about 5.5GB at runtime. It will fit but just barely.

Practical advice: start with a 1.5B to 3B model. If it runs smoothly, try the next size up. If the app closes unexpectedly, the model is too large for your device.

Privacy: Stronger Than Apple Intelligence

Apple Intelligence uses Private Cloud Compute for tasks that exceed on-device capability. Apple says it's end to end encrypted. You're trusting Apple.

Off Grid is private in a stronger sense. There is no cloud component. The computation happens entirely on your phone. No network requests after model download. Verify it yourself: turn on airplane mode and everything works. The code is open source, MIT licensed. No analytics, no telemetry, no accounts.

For people handling sensitive data (medical, legal, financial, proprietary business information, personal journaling), the difference between "a company promises privacy" and "there is no server to send data to" matters.

Getting Started

Install Off Grid from the App Store
Browse recommended models filtered for your device
Download a model over WiFi
Enable airplane mode to verify offline capability
Start chatting

Switch KV cache quantization to q4_0 in settings for the best speed. The quality difference is negligible for most conversations.

What's Coming

Apple's Neural Engine gets more powerful with every chip generation. The A18 has a 16-core Neural Engine. Off Grid ships updates weekly, with tool calling, configurable KV cache, and vision support all added in the last month.

Check the GitHub for the latest releases and the roadmap.

The gap between a 3B model on your iPhone and a 70B model in the cloud is real today. But for the tasks you actually do on your phone, local models are already good enough. And they're getting better every quarter.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.