IndieHacker

Posted on Apr 15

Running Gemma 4 Locally on an iPhone 13 Pro with Swift

#ios #llm #showdev #swift

🚀 Introduction

Running large language models on mobile devices has always felt out of reach. Most real-world applications rely heavily on cloud APIs due to hardware limitations.

But what if we could run modern LLMs directly on a phone?

In this post, I’ll share how I got Gemma 4 running locally on an iPhone 13 Pro, and the open-source Swift wrapper that makes it possible.

👉 https://github.com/mylovelycodes/LiteRTLM-Swift

⸻

💡 Why On-Device LLMs Matter

Cloud-based LLMs are powerful, but they come with trade-offs:
• Network latency
• API costs
• Privacy concerns
• No offline support

On-device inference changes the equation:
• ⚡ Lower latency (no network round trips)
• 🔒 Better privacy (data stays on device)
• 📶 Works offline
• 💰 Zero API cost

The challenge? Hardware constraints.

⸻

📱 The Experiment: Gemma 4 on iPhone 13 Pro

I wanted to explore how far we can push mobile hardware.

Surprisingly, with the right setup, it’s possible to run Gemma 4 locally on an iPhone 13 Pro.

Not perfectly—but well enough to be useful.

⸻

🛠️ The Approach

To make this work, I built:

👉 LiteRTLM-Swift
A lightweight Swift interface for running LiteRT-based LLMs on-device.

Key design goals:
• Native Swift API
• Minimal dependencies
• Easy integration into iOS/macOS apps

Instead of building a full framework, I focused on making something simple and practical.

⸻

⚙️ How It Works (High-Level)

The system relies on:
• A lightweight runtime (LiteRT)
• Model optimization (quantization, smaller variants)
• Efficient memory handling

At a high level:

Load the model into device memory
Run inference locally
Return generated tokens back to the app

Everything happens on-device — no server involved.

⸻

📊 Performance & Constraints

Let’s be honest — this is not comparable to cloud-scale inference.

What works well:
• Smaller / quantized models
• Offline inference
• Simple generation tasks

What doesn’t:
• Large models (memory limits)
• Long context windows
• High-throughput workloads

Key bottlenecks:
• 🧠 RAM constraints
• 🌡️ Thermal throttling
• ⏱️ Latency

Still, for many use cases, it’s already usable.

⸻

🧪 Real-World Use Cases

Even with limitations, on-device LLMs unlock interesting possibilities:
• Offline AI assistants
• Private note summarization
• On-device code tools
• AI features inside native apps

These are scenarios where privacy and availability matter more than raw speed.

⸻

🔓 Open Source

I’ve open-sourced the project here:

👉 https://github.com/mylovelycodes/LiteRTLM-Swift

The goal is to make on-device LLMs more accessible for Swift developers.

⸻

🤔 What’s Next

There’s still a lot to explore:
• Better model optimization
• Improved memory efficiency
• Support for more architectures
• Real-world app integrations

On-device AI is still early — but moving fast.

⸻

💬 Final Thoughts

Running Gemma 4 on an iPhone 13 Pro might sound surprising today.

But it’s a glimpse of where things are heading:

👉 AI moving closer to the user
👉 Less reliance on centralized infrastructure
👉 More control for developers

I’d love to hear how others are experimenting with local LLMs — especially on constrained devices.