How to combine on-device inference with cloud sync — without paying a cent in API fees
The Unspoken Reality of AI API Costs
Here’s the moment every indie developer dreads. You’ve shipped your AI feature. Users love it. It’s working. Then you open your billing dashboard.
Every question your users ask costs you money. Every summary generated, every classification run, you’re paying for it. You have created a successful product, but its popularity is now draining your resources.
What if there was a way to run powerful AI in your Flutter app with zero ongoing API costs? No per-request charges. No server bills. No privacy risk from data leaving the device.
That’s exactly what the combination of Gemma 4 and Firebase unlocks. Gemma does the thinking, entirely on the user’s phone. Firebase handles persistence and sync. The result is an AI-powered app that scales to thousands of users at near-zero marginal cost.
I’m building my own product on this exact stack. Here’s how it works.
Why cloud AI is a trap for indie developers
Cloud AI APIs are seductive. You make one API call, get a response, ship the feature. But the cost model is brutal for bootstrapped products:
- Costs scale directly with usage — success punishes you
- User data leaves the device — a privacy and trust problem
- No connection means broken features — poor offline experience
- You’re dependent on a third-party API that can change pricing or availability
On-device AI solves all four of these problems at once. The model runs locally, the user’s data never hits a network, it works offline, and you own the stack.
The stack: Gemma 4 + flutter_gemma + Firebase
Three components, each with a clear job:
Gemma 4 (the brain)
Gemma 4 is Google’s latest edge-optimized open model, built on the same research as Gemini but designed to run efficiently on mobile hardware. The E2B variant (Effective 2B parameters) runs in under 2GB of RAM thanks to 4-bit quantization. It’s practical for real devices, not just flagship phones. Critically, Gemma 4 supports native function calling, which means you can wire it directly into your app’s logic without any prompt engineering hacks.
flutter_gemma (the bridge)
The flutter_gemma package wraps the native LiteRT-LM inference engine in a clean Dart API. You get GPU acceleration, streaming responses, multimodal vision support, and function calling, all from Flutter. It supports Android, iOS, Web, and Desktop from a single codebase.
Firebase (the backbone)
Firebase handles everything that needs to live in the cloud: Firestore for storing and syncing AI outputs across devices, Firebase Auth for user identity, and optionally Firebase Storage if you need to persist larger assets. The key insight is that your AI inference never touches Firebase; only the results do.
The architecture: who does what
Here’s the principle that makes this work: AI never touches the network. Only the output does.
The flow looks like this:
- User creates content in the app (a note, a photo, a voice memo)
- Gemma processes it on-device — tagging, summarizing, classifying
- The AI output (not the raw input) is written to Firestore
- Firebase syncs the output to the user’s other devices
The user’s raw thoughts, photos, or voice never leave their phone. The model never makes a network call. But their data is still available everywhere they need it, because Firebase syncs the processed result.
Building it: a private AI notes app
Let’s make this concrete. We’ll build a simple notes app where Gemma automatically tags and summarizes each note on-device, then Firebase syncs the result across the user’s devices. Here’s the step-by-step.
Step 1: Set up flutter_gemma
Add the dependency to your pubspec.yaml:
dependencies: flutter_gemma: ^0.13.1 cloud_firestore: ^5.0.0 firebase_auth: ^5.0.0
Download the Gemma 4 E2B model from Hugging Face (you’ll need to accept the license terms). Store your HuggingFace token securely; never hardcode it. On first launch, the app downloads the model to local storage. This is a one-time operation of around 1.5GB.
Step 2: Run on-device inference
Initialize the model and run a simple summarization:
final model = await FlutterGemma.getActiveModel( maxTokens: 512, preferredBackend: PreferredBackend.gpu,);
final chat = await model.createChat();
await chat.addQueryChunk( Message.text( text: ‘Summarize this note in one sentence and suggest 2-3 tags: $noteContent’, isUser: true, ),);
final response = await chat.generateChatResponse();
That’s it. The inference runs entirely on-device. No API key, no network call, no cost. The response gives you a summary and tags you can parse and store.
Step 3: Write the output to Firestore
Take the AI output and persist it to Firestore:
await FirebaseFirestore.instance
.collection(’notes’)
.doc(noteId)
.set({‘originalContent’: noteContent, // stays on-device or stored per your preference
‘summary’: parsedSummary,
‘tags’: parsedTags,
‘userId’: FirebaseAuth.instance.currentUser!.uid,
‘createdAt’: FieldValue.serverTimestamp(),});
Notice what goes to Firestore: the AI-generated summary and tags, not necessarily the raw note. You can store the original too if your app requires it — but the point is you’re in control of what leaves the device.
Step 4: Read it back and sync
Firestore’s real-time listeners handle the sync automatically. Open the app on a second device and your tagged, summarized notes appear instantly — without any polling logic on your part. Firebase does the heavy lifting.
Real talk: the gotchas
I won’t sell you a perfect picture. Here’s what you’ll actually run into:
Model download size
The Gemma 4 E2B model is ~1.5GB. This is a one-time download, but you need to handle it gracefully. Show a clear progress indicator on first launch. Consider letting users choose to download over WiFi only. The flutter_gemma package provides download progress callbacks — use them.
Cold start time
Loading the model into memory takes a few seconds on the first inference of a session. Show a loading state. After the model is warm, subsequent responses are fast. Set user expectations early — frame it as the app ‘preparing your AI’ rather than ‘loading.’
Know when to use cloud AI instead
On-device Gemma 4 is excellent for targeted tasks: summarization, classification, short generation, tagging. It’s not the right tool for complex multi-step reasoning, very long context windows, or tasks where you need GPT-4-level capability. Design your AI features to scope appropriately — use Gemma on-device for the 80% of tasks it handles well, and route complex queries to a cloud model if needed.
The payoff
Here’s what you get with this architecture:
- Zero ongoing AI cost. No per-request charges, ever. The model is on the user’s device.
- Full privacy. User data is processed locally. You can credibly promise your users their data stays on their phone.
- Works offline. AI features work without a connection. Firebase syncs when connectivity returns.
- Scales for free. Your 1,000th user costs you no more in AI inference than your first.
Firebase gives you the sync and persistence layer without building a backend. Gemma gives you the AI without paying per token. Together, they let you ship an AI-powered product that’s genuinely sustainable for an indie developer.
What I’m building with this
This is the exact architecture I’m using for my own product — built on Flutter, Gemma, and Firebase. I’ll be writing more about specific implementation decisions as I go, including how I’m using Gemma’s multimodal vision support and how I structure Firestore for AI-generated data.
If you’re building something similar — or thinking about it — hit reply. I’d love to hear what you’re working on.
Next up: Adding multimodal vision to your Flutter + Gemma app — letting the model read images directly, entirely on-device.

Top comments (0)