DEV Community: Carol Bolger

Legible - I built an on-device document helper for immigrants using Gemma 4

Carol Bolger — Fri, 15 May 2026 16:56:42 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Imagine receiving an official envelope and not being able to read it. Not because you're careless, but rather because English isn't your first language. Worse yet, the letter is written in bureaucratic legalese that would confuse many native speakers, too. Is it urgent? Do you need to respond? Who do you call?
That's the problem Legible tries to solve.

What it does

Legible lets you photograph or upload any official document: a tax notice, school letter, eviction warning, lease, utility bill, or government letter. It returns:

A plain-language explanation in your own language
A deadline countdown ("You have 23 days to respond")
Numbered next steps — concrete, specific actions to take
A glossary of legal or bureaucratic terms found in the document, with simple definitions
An encrypted local history of past scans, so you can refer back without re-uploading Eleven languages supported: Spanish, Chinese (Simplified), Tagalog, Vietnamese, Arabic, Hindi, Korean, French, Portuguese, Russian, and English.

The model choice: why Gemma 4

This is a Build With Gemma 4, and I want to be specific about why Gemma 4 — running locally via Ollama — is the right model here, not just a convenient one.
I ran gemma4:latest, which Ollama resolved to the Gemma 4 Effective 4B (E4B) instruction-tuned model. The privacy and offline arguments below apply equally to any Gemma 4 variant served locally. Still, document parsing with structured XML output and multilingual generation is exactly the kind of task where the extra capacity of E4B shows. The key point is that nothing leaves the machine.

Privacy is the core value proposition. The documents this app handles contain some of the most sensitive information a person owns: Social Security numbers, tax IDs, case numbers, home addresses, and medical details. These are exactly the fields that get scraped, leaked, or sold when you upload documents to a cloud API.
Gemma 4 runs entirely on the user's machine via Ollama. The image never leaves the device. There's no API key, no usage logs, and no third-party server seeing your user's tax notice. It’s not just a privacy policy, it's a privacy architecture.

Multimodal input is essential. Real documents aren't clean PDFs. They're photos taken at an angle under fluorescent lighting, or scans of crumpled letters that have been in someone's bag. Gemma 4's native image understanding handles this skillfully. The model reads the document directly from the photo rather than depending on a separate OCR pipeline that might fail on non-Latin scripts.

Offline capability matters for this audience. Immigrant communities often rely on metered mobile data or shared Wi-Fi. The model downloads once, then runs indefinitely with no internet connection.

How it works

The backend is a FastAPI app that proxies a streaming request to Ollama's local API. The frontend is a single HTML file — no build step, no framework, no dependencies to install beyond Python. Open the browser at localhost:8000 and it works.

The system prompt
Getting structured output from a vision model reliably is mostly a prompting problem. I landed on asking Gemma 4 to respond in a fixed XML schema:
Tax Notice | Lease Agreement | School Letter | ...
YYYY-MM-DD or none
integer or none
Plain-language summary in {target_language}
1. action\n2. action...
English Term | Simple explanation in {target_language}

Injecting today's date into the prompt and asking the model to calculate days remaining directly (rather than doing it in code) turned out to be more reliable than parsing a date string and computing the delta separately. Gemma 4 handles this arithmetic accurately.
The full system prompt is pinned to the user's chosen language so that the glossary definitions, explanation, and next steps all arrive in one language, no mixing.

Encrypted history
Past scans are stored as Fernet-encrypted files in .history/records/. The encryption key lives in .history/key.bin and is generated fresh on first run. Both paths are .gitignored. Deleting the .history/ folder clears everything.
Each record stores a JPEG thumbnail of the document alongside the parsed results, so the history panel shows what the document looked like without re-processing it.

Stack

Running it yourself

Prerequisites: Ollama installed and running.

Pull the model (one-time, ~5 GB)

ollama pull gemma4:latest

Install Python dependencies

pip install -r requirements.txt

Start the app

uvicorn main:app --reload

Open http://localhost:8000 — that's it.
Environment variables to override defaults:

What I learned

Local multimodal inference has crossed a usability threshold. A year ago, running a vision model locally meant wrestling with quantization, drivers, and memory issues. With Ollama and Gemma 4, ollama pull gemma4:latest and uvicorn main:app is the entire setup. That simplicity matters enormously for a tool meant to be shared with non-technical communities.

Structured output from vision models is still a prompt engineering problem. Gemma 4 followed the XML schema reliably once I made the format explicit and gave it examples of what "none" should look like for optional fields. Before that, it occasionally invented its own tags or wrapped the XML in markdown code fences — easy to handle in the parser, but cleaner to prevent at the prompt level.

The privacy architecture is the product. For this use case, "runs locally" isn't a feature — it's the reason the tool is trustworthy enough to use with sensitive documents. That framing changed how I thought about the whole design.

What's next

Mobile-optimised layout for direct phone camera capture
Support for multi-page documents (PDF input)
Offline-first PWA packaging so it can be installed like an app
Optional audio read-aloud of the explanation for users with low literacy

Repo

Source code, setup instructions, and the full system prompt are in the repository -> https://github.com/RealWorldApplications/legible. Questions welcome in the comments.
Built for the DEV × Google Gemma 4 Challenge — Build With Gemma 4 track.

Building a Zero-Cost AI Feature in Flutter with Gemma 4 + Firebase

Carol Bolger — Mon, 11 May 2026 16:33:36 +0000

How to combine on-device inference with cloud sync — without paying a cent in API fees

The Unspoken Reality of AI API Costs

Here’s the moment every indie developer dreads. You’ve shipped your AI feature. Users love it. It’s working. Then you open your billing dashboard.

Every question your users ask costs you money. Every summary generated, every classification run, you’re paying for it. You have created a successful product, but its popularity is now draining your resources.

What if there was a way to run powerful AI in your Flutter app with zero ongoing API costs? No per-request charges. No server bills. No privacy risk from data leaving the device.

That’s exactly what the combination of Gemma 4 and Firebase unlocks. Gemma does the thinking, entirely on the user’s phone. Firebase handles persistence and sync. The result is an AI-powered app that scales to thousands of users at near-zero marginal cost.

I’m building my own product on this exact stack. Here’s how it works.

Why cloud AI is a trap for indie developers

Cloud AI APIs are seductive. You make one API call, get a response, ship the feature. But the cost model is brutal for bootstrapped products:

Costs scale directly with usage — success punishes you
User data leaves the device — a privacy and trust problem
No connection means broken features — poor offline experience
You’re dependent on a third-party API that can change pricing or availability

On-device AI solves all four of these problems at once. The model runs locally, the user’s data never hits a network, it works offline, and you own the stack.

The stack: Gemma 4 + flutter_gemma + Firebase

Three components, each with a clear job:

Gemma 4 (the brain)
Gemma 4 is Google’s latest edge-optimized open model, built on the same research as Gemini but designed to run efficiently on mobile hardware. The E2B variant (Effective 2B parameters) runs in under 2GB of RAM thanks to 4-bit quantization. It’s practical for real devices, not just flagship phones. Critically, Gemma 4 supports native function calling, which means you can wire it directly into your app’s logic without any prompt engineering hacks.

flutter_gemma (the bridge)
The flutter_gemma package wraps the native LiteRT-LM inference engine in a clean Dart API. You get GPU acceleration, streaming responses, multimodal vision support, and function calling, all from Flutter. It supports Android, iOS, Web, and Desktop from a single codebase.

Firebase (the backbone)
Firebase handles everything that needs to live in the cloud: Firestore for storing and syncing AI outputs across devices, Firebase Auth for user identity, and optionally Firebase Storage if you need to persist larger assets. The key insight is that your AI inference never touches Firebase; only the results do.

The architecture: who does what

Here’s the principle that makes this work: AI never touches the network. Only the output does.

The flow looks like this:

User creates content in the app (a note, a photo, a voice memo)
Gemma processes it on-device — tagging, summarizing, classifying
The AI output (not the raw input) is written to Firestore
Firebase syncs the output to the user’s other devices

The user’s raw thoughts, photos, or voice never leave their phone. The model never makes a network call. But their data is still available everywhere they need it, because Firebase syncs the processed result.

Building it: a private AI notes app

Let’s make this concrete. We’ll build a simple notes app where Gemma automatically tags and summarizes each note on-device, then Firebase syncs the result across the user’s devices. Here’s the step-by-step.

Step 1: Set up flutter_gemma
Add the dependency to your pubspec.yaml:

dependencies: flutter_gemma: ^0.13.1 cloud_firestore: ^5.0.0 firebase_auth: ^5.0.0

Download the Gemma 4 E2B model from Hugging Face (you’ll need to accept the license terms). Store your HuggingFace token securely; never hardcode it. On first launch, the app downloads the model to local storage. This is a one-time operation of around 1.5GB.

Step 2: Run on-device inference
Initialize the model and run a simple summarization:

final model = await FlutterGemma.getActiveModel( maxTokens: 512, preferredBackend: PreferredBackend.gpu,); final chat = await model.createChat(); await chat.addQueryChunk( Message.text( text: ‘Summarize this note in one sentence and suggest 2-3 tags: $noteContent’, isUser: true, ),); final response = await chat.generateChatResponse();

That’s it. The inference runs entirely on-device. No API key, no network call, no cost. The response gives you a summary and tags you can parse and store.

Step 3: Write the output to Firestore
Take the AI output and persist it to Firestore:

await FirebaseFirestore.instance .collection(’notes’) .doc(noteId) .set({‘originalContent’: noteContent, // stays on-device or stored per your preference ‘summary’: parsedSummary, ‘tags’: parsedTags, ‘userId’: FirebaseAuth.instance.currentUser!.uid, ‘createdAt’: FieldValue.serverTimestamp(),});

Notice what goes to Firestore: the AI-generated summary and tags, not necessarily the raw note. You can store the original too if your app requires it — but the point is you’re in control of what leaves the device.

Step 4: Read it back and sync
Firestore’s real-time listeners handle the sync automatically. Open the app on a second device and your tagged, summarized notes appear instantly — without any polling logic on your part. Firebase does the heavy lifting.

Real talk: the gotchas

I won’t sell you a perfect picture. Here’s what you’ll actually run into:

Model download size
The Gemma 4 E2B model is ~1.5GB. This is a one-time download, but you need to handle it gracefully. Show a clear progress indicator on first launch. Consider letting users choose to download over WiFi only. The flutter_gemma package provides download progress callbacks — use them.

Cold start time
Loading the model into memory takes a few seconds on the first inference of a session. Show a loading state. After the model is warm, subsequent responses are fast. Set user expectations early — frame it as the app ‘preparing your AI’ rather than ‘loading.’

Know when to use cloud AI instead
On-device Gemma 4 is excellent for targeted tasks: summarization, classification, short generation, tagging. It’s not the right tool for complex multi-step reasoning, very long context windows, or tasks where you need GPT-4-level capability. Design your AI features to scope appropriately — use Gemma on-device for the 80% of tasks it handles well, and route complex queries to a cloud model if needed.

The payoff

Here’s what you get with this architecture:

Zero ongoing AI cost. No per-request charges, ever. The model is on the user’s device.
Full privacy. User data is processed locally. You can credibly promise your users their data stays on their phone.
Works offline. AI features work without a connection. Firebase syncs when connectivity returns.
Scales for free. Your 1,000th user costs you no more in AI inference than your first.

Firebase gives you the sync and persistence layer without building a backend. Gemma gives you the AI without paying per token. Together, they let you ship an AI-powered product that’s genuinely sustainable for an indie developer.

What I’m building with this

This is the exact architecture I’m using for my own product — built on Flutter, Gemma, and Firebase. I’ll be writing more about specific implementation decisions as I go, including how I’m using Gemma’s multimodal vision support and how I structure Firestore for AI-generated data.

If you’re building something similar — or thinking about it — hit reply. I’d love to hear what you’re working on.

Next up: Adding multimodal vision to your Flutter + Gemma app — letting the model read images directly, entirely on-device.

On-device or cloud? Building hybrid AI inference into your Android app with Firebase AI Logic

Carol Bolger — Wed, 29 Apr 2026 22:12:36 +0000

This is a submission for the Google Cloud NEXT Writing Challenge

Not every prompt needs the cloud

Every time a user taps a button in your Android app and Gemini responds, something happens in the background you might not think about: a round trip to Google's servers. Data leaves the device, gets processed in the cloud, and comes back. For most prompts, that's fine. But what about a health journaling app where the prompt contains symptoms? A notes app where the query is someone's private thought? Or a user on a shaky connection in the middle of nowhere?

The assumption baked into most AI-powered apps is that inference lives in the cloud. That made sense when on-device models were too limited to be useful. That assumption is now worth revisiting.

At Google Cloud Next '26, Firebase announced hybrid inference for Firebase AI Logic on Android — currently experimental, powered by Gemini Nano via ML Kit's Prompt API under the hood. The idea is straightforward: run inference locally on the device when it can handle it, and fall back to cloud-hosted Gemini when it can't. From your Kotlin code, the API looks nearly identical either way. You configure a preference, write a prompt, get a response. The SDK figures out the routing.

This article walks through how hybrid inference works, why it matters for Android developers, and how to wire it into a real app today — including the honest caveats you won't find in the announcement post.

How hybrid inference works

At its core, hybrid inference is a routing decision the SDK makes on your behalf. When your app sends a prompt, Firebase AI Logic checks whether the device supports Gemini Nano and has sufficient resources to run it. If yes, inference runs locally — never leaving the device. If not, the request is forwarded to cloud-hosted Gemini, exactly as it would in a standard setup.

The routing behaviour is controlled by four InferenceMode values:

PREFER_ON_DEVICE — try on-device first, fall back to cloud if unavailable
PREFER_IN_CLOUD — try cloud first, fall back to on-device if offline
ONLY_ON_DEVICE — on-device only; throws an exception if unavailable
ONLY_IN_CLOUD — cloud only; throws an exception if offline

For most apps PREFER_ON_DEVICE is the right starting point. It gives you the speed and privacy benefits of local inference where available, without ever leaving users stranded.

Think of it like a CDN for model inference. The SDK picks the most capable compute available at the moment the request is made — and your Kotlin code stays the same regardless of which path runs.

Under the hood, on-device inference uses ML Kit's Prompt API running Gemini Nano — a smaller, quantized model optimized for mobile hardware, not the full cloud Gemini model. That trade-off is worth being upfront about: on-device responses will be faster and free of API cost, but for complex reasoning tasks, cloud Gemini produces higher quality output. Hybrid inference gives the SDK the ability to make that call at runtime rather than forcing you to hardcode it at build time.

Three reasons this matters for Android developers in particular:

Speed. On-device inference removes the network round trip entirely. For short prompts on capable hardware, responses feel near-instant — which opens up UX patterns that feel too slow over the network.

Privacy. When inference runs locally, the prompt never leaves the device. For apps handling sensitive input — health data, personal notes, financial details — that's a meaningful architectural guarantee, not just a talking point.

Cost. Every request handled on-device is a Gemini API call you're not making. At scale, or in apps with high inference volume, this adds up — especially for short, repetitive prompts.

One important note: the on-device model isn't bundled with your APK — it downloads in the background after first launch via ML Kit. Until it's cached, all requests fall back to cloud. For apps where on-device is a hard requirement rather than a preference, you'll want to gate the feature on download completion.

Building MemoMind: record, summarize, save

We're going to build a focused Android app called MemoMind that records a voice memo, transcribes it using Android's built-in speech recognizer, runs a Gemini summary using hybrid inference, and saves everything to Firestore — including which backend (on-device or cloud) handled the request. By the end you'll have something genuinely usable, not just a proof of concept.

The full flow: the user taps record, speaks, taps stop. The app transcribes the audio using SpeechRecognizer, sends the transcript to Firebase AI Logic with hybrid inference configured, parses the structured JSON response into a summary and action items, then writes the result to Firestore.

Prerequisites

The full source for MemoMind is on GitHub at https://github.com/RealWorldApplications/memo-mind-post. You can clone it and follow along, or build from scratch using the steps below.

Tested with: firebase-ai:17.11.0, firebase-ai-ondevice:16.0.0-beta01 on a Pixel 9 running Android 15. Import paths and model names for experimental APIs can shift between releases — treat the GitHub repo as the authoritative reference if anything here doesn't compile.

You'll need a Firebase project with the Gemini Developer API enabled under Build → AI Logic, and Firestore enabled. If you're starting from scratch, the Firebase Android setup guide covers the project creation steps.

Important: After cloning, add your own google-services.json to the app/ directory — it's gitignored in the repo. Download it from your Firebase project console under Project settings → Your apps.

Step 1 — Add dependencies

Hybrid inference requires two separate libraries — the standard Firebase AI library and the on-device extension. Note that firebase-ai-ondevice is not yet in the Firebase Android BoM, so you need to pin the version explicitly.

You also need the Kotlin serialization plugin (used in MemoService for JSON parsing) and material-icons-extended for the Mic and Stop icons — both are easy to miss:

// build.gradle.kts (app level)
plugins {
    // ... your existing plugins ...
    id("org.jetbrains.kotlin.plugin.serialization") version "2.0.21"
}

dependencies {
    // Standard Firebase AI Logic
    implementation("com.google.firebase:firebase-ai:17.11.0")

    // On-device extension — NOT in the BoM yet, pin manually
    implementation("com.google.firebase:firebase-ai-ondevice:16.0.0-beta01")

    // Firestore
    implementation("com.google.firebase:firebase-firestore:25.1.1")

    // Compose
    implementation(platform("androidx.compose:compose-bom:2024.09.00"))
    implementation("androidx.compose.ui:ui")
    implementation("androidx.compose.material3:material3")
    implementation("androidx.compose.material:material-icons-extended") // Mic/Stop icons
    implementation("androidx.activity:activity-compose:1.9.2")

    // Coroutines
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.8.1")
    implementation("androidx.lifecycle:lifecycle-viewmodel-compose:2.8.6")

    // JSON parsing — used in MemoService
    implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.7.3")
}

Also add microphone permission to AndroidManifest.xml:

<uses-permission android:name="android.permission.RECORD_AUDIO" />

Step 2 — Define the data model

Before any AI or UI code, define a clean data model. Gemini will return a structured JSON response that maps directly to this class, and Firestore will store it.

data class Memo(
    val id: String = "",
    val transcript: String = "",
    val summary: String = "",
    val actionItems: List<String> = emptyList(),
    val inferredBy: String = "", // "on_device" or "cloud"
    val createdAt: Long = System.currentTimeMillis()
) {
    fun toMap(): Map<String, Any> = mapOf(
        "transcript" to transcript,
        "summary" to summary,
        "actionItems" to actionItems,
        "inferredBy" to inferredBy,
        "createdAt" to createdAt
    )
}

The inferredBy field is what powers the backend indicator chip in the UI. We'll populate it based on which path the SDK actually took.

Step 3 — Configure Firebase AI Logic with hybrid inference

This is the heart of the tutorial. The @OptIn annotation is required because hybrid inference is currently experimental. Beyond that, notice how little the setup differs from standard Firebase AI — the only meaningful addition is onDeviceConfig.

import com.google.firebase.Firebase
import com.google.firebase.ai.ai
import com.google.firebase.ai.InferenceMode        // not .ondevice — re-exported to root package
import com.google.firebase.ai.OnDeviceConfig        // not .ondevice — re-exported to root package
import com.google.firebase.ai.type.GenerativeBackend // lives in .type, not root
import com.google.firebase.ai.type.PublicPreviewAPI
import kotlinx.serialization.json.Json
import kotlinx.serialization.json.jsonArray
import kotlinx.serialization.json.jsonObject
import kotlinx.serialization.json.jsonPrimitive

class MemoService {

    // Experimental opt-in required for hybrid inference
    @OptIn(PublicPreviewAPI::class)
    private val model = Firebase.ai(backend = GenerativeBackend.googleAI())
        .generativeModel(
            modelName = "gemini-3-flash-preview",
            onDeviceConfig = OnDeviceConfig(
                mode = InferenceMode.PREFER_ON_DEVICE
            )
        )

    @OptIn(PublicPreviewAPI::class)
    suspend fun summarize(transcript: String): SummaryResult {
        val prompt = """
            You are a memo summarizer. Given the transcript below, respond ONLY
            with a valid JSON object — no markdown, no explanation. Use this shape:
            {
              "summary": "2-sentence summary of the memo",
              "actionItems": ["item one", "item two"],
              "inferredBy": "on_device"
            }

            Set "inferredBy" to "on_device" if you are running locally,
            or "cloud" if you are a cloud-hosted model.

            Transcript:
            $transcript
        """.trimIndent()

        return try {
            val response = model.generateContent(prompt)
            val raw = response.text ?: return SummaryResult.empty(transcript)
            parseResponse(raw)
        } catch (e: Exception) {
            SummaryResult.empty(transcript)
        }
    }

    private fun parseResponse(raw: String): SummaryResult {
        return try {
            val cleaned = raw
                .replace("```
{% endraw %}
json", "")
                .replace("
{% raw %}
```", "")
                .trim()
            val json = Json.parseToJsonElement(cleaned).jsonObject
            SummaryResult(
                summary = json["summary"]?.jsonPrimitive?.content ?: "",
                actionItems = json["actionItems"]?.jsonArray
                    ?.map { it.jsonPrimitive.content } ?: emptyList(),
                inferredBy = json["inferredBy"]?.jsonPrimitive?.content ?: "cloud"
            )
        } catch (e: Exception) {
            SummaryResult(summary = raw, actionItems = emptyList(), inferredBy = "cloud")
        }
    }
}

data class SummaryResult(
    val summary: String,
    val actionItems: List<String>,
    val inferredBy: String
) {
    companion object {
        fun empty(transcript: String) = SummaryResult(
            summary = transcript,
            actionItems = emptyList(),
            inferredBy = "cloud"
        )
    }
}

A note on inferredBy: The Firebase AI Logic Android SDK doesn't yet expose a property to read which backend handled a response after the fact. As a practical workaround, we ask the model to self-report in the JSON. On-device Gemini Nano will reliably report "on_device" — it's a factual question about its own execution context. Verify this with your own testing and adjust if needed.

Step 4 — Transcribe with Android's SpeechRecognizer

Android's built-in SpeechRecognizer handles transcription entirely on-device using the platform's native speech engine. No third-party package needed, and audio never leaves the phone at this stage.

The key pattern is wrapping SpeechRecognizer's listener callbacks in suspendCancellableCoroutine so they integrate cleanly with the coroutine-based ViewModel:

class TranscriptionService(private val context: Context) {

    suspend fun transcribe(): String = suspendCancellableCoroutine { continuation ->
        val recognizer = SpeechRecognizer.createSpeechRecognizer(context)

        val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply {
            putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
            putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1)
        }

        recognizer.setRecognitionListener(object : RecognitionListener {
            override fun onResults(results: Bundle?) {
                val transcript = results
                    ?.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION)
                    ?.firstOrNull() ?: ""
                continuation.resume(transcript)
                recognizer.destroy()
            }

            override fun onError(error: Int) {
                continuation.resumeWithException(Exception("Speech recognition error: $error"))
                recognizer.destroy()
            }

            // RecognitionListener requires several no-op overrides — see full source on GitHub
        })

        recognizer.startListening(intent)
        continuation.invokeOnCancellation { recognizer.destroy() }
    }
}

SpeechRecognizer must be created and used on the main thread. Because viewModelScope runs on Dispatchers.Main, calling transcribe() from the ViewModel is safe without any explicit dispatcher switch.

Step 5 — ViewModel: wire it all together

The ViewModel exposes a MemoUiState sealed class (Idle, Recording, Processing, Done, Error) as a StateFlow and orchestrates the three-stage pipeline in startRecording(). The full class including the ViewModelProvider.Factory is in MemoViewModel.kt — the pipeline itself is the part worth reading here:

fun startRecording() {
    _uiState.value = MemoUiState.Recording

    viewModelScope.launch {
        try {
            val transcript = transcriptionService.transcribe() // on-device speech engine
            _uiState.value = MemoUiState.Processing

            val result = memoService.summarize(transcript)     // hybrid inference

            val memo = Memo(
                id = firestore.collection("memos").document().id,
                transcript = transcript,
                summary = result.summary,
                actionItems = result.actionItems,
                inferredBy = result.inferredBy
            )

            firestore.collection("memos").document(memo.id).set(memo.toMap()).await()
            _uiState.value = MemoUiState.Done(memo)
        } catch (e: Exception) {
            _uiState.value = MemoUiState.Error(e.message ?: "Unknown error")
        }
    }
}

Three lines do the meaningful work: transcribe(), summarize(), and set(). Everything else is state management.

Step 6 — Compose UI with the backend indicator chip

MemoScreen collects uiState and renders a FilledIconButton that toggles between mic and stop, a status label, a CircularProgressIndicator during summarization, and the SummaryCard once done. The full screen composable is in MemoScreen.kt.

The part worth looking at closely is SummaryCard — specifically the backend indicator chip, which is the whole point of exposing inferredBy:

@Composable
fun SummaryCard(memo: Memo, onReset: () -> Unit) {
    Card(
        modifier = Modifier.fillMaxWidth(),
        shape = RoundedCornerShape(16.dp)
    ) {
        Column(modifier = Modifier.padding(16.dp)) {

            // Summary header + backend chip
            Row(
                modifier = Modifier.fillMaxWidth(),
                verticalAlignment = Alignment.CenterVertically
            ) {
                Text(
                    text = "Summary",
                    style = MaterialTheme.typography.titleMedium,
                    modifier = Modifier.weight(1f)
                )
                // Backend indicator chip
                Surface(
                    shape = RoundedCornerShape(20.dp),
                    color = if (memo.inferredBy == "on_device")
                        Color(0xFFE1F5EE)
                    else
                        Color(0xFFE6F1FB)
                ) {
                    Text(
                        text = if (memo.inferredBy == "on_device") "On-device" else "Cloud",
                        modifier = Modifier.padding(horizontal = 10.dp, vertical = 4.dp),
                        style = MaterialTheme.typography.labelSmall,
                        color = if (memo.inferredBy == "on_device")
                            Color(0xFF085041)
                        else
                            Color(0xFF0C447C)
                    )
                }
            }

            Spacer(modifier = Modifier.height(8.dp))
            Text(
                text = memo.summary,
                style = MaterialTheme.typography.bodyMedium,
                lineHeight = MaterialTheme.typography.bodyMedium.lineHeight
            )

            if (memo.actionItems.isNotEmpty()) {
                Spacer(modifier = Modifier.height(16.dp))
                Text(
                    text = "Action items",
                    style = MaterialTheme.typography.titleMedium
                )
                Spacer(modifier = Modifier.height(4.dp))
                memo.actionItems.forEach { item ->
                    Text(
                        text = "• $item",
                        style = MaterialTheme.typography.bodyMedium,
                        modifier = Modifier.padding(vertical = 2.dp)
                    )
                }
            }

            Spacer(modifier = Modifier.height(16.dp))
            OutlinedButton(
                onClick = onReset,
                modifier = Modifier.fillMaxWidth()
            ) {
                Text("Record another")
            }
        }
    }
}

Step 7 — Run it and watch the chip

Build and run on a physical Android device that supports Gemini Nano — a Pixel 6 or newer is a safe bet. Record a short memo, something like: "remind me to email Sarah about the Q2 report and book a dentist appointment." Stop and watch the card appear.

On the first run you'll likely see "Cloud" — the on-device Gemini Nano model downloads in the background after first use. Record a second memo and you should see the chip flip to "On-device", with a noticeably faster response time.

That chip is the screenshot worth capturing for your article header. A real summary card showing "On-device" with action items is worth more than any diagram.

What this actually means for Android developers

After building MemoMind, a few things stand out — both as genuine reasons to be excited about hybrid inference and as honest caveats worth knowing before you architect something around it.

What works well

The API design is the real win here. Firebase's decision to wrap hybrid routing in OnDeviceConfig rather than forcing you to maintain two separate model instances means you're not writing conditional execution paths throughout your codebase. The SDK absorbs the routing complexity. For a feature that's still experimental, the ergonomics are surprisingly clean.

The privacy story is also more meaningful than it might first appear. When inference runs locally, it's not just that data doesn't get logged somewhere — it's that the architecture of your app changes. You can make stronger guarantees to users, design features that handle genuinely sensitive input, and avoid the legal grey areas that come with sending personal data to a third-party API. For anyone building in health, fitness, or productivity, that's a real design unlock.

The backend indicator chip in MemoMind isn't just a demo trick. In a production app, surfacing this to users — even subtly — builds a kind of trust that's hard to communicate through a privacy policy alone.

Current limitations worth knowing

It's experimental. The @OptIn(PublicPreviewAPI::class) annotation isn't just boilerplate — it's a signal that the API surface can change in backwards-incompatible ways without deprecation notice. Don't build a production release around this without a contingency plan.

Gemini Nano capability gap. The on-device model is smaller and quantized. For MemoMind's use case — short transcripts, structured JSON output — it performs well. For complex reasoning, longer context, or nuanced instruction-following, you'll notice the quality gap compared to cloud Gemini. Know your prompt's complexity profile before relying on on-device quality.

Model download on first run. Gemini Nano downloads in the background after first launch via ML Kit. Until it's cached, every request goes to cloud. For apps where on-device is a hard privacy requirement rather than a preference, you'll need to listen to the model download state and gate the feature accordingly.

Device support is not universal. On-device inference requires a device that supports Gemini Nano. Pixel 6 and newer, and a growing range of Samsung devices, qualify — but this is not all Android devices. Your PREFER_ON_DEVICE config will silently fall back to cloud on unsupported hardware, which is fine for most cases but worth tracking in your analytics.

Self-reporting inferredBy is a workaround. Asking the model to report its own execution context in the JSON works in practice but isn't a guaranteed contract. The official SDK doesn't yet expose a post-response property for which backend ran. Watch the firebase-ai-ondevice changelog for when this is added properly.

The bigger picture

Hybrid inference is one piece of a broader shift in what mobile AI can look like. The ability to run meaningful inference locally — even in a limited form — changes the kinds of apps you can design. Features that felt too latency-sensitive for a cloud round trip, or too privacy-sensitive to send off-device, become viable. The on-device model will improve over time. Device support will grow. The API will stabilize.

The developers who understand this stack now, rough edges and all, will have a meaningful head start when it does.

Where to go from here

MemoMind as we've built it is a solid foundation. The transcribe → summarize → save pipeline is generic enough to adapt to a wide range of use cases — meeting notes, workout logs, daily journals, field reports. The structured JSON prompt and the Firestore schema transfer cleanly to any domain.

A few natural next steps if you want to keep building:

Add a memo list screen. A simple Firestore real-time listener showing past memos with their on-device/cloud chip makes the app feel complete — and gives you a live view of how often on-device inference wins in practice across your test devices.

Gate on model download completion. ML Kit exposes a download state API for Gemini Nano. Listening to it and showing a subtle "AI ready" indicator once it's cached is a small touch that makes a real UX difference for privacy-sensitive apps.

Add the model device support check. Before showing the on-device privacy promise to users, check at runtime whether their device actually supports Gemini Nano inference and surface that information appropriately.

iOS support for hybrid inference is not yet available — watch the Firebase changelog. The Android experimental API suggests the Dart/Flutter SDK will follow once the underlying infrastructure is proven on native platforms first.

The full source for MemoMind is on GitHub at https://github.com/RealWorldApplications/memo-mind-post. If you run into anything unexpected with the hybrid routing on your device, or see the chip behave differently than expected, I'd like to hear about it.

Resources

Weekend Challenge: Earth Day Edition

Carol Bolger — Mon, 20 Apr 2026 00:59:00 +0000

*This is a submission for Weekend Challenge: Earth Day Edition

What I Built

Agentic Acres is an intelligent, multimodal permaculture design assistant built to accelerate regenerative agriculture and make ecological landscaping accessible to everyone.

My primary goal was to take the overwhelming complexity of designing sustainable ecosystems, knowing which plants support each other, fix nitrogen, or accumulate nutrients, and boil it down to a single photograph. Users simply snap a picture of their yard or planting site (or upload one from their gallery) and provide their location.

The application instantly assesses the site and engineers a complete, climate-specific Plant Guild. Instead of endless research, users are presented with a gorgeous, dynamic dashboard that outlines their primary canopy tree, nitrogen fixers, dynamic accumulators, and ground covers. All are precisely tailored to their local environment and the exact physical constraints shown in the photo.

Demo

Try the web app:
https://agentic-acres.web.app

See a demo video on YouTube:

Code

https://github.com/RealWorldApplications/agentic-acres

How I Built It

I built Agentic Acres as a fully responsive Web Application using Flutter, and deployed it using Firebase Hosting. I built the entire app in Google Antigravity and used Gemini 3.1 as my AI assistant.

The core intelligence of the application is driven by *gemini-3.1-flash-lite-preview
* using the google_generative_ai SDK. I used Gemini's multimodal vision capabilities to achieve something unique: the app passes the user-uploaded image alongside their geographic coordinates directly into the model. I implemented strict prompt engineering so that Gemini would output a highly constrained JSON schema rather than conversational text. This allows the frontend to confidently parse the AI's ecological assessment and dynamically construct the localized "Bento Box" UI.

To make the app feel alive and premium, I incorporated several other key technologies:

Geolocation & Geocoding: I integrated the geolocator package to allow users to pull their exact GPS coordinates with a single click.
Dynamic Media Pipelines: Because Gemini returns common plant names rather than hardcoded URLs, I integrated the Wikipedia (Wikimedia) Action API. As the dashboard renders, the app asynchronously executes fuzzy searches against Wikipedia's backend, ripping the main 400px thumbnail of the matched plant and fading it into the glassmorphic UI cards in real-time.
Modern Glassmorphism Design: I styled the frontend using Flutter widget compositions to build a dark-mode, frosted-glass interface that feels modern, premium, and instantly trustworthy.

Prize Categories

Best Use of Google Gemini

Gemma 4: A Practical Guide to Running Frontier AI on Your Own Hardware

Carol Bolger — Tue, 07 Apr 2026 14:44:37 +0000

"This article was originally published on my Substack."

There’s a quiet assumption baked into the way most of us use AI today: you type a prompt, it leaves your machine, travels to a data center somewhere, gets processed on hardware you don’t own, and the answer comes back. For most of the last three years, “using AI” has meant “renting AI.” Your data leaves. You hope for the best.

Gemma 4 is Google DeepMind’s clearest challenge to that model yet. Recently released under an Apache 2.0 license, it’s a family of four open-weight models. They range from a 2-billion-parameter edge model that fits on a phone to a 31-billion-parameter dense model that runs on a single consumer GPU. These aren’t research toys. The 31B variant currently ranks as the #3 open model in the world on the Arena AI text leaderboard, outcompeting models twenty times its size. The 26B model sits at #6.

Built on the same research and technology behind Gemini 3, these models handle multi-step reasoning, native function calling, code generation, and multimodal input across text, images, video, and audio. They support over 140 languages out of the box. And they do all of it on hardware you already own or could afford to.

Let’s break down what that actually means in practice.

What’s Under the Hood

Gemma 4 ships in four sizes, each designed for a different deployment scenario. Understanding the differences matters because picking the right model is the single most important decision you’ll make.

The 31B Dense model is the quality leader. Every one of its 31 billion parameters activates on every inference pass, which means maximum reasoning depth at the cost of higher compute. If you’re fine-tuning for a specialized task such as legal analysis, medical summarization, domain-specific code generation then this is your foundation. It fits on a single 80GB NVIDIA H100 in full bfloat16 precision, or on consumer GPUs in quantized form.

The 26B Mixture-of-Experts (MoE) model takes a different approach. It contains 26 billion total parameters but only activates roughly 3.8 billion of them during any given inference pass. Think of it as a team of specialists: instead of running every expert on every query, the model routes each token to the most relevant subset. This model delivers significantly accelerated token generation, matching the performance of much smaller architectures while preserving the core reasoning strengths of the dense version. It is the ideal choice when prioritizing low latency over absolute benchmark perfection.

The E4B and E2B edge models are purpose-built for phones, IoT devices, and anything where RAM and battery life are constraints. These are multimodal out of the box. They handle text, images, video, and native audio input. They’re designed to run completely offline with near-zero latency on devices like the Raspberry Pi, NVIDIA Jetson Orin Nano, and Android phones. For Android developers specifically, these models are compatible with the AICore Developer Preview for forward compatibility with Gemini Nano 4.

All four models support context windows of 128K tokens (edge) to 256K tokens (26B and 31B). To put that in practical terms: 256K tokens is roughly 500 pages of text. That’s an entire codebase, a full legal contract, or a quarter’s worth of financial filings processed in a single prompt with no chunking required.

Why “Local” Is a Business Decision, Not Just a Technical One

If you’re a business owner reading this and wondering why you should care about where a model runs, the answer comes down to three things: data control, cost structure, and global reach.

Data stays on your hardware. Every time your team sends customer data, internal documents, or proprietary information to a cloud AI provider, you’re trusting a third party with that data. For industries governed by HIPAA, GDPR, SOC 2, or similar regulations, that trust comes with compliance overhead and significant risk. Running Gemma 4 locally means sensitive information never leaves your premises. There’s no API call to audit, no third-party data processing agreement to negotiate, no residual data sitting on someone else’s servers.

The cost model flips. Cloud AI pricing is usage-based: you pay per token, per request, per minute. For predictable, high-volume operations, this makes local deployment financially advantageous within months, whereas the increasing costs of cloud scaling can otherwise become problematic. A customer service bot handling thousands of queries a day, a document analysis pipeline processing hundreds of contracts a week can get expensive fast. A local deployment has a fixed hardware cost and near-zero marginal cost per inference. Once the GPU is paid for, every additional query is essentially free. For high-volume, predictable workloads, the math favors local deployment within months.

140+ languages, no translation API required. Gemma 4 is natively trained on over 140 languages. If you’re serving customers in São Paulo, Jakarta, and Berlin, you don’t need to add a translation layer or maintain separate model deployments. A single model handles multilingual input and output natively, which dramatically simplifies localization for global products.

For Developers: Agents, Not Chatbots

The most significant shift in Gemma 4 isn’t a benchmark number; it’s the native support for agentic workflows. This isn’t a model that just answers questions. It’s a model designed to use tools, call functions, produce structured JSON output, and follow multi-step plans.

In practical terms, that means you can build an agent that reads a GitHub issue, checks out the relevant branch, identifies the bug in context (thanks to the 256K window), writes a fix, runs the test suite, and opens a pull request. This is all orchestrated by the model’s own reasoning, with each step involving a structured function call to an external tool. Google has built this capability in natively, not as a wrapper or a prompt hack.

For local development specifically, the 31B model is positioned as an offline coding assistant. Quantized versions run on consumer GPUs such as an RTX 4090 or RTX 5090, turning your workstation into a self-contained AI development environment with no internet dependency. Google and NVIDIA have collaborated on optimizations so that these models leverage Tensor Cores for accelerated inference out of the box, with day-one support from tools like Ollama, llama.cpp, vLLM, and LM Studio.

The hardware partnership story extends beyond NVIDIA. Google has worked with Qualcomm Technologies and MediaTek on mobile optimization, and with Arm on efficient edge deployment. The goal is to be able to run Gemma 4 anywhere you have compute.

Getting Started: The Simplest Path

There are several ways to get Gemma 4 running. Here’s the fastest.

If you want to test before committing, open Google AI Studio. The 31B and 26B MoE models are available there for immediate experimentation. No need to download, set up, or have a GPU. For the edge models, Google’s AI Edge Gallery app on Android lets you test E4B and E2B directly on your phone.

If you want to run locally, the most straightforward path is Ollama. Install it, then pull the model:

`bash

ollama pull gemma4`

That’s it. You’re running a frontier-class model locally. If you want more control, such as quantization options, specific model variants, and GPU configuration, then download GGUF weights from Hugging Face and run them through llama.cpp.

If you’re building for production, the model weights are available on Hugging Face, Kaggle, and Ollama, with day-one integration support from Hugging Face Transformers, vLLM, SGLang, NVIDIA NIM, and a long list of other frameworks. For cloud-scale deployment, Vertex AI, Cloud Run, and GKE are all supported paths on Google Cloud.

For production-level builds, you can access model weights through Kaggle, Ollama, and Hugging Face. These platforms offer immediate integration with various frameworks, including SGLang, vLLM, NVIDIA NIM, and Hugging Face Transformers. If your project requires cloud-scale deployment, Google Cloud provides supported pathways through GKE, Cloud Run, and Vertex AI.

The Apache 2.0 license means there are no usage restrictions, no reporting requirements, and no commercial limitations. You can fine-tune, redistribute, and deploy without asking permission.

Three Use Cases Worth Building

The offline retail assistant. Picture a phone app that uses the E4B model to “see” products through the camera, answer customer questions about specifications, check local inventory, and suggest alternatives, all without an internet connection. In a warehouse, a retail floor, or a remote pop-up shop, this works where cloud-dependent solutions don’t.

The enterprise document agent. A 256K context window means you can feed an entire quarterly report, or several, into a single prompt. Pair that with native function calling, and you have an agent that reads the filing, extracts key metrics, compares them against last quarter’s numbers (pulled via a structured API call), flags anomalies, and drafts a summary. The entire pipeline runs on-premises, with no customer or financial data leaving your network.

The autonomous code reviewer. Point the 31B model at a pull request. It reads the diff in context of the full repository (256K tokens covers a lot of code), identifies potential bugs, checks for style violations, suggests performance improvements, and posts its review all as a local CI step that adds seconds, not minutes, to your pipeline.

Where Gemma 4 Still Falls Short

No model is perfect at everything, and intellectual honesty about limitations builds more trust than uncritical hype.

Gemma 4's biggest model is 31 billion parameters. It's great, but for really heavy lifting, such as super complex math, high-level science, or very nuanced writing, you will want to use a cloud model. The faster MoE version is a bit of a trade-off: you get speed, but lose a tiny bit of quality. And while the mobile models are impressive, a small 2B model on your phone won't replace a massive server for critical tasks.

Local deployment also shifts operational responsibility to you. There’s no managed service handling uptime, scaling, or security patches. If you’re running Gemma 4 in production, you own the infrastructure. This is the point for data sovereignty, but it does mean your team needs the capacity to manage it.

The Bigger Picture
The Gemma 4 release represents something worth paying attention to beyond the model itself. Google is releasing frontier-competitive models under Apache 2.0 at the same moment some other AI labs are pulling back from open releases. That’s a strategic bet on ecosystem growth over model lock-in, and it matters for anyone building products on top of open AI infrastructure.

We’re watching a shift from “AI as a service you rent” to “AI as infrastructure you own.” While Gemma 4 won't entirely eliminate your reliance on cloud services, it represents a significant step toward that goal.

A model that ranks among the top open models in the world, runs on a consumer GPU, handles multimodal input in 140 languages, and ships under a permissive open-source license is a genuinely new thing in this space.

The question worth sitting with: if you can run this level of intelligence on your own hardware, with your data never leaving your control, what does that make possible that wasn’t before?

*Gemma 4 models are available:
Hugging Face, Kaggle, and Ollama.

Try them in Google AI Studio or explore the edge models in the AI Edge Gallery on Android.
Thanks for reading!

5 Ways to Optimize Your AI Workflows ⚡️

Carol Bolger — Mon, 06 Apr 2026 14:00:14 +0000

How do you actually manage your content's SEO performance?

Carol Bolger — Sat, 04 Apr 2026 22:30:14 +0000

I write about tech and I've been frustrated with piecing together Google Search Console, spreadsheets, and random tools to figure out what's working. Curious how others handle this — do you have a system that actually works?