Carol Bolger

Posted on Apr 29

On-device or cloud? Building hybrid AI inference into your Android app with Firebase AI Logic

#devchallenge #cloudnextchallenge #googlecloud #android

Google Cloud NEXT '26 Challenge Submission

This is a submission for the Google Cloud NEXT Writing Challenge

Not every prompt needs the cloud

Every time a user taps a button in your Android app and Gemini responds, something happens in the background you might not think about: a round trip to Google's servers. Data leaves the device, gets processed in the cloud, and comes back. For most prompts, that's fine. But what about a health journaling app where the prompt contains symptoms? A notes app where the query is someone's private thought? Or a user on a shaky connection in the middle of nowhere?

The assumption baked into most AI-powered apps is that inference lives in the cloud. That made sense when on-device models were too limited to be useful. That assumption is now worth revisiting.

At Google Cloud Next '26, Firebase announced hybrid inference for Firebase AI Logic on Android — currently experimental, powered by Gemini Nano via ML Kit's Prompt API under the hood. The idea is straightforward: run inference locally on the device when it can handle it, and fall back to cloud-hosted Gemini when it can't. From your Kotlin code, the API looks nearly identical either way. You configure a preference, write a prompt, get a response. The SDK figures out the routing.

This article walks through how hybrid inference works, why it matters for Android developers, and how to wire it into a real app today — including the honest caveats you won't find in the announcement post.

How hybrid inference works

At its core, hybrid inference is a routing decision the SDK makes on your behalf. When your app sends a prompt, Firebase AI Logic checks whether the device supports Gemini Nano and has sufficient resources to run it. If yes, inference runs locally — never leaving the device. If not, the request is forwarded to cloud-hosted Gemini, exactly as it would in a standard setup.

The routing behaviour is controlled by four InferenceMode values:

PREFER_ON_DEVICE — try on-device first, fall back to cloud if unavailable
PREFER_IN_CLOUD — try cloud first, fall back to on-device if offline
ONLY_ON_DEVICE — on-device only; throws an exception if unavailable
ONLY_IN_CLOUD — cloud only; throws an exception if offline

For most apps PREFER_ON_DEVICE is the right starting point. It gives you the speed and privacy benefits of local inference where available, without ever leaving users stranded.

Think of it like a CDN for model inference. The SDK picks the most capable compute available at the moment the request is made — and your Kotlin code stays the same regardless of which path runs.

Under the hood, on-device inference uses ML Kit's Prompt API running Gemini Nano — a smaller, quantized model optimized for mobile hardware, not the full cloud Gemini model. That trade-off is worth being upfront about: on-device responses will be faster and free of API cost, but for complex reasoning tasks, cloud Gemini produces higher quality output. Hybrid inference gives the SDK the ability to make that call at runtime rather than forcing you to hardcode it at build time.

Three reasons this matters for Android developers in particular:

Speed. On-device inference removes the network round trip entirely. For short prompts on capable hardware, responses feel near-instant — which opens up UX patterns that feel too slow over the network.

Privacy. When inference runs locally, the prompt never leaves the device. For apps handling sensitive input — health data, personal notes, financial details — that's a meaningful architectural guarantee, not just a talking point.

Cost. Every request handled on-device is a Gemini API call you're not making. At scale, or in apps with high inference volume, this adds up — especially for short, repetitive prompts.

One important note: the on-device model isn't bundled with your APK — it downloads in the background after first launch via ML Kit. Until it's cached, all requests fall back to cloud. For apps where on-device is a hard requirement rather than a preference, you'll want to gate the feature on download completion.

Building MemoMind: record, summarize, save

We're going to build a focused Android app called MemoMind that records a voice memo, transcribes it using Android's built-in speech recognizer, runs a Gemini summary using hybrid inference, and saves everything to Firestore — including which backend (on-device or cloud) handled the request. By the end you'll have something genuinely usable, not just a proof of concept.

The full flow: the user taps record, speaks, taps stop. The app transcribes the audio using SpeechRecognizer, sends the transcript to Firebase AI Logic with hybrid inference configured, parses the structured JSON response into a summary and action items, then writes the result to Firestore.

Prerequisites

The full source for MemoMind is on GitHub at https://github.com/RealWorldApplications/memo-mind-post. You can clone it and follow along, or build from scratch using the steps below.

Tested with: firebase-ai:17.11.0, firebase-ai-ondevice:16.0.0-beta01 on a Pixel 9 running Android 15. Import paths and model names for experimental APIs can shift between releases — treat the GitHub repo as the authoritative reference if anything here doesn't compile.

You'll need a Firebase project with the Gemini Developer API enabled under Build → AI Logic, and Firestore enabled. If you're starting from scratch, the Firebase Android setup guide covers the project creation steps.

Important: After cloning, add your own google-services.json to the app/ directory — it's gitignored in the repo. Download it from your Firebase project console under Project settings → Your apps.

Step 1 — Add dependencies

Hybrid inference requires two separate libraries — the standard Firebase AI library and the on-device extension. Note that firebase-ai-ondevice is not yet in the Firebase Android BoM, so you need to pin the version explicitly.

You also need the Kotlin serialization plugin (used in MemoService for JSON parsing) and material-icons-extended for the Mic and Stop icons — both are easy to miss:

// build.gradle.kts (app level)
plugins {
    // ... your existing plugins ...
    id("org.jetbrains.kotlin.plugin.serialization") version "2.0.21"
}

dependencies {
    // Standard Firebase AI Logic
    implementation("com.google.firebase:firebase-ai:17.11.0")

    // On-device extension — NOT in the BoM yet, pin manually
    implementation("com.google.firebase:firebase-ai-ondevice:16.0.0-beta01")

    // Firestore
    implementation("com.google.firebase:firebase-firestore:25.1.1")

    // Compose
    implementation(platform("androidx.compose:compose-bom:2024.09.00"))
    implementation("androidx.compose.ui:ui")
    implementation("androidx.compose.material3:material3")
    implementation("androidx.compose.material:material-icons-extended") // Mic/Stop icons
    implementation("androidx.activity:activity-compose:1.9.2")

    // Coroutines
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.8.1")
    implementation("androidx.lifecycle:lifecycle-viewmodel-compose:2.8.6")

    // JSON parsing — used in MemoService
    implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.7.3")
}

Also add microphone permission to AndroidManifest.xml:

<uses-permission android:name="android.permission.RECORD_AUDIO" />

Step 2 — Define the data model

Before any AI or UI code, define a clean data model. Gemini will return a structured JSON response that maps directly to this class, and Firestore will store it.

data class Memo(
    val id: String = "",
    val transcript: String = "",
    val summary: String = "",
    val actionItems: List<String> = emptyList(),
    val inferredBy: String = "", // "on_device" or "cloud"
    val createdAt: Long = System.currentTimeMillis()
) {
    fun toMap(): Map<String, Any> = mapOf(
        "transcript" to transcript,
        "summary" to summary,
        "actionItems" to actionItems,
        "inferredBy" to inferredBy,
        "createdAt" to createdAt
    )
}

The inferredBy field is what powers the backend indicator chip in the UI. We'll populate it based on which path the SDK actually took.

Step 3 — Configure Firebase AI Logic with hybrid inference

This is the heart of the tutorial. The @OptIn annotation is required because hybrid inference is currently experimental. Beyond that, notice how little the setup differs from standard Firebase AI — the only meaningful addition is onDeviceConfig.

import com.google.firebase.Firebase
import com.google.firebase.ai.ai
import com.google.firebase.ai.InferenceMode        // not .ondevice — re-exported to root package
import com.google.firebase.ai.OnDeviceConfig        // not .ondevice — re-exported to root package
import com.google.firebase.ai.type.GenerativeBackend // lives in .type, not root
import com.google.firebase.ai.type.PublicPreviewAPI
import kotlinx.serialization.json.Json
import kotlinx.serialization.json.jsonArray
import kotlinx.serialization.json.jsonObject
import kotlinx.serialization.json.jsonPrimitive

class MemoService {

    // Experimental opt-in required for hybrid inference
    @OptIn(PublicPreviewAPI::class)
    private val model = Firebase.ai(backend = GenerativeBackend.googleAI())
        .generativeModel(
            modelName = "gemini-3-flash-preview",
            onDeviceConfig = OnDeviceConfig(
                mode = InferenceMode.PREFER_ON_DEVICE
            )
        )

    @OptIn(PublicPreviewAPI::class)
    suspend fun summarize(transcript: String): SummaryResult {
        val prompt = """
            You are a memo summarizer. Given the transcript below, respond ONLY
            with a valid JSON object — no markdown, no explanation. Use this shape:
            {
              "summary": "2-sentence summary of the memo",
              "actionItems": ["item one", "item two"],
              "inferredBy": "on_device"
            }

            Set "inferredBy" to "on_device" if you are running locally,
            or "cloud" if you are a cloud-hosted model.

            Transcript:
            $transcript
        """.trimIndent()

        return try {
            val response = model.generateContent(prompt)
            val raw = response.text ?: return SummaryResult.empty(transcript)
            parseResponse(raw)
        } catch (e: Exception) {
            SummaryResult.empty(transcript)
        }
    }

    private fun parseResponse(raw: String): SummaryResult {
        return try {
            val cleaned = raw
                .replace("```
{% endraw %}
json", "")
                .replace("
{% raw %}
```", "")
                .trim()
            val json = Json.parseToJsonElement(cleaned).jsonObject
            SummaryResult(
                summary = json["summary"]?.jsonPrimitive?.content ?: "",
                actionItems = json["actionItems"]?.jsonArray
                    ?.map { it.jsonPrimitive.content } ?: emptyList(),
                inferredBy = json["inferredBy"]?.jsonPrimitive?.content ?: "cloud"
            )
        } catch (e: Exception) {
            SummaryResult(summary = raw, actionItems = emptyList(), inferredBy = "cloud")
        }
    }
}

data class SummaryResult(
    val summary: String,
    val actionItems: List<String>,
    val inferredBy: String
) {
    companion object {
        fun empty(transcript: String) = SummaryResult(
            summary = transcript,
            actionItems = emptyList(),
            inferredBy = "cloud"
        )
    }
}

A note on inferredBy: The Firebase AI Logic Android SDK doesn't yet expose a property to read which backend handled a response after the fact. As a practical workaround, we ask the model to self-report in the JSON. On-device Gemini Nano will reliably report "on_device" — it's a factual question about its own execution context. Verify this with your own testing and adjust if needed.

Step 4 — Transcribe with Android's SpeechRecognizer

Android's built-in SpeechRecognizer handles transcription entirely on-device using the platform's native speech engine. No third-party package needed, and audio never leaves the phone at this stage.

The key pattern is wrapping SpeechRecognizer's listener callbacks in suspendCancellableCoroutine so they integrate cleanly with the coroutine-based ViewModel:

class TranscriptionService(private val context: Context) {

    suspend fun transcribe(): String = suspendCancellableCoroutine { continuation ->
        val recognizer = SpeechRecognizer.createSpeechRecognizer(context)

        val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply {
            putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
            putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1)
        }

        recognizer.setRecognitionListener(object : RecognitionListener {
            override fun onResults(results: Bundle?) {
                val transcript = results
                    ?.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION)
                    ?.firstOrNull() ?: ""
                continuation.resume(transcript)
                recognizer.destroy()
            }

            override fun onError(error: Int) {
                continuation.resumeWithException(Exception("Speech recognition error: $error"))
                recognizer.destroy()
            }

            // RecognitionListener requires several no-op overrides — see full source on GitHub
        })

        recognizer.startListening(intent)
        continuation.invokeOnCancellation { recognizer.destroy() }
    }
}

SpeechRecognizer must be created and used on the main thread. Because viewModelScope runs on Dispatchers.Main, calling transcribe() from the ViewModel is safe without any explicit dispatcher switch.

Step 5 — ViewModel: wire it all together

The ViewModel exposes a MemoUiState sealed class (Idle, Recording, Processing, Done, Error) as a StateFlow and orchestrates the three-stage pipeline in startRecording(). The full class including the ViewModelProvider.Factory is in MemoViewModel.kt — the pipeline itself is the part worth reading here:

fun startRecording() {
    _uiState.value = MemoUiState.Recording

    viewModelScope.launch {
        try {
            val transcript = transcriptionService.transcribe() // on-device speech engine
            _uiState.value = MemoUiState.Processing

            val result = memoService.summarize(transcript)     // hybrid inference

            val memo = Memo(
                id = firestore.collection("memos").document().id,
                transcript = transcript,
                summary = result.summary,
                actionItems = result.actionItems,
                inferredBy = result.inferredBy
            )

            firestore.collection("memos").document(memo.id).set(memo.toMap()).await()
            _uiState.value = MemoUiState.Done(memo)
        } catch (e: Exception) {
            _uiState.value = MemoUiState.Error(e.message ?: "Unknown error")
        }
    }
}

Three lines do the meaningful work: transcribe(), summarize(), and set(). Everything else is state management.

Step 6 — Compose UI with the backend indicator chip

MemoScreen collects uiState and renders a FilledIconButton that toggles between mic and stop, a status label, a CircularProgressIndicator during summarization, and the SummaryCard once done. The full screen composable is in MemoScreen.kt.

The part worth looking at closely is SummaryCard — specifically the backend indicator chip, which is the whole point of exposing inferredBy:

@Composable
fun SummaryCard(memo: Memo, onReset: () -> Unit) {
    Card(
        modifier = Modifier.fillMaxWidth(),
        shape = RoundedCornerShape(16.dp)
    ) {
        Column(modifier = Modifier.padding(16.dp)) {

            // Summary header + backend chip
            Row(
                modifier = Modifier.fillMaxWidth(),
                verticalAlignment = Alignment.CenterVertically
            ) {
                Text(
                    text = "Summary",
                    style = MaterialTheme.typography.titleMedium,
                    modifier = Modifier.weight(1f)
                )
                // Backend indicator chip
                Surface(
                    shape = RoundedCornerShape(20.dp),
                    color = if (memo.inferredBy == "on_device")
                        Color(0xFFE1F5EE)
                    else
                        Color(0xFFE6F1FB)
                ) {
                    Text(
                        text = if (memo.inferredBy == "on_device") "On-device" else "Cloud",
                        modifier = Modifier.padding(horizontal = 10.dp, vertical = 4.dp),
                        style = MaterialTheme.typography.labelSmall,
                        color = if (memo.inferredBy == "on_device")
                            Color(0xFF085041)
                        else
                            Color(0xFF0C447C)
                    )
                }
            }

            Spacer(modifier = Modifier.height(8.dp))
            Text(
                text = memo.summary,
                style = MaterialTheme.typography.bodyMedium,
                lineHeight = MaterialTheme.typography.bodyMedium.lineHeight
            )

            if (memo.actionItems.isNotEmpty()) {
                Spacer(modifier = Modifier.height(16.dp))
                Text(
                    text = "Action items",
                    style = MaterialTheme.typography.titleMedium
                )
                Spacer(modifier = Modifier.height(4.dp))
                memo.actionItems.forEach { item ->
                    Text(
                        text = "• $item",
                        style = MaterialTheme.typography.bodyMedium,
                        modifier = Modifier.padding(vertical = 2.dp)
                    )
                }
            }

            Spacer(modifier = Modifier.height(16.dp))
            OutlinedButton(
                onClick = onReset,
                modifier = Modifier.fillMaxWidth()
            ) {
                Text("Record another")
            }
        }
    }
}

Step 7 — Run it and watch the chip

Build and run on a physical Android device that supports Gemini Nano — a Pixel 6 or newer is a safe bet. Record a short memo, something like: "remind me to email Sarah about the Q2 report and book a dentist appointment." Stop and watch the card appear.

On the first run you'll likely see "Cloud" — the on-device Gemini Nano model downloads in the background after first use. Record a second memo and you should see the chip flip to "On-device", with a noticeably faster response time.

That chip is the screenshot worth capturing for your article header. A real summary card showing "On-device" with action items is worth more than any diagram.

What this actually means for Android developers

After building MemoMind, a few things stand out — both as genuine reasons to be excited about hybrid inference and as honest caveats worth knowing before you architect something around it.

What works well

The API design is the real win here. Firebase's decision to wrap hybrid routing in OnDeviceConfig rather than forcing you to maintain two separate model instances means you're not writing conditional execution paths throughout your codebase. The SDK absorbs the routing complexity. For a feature that's still experimental, the ergonomics are surprisingly clean.

The privacy story is also more meaningful than it might first appear. When inference runs locally, it's not just that data doesn't get logged somewhere — it's that the architecture of your app changes. You can make stronger guarantees to users, design features that handle genuinely sensitive input, and avoid the legal grey areas that come with sending personal data to a third-party API. For anyone building in health, fitness, or productivity, that's a real design unlock.

The backend indicator chip in MemoMind isn't just a demo trick. In a production app, surfacing this to users — even subtly — builds a kind of trust that's hard to communicate through a privacy policy alone.

Current limitations worth knowing

It's experimental. The @OptIn(PublicPreviewAPI::class) annotation isn't just boilerplate — it's a signal that the API surface can change in backwards-incompatible ways without deprecation notice. Don't build a production release around this without a contingency plan.

Gemini Nano capability gap. The on-device model is smaller and quantized. For MemoMind's use case — short transcripts, structured JSON output — it performs well. For complex reasoning, longer context, or nuanced instruction-following, you'll notice the quality gap compared to cloud Gemini. Know your prompt's complexity profile before relying on on-device quality.

Model download on first run. Gemini Nano downloads in the background after first launch via ML Kit. Until it's cached, every request goes to cloud. For apps where on-device is a hard privacy requirement rather than a preference, you'll need to listen to the model download state and gate the feature accordingly.

Device support is not universal. On-device inference requires a device that supports Gemini Nano. Pixel 6 and newer, and a growing range of Samsung devices, qualify — but this is not all Android devices. Your PREFER_ON_DEVICE config will silently fall back to cloud on unsupported hardware, which is fine for most cases but worth tracking in your analytics.

Self-reporting inferredBy is a workaround. Asking the model to report its own execution context in the JSON works in practice but isn't a guaranteed contract. The official SDK doesn't yet expose a post-response property for which backend ran. Watch the firebase-ai-ondevice changelog for when this is added properly.

The bigger picture

Hybrid inference is one piece of a broader shift in what mobile AI can look like. The ability to run meaningful inference locally — even in a limited form — changes the kinds of apps you can design. Features that felt too latency-sensitive for a cloud round trip, or too privacy-sensitive to send off-device, become viable. The on-device model will improve over time. Device support will grow. The API will stabilize.

The developers who understand this stack now, rough edges and all, will have a meaningful head start when it does.

Where to go from here

MemoMind as we've built it is a solid foundation. The transcribe → summarize → save pipeline is generic enough to adapt to a wide range of use cases — meeting notes, workout logs, daily journals, field reports. The structured JSON prompt and the Firestore schema transfer cleanly to any domain.

A few natural next steps if you want to keep building:

Add a memo list screen. A simple Firestore real-time listener showing past memos with their on-device/cloud chip makes the app feel complete — and gives you a live view of how often on-device inference wins in practice across your test devices.

Gate on model download completion. ML Kit exposes a download state API for Gemini Nano. Listening to it and showing a subtle "AI ready" indicator once it's cached is a small touch that makes a real UX difference for privacy-sensitive apps.

Add the model device support check. Before showing the on-device privacy promise to users, check at runtime whether their device actually supports Gemini Nano inference and surface that information appropriately.

iOS support for hybrid inference is not yet available — watch the Firebase changelog. The Android experimental API suggests the Dart/Flutter SDK will follow once the underlying infrastructure is proven on native platforms first.

The full source for MemoMind is on GitHub at https://github.com/RealWorldApplications/memo-mind-post. If you run into anything unexpected with the hybrid routing on your device, or see the chip behave differently than expected, I'd like to hear about it.

DEV Community