DEV Community

Cover image for Local LLM with Google Gemma: On-Device Inference Between Theory and Practice
eleonorarocchi
eleonorarocchi

Posted on

Local LLM with Google Gemma: On-Device Inference Between Theory and Practice

TL;DR

Running an LLM locally on a smartphone is now possible—and it’s not even that exotic anymore. The interesting part is no longer whether it can be done, but how it’s done and what trade-offs actually emerge: model format, runtime, performance, and distribution.

To understand this better, I built a small Flutter app that performs on-device inference using LiteRT-LM and a Gemma 4 E2B model.

The Starting Point

Anyone working with LLMs already knows: local inference isn’t new. Between quantization, smaller models, and optimized runtimes, running models directly on devices is a real path.

So the interesting question today is no longer “can it be done?”, but rather: what does this integration actually look like when you bring it to mobile?

To answer that, I chose a deliberately simple setup: a Flutter app, a textarea, a button, and a response generated locally by the model. No backend, no API, no remote calls. Just the app and the model.

Why LiteRT-LM

It’s worth pausing here, because the runtime significantly changes the kind of work you’re doing.

LiteRT-LM is not the only option for on-device inference. In the mobile local-model landscape, alternatives like llama.cpp (with GGUF models, widely used for quantized LLMs), ONNX Runtime (more focused on cross-platform portability), and ExecuTorch (the mobile runtime from the PyTorch ecosystem, still maturing) offer different approaches depending on the model type and target hardware.

The main advantage of LiteRT-LM, however, is its native integration with the Android ecosystem and direct support for hardware delegates like the device’s GPU and NPU, making it the most straightforward choice for on-device inference without dealing with format conversions or external dependencies.

That said, there is a trade-off: the approach is less flexible than others. You can’t just use “any” model on the fly—you either use models already prepared for LiteRT or handle the conversion yourself.

Why Gemma 4 E2B

For the model, I used this variant:

https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

The choice is not random. The Gemma 4 family includes different variants designed to balance capability and computational requirements. The E2B version is interesting because it sits at a sensible middle ground: it’s not the largest model in the family (far from it), but it’s capable enough to produce useful output while still being compact enough to make sense on a smartphone.

In other words: it’s a practical choice—not because it’s “the best ever,” but because it represents the kind of compromise that makes sense when constraints include not just output quality, but also memory, loading time, and inference speed.

The First Thing You Notice: Size

The file you download from Hugging Face weighs about 2.4 GB.

That’s not automatically a deal-breaker. Today, app stores and distribution systems offer various strategies for handling large assets: dynamic downloads, splits, additional modules, local caching...

Still, it’s important to be aware of this when thinking about production, because you’ll definitely need to reason concretely about how to package and distribute your app.

For a simple experiment like this, the easiest approach is to include the model in the app assets and then copy it to the local filesystem on first launch.

If you’re wondering why the model needs to be copied to the local filesystem, the reason is simple: LiteRT-LM (and many ML runtimes in general) require a file path on disk because they need direct access to the model file. During inference, the runtime constantly jumps between different parts of the model and accesses specific blocks (layers, weights, cache), often reusing data or working in parallel. This requires fast random access. Also, the model is not fully loaded into memory but memory-mapped as needed. None of this is feasible with a stream from assets, which only provides sequential access.

A Step-by-Step Guide

1. Create the Flutter project

From the terminal:

flutter create edge_llm_app
cd edge_llm_app
flutter run
Enter fullscreen mode Exit fullscreen mode

At this point, you’ll see the classic default Flutter app with the counter.

2. Add LiteRT-LM to the Android project

This step adds the Android runtime required to run the model on-device.

Open the file:

android/app/build.gradle.kts
Enter fullscreen mode Exit fullscreen mode

If there’s no dependencies block, you can add one at the end of the file. Inside it, insert:

dependencies {
    implementation("com.google.ai.edge.litertlm:litertlm-android:latest.release")
}
Enter fullscreen mode Exit fullscreen mode

3. Enable the native library for GPU backend

To use the GPU (and other accelerators) for general-purpose computation—not graphics—you use OpenCL. In this case, it’s needed to run heavy computations like those of language models. Of course, this only works if the device supports it.

Open the file:

android/app/src/main/AndroidManifest.xml
Enter fullscreen mode Exit fullscreen mode

Find the <application> tag and add this line inside it:

<uses-native-library
    android:name="libOpenCL.so"
    android:required="false" />
Enter fullscreen mode Exit fullscreen mode

This allows the app to use OpenCL if the device supports it.

4. Download the model

Download the .litertlm file from:

https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

In the Files and versions tab, you’ll find the model file. For simplicity, you can rename it:

gemma.litertlm
Enter fullscreen mode Exit fullscreen mode

5. Copy the model into the right folder

Create the assets folder if it doesn’t exist:

android/app/src/main/assets
Enter fullscreen mode Exit fullscreen mode

Then place the downloaded file inside:

android/app/src/main/assets/gemma.litertlm
Enter fullscreen mode Exit fullscreen mode

6. Create the Flutter ↔ Android bridge

In the Flutter project, create this file:

lib/llm_service.dart
Enter fullscreen mode Exit fullscreen mode

And paste this code:

import 'package:flutter/services.dart';

class LlmService {
  static const _channel = MethodChannel('llm');

  static Future<void> init() async {
    await _channel.invokeMethod('init');
  }

  static Future<String> ask(String prompt) async {
    final result = await _channel.invokeMethod('ask', {
      'prompt': prompt,
    });
    return result;
  }
}
Enter fullscreen mode Exit fullscreen mode

This file is the bridge between the Flutter UI and the native Android code that will actually run the model.

7. Modify MainActivity.kt

Open:

android/app/src/main/kotlin/com/example/edge_llm_app/MainActivity.kt
Enter fullscreen mode Exit fullscreen mode

(The exact path may vary slightly depending on your package name.)

Replace the content with a version that:

  • initializes the engine
  • copies the model from assets
  • exposes two methods to Flutter: init and ask

For example:

// (code unchanged)
Enter fullscreen mode Exit fullscreen mode

This is the core of the integration. The model is copied from assets to the filesystem, the runtime is initialized, and the prompt is passed to the model.

8. Replace the default Flutter UI

Open:

lib/main.dart
Enter fullscreen mode Exit fullscreen mode

Replace its content with something simple but usable, for example:

// (code unchanged)
Enter fullscreen mode Exit fullscreen mode

At this point, you have a minimal UI that’s sufficient to test inference.

9. Run the app on your phone

Now you can run:

flutter run
Enter fullscreen mode Exit fullscreen mode

This is where you see the difference compared to an API call.

When you press “Send,” the phone does the work. The UI may freeze for a few seconds, then the response arrives (the UI can definitely be improved, but that’s not the goal here).

From the logs, you can clearly see the different phases of inference: prefill, generation, output.

And most importantly: everything happens locally!

What This Exercise Really Shows

In the end, the interesting point is not proving that you can run an LLM on a phone. That’s already established.

The real insight is understanding what kind of integration you are building.

LiteRT-LM simplifies execution on mobile but requires you to accept a specific ecosystem. Gemma 4 E2B makes sense because it sits in a realistic range for this type of use. And the model size is not so much an absolute deal-breaker as it is an architectural variable you need to manage.

The biggest difference, however, is conceptual: when working with APIs, AI is an external service. Here, it becomes part of the application itself. You start reasoning in terms of filesystem, memory, initialization time, hardware, and acceleration.

You’re no longer just making a request.

You’re executing something locally.

And that’s the most interesting paradigm shift of all.

Top comments (0)