Baiju Rajyaguru

Posted on May 16

I Ran Gemma 4 on a Mid-Range Android Phone — Here's What Actually Happened

#gemmachallenge #ai #android

Gemma 4 Challenge: Write about Gemma 4 Submission

Real benchmarks. No server. Just a Galaxy A35 5G and some patience.

Most "on-device AI" articles are written by people who tested on a Pixel 9 Pro or an iPhone 15 Pro Max, ran 10 prompts, and called it a day. This isn't that. I've been daily-driving Google's Gemma 4 E2B and E4B models on a Samsung Galaxy A35 5G (8GB RAM) for a few weeks — using them to work through Python logic puzzles and generate hardware scripts — across two different apps, two different inference backends, with close attention to what was actually limiting performance. This wasn't a quick stress test. My main focus was the E4B, because that's where things get interesting.

What Is Gemma 4, Really?

Before we get into numbers, let's make sure we're on the same page about what "E2B" and "E4B" actually mean — because it's not obvious.

The "E" stands for "Effective" parameters, not just "Edge." These models use a technique called Per-Layer Embeddings (PLE), where instead of a single embedding lookup at the input layer like a standard transformer, each decoder layer gets its own dedicated embedding vector for every token. This makes a 2B-active-parameter model punch significantly above its weight — the E2B has roughly 2.3B effective parameters but closer to 5B total stored parameters, and the E4B sits around 4.5B effective. That's why the Q4_K_M quantized file for E4B comes in at around 4.8 GB — larger than you'd expect from a "4 billion parameter" model.

Both models are genuinely multimodal out of the box: text, images, and audio — all natively, no cloud required. They use a hybrid attention mechanism that mixes local sliding window attention (512-token window) with full global attention at key layers, giving them long-context awareness without the memory cost of full attention on every layer. Context window support goes up to 128K tokens on paper.

My Setup

Device: Samsung Galaxy A35 5G — Exynos 1380, 4× Cortex-A78 performance cores, 4× Cortex-A55 efficiency cores, Mali-G68 MP5 GPU, 8GB LPDDR4X RAM.

Models tested:

Gemma 4 E4B — Q4_K_M quantization, downloaded from Google's official Hugging Face repo, 4.8 GB
Gemma 4 E2B — Q4_K_M quantization, same source

Apps used:

PocketPal AI — ran local GGUF files, CPU inference only (more on why below)
LLM Hub — downloaded the model in-app, used GPU acceleration

Settings: 4096 context length, 4096 token generation on both models and both apps.

The GPU Problem: Mali G68 MP5 Isn't Supported in PocketPal

Here's the first real-world catch that no spec sheet will tell you: PocketPal AI does not support the Mali G68 MP5 GPU on the Galaxy A35. This means all inference in PocketPal ran entirely on the CPU — specifically, only the 4 Cortex-A78 performance cores, with the efficiency cores left out of the equation.

This isn't a flaw in Gemma 4. It's a compatibility gap between the inference runtime (llama.cpp-based) and Mali's OpenCL/Vulkan implementation on this particular chip. Snapdragon and Apple Silicon users have a much smoother experience here — the A35's Exynos 1380 sits outside the well-tested hardware matrix for most on-device AI apps right now.

Benchmark Results

Gemma 4 E4B — The Main Event

Backend	App	Tokens/sec
CPU only (4× A78)	PocketPal	3.4 – 3.6
GPU (Mali G68 MP5)	LLM Hub	~3.8

The GPU gave a small but real boost — roughly 5–12% improvement. But it came at a cost: noticeably higher battery drain. Worse, LLM Hub eventually became unstable during extended sessions — it crashed and failed to reload the model into memory entirely. This sealed it: Mali G68 MP5 GPU support is still too buggy to trust as a daily driver. The CPU path in PocketPal, while slower on paper, was the only backend that stayed reliable across weeks of testing. For a gain this modest paired with that level of instability, there's no contest.

Thermal & Short-Script Advantage

Sustained generation tells only part of the story. For short, bursty tasks — like asking Gemma 4 E4B to write a Python MD5 hashing script — the phone can briefly exceed its sustained throughput ceiling. After letting the device cool down and clearing background RAM, I hit a peak of 4.10 tokens/second with an 870ms time-to-first-token (TTFT). That's a noticeably snappier experience than the 3.4–3.6 tok/s average suggests. Thermal throttling and background memory pressure are real enemies here; managing them buys you a meaningful headroom boost on short prompts even without changing any model settings.

Gemma 4 E2B — For Comparison

Running E2B on the same device produced a much more comfortable 5–7 tokens/second on CPU. That's the difference between "reads like a human typing" and "feels like AI lag." E2B is fast enough for real interactive use on this hardware. However, it fails to match the E4B in quality and reasoning, making it better suited for simple tasks and quick queries.

The Real Bottleneck: Memory Bandwidth

Here's the most important finding from this whole experiment, and it's something that doesn't get talked about enough in mobile AI discussions:

Neither the CPU nor the GPU was the limiting factor. Memory bandwidth was.

During inference, both the CPU and GPU were running at or near 100% utilization — but token throughput still plateaued. The E4B model's weight matrix is large, and during every single token generation step, the system has to stream those weights from RAM through the memory bus. The Galaxy A35's LPDDR4X memory subsystem simply can't push data fast enough to keep the compute units fed.

This is a fundamental constraint of running large quantized models on consumer mobile hardware. The compute is there. The bandwidth isn't. It explains why going from CPU to GPU gave only a marginal improvement — you moved the computation, but the data still has to travel the same slow road.

Devices with LPDDR5X or unified memory architectures (like Apple Silicon or newer Snapdragon 8 Elite) show much better performance on the same models precisely because they have higher memory bandwidth, not because their CPUs or GPUs are proportionally faster.

What This Means for Mid-Range Android Users

The Galaxy A35 sits in a realistic slice of the Android market — it's not a budget phone, but it's not flagship either. Running E4B at 3.4–3.6 tokens/second is functional, but not smooth. Here's the honest breakdown:

E2B at 5–7 tok/s → usable for back-and-forth conversation, lightweight reasoning, quick queries
E4B at 3.4–3.8 tok/s → better output quality, but you'll wait. Good for non-interactive tasks: summarization, analysis, drafting
GPU acceleration on Mali G68 MP5 → limited by app support right now. Don't expect miracles even when it works

If you want to run Gemma 4 on a mid-range Android device today, E2B is the pragmatic choice. E4B is worth running if output quality matters more than speed — but set your expectations accordingly.

Why Not Just Use the Giants?

A fair question. The Gemma 4 family also includes a 26B A4B Mixture-of-Experts model and a 31B Dense model — and I tested both via cloud inference. The reasoning quality is genuinely impressive; the 31B in particular shows that Google can compete at the frontier level in the cloud. But benchmarking those misses the entire point of what makes the E2B and E4B remarkable.

The 31B proves Google can build a world-class cloud model. The E2B and E4B prove something far more consequential: offline, multimodal AI is finally trickling down to the devices the rest of the world actually owns.

Think about what it means to run a 4.8 GB multimodal model natively on a mid-range phone — no Wi-Fi, no API key, no monthly subscription — in a country where data costs money and flagship hardware is out of reach for most developers. That's not a footnote. That's a shift. The gap between "AI you can experiment with" and "AI you can build on" just got a lot smaller for anyone who doesn't have a MacBook Pro or a cloud budget to burn.

Should You Try This?

Yes — with context.

Gemma 4's edge models are a genuine achievement. Multimodal input (vision + audio + text), function calling, 128K context, all in a model small enough to fit on a phone. A year ago this capability profile didn't exist at any size you could run locally.

But the Galaxy A35 — and phones like it — are memory-bandwidth constrained in a way that no amount of quantization fully solves. The models run. They produce good output. They just don't run fast.

If you're on a flagship device with LPDDR5X or an NPU with proper driver support, your experience will be materially different. For everyone else on mid-range hardware: E2B is your friend, E4B is your weekend project.

Tested on Samsung Galaxy A35 5G (8GB RAM, Exynos 1380). Models: Gemma 4 E2B and E4B, Q4_K_M quantization from Google's Hugging Face repository. Inference via PocketPal AI (CPU) and LLM Hub (GPU). Settings: 4096 context / 4096 token generation.

DEV Community