Submitted for the Gemma 4 Challenge: Write About Gemma 4
A 2B model running entirely on my local machine, no cloud, no API key, produced a correctly rendered interactive UI layout on the first attempt. Not what I expected going into this.
I ran all four Gemma 4 variants through OpenUI, a generative UI framework that turns model output directly into rendered components. Two smaller models ran locally via Ollama. The larger two came through OpenRouter and Ollama Cloud.
# Specifications
$os -> Windows
$ram -> 16GB DDR5
$gpu -> RTX 3050 Ti Laptop GPU (4GB VRAM, 90W)
$inference -> Ollama + OpenRouter
The question I wanted to answer: are the smaller Gemma 4 models actually useful for structured generation tasks, or just impressive for their size in a way that falls apart the moment you ask them to do real work?
Short answer: more capable than I expected, with a ceiling that is real but higher than most people assume.
Why OpenUI Makes a Better Test Than Standard Benchmarks
Most model benchmarks are forgiving. A model can hedge, partially answer, or pad a response and still score well.
OpenUI is not forgiving. The framework uses a declarative language called openui-lang where the model's output maps directly to rendered UI components. Every variable referenced in a layout must be defined. Arguments are positional, named syntax silently breaks things. Every component name has to match the schema exactly. Wrong component name returns a diagnostic. Undefined reference drops the section without any warning.
A prompt like "create a sales dashboard with a stats table and follow-up suggestions" requires the model to:
- Output
root = Card(...)first so the UI shell renders immediately during streaming - Reference every defined variable from its parent, or it gets silently dropped
- Use only positional arguments,
Table([col1, col2])notTable(columns=[col1, col2])
Here is what correct output looks like for a simple dashboard:
root = Card([header, statsTable, suggestions])
header = CardHeader("Sales Overview", "Q4 2025")
statsTable = Table([regionCol, revenueCol, growthCol])
regionCol = Col("Region", ["North", "South", "East", "West"])
revenueCol = Col("Revenue", [142000, 98000, 176000, 115000])
growthCol = Col("Growth %", [12, 7, 21, 9])
suggestions = FollowUpBlock([fu1, fu2])
fu1 = FollowUpItem("Break this down by month")
fu2 = FollowUpItem("Show the lowest performing region")
Miss any of those constraints and you get a partial render or nothing. That is why OpenUI generation gives you a concrete pass/fail result instead of a vibes-based quality assessment.
Test Setup
| Model | Type | Inference |
|---|---|---|
| Gemma 4 E2B | Small MoE | Ollama local |
| Gemma 4 E4B | Small MoE | Ollama local |
| Gemma 4 26B | MoE | OpenRouter |
| Gemma 4 31B | Dense | OpenRouter / Ollama Cloud |
I ran each model through a range of prompts: simple (single stat card, basic table, short follow-up list) through to complex (multi-section dashboard, accordion with nested content, form with validation). About 15 prompts per model across complexity levels.
For full local setup: Setting Up OpenUI with Ollama
E2B: Useful at the Low End, Not Beyond It
Simple prompts like single card layouts, basic tables, short lists, E2B completed correctly about 7 out of 10 times. The model got the basic openui-lang structure, followed component names reliably, and produced usable output.
Complex prompts like multi-section dashboards, nested structures, anything requiring consistency across more than a dozen variable definitions, dropped to about 2 or 3 out of 10. The layout shell would start correctly then lose coherence midway. You get a valid outer frame with broken or missing inner components.
The working output was genuinely usable. Not "impressive for 2B" usable, but actually usable. For a simple dashboard or form-based prototype, you can get working UI output offline on consumer hardware with a model small enough to run alongside other applications.
If you have a 16GB machine and want to try E2B today: it works for simple, well-scoped prompts. Keep your layouts shallow and your variable chains short.
E4B: Better Quality, Memory Is the Real Constraint
E4B was a step up on layout consistency. Component hierarchies held together longer. Moderately complex prompts that E2B failed on frequently came through correctly.
The constraint on a 16GB system is RAM. E4B pushes memory hard. During larger generations I watched utilization climb toward the ceiling, and the failures had a specific and frustrating pattern: data sections disappeared, layout blocks became incomplete, the model stopped mid-output. Not a crash, a quiet failure where the rendered UI looks fine at first glance until you notice entire sections are just absent.
It took me a while to diagnose because it did not look like a failure, it looked like the model had decided not to render certain components. Monitoring RAM during generation was what clarified it. E4B peaked at 14–15GB on my machine during complex generations.
Rough thresholds from what I observed:
- 16GB: E4B is inconsistent on anything complex
- 32GB: E4B should be reliable across most prompt types
- 64GB+: comfortable for the larger models locally
26B: Where Reliability Kicks In
Switching to 26B through OpenRouter was an immediate change. Layouts that E4B would drop sections from, 26B completed on the first attempt. The model held structure across longer generations without degrading. Prompts that needed multiple retries locally just worked.
The instruction-following across longer output sequences is different in kind, not just degree. Complex dashboard prompts requiring the model to maintain a dozen or more correct variable references, 26B did that consistently.
One practical note: 26B is too heavy for 16GB RAM locally, and there is no free tier on OpenRouter. You are paying API costs for serious use.
31B: The Most Consistent Results
The 31B dense model was the most reliable across every prompt type. Simple layouts, complex dashboards, nested structures, longer generations, output held together consistently.
The 31B is available for local download but will not run on 16GB RAM. I used it through OpenRouter and Ollama Cloud.
Ollama Cloud is worth knowing about: it is free to use, which means you get 31B-quality output at no cost. The catch is rate limits, practical for testing and moderate use, not for anything needing high throughput. Both cloud options mean your prompts are leaving your machine, which matters if you are working with anything sensitive.
How Each Model Actually Fails
This was the most useful thing I learned from the whole test. The failures were not random, and they were not consistent across model sizes. Knowing which failure pattern you are dealing with changes how you respond to it.
E2B: Structural breakdown in longer outputs. The model starts a layout correctly, then loses coherence in nested sections. You get a valid shell with broken inner components. The fix is simpler prompts, not retries.
E4B: Memory-pressure truncation. The model generates correct output until RAM runs out, then stops. The rendered UI looks complete until you notice missing sections. Monitor RAM. The fix is either more memory or smaller prompts.
26B and 31B: Semantic errors rather than structural ones. Wrong component name, mismatched prop type. These are fixable because the renderer returns specific diagnostics like unknown-component: DataGrid, available: Table, Col, BarChart, and you tell the model exactly what to correct. One follow-up prompt usually fixes it.
Which Model for Which Situation
| Model | Best For | Constraint |
|---|---|---|
| E2B | Simple local prototyping, 16GB RAM, no cloud | Breaks on complex layouts; ~7/10 on simple prompts |
| E4B | Better local quality | Needs 32GB+ for reliable results; silent failures at 16GB |
| 26B | Reliable structured generation via API | Too heavy for 16GB RAM; no free tier |
| 31B | Best consistency; free via Ollama Cloud | Too heavy for local 16GB; rate limits on free tier |
What to Actually Take Away
Before running these tests, I assumed 2B–4B parameter models were useful for quick experiments and not for structured generation tasks that require strict schema adherence.
That assumption was wrong in a specific way. For well-scoped, simple prompts, E2B produced correct structured UI output. Not output that was impressive given its size, but output that was usable for a real prototyping task. The gap between "small local model" and "requires cloud API" is narrower than it was a year ago, and Gemma 4 is a meaningful part of why.
For anything complex, 26B and 31B are in a different category. But if you are on a consumer machine and want to prototype a simple dashboard or form-based tool without touching a cloud API, E2B is a practical starting point today.
Start simple. Know where the ceiling is. Work within it.
Resources
Additional setup guide, configs, and testing resources:
👉 GitHub & OpenUI Setup Walkthrough








Top comments (0)