Carlos Barbero for Google Developer Experts

Posted on May 13 • Edited on May 26

Local-First AI Done Right: How Gemma 4 E2B and 'Thinking Mode' Powered DiagramFlowAI

#devchallenge #gemmachallenge #gemma #opensource

Gemma 4 Challenge: Build With Gemma 4 Submission

Over the past few months, the tech world has shifted from "AI-capable" to "Agentic-driven". But as developers, we face a major challenge: How do we build autonomous, code-executing AI agents that respect privacy, run locally, and don't cost a fortune in API credits?

In this article, I’ll show you how I built DiagramFlowAI — a local-first architecture that uses Gemma 2 (via flutter_gemma) with a custom Thinking Mode to dynamically generate architectural diagrams, executing everything safely inside E2B sandboxes.

🚀 Live Repo: github.com/carlosrgomes/DiagramFlowAI

💡 If you find this architecture useful, don't forget to leave a star!

What I Built

DiagramFlowAI is a local-first desktop application (macOS, Windows, and Linux) that transforms natural language descriptions into production-ready architecture diagrams. It intelligently generates standard Mermaid syntax for general workflows, or outputs structured commands mapping icons for cloud architectures.

The application solves a very specific tension in modern software engineering: privacy versus productivity. When architects and engineers sketch out internal systems—such as authentication flows, proprietary data pipelines, or secure cloud perimeters—sending that data to a cloud-based LLM endpoint is often a compliance deal-breaker.

DiagramFlowAI is designed to be completely self-contained. Powered by flutter_gemma and LiteRT, it runs 100% locally. After the initial model download, it requires zero internet connection, uses no API keys, and has no telemetry. It’s an AI diagramming studio that respects your company’s security posture.

Demo

Code

github.com/carlosrgomes/DiagramFlowAI

How I Used Gemma 4

Most AI showcases default to the largest model available. I did the exact opposite. I deliberately built DiagramFlowAI around Gemma 4 E2B and E4B—the edge variants—and intentionally skipped the 31B Dense and 26B MoE models. Here is why the smallest variants were the secret to making this desktop app work, and how Gemma 4's "Thinking Mode" unlocked capabilities I didn't expect.

The Unfashionable Choice: Small over Large

If you're building a high-throughput backend, the 31B Dense or 26B MoE are obvious choices. However, my deployment constraints pointed in a completely different direction:

Democratic Hardware Requirements: A 31B dense model in 4-bit quantization demands around 16-20 GB of RAM. The E4B model comfortably fits within 4-6 GB and runs smoothly on integrated GPUs. That’s the difference between an app anyone can use and a toy restricted to high-end workstations.
Frictionless Onboarding: The moment a user has to paste an API key, onboarding conversion plummets. Because E2B and E4B are open weights, users can simply click "download" and start diagramming. No auth walls, no billing setups.
Snappy Cold Starts: In a desktop app, the first interaction needs to feel immediate. The E2B model loads and responds in seconds on modern M-series Macs and modern PCs, keeping the user in their flow state.

To give users flexibility, I built in a toggle between E2B (faster) and E4B (more accurate on complex syntax), rather than hardcoding a single option.

The Underrated Superpower: Thinking Mode

If there is one thing every developer building with Gemma 4 should internalize, it's the power of the reasoning trace. The flutter_gemma SDK exposes Gemma 4's internal reasoning as a distinct stream of ThinkingResponse chunks.

For diagram generation, this is a game-changer. Mermaid syntax is notoriously fragile—a stray colon, an unquoted string, or a missing end tag will break the entire render. Without Thinking Mode, a 4B parameter model will often confidently output syntactically broken DSLs in one shot.

With Thinking Mode enabled, the model spends a few hundred tokens planning its structure first ("OK, this is a sequence diagram, I need actor -> participant -> arrow -> response..."). Consequently, the final output is dramatically more reliable.

In the UI, I expose this trace as a collapsed accordion (e.g., "Thinking · 2.4s"). This subtle UX choice builds user trust and makes the generation wait feel highly productive, without overwhelming them with raw logs.

Pragmatic Patterns for 4B Models

Fighting with the model for a few weeks led me to a few hard-won architectural patterns:

Treat the System Prompt as a Grammar, Not a Personality: Small models pattern-match exceptionally well. My 500-line system prompt isn't about making the AI "helpful"; it's an output contract. I use explicit delimiters (<DIAGRAM>...</DIAGRAM>) and provide "syntax cards" showing the most common parser failures (e.g., NEVER write X). Teaching the model what not to do prevents entire classes of bugs.
Trust the Contract over Regex: Instead of fighting fragile markdown fences with complex Regex, I rely on the XML-style delimiters defined in the system prompt. Even when the model decides to write an explanatory paragraph, the actual code is safely wrapped and easily extracted.
Engineer the Recovery Loop: Even with Thinking Mode, complex diagrams might occasionally fail to parse. Instead of trying to prompt-engineer my way to a 100% success rate (which is near impossible at 4B), I built a small ReAct-style retry loop. If the Mermaid parser throws an error, the app feeds the exact error message back into a follow-up turn. The model almost always fixes its syntax on the second attempt.

Gemma 4 E2B and E4B prove that you don't need a massive, cloud-hosted LLM to ship a genuinely useful, structured AI application. If you map your deployment constraints, lean hard on the system prompt, enable Thinking Mode, and engineer a smart recovery loop, these edge models become a feature, not a compromise.

Do you think "Local-first AI" is the future for enterprise dev tools, or is the convenience of Cloud LLMs too hard to beat? Also, if you have ideas on how to improve the Mermaid recovery loop, let’s chat in the comments!

Top comments (59)

Max Quimby • May 16

The local-first argument gets stronger every quarter, but I think it deserves a sharper qualifier: it works beautifully for tasks where you can bound the input. Diagram generation is a near-perfect fit because the schema is constrained and the model is allowed to be slow — a user will happily wait 4 seconds for a clean mermaid graph.

Where it falls over is open-ended conversation or long-context reasoning, where Gemma 4 E2B's headroom runs out fast against frontier-tier models. So I read "local-first" less as "always" and more as "use the smallest model that hits the task's quality bar, and a surprising number of tasks have a low bar." Did you do any A/B testing where you ran the same diagram requests through a cloud model and compared output? I'm curious how visible the gap is to actual users vs. how visible it is on a benchmark.

Carlos Barbero Google Developer Experts • May 16

Excellent observation. You captured the exact sweet spot of this approach: reading 'local-first' as 'using the smallest model that hits the task's quality bar' is the perfect definition for pragmatic AI integration.
You are completely right about its limits with long-context and open-ended conversations. For diagrams, the bounded scope and the Thinking Mode make up for the model's smaller size, and users gladly accept a few seconds of latency in exchange for keeping their architectural data private.
Regarding the A/B testing and the visible gap: the biggest difference for the actual user isn't usually in the final syntax (the error-recovery loop handles that pretty well), but in dealing with ambiguous descriptions. Frontier cloud models are much better at filling in context gaps when a user gives vague or unstructured instructions. Smaller models like Gemma 4 E2B/E4B, on the other hand, require more direct and explicit prompts to stay on track.
That is exactly why, in the future, I intend to make other models available within the tool to dive deeper into these tests. The goal is to allow real A/B comparisons and give users the freedom to choose their preferred trade-off between absolute privacy and frontier reasoning capabilities. Thanks for the great insights!

Julio Saraiva • May 25

Excelente artigo, Carlos! É muito inspirador ver uma aplicação prática que desmistifica a ideia de que precisamos sempre de modelos gigantescos em nuvem para tarefas estruturadas. A tua abordagem ao tratar o System Prompt como uma especificação de gramática rígida e implementar o loop ReAct para autocorreção é pura engenharia de software de qualidade.

Fiquei com uma curiosidade técnica sobre a experiência do utilizador (UX): o Thinking Mode dos modelos Gemma 4 Edge (E2B/E4B) adiciona um overhead de tokens gerados antes da resposta final. Na tua experiência de testes com diagramas muito complexos (com múltiplos subgrafos ou muitos nós), o aumento do tempo de geração (Time to First Token do output real do Mermaid) chegou a ser um ponto de fricção para o utilizador, ou o streaming visual do log de raciocínio (o acordeão na UI) foi suficiente para mascarar essa latência e manter o estado de fluxo? Gostaria muito de entender como avaliaste esse limite tolerável.

Carlos Barbero Google Developer Experts • May 25

Obrigado Julio! Na aplicação eu coloquei um log visual mostrando o raciocínio para não deixar o usuário esperando sem saber o que está acontecendo. Outro ponto é que quanto melhor a máquina que estiver usando mais rápido vai funcionar o modelo.

Julio Saraiva • May 25

Valeu Carlos, obrigado por compartilhar seu conhecimento! inspira muito!

Lucas Fernandes • May 23

Parabéns pelo artigo, Carlos! Solução fantástica. O uso do Thinking Mode no Gemma 4 local prova que modelos edge conseguem lidar com dados altamente estruturados. Já estou a projetar a adaptação desta arquitetura para o mercado de tatuagem: usar o modelo local para fazer o parse de briefings caóticos e descritivos de clientes, transformando-os em fichas técnicas padronizadas (estilo, elementos, restrições anatómicas) com total privacidade. O teu artigo foi um excelente blueprint de como aplicar restrições rígidas localmente!

Carlos Barbero Google Developer Experts • May 24

Obrigado!

PracHub • May 13

Using DiagramFlowAI locally to keep sensitive architecture data private is smart. I'm curious about "Thinking Mode" and how effectively it reduces errors in Mermaid syntax. Does it really help with complex workflows? I found a specific bank for system design on PracHub that matched what I saw in the OA. It's been more useful for structured prompts than going through random Glassdoor threads.

Carlos Barbero Google Developer Experts • May 13

Your point is brilliant! You captured exactly the essence of what we're building. Privacy in DiagramFlowAI isn't an optional feature; it's what I call the project's "Firmware"—it's at the foundation of everything. In the corporate world, designing a proprietary database architecture or authentication flow in a cloud-based SaaS is a compliance risk that many can't afford to take. Running 100% locally with Gemma 4 definitively solves this.

Regarding your question about Thinking Mode, the implementation in the code (which you can check in lib/models/ai_engine_service.dart) treats ThinkingResponse as a prioritized data flow before TextResponse.

For small models like Gemma 4 E2B/E4B, this is a game-changer for three technical reasons I observed in the project:

Reduction of "Syntactic Hallucination": Mermaid is sensitive. A parenthesis ( instead of a square bracket [ breaks the renderer. Thinking Mode forces the model to "map" the hierarchy
of the nodes before writing the code. It plans the subgraph and the connections (-->) first.
Deterministic Grammar: Since 2B/4B models have less "working memory," I treat the System Prompt almost like a compiler grammar (see in
lib/data/prompt_templates.dart how the prompts are structured as rigid contracts: "Output ONLY a flowchart... do not output prose"). Internal reasoning ensures that
this contract is fulfilled.
Complexity in Flows: In complex sequence diagrams (like the OAuth login diagram we have in the templates), the model uses "thinking" to ensure that the actors are
declared before the interactions. Without this, small models often get lost in the middle of the flow.

And about PracHub, you're absolutely right! Structured prompt banks These are the "High Octane Fuel" for what I call Vibe Coding. In DiagramFlowAI, instead of relying on generic prompts, we use templates that function as "Syntax Cards," teaching the model exactly what not to do to avoid common parser errors.

The future is local, private, and, with Gemma 4's reasoning, extremely intelligent.

Harjot Singh • Jun 1

really interesting take on local-first architecture and how it tackles privacy concerns in AI. the use of E2B sandboxes sounds like a smart move. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code on your github. if you're curious, happy to give you a complimentary run to check it out.

Maria Leticia • May 23

Muitas vezes o mercado foca apenas em modelos gigantescos na nuvem, mas o seu projeto prova o valor da engenharia pragmática: usar as variantes menores do Gemma 4 (E2B e E4B) não só democratiza o acesso ao hardware, como o "Modo Pensamento" (Thinking Mode) garante a robustez necessária para a sintaxe do Mermaid.

Parabéns pelo projeto e pelo compartilhamento desses aprendizados práticos!

Carlos Barbero Google Developer Experts • May 23

Obrigado!

André Favretto • May 23

Tive a oportunidade de experimentar a versão prévia dessa aplicação que o mestre Barbero desenvolveu no ambiente corporativo.

Mas essa versão de diagram na época era uma versão que não rodava localmente, mas já mostrava o potencial.

Agora ver que toda essa arquitetura evoluiu para um modelo que é mais eficiente, utiliza processamento local e ainda garante a mesma qualidade no output, é realmente incrível.

A otimização que foi alcançada, a escolha dos métodos, e principalmente o principal limitador, o processamento de tokens, foram sanados com a adoção otimizada do Gemma. As re-tentativas são fundamentais quando quer garantir a consistencia antes de entregar o resultado final, e garantem que o contexto aliado ao metodo de recuperação do xml garante adições e incrementos futuros em cima do output.

Parabéns novamente Barbero, você é uma inspiração 💪

Carlos Barbero Google Developer Experts • May 23

Obrigado!

mariana lima dias • May 13

Projeto extremamente bem direcionado para um desafio real, usar IA sem comprometer dados sensíveis. A escolha de priorizar execução local e modelos menores do Gemma 4 não só reduz fricção técnica, como resolve uma barreira crítica de adoção em ambientes corporativos. Aqui, arquitetura pesa mais que força bruta. O uso do Thinking Mode combinado com controle de saída e loop de correção mostra uma maturidade importante; que não é só gerar com IA, é garantir consistência e confiabilidade no resultado. É um exemplo claro de como IA aplicada com estratégia supera IA aplicada com excesso de escala. Parabéns.

Carlos Barbero Google Developer Experts • May 13

Obrigado!

Bedenego Quintino Junior • May 19

Excelente projeto! Parabéns pela iniciativa e pela abordagem pragmática. Hoje, ferramentas de IA que geram diagramas e arquiteturas poupam horas de trabalho de arquitetos e desenvolvedores. No entanto, o custo financeiro e o risco de exposição de dados sensíveis costumam ser grandes impeditivos.

Carlos Barbero Google Developer Experts • May 19

Obrigado!

Lucas Damasceno • May 18

It's incredible how far we've come.. and on top of that, being able to run it anywhere without internet (on a plane during a flight, like some folks are already doing) — awesome!!! Congrats on the great work, Barbero!!

Carlos Barbero Google Developer Experts • May 19

Thanks!

View full discussion (59 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.