DEV Community

Local-First AI Done Right: How Gemma 4 E2B and 'Thinking Mode' Powered DiagramFlowAI

Carlos Barbero on May 13, 2026

Over the past few months, the tech world has shifted from "AI-capable" to "Agentic-driven". But as developers, we face a major challenge: How do we...

Read full post

Max Quimby • May 16

The local-first argument gets stronger every quarter, but I think it deserves a sharper qualifier: it works beautifully for tasks where you can bound the input. Diagram generation is a near-perfect fit because the schema is constrained and the model is allowed to be slow — a user will happily wait 4 seconds for a clean mermaid graph.

Where it falls over is open-ended conversation or long-context reasoning, where Gemma 4 E2B's headroom runs out fast against frontier-tier models. So I read "local-first" less as "always" and more as "use the smallest model that hits the task's quality bar, and a surprising number of tasks have a low bar." Did you do any A/B testing where you ran the same diagram requests through a cloud model and compared output? I'm curious how visible the gap is to actual users vs. how visible it is on a benchmark.

Carlos Barbero Google Developer Experts • May 16

Excellent observation. You captured the exact sweet spot of this approach: reading 'local-first' as 'using the smallest model that hits the task's quality bar' is the perfect definition for pragmatic AI integration.
You are completely right about its limits with long-context and open-ended conversations. For diagrams, the bounded scope and the Thinking Mode make up for the model's smaller size, and users gladly accept a few seconds of latency in exchange for keeping their architectural data private.
Regarding the A/B testing and the visible gap: the biggest difference for the actual user isn't usually in the final syntax (the error-recovery loop handles that pretty well), but in dealing with ambiguous descriptions. Frontier cloud models are much better at filling in context gaps when a user gives vague or unstructured instructions. Smaller models like Gemma 4 E2B/E4B, on the other hand, require more direct and explicit prompts to stay on track.
That is exactly why, in the future, I intend to make other models available within the tool to dive deeper into these tests. The goal is to allow real A/B comparisons and give users the freedom to choose their preferred trade-off between absolute privacy and frontier reasoning capabilities. Thanks for the great insights!

Julio Saraiva • May 25

Excelente artigo, Carlos! É muito inspirador ver uma aplicação prática que desmistifica a ideia de que precisamos sempre de modelos gigantescos em nuvem para tarefas estruturadas. A tua abordagem ao tratar o System Prompt como uma especificação de gramática rígida e implementar o loop ReAct para autocorreção é pura engenharia de software de qualidade.

Fiquei com uma curiosidade técnica sobre a experiência do utilizador (UX): o Thinking Mode dos modelos Gemma 4 Edge (E2B/E4B) adiciona um overhead de tokens gerados antes da resposta final. Na tua experiência de testes com diagramas muito complexos (com múltiplos subgrafos ou muitos nós), o aumento do tempo de geração (Time to First Token do output real do Mermaid) chegou a ser um ponto de fricção para o utilizador, ou o streaming visual do log de raciocínio (o acordeão na UI) foi suficiente para mascarar essa latência e manter o estado de fluxo? Gostaria muito de entender como avaliaste esse limite tolerável.

Carlos Barbero Google Developer Experts • May 25

Obrigado Julio! Na aplicação eu coloquei um log visual mostrando o raciocínio para não deixar o usuário esperando sem saber o que está acontecendo. Outro ponto é que quanto melhor a máquina que estiver usando mais rápido vai funcionar o modelo.

Julio Saraiva • May 25

Valeu Carlos, obrigado por compartilhar seu conhecimento! inspira muito!

Lucas Fernandes • May 23

Parabéns pelo artigo, Carlos! Solução fantástica. O uso do Thinking Mode no Gemma 4 local prova que modelos edge conseguem lidar com dados altamente estruturados. Já estou a projetar a adaptação desta arquitetura para o mercado de tatuagem: usar o modelo local para fazer o parse de briefings caóticos e descritivos de clientes, transformando-os em fichas técnicas padronizadas (estilo, elementos, restrições anatómicas) com total privacidade. O teu artigo foi um excelente blueprint de como aplicar restrições rígidas localmente!

Carlos Barbero Google Developer Experts • May 24

Obrigado!

PracHub • May 13

Using DiagramFlowAI locally to keep sensitive architecture data private is smart. I'm curious about "Thinking Mode" and how effectively it reduces errors in Mermaid syntax. Does it really help with complex workflows? I found a specific bank for system design on PracHub that matched what I saw in the OA. It's been more useful for structured prompts than going through random Glassdoor threads.

Carlos Barbero Google Developer Experts • May 13

Your point is brilliant! You captured exactly the essence of what we're building. Privacy in DiagramFlowAI isn't an optional feature; it's what I call the project's "Firmware"—it's at the foundation of everything. In the corporate world, designing a proprietary database architecture or authentication flow in a cloud-based SaaS is a compliance risk that many can't afford to take. Running 100% locally with Gemma 4 definitively solves this.

Regarding your question about Thinking Mode, the implementation in the code (which you can check in lib/models/ai_engine_service.dart) treats ThinkingResponse as a prioritized data flow before TextResponse.

For small models like Gemma 4 E2B/E4B, this is a game-changer for three technical reasons I observed in the project:

Reduction of "Syntactic Hallucination": Mermaid is sensitive. A parenthesis ( instead of a square bracket [ breaks the renderer. Thinking Mode forces the model to "map" the hierarchy
of the nodes before writing the code. It plans the subgraph and the connections (-->) first.
Deterministic Grammar: Since 2B/4B models have less "working memory," I treat the System Prompt almost like a compiler grammar (see in
lib/data/prompt_templates.dart how the prompts are structured as rigid contracts: "Output ONLY a flowchart... do not output prose"). Internal reasoning ensures that
this contract is fulfilled.
Complexity in Flows: In complex sequence diagrams (like the OAuth login diagram we have in the templates), the model uses "thinking" to ensure that the actors are
declared before the interactions. Without this, small models often get lost in the middle of the flow.

And about PracHub, you're absolutely right! Structured prompt banks These are the "High Octane Fuel" for what I call Vibe Coding. In DiagramFlowAI, instead of relying on generic prompts, we use templates that function as "Syntax Cards," teaching the model exactly what not to do to avoid common parser errors.

The future is local, private, and, with Gemma 4's reasoning, extremely intelligent.

Harjot Singh • Jun 1

really interesting take on local-first architecture and how it tackles privacy concerns in AI. the use of E2B sandboxes sounds like a smart move. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code on your github. if you're curious, happy to give you a complimentary run to check it out.

Maria Leticia • May 23

Muitas vezes o mercado foca apenas em modelos gigantescos na nuvem, mas o seu projeto prova o valor da engenharia pragmática: usar as variantes menores do Gemma 4 (E2B e E4B) não só democratiza o acesso ao hardware, como o "Modo Pensamento" (Thinking Mode) garante a robustez necessária para a sintaxe do Mermaid.

Parabéns pelo projeto e pelo compartilhamento desses aprendizados práticos!

Carlos Barbero Google Developer Experts • May 23

Obrigado!

André Favretto • May 23

Tive a oportunidade de experimentar a versão prévia dessa aplicação que o mestre Barbero desenvolveu no ambiente corporativo.

Mas essa versão de diagram na época era uma versão que não rodava localmente, mas já mostrava o potencial.

Agora ver que toda essa arquitetura evoluiu para um modelo que é mais eficiente, utiliza processamento local e ainda garante a mesma qualidade no output, é realmente incrível.

A otimização que foi alcançada, a escolha dos métodos, e principalmente o principal limitador, o processamento de tokens, foram sanados com a adoção otimizada do Gemma. As re-tentativas são fundamentais quando quer garantir a consistencia antes de entregar o resultado final, e garantem que o contexto aliado ao metodo de recuperação do xml garante adições e incrementos futuros em cima do output.

Parabéns novamente Barbero, você é uma inspiração 💪

Carlos Barbero Google Developer Experts • May 23

Obrigado!

mariana lima dias • May 13

Projeto extremamente bem direcionado para um desafio real, usar IA sem comprometer dados sensíveis. A escolha de priorizar execução local e modelos menores do Gemma 4 não só reduz fricção técnica, como resolve uma barreira crítica de adoção em ambientes corporativos. Aqui, arquitetura pesa mais que força bruta. O uso do Thinking Mode combinado com controle de saída e loop de correção mostra uma maturidade importante; que não é só gerar com IA, é garantir consistência e confiabilidade no resultado. É um exemplo claro de como IA aplicada com estratégia supera IA aplicada com excesso de escala. Parabéns.

Carlos Barbero Google Developer Experts • May 13

Obrigado!

Bedenego Quintino Junior • May 19

Excelente projeto! Parabéns pela iniciativa e pela abordagem pragmática. Hoje, ferramentas de IA que geram diagramas e arquiteturas poupam horas de trabalho de arquitetos e desenvolvedores. No entanto, o custo financeiro e o risco de exposição de dados sensíveis costumam ser grandes impeditivos.

Carlos Barbero Google Developer Experts • May 19

Obrigado!

Lucas Damasceno • May 18

It's incredible how far we've come.. and on top of that, being able to run it anywhere without internet (on a plane during a flight, like some folks are already doing) — awesome!!! Congrats on the great work, Barbero!!

Carlos Barbero Google Developer Experts • May 19

Thanks!

Luan Sousa • May 20

Great read! It’s fascinating to see how compact models like Gemma 2 2B, when paired with techniques like 'Thinking Mode,' can deliver the kind of structured reasoning we previously only expected from much larger models. The 'Local-First' approach for DiagramFlowAI is a massive differentiator for both privacy and latency. Thanks for sharing such a detailed technical breakdown!

Carlos Barbero Google Developer Experts • May 20

thanks

Daniel Sousa • May 26

Great read, Carlos! As a Google Cloud Support Manager, I deal daily with the constant tension between the need for AI-driven productivity and the strict security and compliance policies that enterprise clients must adhere to.
DiagramFlowAI hits the nail on the head with its 'local-first' approach.

Carlos Barbero Google Developer Experts • May 26

Thanks Daniel!

William Miranda • May 15 • Edited

O projeto DiagramFlowAI é um exemplo brilhante de como a IA pode ser aplicada de forma pragmática e estratégica para resolver problemas reais de engenharia. A decisão de adotar uma abordagem "local-first" é certeira, pois ataca diretamente a tensão entre produtividade e privacidade, permitindo que arquitetos manipulem dados sensíveis sem os riscos de conformidade associados a LLMs em nuvem.

É impressionante como a escolha técnica pelos modelos Gemma 4 E2B e E4B transformou o que poderia ser uma limitação em uma vantagem competitiva: ao priorizar modelos menores, a solução garante democratização do acesso (rodando em hardware comum e GPUs integradas) e uma experiência de usuário fluida, sem a barreira de chaves de API ou configurações de cobrança.
Além disso, a maturidade no desenvolvimento fica evidente no uso do "Thinking Mode". Essa funcionalidade não apenas melhora a confiabilidade da sintaxe Mermaid, que é notoriamente frágil, mas também educa o usuário ao expor o raciocínio da IA na interface, transformando o tempo de espera em algo produtivo.
O refinamento adicional com um loop de recuperação (ReAct) e o tratamento do prompt do sistema como uma gramática rígida mostram que a eficácia do projeto vem da inteligência arquitetural, e não apenas da força bruta do modelo.
Parabéns por criar uma ferramenta que realmente respeita a postura de segurança das empresas enquanto eleva o fluxo de trabalho dos desenvolvedores

Carlos Barbero Google Developer Experts • May 16

Obrigado!

Vinicius Inacio • May 23

This is a truly fascinating step forward in technology. Google's Antigravity system has the potential to completely redefine how we think about innovation, engineering, and the future of mobility. The level of creativity and ambition behind this project is incredibly inspiring — excited to see how this evolves in the coming years!

Carlos Barbero Google Developer Experts • May 23

Thanks!

Diego Martinez L. • May 22

Excellent article, your project is very inspiring. 🚀

Carlos Barbero Google Developer Experts • May 23

Thanks

Marcos Barbero • May 16

Great article, I’ll explore some of it in my projects

Carlos Barbero Google Developer Experts • May 16

Thanks!

Maicon Romario • May 19

Great article! Building this locally with the smaller Gemma models was a really smart choice. Congratulations! 🚀

Carlos Barbero Google Developer Experts • May 19

thanks!

Jonatas Onca • May 21

Congratulations Barbero! Very well detailed!

Carlos Barbero Google Developer Experts • May 21

Tranks!

Fernanda Oliveira • May 21

Projeto maravilhoso!
Sempre contribuindo com o conhecimento e insight valiosos.
Parabéns!!!!

Carlos Barbero Google Developer Experts • May 21

Obrigado!

Jonas Barletta • May 21

Excelente projeto! Parabéns!! 👏👏

Carlos Barbero Google Developer Experts • May 21

Obrigado!

Pedro Henrique • May 21

Projeto diferenciado em Barbero, curti muito. Vou replicar aqui e brincar um pouco. Massa demais!!!

Carlos Barbero Google Developer Experts • May 21

Obrigado!

Ícaro Joel Moura Pinto • May 21

O fato de rodar local, sem chance de expor algum dado sensível na rede é uma evolução, dado que uma maquina local consegue processar o contexto necessário para um bom resultado

Carlos Barbero Google Developer Experts • May 21

Isso ai!

Carlos Takeshi Sato • May 21

Great Article, Barbero!!

Carlos Barbero Google Developer Experts • May 21

Thanks!

Valmil Candido da Silva Junior • May 18

Congrats, Barbero!

It's a great article How to use dialogflow with local benefit's.

I'll try to implement here!

Carlos Barbero Google Developer Experts • May 18

Thanks!

Angela Bruna • May 23

Muito obrigada Carlos

Carlos Barbero Google Developer Experts • May 24

Que bom que gostou!

Poliana Micheline • May 24

Great article! Loved the local-first approach and the practical use of Gemma 4 + thinking mode. Nice balance between privacy, autonomy and intelligent workflows

Carlos Barbero Google Developer Experts • May 24

very good!

Lucas Mazim • May 26

Yes, I think Local-first AI will be the future! I see that companies are looking for more sustainable ways to work with LLM's and looking for more accountability.

Carlos Barbero Google Developer Experts • May 26

Interesting! I believe we'll have a balance between on-premises and cloud-based models.

Kauê Lima • Jun 8

Muito bom o artigo Carlos, didatica ótima, vou explorar alguns desses pontos nos meus projetos também.