Over the past few months, the tech world has shifted from "AI-capable" to "Agentic-driven". But as developers, we face a major challenge: How do we build autonomous, code-executing AI agents that respect privacy, run locally, and don't cost a fortune in API credits?
In this article, I’ll show you how I built DiagramFlowAI — a local-first architecture that uses Gemma 2 (via flutter_gemma) with a custom Thinking Mode to dynamically generate architectural diagrams, executing everything safely inside E2B sandboxes.
🚀 Live Repo: github.com/carlosrgomes/DiagramFlowAI
💡 If you find this architecture useful, don't forget to leave a star!
What I Built
DiagramFlowAI is a local-first desktop application (macOS, Windows, and Linux) that transforms natural language descriptions into production-ready architecture diagrams. It intelligently generates standard Mermaid syntax for general workflows, or outputs structured commands mapping icons for cloud architectures.
The application solves a very specific tension in modern software engineering: privacy versus productivity. When architects and engineers sketch out internal systems—such as authentication flows, proprietary data pipelines, or secure cloud perimeters—sending that data to a cloud-based LLM endpoint is often a compliance deal-breaker.
DiagramFlowAI is designed to be completely self-contained. Powered by flutter_gemma and LiteRT, it runs 100% locally. After the initial model download, it requires zero internet connection, uses no API keys, and has no telemetry. It’s an AI diagramming studio that respects your company’s security posture.
Demo
Code
github.com/carlosrgomes/DiagramFlowAI
How I Used Gemma 4
Most AI showcases default to the largest model available. I did the exact opposite. I deliberately built DiagramFlowAI around Gemma 4 E2B and E4B—the edge variants—and intentionally skipped the 31B Dense and 26B MoE models. Here is why the smallest variants were the secret to making this desktop app work, and how Gemma 4's "Thinking Mode" unlocked capabilities I didn't expect.
The Unfashionable Choice: Small over Large
If you're building a high-throughput backend, the 31B Dense or 26B MoE are obvious choices. However, my deployment constraints pointed in a completely different direction:
- Democratic Hardware Requirements: A 31B dense model in 4-bit quantization demands around 16-20 GB of RAM. The E4B model comfortably fits within 4-6 GB and runs smoothly on integrated GPUs. That’s the difference between an app anyone can use and a toy restricted to high-end workstations.
- Frictionless Onboarding: The moment a user has to paste an API key, onboarding conversion plummets. Because E2B and E4B are open weights, users can simply click "download" and start diagramming. No auth walls, no billing setups.
- Snappy Cold Starts: In a desktop app, the first interaction needs to feel immediate. The E2B model loads and responds in seconds on modern M-series Macs and modern PCs, keeping the user in their flow state.
To give users flexibility, I built in a toggle between E2B (faster) and E4B (more accurate on complex syntax), rather than hardcoding a single option.
The Underrated Superpower: Thinking Mode
If there is one thing every developer building with Gemma 4 should internalize, it's the power of the reasoning trace. The flutter_gemma SDK exposes Gemma 4's internal reasoning as a distinct stream of ThinkingResponse chunks.
For diagram generation, this is a game-changer. Mermaid syntax is notoriously fragile—a stray colon, an unquoted string, or a missing end tag will break the entire render. Without Thinking Mode, a 4B parameter model will often confidently output syntactically broken DSLs in one shot.
With Thinking Mode enabled, the model spends a few hundred tokens planning its structure first ("OK, this is a sequence diagram, I need actor -> participant -> arrow -> response..."). Consequently, the final output is dramatically more reliable.
In the UI, I expose this trace as a collapsed accordion (e.g., "Thinking · 2.4s"). This subtle UX choice builds user trust and makes the generation wait feel highly productive, without overwhelming them with raw logs.
Pragmatic Patterns for 4B Models
Fighting with the model for a few weeks led me to a few hard-won architectural patterns:
- Treat the System Prompt as a Grammar, Not a Personality: Small models pattern-match exceptionally well. My 500-line system prompt isn't about making the AI "helpful"; it's an output contract. I use explicit delimiters (
<DIAGRAM>...</DIAGRAM>) and provide "syntax cards" showing the most common parser failures (e.g.,NEVER write X). Teaching the model what not to do prevents entire classes of bugs. - Trust the Contract over Regex: Instead of fighting fragile markdown fences with complex Regex, I rely on the XML-style delimiters defined in the system prompt. Even when the model decides to write an explanatory paragraph, the actual code is safely wrapped and easily extracted.
- Engineer the Recovery Loop: Even with Thinking Mode, complex diagrams might occasionally fail to parse. Instead of trying to prompt-engineer my way to a 100% success rate (which is near impossible at 4B), I built a small ReAct-style retry loop. If the Mermaid parser throws an error, the app feeds the exact error message back into a follow-up turn. The model almost always fixes its syntax on the second attempt.
Gemma 4 E2B and E4B prove that you don't need a massive, cloud-hosted LLM to ship a genuinely useful, structured AI application. If you map your deployment constraints, lean hard on the system prompt, enable Thinking Mode, and engineer a smart recovery loop, these edge models become a feature, not a compromise.
Do you think "Local-first AI" is the future for enterprise dev tools, or is the convenience of Cloud LLMs too hard to beat? Also, if you have ideas on how to improve the Mermaid recovery loop, let’s chat in the comments!



Top comments (58)
The local-first argument gets stronger every quarter, but I think it deserves a sharper qualifier: it works beautifully for tasks where you can bound the input. Diagram generation is a near-perfect fit because the schema is constrained and the model is allowed to be slow — a user will happily wait 4 seconds for a clean mermaid graph.
Where it falls over is open-ended conversation or long-context reasoning, where Gemma 4 E2B's headroom runs out fast against frontier-tier models. So I read "local-first" less as "always" and more as "use the smallest model that hits the task's quality bar, and a surprising number of tasks have a low bar." Did you do any A/B testing where you ran the same diagram requests through a cloud model and compared output? I'm curious how visible the gap is to actual users vs. how visible it is on a benchmark.
Excellent observation. You captured the exact sweet spot of this approach: reading 'local-first' as 'using the smallest model that hits the task's quality bar' is the perfect definition for pragmatic AI integration.
You are completely right about its limits with long-context and open-ended conversations. For diagrams, the bounded scope and the Thinking Mode make up for the model's smaller size, and users gladly accept a few seconds of latency in exchange for keeping their architectural data private.
Regarding the A/B testing and the visible gap: the biggest difference for the actual user isn't usually in the final syntax (the error-recovery loop handles that pretty well), but in dealing with ambiguous descriptions. Frontier cloud models are much better at filling in context gaps when a user gives vague or unstructured instructions. Smaller models like Gemma 4 E2B/E4B, on the other hand, require more direct and explicit prompts to stay on track.
That is exactly why, in the future, I intend to make other models available within the tool to dive deeper into these tests. The goal is to allow real A/B comparisons and give users the freedom to choose their preferred trade-off between absolute privacy and frontier reasoning capabilities. Thanks for the great insights!
Using DiagramFlowAI locally to keep sensitive architecture data private is smart. I'm curious about "Thinking Mode" and how effectively it reduces errors in Mermaid syntax. Does it really help with complex workflows? I found a specific bank for system design on PracHub that matched what I saw in the OA. It's been more useful for structured prompts than going through random Glassdoor threads.
Your point is brilliant! You captured exactly the essence of what we're building. Privacy in DiagramFlowAI isn't an optional feature; it's what I call the project's "Firmware"—it's at the foundation of everything. In the corporate world, designing a proprietary database architecture or authentication flow in a cloud-based SaaS is a compliance risk that many can't afford to take. Running 100% locally with Gemma 4 definitively solves this.
Regarding your question about Thinking Mode, the implementation in the code (which you can check in lib/models/ai_engine_service.dart) treats ThinkingResponse as a prioritized data flow before TextResponse.
For small models like Gemma 4 E2B/E4B, this is a game-changer for three technical reasons I observed in the project:
Reduction of "Syntactic Hallucination": Mermaid is sensitive. A parenthesis ( instead of a square bracket [ breaks the renderer. Thinking Mode forces the model to "map" the hierarchy
of the nodes before writing the code. It plans the subgraph and the connections (-->) first.
Deterministic Grammar: Since 2B/4B models have less "working memory," I treat the System Prompt almost like a compiler grammar (see in
lib/data/prompt_templates.dart how the prompts are structured as rigid contracts: "Output ONLY a flowchart... do not output prose"). Internal reasoning ensures that
this contract is fulfilled.
Complexity in Flows: In complex sequence diagrams (like the OAuth login diagram we have in the templates), the model uses "thinking" to ensure that the actors are
declared before the interactions. Without this, small models often get lost in the middle of the flow.
And about PracHub, you're absolutely right! Structured prompt banks These are the "High Octane Fuel" for what I call Vibe Coding. In DiagramFlowAI, instead of relying on generic prompts, we use templates that function as "Syntax Cards," teaching the model exactly what not to do to avoid common parser errors.
The future is local, private, and, with Gemma 4's reasoning, extremely intelligent.
really interesting take on local-first architecture and how it tackles privacy concerns in AI. the use of E2B sandboxes sounds like a smart move. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code on your github. if you're curious, happy to give you a complimentary run to check it out.
Excelente artigo, Carlos! É muito inspirador ver uma aplicação prática que desmistifica a ideia de que precisamos sempre de modelos gigantescos em nuvem para tarefas estruturadas. A tua abordagem ao tratar o System Prompt como uma especificação de gramática rígida e implementar o loop ReAct para autocorreção é pura engenharia de software de qualidade.
Fiquei com uma curiosidade técnica sobre a experiência do utilizador (UX): o Thinking Mode dos modelos Gemma 4 Edge (E2B/E4B) adiciona um overhead de tokens gerados antes da resposta final. Na tua experiência de testes com diagramas muito complexos (com múltiplos subgrafos ou muitos nós), o aumento do tempo de geração (Time to First Token do output real do Mermaid) chegou a ser um ponto de fricção para o utilizador, ou o streaming visual do log de raciocínio (o acordeão na UI) foi suficiente para mascarar essa latência e manter o estado de fluxo? Gostaria muito de entender como avaliaste esse limite tolerável.
Obrigado Julio! Na aplicação eu coloquei um log visual mostrando o raciocínio para não deixar o usuário esperando sem saber o que está acontecendo. Outro ponto é que quanto melhor a máquina que estiver usando mais rápido vai funcionar o modelo.
Valeu Carlos, obrigado por compartilhar seu conhecimento! inspira muito!
Parabéns pelo artigo, Carlos! Solução fantástica. O uso do Thinking Mode no Gemma 4 local prova que modelos edge conseguem lidar com dados altamente estruturados. Já estou a projetar a adaptação desta arquitetura para o mercado de tatuagem: usar o modelo local para fazer o parse de briefings caóticos e descritivos de clientes, transformando-os em fichas técnicas padronizadas (estilo, elementos, restrições anatómicas) com total privacidade. O teu artigo foi um excelente blueprint de como aplicar restrições rígidas localmente!
Obrigado!
Projeto extremamente bem direcionado para um desafio real, usar IA sem comprometer dados sensíveis. A escolha de priorizar execução local e modelos menores do Gemma 4 não só reduz fricção técnica, como resolve uma barreira crítica de adoção em ambientes corporativos. Aqui, arquitetura pesa mais que força bruta. O uso do Thinking Mode combinado com controle de saída e loop de correção mostra uma maturidade importante; que não é só gerar com IA, é garantir consistência e confiabilidade no resultado. É um exemplo claro de como IA aplicada com estratégia supera IA aplicada com excesso de escala. Parabéns.
Obrigado!
O projeto DiagramFlowAI é um exemplo brilhante de como a IA pode ser aplicada de forma pragmática e estratégica para resolver problemas reais de engenharia. A decisão de adotar uma abordagem "local-first" é certeira, pois ataca diretamente a tensão entre produtividade e privacidade, permitindo que arquitetos manipulem dados sensíveis sem os riscos de conformidade associados a LLMs em nuvem.
É impressionante como a escolha técnica pelos modelos Gemma 4 E2B e E4B transformou o que poderia ser uma limitação em uma vantagem competitiva: ao priorizar modelos menores, a solução garante democratização do acesso (rodando em hardware comum e GPUs integradas) e uma experiência de usuário fluida, sem a barreira de chaves de API ou configurações de cobrança.
Além disso, a maturidade no desenvolvimento fica evidente no uso do "Thinking Mode". Essa funcionalidade não apenas melhora a confiabilidade da sintaxe Mermaid, que é notoriamente frágil, mas também educa o usuário ao expor o raciocínio da IA na interface, transformando o tempo de espera em algo produtivo.
O refinamento adicional com um loop de recuperação (ReAct) e o tratamento do prompt do sistema como uma gramática rígida mostram que a eficácia do projeto vem da inteligência arquitetural, e não apenas da força bruta do modelo.
Parabéns por criar uma ferramenta que realmente respeita a postura de segurança das empresas enquanto eleva o fluxo de trabalho dos desenvolvedores
Obrigado!
It's incredible how far we've come.. and on top of that, being able to run it anywhere without internet (on a plane during a flight, like some folks are already doing) — awesome!!! Congrats on the great work, Barbero!!
Thanks!
Excelente projeto! Parabéns pela iniciativa e pela abordagem pragmática. Hoje, ferramentas de IA que geram diagramas e arquiteturas poupam horas de trabalho de arquitetos e desenvolvedores. No entanto, o custo financeiro e o risco de exposição de dados sensíveis costumam ser grandes impeditivos.
Obrigado!
Great read! It’s fascinating to see how compact models like Gemma 2 2B, when paired with techniques like 'Thinking Mode,' can deliver the kind of structured reasoning we previously only expected from much larger models. The 'Local-First' approach for DiagramFlowAI is a massive differentiator for both privacy and latency. Thanks for sharing such a detailed technical breakdown!
thanks
Some comments may only be visible to logged-in visitors. Sign in to view all comments.