<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: System Rationale</title>
    <description>The latest articles on DEV Community by System Rationale (@system_rationale).</description>
    <link>https://dev.to/system_rationale</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F404282%2F25c84903-9f4f-4a6c-a886-c324eebf901c.png</url>
      <title>DEV Community: System Rationale</title>
      <link>https://dev.to/system_rationale</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/system_rationale"/>
    <language>en</language>
    <item>
      <title>RAG vs GraphRAG: When to Use What (From a Builder’s Perspective)</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Mon, 13 Apr 2026 23:27:00 +0000</pubDate>
      <link>https://dev.to/system_rationale/rag-vs-graphrag-when-to-use-what-from-a-builders-perspective-132b</link>
      <guid>https://dev.to/system_rationale/rag-vs-graphrag-when-to-use-what-from-a-builders-perspective-132b</guid>
      <description>&lt;p&gt;I wasted time overengineering a GraphRAG system…&lt;br&gt;
when a simple RAG pipeline would’ve done the job better.&lt;/p&gt;

&lt;p&gt;If you’re building with LLMs, you’ll hit this question:&lt;/p&gt;

&lt;p&gt;“Should I use RAG or GraphRAG?”&lt;/p&gt;

&lt;p&gt;Let’s break it down without hype.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚙️ What RAG actually is (in real systems)
&lt;/h2&gt;

&lt;p&gt;RAG is simple:&lt;br&gt;
    1. Chunk your data&lt;br&gt;
    2. Convert to embeddings&lt;br&gt;
    3. Store in vector DB&lt;br&gt;
    4. Retrieve top-k chunks&lt;br&gt;
    5. Send to LLM&lt;/p&gt;

&lt;h1&gt;
  
  
  simplified flow
&lt;/h1&gt;

&lt;p&gt;query_embedding = embed(query)&lt;br&gt;
docs = vector_db.search(query_embedding, top_k=5)&lt;br&gt;
response = llm.generate(query, context=docs)&lt;/p&gt;

&lt;p&gt;👉 That’s it.&lt;/p&gt;

&lt;p&gt;And honestly?&lt;/p&gt;

&lt;p&gt;This solves 80–90% of real-world use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧠 What GraphRAG adds (and why it exists)
&lt;/h2&gt;

&lt;p&gt;GraphRAG introduces:&lt;br&gt;
    • entities&lt;br&gt;
    • relationships&lt;br&gt;
    • graph traversal&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;“Find similar text”&lt;/p&gt;

&lt;p&gt;It does:&lt;/p&gt;

&lt;p&gt;“Find related concepts and how they connect”&lt;/p&gt;

&lt;p&gt;This enables:&lt;br&gt;
    • multi-hop reasoning&lt;br&gt;
    • cross-document understanding&lt;br&gt;
    • better context stitching&lt;/p&gt;

&lt;p&gt;But there’s a catch 👇&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚠️ The hidden cost nobody talks about
&lt;/h2&gt;

&lt;p&gt;GraphRAG is NOT just “RAG + graph”&lt;/p&gt;

&lt;p&gt;You now need:&lt;br&gt;
    • entity extraction pipelines&lt;br&gt;
    • relationship modeling&lt;br&gt;
    • graph database (Neo4j etc.)&lt;br&gt;
    • community detection / summaries&lt;br&gt;
    • sync between vector + graph&lt;/p&gt;

&lt;p&gt;👉 This is real engineering overhead.&lt;/p&gt;

&lt;p&gt;And in many cases… unnecessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧠 When you should use RAG
&lt;/h2&gt;

&lt;p&gt;Use RAG if your problem is:&lt;br&gt;
    • “Find answer from documents”&lt;br&gt;
    • “Summarize this content”&lt;br&gt;
    • “Search internal knowledge base”&lt;br&gt;
    • “Answer FAQ / support queries”&lt;/p&gt;

&lt;p&gt;👉 RAG is faster, cheaper, easier&lt;/p&gt;

&lt;p&gt;Also:&lt;br&gt;
    • updates = reindex&lt;br&gt;
    • no schema headache&lt;/p&gt;

&lt;h2&gt;
  
  
  When GraphRAG actually makes sense
&lt;/h2&gt;

&lt;p&gt;Use GraphRAG ONLY if:&lt;br&gt;
    • relationships matter more than text&lt;br&gt;
    • queries require multi-step reasoning&lt;br&gt;
    • data is highly interconnected&lt;/p&gt;

&lt;p&gt;Examples:&lt;br&gt;
    • fraud detection (who is linked to whom)&lt;br&gt;
    • research analysis (connecting papers, concepts)&lt;br&gt;
    • enterprise knowledge graphs&lt;br&gt;
    • supply chain / dependency mapping&lt;/p&gt;

&lt;p&gt;👉 If your question is:&lt;/p&gt;

&lt;p&gt;“How are A, B, and C connected?”&lt;/p&gt;

&lt;p&gt;You need GraphRAG.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔥 The mistake most devs make
&lt;/h3&gt;

&lt;p&gt;They do this:&lt;/p&gt;

&lt;p&gt;“GraphRAG is more advanced → I should use it”&lt;/p&gt;

&lt;p&gt;Wrong.&lt;/p&gt;

&lt;p&gt;GraphRAG is:&lt;br&gt;
    • slower&lt;br&gt;
    • more expensive&lt;br&gt;
    • harder to maintain&lt;/p&gt;

&lt;p&gt;And for simple Q&amp;amp;A…&lt;/p&gt;

&lt;p&gt;👉 it can perform worse than RAG  ￼&lt;/p&gt;

&lt;h2&gt;
  
  
  The real-world architecture (what actually works)
&lt;/h2&gt;

&lt;p&gt;Best systems don’t choose.&lt;/p&gt;

&lt;p&gt;They combine:&lt;br&gt;
    • RAG → for fast retrieval&lt;br&gt;
    • Graph → for reasoning&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flow:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Query → Vector Search → Relevant chunks&lt;br&gt;&lt;br&gt;
        ↓&lt;br&gt;&lt;br&gt;
     Graph traversal → relationships&lt;br&gt;&lt;br&gt;
        ↓&lt;br&gt;&lt;br&gt;
     LLM → final answer&lt;/p&gt;

&lt;p&gt;👉 Hybrid is where things get powerful  ￼&lt;/p&gt;

&lt;p&gt;⚠️** One more thing (security)**&lt;/p&gt;

&lt;p&gt;RAG systems can be attacked via:&lt;br&gt;
    • prompt injection in documents&lt;/p&gt;

&lt;p&gt;So always:&lt;br&gt;
    • sanitize inputs&lt;br&gt;
    • separate instructions from data&lt;/p&gt;

&lt;p&gt;** Final takeaway**&lt;br&gt;
    • Start with RAG&lt;br&gt;
    • Add Graph only if needed&lt;br&gt;
    • Don’t overengineer early&lt;/p&gt;

&lt;p&gt;Useful resources&lt;/p&gt;

&lt;p&gt;RAG&lt;br&gt;
    • &lt;a href="https://www.ibm.com/think/topics/retrieval-augmented-generation" rel="noopener noreferrer"&gt;https://www.ibm.com/think/topics/retrieval-augmented-generation&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://weaviate.io/blog/introduction-to-rag" rel="noopener noreferrer"&gt;https://weaviate.io/blog/introduction-to-rag&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://www.pinecone.io/learn/retrieval-augmented-generation/" rel="noopener noreferrer"&gt;https://www.pinecone.io/learn/retrieval-augmented-generation/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GraphRAG&lt;br&gt;
    • &lt;a href="https://microsoft.github.io/graphrag/" rel="noopener noreferrer"&gt;https://microsoft.github.io/graphrag/&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/" rel="noopener noreferrer"&gt;https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/neo4j/neo4j-graphrag-python" rel="noopener noreferrer"&gt;https://github.com/neo4j/neo4j-graphrag-python&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Frameworks&lt;br&gt;
    • &lt;a href="https://docs.langchain.com/oss/python/langchain/rag" rel="noopener noreferrer"&gt;https://docs.langchain.com/oss/python/langchain/rag&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://developers.llamaindex.ai/" rel="noopener noreferrer"&gt;https://developers.llamaindex.ai/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👋 If you’re building&lt;/p&gt;

&lt;p&gt;I’m building AI systems in public — sharing:&lt;br&gt;
    • what works&lt;br&gt;
    • what breaks&lt;br&gt;
    • what scales&lt;/p&gt;

&lt;p&gt;Let’s connect if you’re in the same space.&lt;/p&gt;

&lt;p&gt;you can follow me on x: [&lt;a href="https://x.com/systemRationale" rel="noopener noreferrer"&gt;https://x.com/systemRationale&lt;/a&gt;]&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Running AI Fully Offline on Mobile with Gemma 4 (Android + iOS)</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Mon, 13 Apr 2026 11:01:00 +0000</pubDate>
      <link>https://dev.to/system_rationale/running-ai-fully-offline-on-mobile-with-gemma-4-android-ios-486f</link>
      <guid>https://dev.to/system_rationale/running-ai-fully-offline-on-mobile-with-gemma-4-android-ios-486f</guid>
      <description>&lt;p&gt;I used to think “AI in apps” meant calling an API.&lt;br&gt;
Then I tried running the model inside the app itself.&lt;/p&gt;

&lt;p&gt;No network. No latency spikes. No sending user data anywhere.&lt;/p&gt;

&lt;p&gt;That’s when things started to feel… different.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why this shift actually matters
&lt;/h2&gt;

&lt;p&gt;Most mobile AI apps today work like this:&lt;/p&gt;

&lt;p&gt;User → App → API → Cloud Model → Response&lt;/p&gt;

&lt;p&gt;Which means:&lt;br&gt;
    • unpredictable latency&lt;br&gt;
    • ongoing cost&lt;br&gt;
    • user data leaves the device&lt;/p&gt;

&lt;p&gt;Now compare that with:&lt;/p&gt;

&lt;p&gt;User → App → Local Model → Response&lt;/p&gt;

&lt;p&gt;No round trips. No dependency.&lt;/p&gt;

&lt;p&gt;That’s what Gemma 4 enables with its edge-optimized variants (E2B / E4B).&lt;/p&gt;
&lt;h2&gt;
  
  
  ⚙️ How you actually run it on mobile
&lt;/h2&gt;

&lt;p&gt;There are two real paths here. Everything else is noise.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Android (Best path): System-level AI via AICore&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you’re targeting modern Android:&lt;br&gt;
    • The model runs as part of the system&lt;br&gt;
    • You don’t bundle anything heavy&lt;br&gt;
    • OS handles optimization (CPU/GPU scheduling)&lt;/p&gt;

&lt;p&gt;👉 This is the cleanest architecture:&lt;br&gt;
    • smaller APK&lt;br&gt;
    • better performance&lt;br&gt;
    • less maintenance&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cross-platform: MediaPipe / AI Edge (Android + iOS)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is where most devs will start.&lt;/p&gt;

&lt;p&gt;You:&lt;br&gt;
    • download a Gemma model (optimized format)&lt;br&gt;
    • run it locally via inference API&lt;br&gt;
    • stream responses into your UI&lt;/p&gt;
&lt;h2&gt;
  
  
  What the code actually looks like
&lt;/h2&gt;

&lt;p&gt;Let’s keep it real — not pseudo code.&lt;/p&gt;

&lt;p&gt;🔹 Android (Kotlin)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;llm&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LlmInference&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createFromOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;appContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LlmInferenceOptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setModelPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/data/user/0/app/files/gemma-4-E2B.litertlm"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setMaxTokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Run inference off main thread&lt;/span&gt;
&lt;span class="nc"&gt;CoroutineScope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Dispatchers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IO&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;response&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Explain event-driven architecture simply"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;withContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Dispatchers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Main&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;textView&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Important:&lt;br&gt;
    • Never run this on main thread&lt;br&gt;
    • Keep responses streamed if possible&lt;/p&gt;

&lt;p&gt;🔹 iOS (Swift)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="kt"&gt;LlmInference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;modelPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gemma-4-E2B.litertlm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kt"&gt;DispatchQueue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;global&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;qos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;userInitiated&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nv"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Explain microservices vs monolith"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="kt"&gt;DispatchQueue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outputLabel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;👉 Same rule applies:&lt;br&gt;
    • background execution is mandatory&lt;br&gt;
    • UI must stay responsive&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚡ Performance reality (this is where most fail)
&lt;/h2&gt;

&lt;p&gt;Let’s be honest — running LLMs on phones is not “free”.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model size is your first constraint&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even optimized models:&lt;br&gt;
    • can be hundreds of MB&lt;/p&gt;

&lt;p&gt;👉 Practical approach:&lt;br&gt;
    • Default → E2B&lt;br&gt;
    • Optional → E4B (for high-end devices)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;First response latency matters more than speed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Users don’t care about tokens/sec.&lt;br&gt;
They care about:&lt;/p&gt;

&lt;p&gt;“How fast did I get the first answer?”&lt;/p&gt;

&lt;p&gt;👉 Fix:&lt;br&gt;
    • warm up model with a tiny prompt&lt;br&gt;
    • preload when app is idle&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GPU / Metal is not optional&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you rely only on CPU:&lt;br&gt;
    • performance drops hard&lt;br&gt;
    • battery drains faster&lt;/p&gt;

&lt;p&gt;👉 Always enable:&lt;br&gt;
    • GPU backend (Android)&lt;br&gt;
    • Metal (iOS)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Threading mistakes will break your app&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you run inference on UI thread:&lt;br&gt;
    • frame drops&lt;br&gt;
    • ANRs&lt;br&gt;
    • crashes&lt;/p&gt;

&lt;p&gt;👉 Treat model inference like:&lt;br&gt;
    • network call&lt;br&gt;
    • or heavy computation&lt;/p&gt;

&lt;h2&gt;
  
  
  Privacy becomes a feature (finally)
&lt;/h2&gt;

&lt;p&gt;This is the part most people underestimate.&lt;/p&gt;

&lt;p&gt;When everything runs locally:&lt;br&gt;
    • user input stays on device&lt;br&gt;
    • no logs&lt;br&gt;
    • no external dependency&lt;/p&gt;

&lt;p&gt;👉 This unlocks real use cases:&lt;br&gt;
    • private note summarization&lt;br&gt;
    • personal AI assistants&lt;br&gt;
    • sensitive chat analysis&lt;br&gt;
    • offline learning tools&lt;/p&gt;

&lt;h2&gt;
  
  
  App size strategy (critical decision)
&lt;/h2&gt;

&lt;p&gt;This is where many implementations go wrong.&lt;/p&gt;

&lt;p&gt;❌ Don’t do this&lt;br&gt;
    • bundle model inside APK/IPA&lt;br&gt;
    • force download during install&lt;/p&gt;

&lt;p&gt;👉 You’ll kill install conversion.&lt;/p&gt;

&lt;p&gt;✅ Do this instead&lt;br&gt;
    • download model after user opts in&lt;br&gt;
    • store in app-specific storage&lt;br&gt;
    • allow deletion / re-download&lt;/p&gt;

&lt;p&gt;🧠 Even better (Android)&lt;/p&gt;

&lt;p&gt;If available:&lt;br&gt;
    • use system model (AICore)&lt;/p&gt;

&lt;p&gt;👉 Zero model shipping&lt;br&gt;
👉 Zero storage overhead&lt;/p&gt;

&lt;p&gt;** Where this actually makes sense**&lt;/p&gt;

&lt;p&gt;Not every app needs on-device AI.&lt;/p&gt;

&lt;p&gt;But for these, it’s a serious advantage:&lt;br&gt;
    • EdTech (offline tutor, quizzes)&lt;br&gt;
    • Productivity (notes, summaries)&lt;br&gt;
    • Messaging (privacy-first features)&lt;br&gt;
    • Dating apps (local intelligence, no data leak)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️ Hard truth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is not magic.&lt;/p&gt;

&lt;p&gt;Avoid if:&lt;br&gt;
    • targeting low-end devices&lt;br&gt;
    • need heavy multi-agent orchestration&lt;br&gt;
    • require massive context windows&lt;/p&gt;

&lt;p&gt;🚀 What changed for me&lt;/p&gt;

&lt;p&gt;After experimenting with this setup, one thing became clear:&lt;/p&gt;

&lt;p&gt;The future of mobile AI isn’t “better APIs”&lt;br&gt;
It’s “less APIs”&lt;/p&gt;

&lt;p&gt;🔚 So We’re moving from:&lt;/p&gt;

&lt;p&gt;“Send data → wait → receive response”&lt;/p&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;p&gt;“Compute locally → respond instantly”&lt;/p&gt;

&lt;p&gt;And the teams that design for this early&lt;br&gt;
will build products that feel fundamentally faster and more trustworthy.&lt;/p&gt;

&lt;p&gt;👋 If you’re building in this space&lt;/p&gt;

&lt;p&gt;I’m building an AI-powered learning system in public.&lt;/p&gt;

&lt;p&gt;Sharing:&lt;br&gt;
    • what I build&lt;br&gt;
    • what breaks&lt;br&gt;
    • what actually scales&lt;/p&gt;

&lt;p&gt;If that’s your space too → let’s connect.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>android</category>
      <category>llm</category>
      <category>mobile</category>
    </item>
    <item>
      <title>Running AI in the Browser with Gemma 4 (No API, No Server)</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Sat, 11 Apr 2026 12:48:50 +0000</pubDate>
      <link>https://dev.to/system_rationale/running-ai-in-the-browser-with-gemma-4-no-api-no-server-3en2</link>
      <guid>https://dev.to/system_rationale/running-ai-in-the-browser-with-gemma-4-no-api-no-server-3en2</guid>
      <description>&lt;p&gt;Most “AI apps” today are just API wrappers.&lt;br&gt;
That’s fine… until you care about latency, cost, or privacy.&lt;/p&gt;

&lt;p&gt;I’ve been exploring what it actually takes to run LLMs inside the browser, and Gemma 4 completely changes what’s possible.&lt;/p&gt;

&lt;p&gt;This is not theory  this is what actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gemma 4 is different
&lt;/h2&gt;

&lt;p&gt;Gemma 4 isn’t just another model release.&lt;/p&gt;

&lt;p&gt;It’s designed for:&lt;br&gt;
    • on-device inference&lt;br&gt;
    • agentic workflows&lt;br&gt;
    • multimodal tasks (text, audio, vision)&lt;/p&gt;

&lt;p&gt;The important part?&lt;/p&gt;

&lt;p&gt;👉 The E2B / E4B variants are small enough to run inside a browser tab.&lt;/p&gt;

&lt;p&gt;No backend required.&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚙️ How it actually runs in the browser
&lt;/h3&gt;

&lt;p&gt;Let’s cut the hype.&lt;/p&gt;

&lt;p&gt;There are only 2 real approaches:&lt;/p&gt;

&lt;h2&gt;
  
  
  1. MediaPipe LLM Inference (Recommended)
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;• WebAssembly + WebGPU under the hood
• Load model like:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;const llm = await LlmInference.createFromOptions({&lt;br&gt;
  modelAssetPath: "/models/gemma-4-E2B.litertlm",&lt;br&gt;
});&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;You now have:&lt;br&gt;
    • streaming responses&lt;br&gt;
    • token control&lt;br&gt;
    • temperature, top-k, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. WebGPU (Transformers.js style)
&lt;/h2&gt;

&lt;p&gt;More control, more pain.&lt;br&gt;
    • You host quantized model&lt;br&gt;
    • Run inference via WebGPU&lt;br&gt;
    • Manage decoding loop yourself&lt;/p&gt;

&lt;p&gt;👉 Only use this if you need custom pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚡ Performance Reality (What nobody tells you)
&lt;/h2&gt;

&lt;p&gt;Running LLMs in browser ≠ free magic.&lt;/p&gt;

&lt;p&gt;Here’s what actually matters:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Model size will kill you if you’re careless
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;• Raw models → GBs
• Optimized (4-bit) → hundreds of MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;👉 Rule:&lt;br&gt;
    • E2B → default&lt;br&gt;
    • E4B → only for high-end devices&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Token limits = UX
&lt;/h3&gt;

&lt;p&gt;Don’t blindly use 128K context.&lt;/p&gt;

&lt;p&gt;You’ll:&lt;br&gt;
    • increase latency&lt;br&gt;
    • kill memory&lt;br&gt;
    • freeze UI&lt;/p&gt;

&lt;p&gt;👉 Cap aggressively:&lt;/p&gt;

&lt;p&gt;maxTokens: 512&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Main thread blocking = bad UX
&lt;/h3&gt;

&lt;p&gt;If you don’t handle this:&lt;br&gt;
    • UI freezes&lt;br&gt;
    • typing lag&lt;br&gt;
    • users drop&lt;/p&gt;

&lt;p&gt;👉 Always:&lt;br&gt;
    • stream tokens&lt;br&gt;
    • use Web Workers if custom setup&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;## You need device intelligence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Don’t assume every device can handle it.&lt;/p&gt;

&lt;p&gt;👉 Do this:&lt;br&gt;
    • Check WebGPU support&lt;br&gt;
    • Estimate memory&lt;br&gt;
    • Fallback → API model&lt;/p&gt;

&lt;h2&gt;
  
  
  🔐 Privacy = Your biggest advantage
&lt;/h2&gt;

&lt;p&gt;This is where things get interesting.&lt;/p&gt;

&lt;p&gt;With browser-based Gemma:&lt;br&gt;
    • No API calls&lt;br&gt;
    • No prompt logging&lt;br&gt;
    • No server dependency&lt;/p&gt;

&lt;h3&gt;
  
  
  Your pitch becomes :
&lt;/h3&gt;

&lt;p&gt;“Your data never leaves your device.”&lt;/p&gt;

&lt;p&gt;That’s not marketing — that’s architecture.&lt;/p&gt;

&lt;p&gt;** How to keep your app lightweight**&lt;/p&gt;

&lt;p&gt;If you mess this up, your app is dead.&lt;/p&gt;

&lt;p&gt;❌ Wrong approach:&lt;br&gt;
    • Bundle model in JS&lt;br&gt;
    • Load on startup&lt;/p&gt;

&lt;p&gt;✅ Correct approach:&lt;br&gt;
    1.  Lazy load model&lt;/p&gt;

&lt;p&gt;&lt;code&gt;if (userClicksAI) {&lt;br&gt;
  loadModel();&lt;br&gt;
}&lt;/code&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Separate asset hosting&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;• /models/gemma-4-E2B.litertlm&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cache aggressively&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;• long cache headers&lt;br&gt;
   • avoid re-download&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Progressive upgrade&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;• start small → offer bigger model later&lt;/p&gt;

&lt;p&gt;🧠 Real use cases (not demos)&lt;/p&gt;

&lt;p&gt;Where this actually makes sense:&lt;br&gt;
    • Private note summarizer&lt;br&gt;
    • Offline AI assistant&lt;br&gt;
    • In-browser coding helper&lt;br&gt;
    • Document parsing (OCR + reasoning)&lt;/p&gt;

&lt;p&gt;⚠️ &lt;strong&gt;Brutal truth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is NOT for:&lt;br&gt;
 • low-end phones&lt;br&gt;
 • heavy reasoning tasks&lt;br&gt;
 • large-scale SaaS&lt;/p&gt;

&lt;p&gt;🚀 Where this fits in real products&lt;/p&gt;

&lt;p&gt;If you’re building something like:&lt;br&gt;
 • productivity tools&lt;br&gt;
 • education apps&lt;br&gt;
 • private assistants&lt;/p&gt;

&lt;p&gt;This is a massive differentiator.&lt;/p&gt;

&lt;p&gt;🔚 Final thought&lt;/p&gt;

&lt;p&gt;We’re moving from:&lt;/p&gt;

&lt;p&gt;“AI as API” to “AI as runtime”&lt;/p&gt;

&lt;p&gt;And browsers are becoming compute platforms.&lt;/p&gt;

&lt;p&gt;If you’re building something real (not demos),&lt;br&gt;
this shift matters more than any model benchmark.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agent workflows&lt;/li&gt;
&lt;li&gt;on-device AI&lt;/li&gt;
&lt;li&gt;system design decisions&lt;/li&gt;
&lt;li&gt;mistakes &amp;amp; trade-offs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Follow me on X: [(&lt;a href="https://x.com/systemRationale)" rel="noopener noreferrer"&gt;https://x.com/systemRationale)&lt;/a&gt;]&lt;/p&gt;

</description>
      <category>ai</category>
      <category>javascript</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Part 3 — Making Gemma 4 Agents Production-Ready: Guardrails, Structured Outputs, and Self-Healing Systems</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Fri, 10 Apr 2026 02:58:00 +0000</pubDate>
      <link>https://dev.to/system_rationale/part-3-making-gemma-4-agents-production-ready-guardrails-structured-outputs-and-self-healing-575n</link>
      <guid>https://dev.to/system_rationale/part-3-making-gemma-4-agents-production-ready-guardrails-structured-outputs-and-self-healing-575n</guid>
      <description>&lt;p&gt;The uncomfortable truth about AI agents&lt;/p&gt;

&lt;p&gt;By the time most teams reach this stage, they’ve already built:&lt;br&gt;
    • a multi-step workflow&lt;br&gt;
    • a supervisor + worker setup&lt;br&gt;
    • integration with tools and APIs&lt;/p&gt;

&lt;p&gt;And yet, the system still fails in production.&lt;/p&gt;

&lt;p&gt;Not because the model is weak.&lt;/p&gt;

&lt;p&gt;But because the system is non-deterministic.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Where reliability actually breaks&lt;/p&gt;

&lt;p&gt;In real deployments, failures don’t come from “bad reasoning”.&lt;/p&gt;

&lt;p&gt;They come from:&lt;br&gt;
    • malformed outputs (invalid JSON, missing fields)&lt;br&gt;
    • inconsistent decisions across steps&lt;br&gt;
    • uncontrolled retries and loops&lt;br&gt;
    • unsafe or duplicated side effects&lt;/p&gt;

&lt;p&gt;You can’t patch these with better prompts.&lt;/p&gt;

&lt;p&gt;You need contracts, validation, and control layers.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;From probabilistic outputs → deterministic contracts&lt;/p&gt;

&lt;p&gt;The first shift is simple but critical:&lt;/p&gt;

&lt;p&gt;Treat every model output as untrusted input&lt;/p&gt;

&lt;p&gt;Instead of accepting free-form text, define strict schemas using&lt;br&gt;
Pydantic or&lt;br&gt;
PydanticAI.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Example: Root Cause Contract&lt;/p&gt;

&lt;p&gt;class RootCause(BaseModel):&lt;br&gt;
    service: str&lt;br&gt;
    confidence: float&lt;br&gt;
    error_type: Literal["OOM", "MemoryLeak", "Config", "Network"]&lt;br&gt;
    evidence: list[str]&lt;br&gt;
    next_steps: list[str]&lt;/p&gt;

&lt;p&gt;This does three things:&lt;br&gt;
    1.  Forces the model into a structured format&lt;br&gt;
    2.  Enables automatic validation&lt;br&gt;
    3.  Creates a stable interface between system components&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What this looks like in practice&lt;/p&gt;

&lt;p&gt;A production pipeline becomes:&lt;/p&gt;

&lt;p&gt;LLM Output → Schema Validation → Accept / Reject → Retry / Escalate&lt;/p&gt;

&lt;p&gt;This is no longer “AI responding”.&lt;/p&gt;

&lt;p&gt;It’s a controlled data pipeline.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The self-healing loop&lt;/p&gt;

&lt;p&gt;Validation is only half the system.&lt;/p&gt;

&lt;p&gt;The real reliability comes from how you handle failure.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Controlled retry pattern&lt;br&gt;
    1.  Generate output&lt;br&gt;
    2.  Validate against schema&lt;br&gt;
    3.  Capture validation error&lt;br&gt;
    4.  Feed error back into model&lt;br&gt;
    5.  Retry with constraints&lt;br&gt;
    6.  Stop after N attempts&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Example failure feedback&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;“Try again”&lt;/p&gt;

&lt;p&gt;You send:&lt;/p&gt;

&lt;p&gt;“Field confidence must be a float between 0 and 1.&lt;br&gt;
error_type must be one of [OOM, MemoryLeak, Config, Network].&lt;br&gt;
Fix the JSON.”&lt;/p&gt;

&lt;p&gt;This transforms the model into a self-correcting system.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why Gemma 4 fits this model well&lt;/p&gt;

&lt;p&gt;With Gemma 4, this loop becomes practical at scale.&lt;/p&gt;

&lt;p&gt;Because:&lt;br&gt;
    • thinking mode improves structured reasoning&lt;br&gt;
    • MoE architecture reduces cost per retry&lt;br&gt;
    • long context allows passing validation history&lt;br&gt;
    • tool calling aligns with structured outputs&lt;/p&gt;

&lt;p&gt;This is critical.&lt;/p&gt;

&lt;p&gt;Self-healing systems require multiple attempts.&lt;br&gt;
Cost-efficient inference makes that viable.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Guardrails are not optional&lt;/p&gt;

&lt;p&gt;Without guardrails, your system will eventually:&lt;br&gt;
    • loop indefinitely&lt;br&gt;
    • call the wrong tools&lt;br&gt;
    • execute unsafe actions&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Minimum guardrail layer&lt;/p&gt;

&lt;p&gt;You should implement:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Step limits&lt;br&gt;
• Hard cap on number of node executions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Error classification&lt;br&gt;
• Retry: timeouts, rate limits&lt;br&gt;
• Fail: schema errors, auth issues&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Circuit breakers&lt;br&gt;
• Stop calling failing dependencies&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Human-in-the-loop&lt;br&gt;
• Required for destructive actions&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Visualizing guardrails in the system&lt;/p&gt;

&lt;p&gt;Think of your system as:&lt;/p&gt;

&lt;p&gt;State Machine&lt;br&gt;
   ↓&lt;br&gt;
Validation Layer&lt;br&gt;
   ↓&lt;br&gt;
Guardrails&lt;br&gt;
   ↓&lt;br&gt;
Execution&lt;/p&gt;

&lt;p&gt;Each layer reduces uncertainty.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Going beyond validation: adaptive systems with DSPy&lt;/p&gt;

&lt;p&gt;Validation ensures correctness.&lt;/p&gt;

&lt;p&gt;But how do you improve the system over time?&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Enter DSPy&lt;/p&gt;

&lt;p&gt;DSPy treats your pipeline as a program:&lt;br&gt;
    • inputs → outputs&lt;br&gt;
    • defined signatures&lt;br&gt;
    • measurable metrics&lt;/p&gt;

&lt;p&gt;It allows you to:&lt;br&gt;
    • run evaluation datasets&lt;br&gt;
    • measure output quality&lt;br&gt;
    • optimize prompts automatically&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What this unlocks&lt;/p&gt;

&lt;p&gt;Instead of manual tuning:&lt;br&gt;
    • the system detects failures&lt;br&gt;
    • adjusts prompts / examples&lt;br&gt;
    • improves over time&lt;/p&gt;

&lt;p&gt;This is the missing layer in most agent systems.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Combining everything: the deterministic stack&lt;/p&gt;

&lt;p&gt;A production-ready Gemma 4 system looks like:&lt;/p&gt;

&lt;p&gt;State Graph (LangGraph)&lt;br&gt;
      ↓&lt;br&gt;
Supervisor (Gemma 4 thinking mode)&lt;br&gt;
      ↓&lt;br&gt;
Workers (task-specific agents)&lt;br&gt;
      ↓&lt;br&gt;
Pydantic Validation&lt;br&gt;
      ↓&lt;br&gt;
Guardrails&lt;br&gt;
      ↓&lt;br&gt;
DSPy Evaluation + Optimization&lt;/p&gt;

&lt;p&gt;Each layer solves a specific failure mode.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Real-world application: autonomous DevOps agent&lt;/p&gt;

&lt;p&gt;Example workflow:&lt;/p&gt;

&lt;p&gt;Trace&lt;br&gt;
    • collect logs, metrics, events&lt;/p&gt;

&lt;p&gt;RootCause&lt;br&gt;
    • detect anomalies (OOMKilled, memory leaks)&lt;/p&gt;

&lt;p&gt;Plan&lt;br&gt;
    • decide corrective action&lt;/p&gt;

&lt;p&gt;Fix&lt;br&gt;
    • restart pods, scale services, or open PR&lt;/p&gt;

&lt;p&gt;Verify&lt;br&gt;
    • confirm system recovery&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why this works&lt;/p&gt;

&lt;p&gt;Because:&lt;br&gt;
    • every step is validated&lt;br&gt;
    • every action is controlled&lt;br&gt;
    • every failure is recoverable&lt;/p&gt;

&lt;p&gt;This is not an “AI agent”.&lt;/p&gt;

&lt;p&gt;It’s a deterministic system with AI inside it.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Practical implementation stack&lt;/p&gt;

&lt;p&gt;If you’re building this today:&lt;br&gt;
    • Model: Gemma 4 (26B MoE)&lt;br&gt;
    • Orchestration: LangGraph&lt;br&gt;
    • Validation: Pydantic / PydanticAI&lt;br&gt;
    • Guardrails: custom + middleware&lt;br&gt;
    • Evaluation: DSPy&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Resources&lt;/p&gt;

&lt;p&gt;Core&lt;br&gt;
    • &lt;a href="https://github.com/google-deepmind/gemma" rel="noopener noreferrer"&gt;https://github.com/google-deepmind/gemma&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/google/gemma_pytorch" rel="noopener noreferrer"&gt;https://github.com/google/gemma_pytorch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Orchestration&lt;br&gt;
    • &lt;a href="https://github.com/langchain-ai/langgraph" rel="noopener noreferrer"&gt;https://github.com/langchain-ai/langgraph&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/langchain-ai/langgraph-example" rel="noopener noreferrer"&gt;https://github.com/langchain-ai/langgraph-example&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Validation &amp;amp; Guardrails&lt;br&gt;
    • &lt;a href="https://github.com/pydantic/pydantic-ai" rel="noopener noreferrer"&gt;https://github.com/pydantic/pydantic-ai&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/jagreehal/pydantic-ai-guardrails" rel="noopener noreferrer"&gt;https://github.com/jagreehal/pydantic-ai-guardrails&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Evaluation &amp;amp; Optimization&lt;br&gt;
    • &lt;a href="https://github.com/stanfordnlp/dspy" rel="noopener noreferrer"&gt;https://github.com/stanfordnlp/dspy&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/Scale3-Labs/dspy-examples" rel="noopener noreferrer"&gt;https://github.com/Scale3-Labs/dspy-examples&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Real-world systems&lt;br&gt;
    • &lt;a href="https://github.com/qicesun/SRE-Agent-App" rel="noopener noreferrer"&gt;https://github.com/qicesun/SRE-Agent-App&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Final perspective&lt;/p&gt;

&lt;p&gt;Most teams are still chasing:&lt;br&gt;
    • better prompts&lt;br&gt;
    • better models&lt;br&gt;
    • better outputs&lt;/p&gt;

&lt;p&gt;That’s not where reliability comes from.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Reliability comes from:&lt;br&gt;
    • explicit state&lt;br&gt;
    • strict contracts&lt;br&gt;
    • controlled execution&lt;br&gt;
    • continuous evaluation&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productionagent</category>
      <category>agentdesign</category>
      <category>multiagent</category>
    </item>
    <item>
      <title>Designing Multi-Agent Systems with Gemma 4: Supervisor and Worker Pattern</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Wed, 08 Apr 2026 02:49:00 +0000</pubDate>
      <link>https://dev.to/system_rationale/designing-multi-agent-systems-with-gemma-4-supervisor-and-worker-pattern-2ckh</link>
      <guid>https://dev.to/system_rationale/designing-multi-agent-systems-with-gemma-4-supervisor-and-worker-pattern-2ckh</guid>
      <description>&lt;p&gt;Most agent implementations fail for a simple reason:&lt;/p&gt;

&lt;p&gt;They try to make one model do everything.&lt;/p&gt;

&lt;p&gt;That approach does not scale.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The limitation of single-agent systems&lt;/p&gt;

&lt;p&gt;When one agent is responsible for:&lt;br&gt;
    • understanding context&lt;br&gt;
    • making decisions&lt;br&gt;
    • calling tools&lt;br&gt;
    • validating outputs&lt;br&gt;
    • executing actions&lt;/p&gt;

&lt;p&gt;you introduce uncontrolled complexity.&lt;/p&gt;

&lt;p&gt;The result is:&lt;br&gt;
    • inconsistent behavior&lt;br&gt;
    • hallucinated decisions&lt;br&gt;
    • poor failure recovery&lt;/p&gt;

&lt;p&gt;This is not a model limitation. It’s a design issue.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The correct pattern: separation of responsibilities&lt;/p&gt;

&lt;p&gt;A more stable architecture separates concerns into two layers:&lt;/p&gt;

&lt;p&gt;Worker agents&lt;/p&gt;

&lt;p&gt;Each worker is narrowly scoped:&lt;br&gt;
    • log analysis&lt;br&gt;
    • root cause detection&lt;br&gt;
    • code or PR generation&lt;br&gt;
    • infrastructure interaction&lt;/p&gt;

&lt;p&gt;Workers should be predictable and task-specific.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Supervisor agent&lt;/p&gt;

&lt;p&gt;The supervisor coordinates the system.&lt;/p&gt;

&lt;p&gt;With Gemma 4, this becomes significantly more powerful due to its thinking mode.&lt;/p&gt;

&lt;p&gt;The supervisor:&lt;br&gt;
    • reads the global system state&lt;br&gt;
    • decides which worker to invoke&lt;br&gt;
    • validates outputs before progressing&lt;br&gt;
    • handles retries and escalation&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why thinking mode matters&lt;/p&gt;

&lt;p&gt;Gemma 4 introduces structured reasoning behavior, often referred to as a “thinking” phase.&lt;/p&gt;

&lt;p&gt;In practice, this allows the supervisor to:&lt;br&gt;
    1.  evaluate multiple possible actions&lt;br&gt;
    2.  internally reason about risks and outcomes&lt;br&gt;
    3.  select the next state transition&lt;/p&gt;

&lt;p&gt;This creates a separation between:&lt;br&gt;
    • internal reasoning&lt;br&gt;
    • external actions&lt;/p&gt;

&lt;p&gt;That separation is critical for reliability.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Putting it together: state-driven execution&lt;/p&gt;

&lt;p&gt;A typical flow looks like this:&lt;br&gt;
    • Trace — collect logs, metrics, events&lt;br&gt;
    • RootCause — identify likely issue&lt;br&gt;
    • Plan — decide next action&lt;br&gt;
    • Fix / Escalate — execute or request approval&lt;br&gt;
    • Verify — confirm resolution&lt;/p&gt;

&lt;p&gt;Each step is a node in a state machine.&lt;/p&gt;

&lt;p&gt;The supervisor controls transitions between nodes.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What this architecture fixes&lt;/p&gt;

&lt;p&gt;This approach eliminates common issues:&lt;br&gt;
    • uncontrolled loops → bounded by state transitions&lt;br&gt;
    • inconsistent decisions → centralized in supervisor&lt;br&gt;
    • retry chaos → handled explicitly in graph&lt;br&gt;
    • unclear execution → traceable at each node&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What most teams still get wrong&lt;/p&gt;

&lt;p&gt;Even with this architecture, many implementations fail because they:&lt;br&gt;
    • skip output validation&lt;br&gt;
    • allow unlimited retries&lt;br&gt;
    • treat tool calls as always safe&lt;br&gt;
    • don’t distinguish between reversible and irreversible actions&lt;/p&gt;

&lt;p&gt;These are not optional concerns.&lt;/p&gt;

&lt;p&gt;They define whether your system is production-ready.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Resources&lt;br&gt;
    • &lt;a href="https://github.com/langchain-ai/langgraph" rel="noopener noreferrer"&gt;https://github.com/langchain-ai/langgraph&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://github.com/emarco177/langgraph-course" rel="noopener noreferrer"&gt;https://github.com/emarco177/langgraph-course&lt;/a&gt;&lt;br&gt;
    • &lt;a href="https://codelabs.developers.google.com/aidemy-multi-agent/instructions" rel="noopener noreferrer"&gt;https://codelabs.developers.google.com/aidemy-multi-agent/instructions&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Next&lt;/p&gt;

&lt;p&gt;In the final part:&lt;/p&gt;

&lt;p&gt;How to make Gemma 4 agents deterministic using structured outputs, guardrails, and self-healing pipelines&lt;/p&gt;

</description>
      <category>gemma4</category>
      <category>agentdesign</category>
      <category>agentworker</category>
      <category>multiagent</category>
    </item>
    <item>
      <title>Gemma 4 MoE: frontier quality at 1/10th the API cost</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Tue, 07 Apr 2026 02:43:00 +0000</pubDate>
      <link>https://dev.to/system_rationale/gemma-4-moe-frontier-quality-at-110th-the-api-cost-2oan</link>
      <guid>https://dev.to/system_rationale/gemma-4-moe-frontier-quality-at-110th-the-api-cost-2oan</guid>
      <description>&lt;p&gt;Gemma 4 MoE: frontier quality at 1/10th the API cost&lt;/p&gt;

&lt;h1&gt;
  
  
  gemma4 #moe #llm #openweights #aiinfra
&lt;/h1&gt;

&lt;p&gt;Continuing from Part 1 — once you have a proper state machine architecture, the next question is: which model runs inside it?&lt;/p&gt;

&lt;p&gt;For high-volume agent workloads, my pick is Gemma 4 26B MoE.&lt;/p&gt;

&lt;p&gt;Here's the actual reasoning.&lt;/p&gt;




&lt;h2&gt;
  
  
  What MoE means (no marketing)
&lt;/h2&gt;

&lt;p&gt;Most LLMs are dense. A 30B dense model activates 30B parameters per token — every single one, every single call.&lt;/p&gt;

&lt;p&gt;Mixture-of-Experts works differently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total parameters: ~25B&lt;/li&gt;
&lt;li&gt;Active parameters per token: ~3.8B&lt;/li&gt;
&lt;li&gt;A router picks 8 experts out of 128 per token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Near-30B quality. ~4B compute per token.&lt;/p&gt;

&lt;p&gt;Not a trick. Just a better architecture for inference-heavy workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real cost math
&lt;/h2&gt;

&lt;p&gt;GPT-4o: $2.50 per 1M input tokens, $10 per 1M output tokens.&lt;/p&gt;

&lt;p&gt;Gemma 4 is open-weight. Host it yourself on an A100. At volume — thousands of agent runs per day — the math flips hard in your favor.&lt;/p&gt;

&lt;p&gt;This matters specifically for agents because agents are token-heavy. One agent run might involve 5–20 LLM calls, each with a full context window. At GPT-4o pricing, that adds up fast. On self-hosted Gemma 4, it stays manageable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Gemma 4 gives you specifically for agents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;256K context window — feed full log files, traces, conversation history in one shot&lt;/li&gt;
&lt;li&gt;Native function calling — no wrapper hacks for tool use&lt;/li&gt;
&lt;li&gt;Thinking mode — model reasons privately before acting (critical for Supervisor agents — Part 3)&lt;/li&gt;
&lt;li&gt;Multimodal input — pass Grafana screenshots directly to it&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When GPT-4o still wins
&lt;/h2&gt;

&lt;p&gt;Being honest here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Need sub-second latency, don't control infra → GPT-4o&lt;/li&gt;
&lt;li&gt;Need best reasoning with zero setup → GPT-4o&lt;/li&gt;
&lt;li&gt;Running under 10k tokens/day → pricing doesn't matter, use anything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gemma 4 wins when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need cost control at volume&lt;/li&gt;
&lt;li&gt;Data can't leave your infra (regulated, private)&lt;/li&gt;
&lt;li&gt;You're comfortable with GPU infra or a cloud GPU provider&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull gemma4:26b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Local testing done. For production throughput, pair with vLLM.&lt;/p&gt;




&lt;p&gt;Part 3 is the architecture — Supervisor + Worker agents using Gemma 4's thinking mode inside a LangGraph state machine. That's where 99.9% reliability actually becomes achievable.&lt;/p&gt;

&lt;p&gt;— System Rationale&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Gemma 4 on Mobile: Which Model to Load (E2B vs E4B) + Real Implementation Guide</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Mon, 06 Apr 2026 18:09:21 +0000</pubDate>
      <link>https://dev.to/system_rationale/gemma-4-on-mobile-which-model-to-load-e2b-vs-e4b-real-implementation-guide-2i1k</link>
      <guid>https://dev.to/system_rationale/gemma-4-on-mobile-which-model-to-load-e2b-vs-e4b-real-implementation-guide-2i1k</guid>
      <description>&lt;p&gt;Hey devs 👋&lt;br&gt;
I’ve been hands-on with Gemma 4 since it dropped 4 days ago and honestly — the E2B and E4B variants are the first models that actually feel practical for real mobile apps.&lt;br&gt;
Here’s the no-BS guide I wish I had: which model to load for your use case + exactly how to load it on Android, iOS, React Native, and web.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Which Gemma 4 model should you actually load?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;E2B (≈5.1B total params, only 2.3B active thanks to Per-Layer Embeddings)&lt;br&gt;
→ Your default for phones.&lt;br&gt;
Use cases: offline tutor, smart replies, chat rephrasing, note summarization, safety filters, anything battery/RAM sensitive.&lt;br&gt;
Cold start is fast, runs smooth on mid-range devices.&lt;br&gt;
E4B (≈8B total, 4.5B effective)&lt;br&gt;
→ Sweet spot for flagship phones or when you need noticeably better reasoning + native audio + image understanding.&lt;br&gt;
Use cases: multimodal (photo → description), longer context tasks, or when E2B feels a bit “light”.&lt;br&gt;
26B A4B MoE or 31B&lt;br&gt;
→ Skip these on mobile. Only for laptops, desktops, or server-side heavy lifting.&lt;/p&gt;

&lt;p&gt;Rule of thumb I use: start with E2B. Only bump to E4B if users complain about quality or you need audio/image input.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How to actually load the model (the part that matters)
Android&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Easiest path: AICore Developer Preview (system-wide Gemma 4, zero weights to ship).&lt;br&gt;
Just call the ML Kit GenAI Prompt API — Google handles hardware delegation (NPU/GPU).&lt;br&gt;
For full control in your app: LiteRT-LM&lt;br&gt;
Download the quantized .task file (4-bit) from HF&lt;br&gt;
Use on-demand Play Asset Delivery so your APK stays &amp;lt;100 MB&lt;br&gt;
Load in background with Coroutines → never block UI&lt;br&gt;
Use streaming callback so tokens appear live&lt;/p&gt;

&lt;p&gt;iOS&lt;/p&gt;

&lt;p&gt;MediaPipe LLM Inference API is the official way.&lt;br&gt;
Convert to MediaPipe task format → memory-map the weights → Metal/MPS acceleration.&lt;br&gt;
Warm up the model during app idle time so first token feels instant.&lt;/p&gt;

&lt;p&gt;React Native&lt;/p&gt;

&lt;p&gt;Native TurboModule (Kotlin + Swift) is non-negotiable.&lt;br&gt;
Keep the entire model + inference in native code.&lt;br&gt;
Expose only generateResponse(prompt, options) and onToken events back to JS.&lt;br&gt;
Never run inference on the JS thread — you will OOM and crash.&lt;/p&gt;

&lt;p&gt;Web&lt;/p&gt;

&lt;p&gt;MediaPipe + WebGPU (works surprisingly well in Chrome).&lt;/p&gt;

&lt;p&gt;Universal tips that saved my ass:&lt;/p&gt;

&lt;p&gt;Always use 4-bit quantized version (Q4_K_M or LiteRT equivalent)&lt;br&gt;
Never bundle the full model in the APK/IPA — download on first user opt-in&lt;br&gt;
Cap context at 4K–8K for mobile (128K is possible but eats RAM)&lt;br&gt;
Stream tokens. Always. Users hate staring at a blank screen.&lt;/p&gt;

&lt;p&gt;Security bonus: because E2B/E4B run 100% offline, user data (exam answers, private notes, photos) never touches your servers. Huge privacy win.&lt;br&gt;
I’m using this exact stack right now for an offline-first tutor app and it’s buttery smooth.&lt;br&gt;
Drop your use case below and I’ll tell you which variant + exact loading path I’d pick for it.&lt;br&gt;
Useful resources (all fresh as of April 2026):&lt;/p&gt;

&lt;p&gt;Official Gemma 4 announcement: &lt;a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/" rel="noopener noreferrer"&gt;https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/&lt;/a&gt;&lt;br&gt;
Model card + sizes: &lt;a href="https://ai.google.dev/gemma/docs/core/model_card_4" rel="noopener noreferrer"&gt;https://ai.google.dev/gemma/docs/core/model_card_4&lt;/a&gt;&lt;br&gt;
Full model overview (E2B/E4B details): &lt;a href="https://ai.google.dev/gemma/docs/core" rel="noopener noreferrer"&gt;https://ai.google.dev/gemma/docs/core&lt;/a&gt;&lt;br&gt;
Android AICore + ML Kit guide: &lt;a href="https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html" rel="noopener noreferrer"&gt;https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html&lt;/a&gt;&lt;br&gt;
LiteRT-LM mobile deployment: &lt;a href="https://ai.google.dev/edge/litert-lm" rel="noopener noreferrer"&gt;https://ai.google.dev/edge/litert-lm&lt;/a&gt;&lt;br&gt;
Hugging Face E2B/E4B quantized models: &lt;a href="https://huggingface.co/google/gemma-4-E2B-it" rel="noopener noreferrer"&gt;https://huggingface.co/google/gemma-4-E2B-it&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Who’s actually shipping Gemma 4 on device right now? Show me your stack 🙌&lt;/p&gt;

</description>
      <category>gemma4</category>
      <category>ondeviceai</category>
      <category>mobile</category>
      <category>offlineai</category>
    </item>
    <item>
      <title>Why your LLM agent fails at 3 AM (and how state machines fix it)</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Mon, 06 Apr 2026 09:35:26 +0000</pubDate>
      <link>https://dev.to/system_rationale/why-your-llm-agent-fails-at-3-am-and-how-state-machines-fix-it-3691</link>
      <guid>https://dev.to/system_rationale/why-your-llm-agent-fails-at-3-am-and-how-state-machines-fix-it-3691</guid>
      <description>&lt;p&gt;Why your LLM agent fails at 3 AM (and how state machines fix it)&lt;/p&gt;

&lt;h1&gt;
  
  
  agents #llm #langgraph #systemdesign #aiinfra
&lt;/h1&gt;

&lt;p&gt;I've been reading postmortems from teams running LLM agents in production.&lt;/p&gt;

&lt;p&gt;Same failure every time.&lt;/p&gt;

&lt;p&gt;Not model quality. Not prompt engineering. The architecture.&lt;/p&gt;

&lt;p&gt;Most AI agents today still look like this:&lt;/p&gt;

&lt;p&gt;User Input → LLM Call → Tool Call → LLM Call → Output&lt;/p&gt;

&lt;p&gt;A chain. Linear. Stateless. Hopeful.&lt;/p&gt;

&lt;p&gt;Works great in a notebook. Breaks under real load.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 4 ways chains die in production
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Infinite loops&lt;/strong&gt;&lt;br&gt;
Agent calls a tool → tool fails → agent retries → tool fails → agent retries.&lt;br&gt;
No exit condition. You're burning tokens at 3 AM while sleeping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. No checkpoint on failure&lt;/strong&gt;&lt;br&gt;
Step 7 of 10 fails. You restart from step 1. Every. Single. Time.&lt;br&gt;
Duplicate side effects — emails, API writes, deploys — retried blindly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Opaque debugging&lt;/strong&gt;&lt;br&gt;
You see the final error. Not which step poisoned the state.&lt;br&gt;
No trace. No replay. Just vibes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Mixed mutation semantics&lt;/strong&gt;&lt;br&gt;
Read-only and write steps treated identically.&lt;br&gt;
A retry re-applies a deployment or a payment. You've now deployed twice.&lt;/p&gt;




&lt;h2&gt;
  
  
  The mental model shift
&lt;/h2&gt;

&lt;p&gt;Stop thinking: "prompt chain"&lt;br&gt;
Start thinking: "distributed system with state"&lt;/p&gt;

&lt;p&gt;A state machine models your workflow as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;States — Idle, Planning, Executing, Validating, Recovering&lt;/li&gt;
&lt;li&gt;Transitions — conditional, guarded, audited&lt;/li&gt;
&lt;li&gt;Persisted state — survives crashes, enables checkpointing, replay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangGraph made this practical. Every node writes to a shared state object. Every edge is conditional.&lt;/p&gt;

&lt;p&gt;If a node fails → resume from the last checkpoint. Not from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this actually looks like
&lt;/h2&gt;

&lt;p&gt;Chain:  A → B → C → D → Error (restart from A)&lt;/p&gt;

&lt;p&gt;Graph:  A → B → C → Error → Retry(C) → D&lt;br&gt;
                    ↓&lt;br&gt;
               HumanApproval → D&lt;/p&gt;

&lt;p&gt;The graph knows where it failed. It knows what to do next.&lt;br&gt;
The chain just panics.&lt;/p&gt;




&lt;p&gt;This is Part 1 of a series on building deterministic, production-grade multi-agent systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt; Why I'm using Gemma 4 26B MoE as the reasoning engine — and how it compares to GPT-4o on real cost.&lt;/p&gt;

&lt;p&gt;If you're building AI systems that need to work under an SLA — follow along.&lt;/p&gt;

&lt;p&gt;— System Rationale&lt;/p&gt;

</description>
      <category>agents</category>
      <category>architecture</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
