<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: System Rationale</title>
    <description>The latest articles on DEV Community by System Rationale (@system_rationale).</description>
    <link>https://dev.to/system_rationale</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F404282%2F25c84903-9f4f-4a6c-a886-c324eebf901c.png</url>
      <title>DEV Community: System Rationale</title>
      <link>https://dev.to/system_rationale</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/system_rationale"/>
    <language>en</language>
    <item>
      <title>Why your LLM agent fails at 3 AM (and how state machines fix it)</title>
      <dc:creator>System Rationale</dc:creator>
      <pubDate>Mon, 06 Apr 2026 09:35:26 +0000</pubDate>
      <link>https://dev.to/system_rationale/why-your-llm-agent-fails-at-3-am-and-how-state-machines-fix-it-3691</link>
      <guid>https://dev.to/system_rationale/why-your-llm-agent-fails-at-3-am-and-how-state-machines-fix-it-3691</guid>
      <description>&lt;p&gt;Why your LLM agent fails at 3 AM (and how state machines fix it)&lt;/p&gt;

&lt;h1&gt;
  
  
  agents #llm #langgraph #systemdesign #aiinfra
&lt;/h1&gt;

&lt;p&gt;I've been reading postmortems from teams running LLM agents in production.&lt;/p&gt;

&lt;p&gt;Same failure every time.&lt;/p&gt;

&lt;p&gt;Not model quality. Not prompt engineering. The architecture.&lt;/p&gt;

&lt;p&gt;Most AI agents today still look like this:&lt;/p&gt;

&lt;p&gt;User Input → LLM Call → Tool Call → LLM Call → Output&lt;/p&gt;

&lt;p&gt;A chain. Linear. Stateless. Hopeful.&lt;/p&gt;

&lt;p&gt;Works great in a notebook. Breaks under real load.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 4 ways chains die in production
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Infinite loops&lt;/strong&gt;&lt;br&gt;
Agent calls a tool → tool fails → agent retries → tool fails → agent retries.&lt;br&gt;
No exit condition. You're burning tokens at 3 AM while sleeping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. No checkpoint on failure&lt;/strong&gt;&lt;br&gt;
Step 7 of 10 fails. You restart from step 1. Every. Single. Time.&lt;br&gt;
Duplicate side effects — emails, API writes, deploys — retried blindly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Opaque debugging&lt;/strong&gt;&lt;br&gt;
You see the final error. Not which step poisoned the state.&lt;br&gt;
No trace. No replay. Just vibes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Mixed mutation semantics&lt;/strong&gt;&lt;br&gt;
Read-only and write steps treated identically.&lt;br&gt;
A retry re-applies a deployment or a payment. You've now deployed twice.&lt;/p&gt;




&lt;h2&gt;
  
  
  The mental model shift
&lt;/h2&gt;

&lt;p&gt;Stop thinking: "prompt chain"&lt;br&gt;
Start thinking: "distributed system with state"&lt;/p&gt;

&lt;p&gt;A state machine models your workflow as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;States — Idle, Planning, Executing, Validating, Recovering&lt;/li&gt;
&lt;li&gt;Transitions — conditional, guarded, audited&lt;/li&gt;
&lt;li&gt;Persisted state — survives crashes, enables checkpointing, replay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangGraph made this practical. Every node writes to a shared state object. Every edge is conditional.&lt;/p&gt;

&lt;p&gt;If a node fails → resume from the last checkpoint. Not from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this actually looks like
&lt;/h2&gt;

&lt;p&gt;Chain:  A → B → C → D → Error (restart from A)&lt;/p&gt;

&lt;p&gt;Graph:  A → B → C → Error → Retry(C) → D&lt;br&gt;
                    ↓&lt;br&gt;
               HumanApproval → D&lt;/p&gt;

&lt;p&gt;The graph knows where it failed. It knows what to do next.&lt;br&gt;
The chain just panics.&lt;/p&gt;




&lt;p&gt;This is Part 1 of a series on building deterministic, production-grade multi-agent systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next up:&lt;/strong&gt; Why I'm using Gemma 4 26B MoE as the reasoning engine — and how it compares to GPT-4o on real cost.&lt;/p&gt;

&lt;p&gt;If you're building AI systems that need to work under an SLA — follow along.&lt;/p&gt;

&lt;p&gt;— System Rationale&lt;/p&gt;

</description>
      <category>agents</category>
      <category>architecture</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
