System Rationale

Posted on Apr 6

Why your LLM agent fails at 3 AM (and how state machines fix it)

#agents #llm #architecture #systemdesign

agents #llm #langgraph #systemdesign #aiinfra

I've been reading postmortems from teams running LLM agents in production.

Same failure every time.

Not model quality. Not prompt engineering. The architecture.

Most AI agents today still look like this:

User Input → LLM Call → Tool Call → LLM Call → Output

A chain. Linear. Stateless. Hopeful.

Works great in a notebook. Breaks under real load.

The 4 ways chains die in production

1. Infinite loops
Agent calls a tool → tool fails → agent retries → tool fails → agent retries.
No exit condition. You're burning tokens at 3 AM while sleeping.

2. No checkpoint on failure
Step 7 of 10 fails. You restart from step 1. Every. Single. Time.
Duplicate side effects — emails, API writes, deploys — retried blindly.

3. Opaque debugging
You see the final error. Not which step poisoned the state.
No trace. No replay. Just vibes.

4. Mixed mutation semantics
Read-only and write steps treated identically.
A retry re-applies a deployment or a payment. You've now deployed twice.

The mental model shift

Stop thinking: "prompt chain"
Start thinking: "distributed system with state"

A state machine models your workflow as:

States — Idle, Planning, Executing, Validating, Recovering
Transitions — conditional, guarded, audited
Persisted state — survives crashes, enables checkpointing, replay

LangGraph made this practical. Every node writes to a shared state object. Every edge is conditional.

If a node fails → resume from the last checkpoint. Not from scratch.

What this actually looks like

Chain: A → B → C → D → Error (restart from A)

Graph: A → B → C → Error → Retry(C) → D
↓
HumanApproval → D

The graph knows where it failed. It knows what to do next.
The chain just panics.

This is Part 1 of a series on building deterministic, production-grade multi-agent systems.

Next up: Why I'm using Gemma 4 26B MoE as the reasoning engine — and how it compares to GPT-4o on real cost.

If you're building AI systems that need to work under an SLA — follow along.

— System Rationale

DEV Community

Why your LLM agent fails at 3 AM (and how state machines fix it)

agents #llm #langgraph #systemdesign #aiinfra

The 4 ways chains die in production

The mental model shift

What this actually looks like

Top comments (0)