DEV Community

vishal-dehurdle
vishal-dehurdle

Posted on

I Used Lyapunov Stability Theory to Monitor LLM Agents — Here's What Actually Worked and What Didn't

The Elephant in the Room: "Isn't This Just max_iterations?"

Let me address this up front.

If you're building a ReAct loop with a single LLM and 10 tool calls, you do not need a physics-inspired monitoring library. Set max_iterations=10, add a budget cap, and move on. LangGraph, CrewAI, and every modern agent framework already support this natively.

I built state-harness because I ran into a problem that max_iterations doesn't solve. And after benchmarking it across 2,367 runs, I also learned what it can't do — which I'll be equally transparent about.


The Problem That max_iterations Doesn't Solve

There are two specific scenarios where simple iteration caps fall short:

1. Search-Tree Agents (MCTS, Beam Search)

Advanced coding agents — the kind that solve SWE-bench tasks, or the architecture behind tools like Devin — don't run a flat loop. They explore a search tree. Each node branches into multiple candidate solutions. A node that spirals doesn't just waste one turn; it inflates every downstream branch.

In a 50-node search tree, you can't set max_iterations=50 and call it a day. The agent isn't iterating — it's branching. Token usage grows quadratically. A single stuck branch can burn thousands of tokens before the tree-level budget cap even notices, because the per-branch cost looks normal in isolation.

2. Failure Pattern Aggregation at Scale

If you run 100 agent tasks a day, you open LangSmith, look at the traces of the 5 that failed, and debug them manually. That works.

If you run 10,000+ tasks a day, manual trace inspection is impossible. Your observability bill alone (storing and indexing millions of multi-turn traces) becomes significant. What you actually need is: classify the failure pattern at the edge, at zero cost, and export it as a structured attribute to your metrics pipeline. Then your Grafana dashboard shows: "This week, 40% of failures are retry storms on the SQL tool → add exponential backoff."

That's not something max_iterations gives you. It's not something LangSmith gives you (at least not without paying for indexing every trace). It's what state-harness was designed for.


The Core Insight: Growth-Ratio Normalization

In physics, Lyapunov stability determines whether a dynamical system will return to equilibrium or diverge.

I modeled LLM agent token consumption as a dynamical system where the "energy" V(k) is a function of cumulative token growth. The stability criterion is straightforward: if the energy derivative ΔV ≥ 0 for consecutive steps, the system is diverging.

The problem: In any multi-turn conversation, token usage grows monotonically because the context window accumulates history. A naive Lyapunov monitor would trip on every healthy conversation — you'd get 100% false positives.

The solution: Instead of monitoring raw token counts, normalize each turn against a running baseline to compute a growth ratio:

  • Growth ratio ≈ 1.0 → the agent is consuming tokens at its expected rate (stable)
  • Growth ratio > 2.0× for 3+ consecutive turns → the agent is consuming disproportionately more each turn (diverging)

This normalization is the key insight. It's analogous to the distinction between intensive and extensive quantities in thermodynamics — monitoring density (ratio) rather than mass (absolute count).


Integration: 5 Lines of Code

from state_harness import GrowthRatioGuard, FailureReport

guard = GrowthRatioGuard(token_budget=50_000)

with guard:
    for turn in agent_loop:
        result = llm.invoke(turn.prompt)
        guard.record_step(tokens_used=result.usage.total_tokens)

report = FailureReport.from_guard(guard)
print(report)
Enter fullscreen mode Exit fullscreen mode

When the guard trips, the diagnostic report classifies the failure pattern — at zero cost, with no LLM calls:

⚠️  STABILITY TRIPPED at turn 12

Pattern: Context Accumulation Spiral (confidence: 92%)
  • Last 5 turns all exceeded 1.5× baseline (4/4 were accelerating).
  • Peak growth ratio: 5.2× baseline.
  • Without intervention, projected cost was $0.0396 (actual: $0.0039).

Suggested actions:
  🔴 1. Enable history compression in your agent loop.
  🟡 2. Lower the growth ratio threshold to 1.8×.
  🟢 3. Add a sliding-window context strategy.
Enter fullscreen mode Exit fullscreen mode

The classified pattern and suggested actions export cleanly to OpenTelemetry:

from opentelemetry import trace
span = trace.get_current_span()
span.set_attributes(report.to_otel_attributes())
# Adds: state_harness.pattern, state_harness.confidence, etc.
Enter fullscreen mode Exit fullscreen mode

Framework Integrations

LangGraph:

from langgraph.prebuilt import create_react_agent
from state_harness.adapters import monitor_graph

agent = create_react_agent(model, tools=[search, calculate])
safe = monitor_graph(agent, token_budget=100_000)
result = safe.invoke({"messages": [("user", "Fix the login bug")]})
print(safe.report)
Enter fullscreen mode Exit fullscreen mode

CrewAI:

from state_harness.adapters import CrewAICallback

callback = CrewAICallback(token_budget=200_000)
crew = Crew(agents=[...], tasks=[...], step_callback=callback.step_callback)
result = crew.kickoff()
print(callback.report)
Enter fullscreen mode Exit fullscreen mode

Benchmarks: What Worked and What Didn't

We evaluated state-harness across 2,367 total runs with a 5-condition ablation study on three benchmarks.

What worked: Zero false positives on stable tasks

Across 1,136 MINT runs (short-loop reasoning) and 750 τ³-bench runs (medium-loop customer service), state-harness never tripped once. The growth-ratio normalization correctly identified these as stable conversations and introduced <2% token overhead.

This is the most important result. A monitoring tool that interferes with healthy agents is worse than no monitoring at all.

What worked: Compute savings on search trees

On SWE-bench Verified (37 Django instances, Moatless-tools SearchTree agent, Gemini 2.5 Flash):

Condition Compute (nodes) Reduction
A. Baseline (no monitoring) 945
B. + Lyapunov monitor only 620 34.4%
D. Full-stack (Lyapunov+RG+VSA) 580 38.6%

The monitor eliminated all max-budget burnout events (7 tasks hitting the 50-node ceiling → 0) and reduced wall time by 30%.

What didn't work: Improving resolve rates

This is the honest part that most open-source projects would hide.

We ran 3 independent trials per condition (333 total runs) to measure nondeterminism:

Condition Mean ± σ
A. Baseline 44.1% ± 4.1%
D. Full-stack 40.5% ± 2.7%
E. Naive Cap 45.9% ± 5.4%

A naive budget cap achieves comparable resolve rates. The cross-condition variance (2.9%) is smaller than the within-condition nondeterminism (4.1%). state-harness doesn't make agents smarter — it makes failures diagnosable.

Bonus finding: The nondeterminism floor

Both τ³-bench and SWE-bench converged on a ~4–5% intrinsic nondeterminism floor for Gemini 2.5 Flash on code tasks. This means any single-run benchmark comparison reporting performance deltas under 8% is statistically unreliable. If you see a paper claiming "our agent is 6% better," ask them how many trials they ran.


The Three Mechanisms (and an Honest Ablation)

state-harness has three components, all written in Rust (via PyO3):

  1. Lyapunov Monitor (~1μs/step): The growth-ratio energy function described above.
  2. RG Decimator (~100μs/compress): TF-IDF-based history compression inspired by Renormalization Group theory.
  3. Holographic Engine (~10μs/check): VSA-based semantic drift detection using 10,000-dimensional bipolar vectors.

The honest ablation result: Lyapunov alone delivers ~90% of the total benefit (34.4% out of 38.6%). RG and VSA add incremental value. If you want maximum simplicity, just use the GrowthRatioGuard with default settings and ignore the rest.


Who Should (and Shouldn't) Use This

If you're... Use state-harness?
Building a chatbot or RAG pipeline ❌ No. These don't spiral.
Running a simple ReAct agent (<10 turns) ❌ No. max_iterations is enough.
Running coding/DevOps agents with search trees ✅ Yes. Branch explosion is real.
Running 1000+ agent tasks/day in production ✅ Yes. Edge-classified failure patterns at zero cost.
Benchmarking agents and publishing results ✅ Yes. The nondeterminism floor matters.

Try It

Built as a research project exploring whether control theory can provide useful runtime guarantees for stochastic software. If you're running agents at scale and want zero-cost failure diagnostics — or if you're just curious about applying physics to AI systems — I'd love your feedback.

Top comments (0)