DEV Community

gentic news
gentic news

Posted on • Originally published at gentic.news

AgingBench: AI Agents Lose Reliability Over Time & Memory Fails

UT Austin paper finds AI agents degrade over time via memory errors. Proposes AgingBench to measure reliability decay across sessions.

University of Texas researchers found AI agents quietly degrade over time. Their new paper proposes AgingBench, a benchmark measuring reliability decay across sessions.

Key facts

  • Paper from University of Texas on arXiv
  • Identifies 4 failure modes: summary drift, memory interference, stale updates, maintenance bugs
  • Proposes AgingBench for multi-session reliability testing
  • Agents can sound competent while becoming less exact
  • Code and dataset not yet publicly released

A new paper from the University of Texas, posted on arXiv, argues that AI agents suffer from 'aging' — a slow, silent decline in reliability after deployment, even when the underlying language model remains unchanged. The core problem, according to the researchers, is that agents are typically evaluated in a single clean session, but real-world agents accumulate state: they summarize old chats, store memories, update facts, and undergo maintenance. Each step can introduce errors that compound.

The paper, titled "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems," identifies four primary failure modes:

  • Summary drift: key details are dropped or distorted when old conversations are compressed.
  • Memory interference: similar client records or facts blur together.
  • Stale updates: corrected facts remain overwritten by older, incorrect versions.
  • Maintenance bugs: cleanup passes can accidentally delete or corrupt stored data.

The authors propose AgingBench, a benchmark that simulates multi-session agent interactions to measure how reliability degrades. The benchmark tests each failure mode separately, aiming to provide a structured way to evaluate agent longevity.

The paper's unique take is that 'give it more memory' is often the wrong fix. If a fact was never written, retrieval cannot save it. If it was crowded out, better summarization won't help. If it's present but unused, the problem is not storage but the agent's decision to trust or ignore what it retrieved.

The researchers emphasize that deployed agents behave less like static models and more like aging infrastructure — a system that requires ongoing monitoring, not just a one-time evaluation.

The paper does not disclose specific benchmark numbers or compare against existing agent evaluation frameworks. It also does not release the AgingBench code or dataset publicly yet, though the authors state they plan to.

What to watch

Watch for the public release of the AgingBench code and dataset, and for whether major agent platforms (Anthropic, OpenAI, Google) adopt multi-session reliability as a standard evaluation metric in their developer documentation or benchmarks.


Originally published on gentic.news

Top comments (0)