I’m the founder of Remoroo, so full disclosure: this is from something I built.
One thing that became obvious while working on long-running agent systems is that the hard part is not getting the model to generate code. The hard part is keeping the system coherent after hours of tool use.
In a long run, the agent accumulates file reads, shell output, logs, edits, metric checks, and intermediate conclusions. At some point, the useful context becomes larger than what the model can reliably keep in-window. Once that starts happening, the behavior degrades in predictable ways: it forgets the original objective, loses track of the baseline, re-reads files it already inspected, and sometimes retries approaches that already failed.
What did not work for us was naive truncation. Dropping older context made the run cheaper, but it also made it much less coherent.
What worked better was treating memory more like a systems problem. We built a demand-paging approach where important context is retained, less relevant context is compressed, and older material can be recalled when needed instead of staying permanently in the active window.
This is still imperfect. Retrieval mistakes still matter, and memory policies can bias what the agent “remembers.” But for long-running experiment loops, it worked much better than simple truncation.
Technical write-up / docs:
https://www.remoroo.com/blog/how-remoroo-works
Top comments (0)