Lizzie Siegle

Posted on Jun 23 • Edited on Jun 24

Agents write code, but they don't remember

#ai #sdlc #llms #agents

Intent as the spine of a new SDLC

For twenty years, the software development lifecycle (SDLC) was like a relay race (I love relays--something about the teamwork and team sports!) It involved a team: one person wrote a ticket, another designed, someone built, and another reviewed. Each handoff had its own tool, artifact, and meeting.

With the rise of AI agents, that process has been unevenly compressed.

Addy Osmani makes this point very well in his write-up of Google's new SDLC paper--worth reading in full, IMO: implementation time drops from weeks to hours, but requirements, architecture, and verification remain slow because they're judgment work.

Generation is pretty much solved and what's left is specification, verification, and the systems that hold them together.

This blog post will go over how those systems keep dropping context.

The 80% problem is a context problem

Addy calls the ceiling the "80% problem": agents get the first 80% of a feature quickly, and the last 20% (ie edge cases and seams between systems) still needs context the models don't usually have.

I'd phrase it slightly differently. That last 20% is hard not just because the code is hard, but because the reasoning that produced the first 80% is already gone. When an agent builds something, the why evaporates when the session ends, leaving developers/builders
with a large diff and a fuzzy memory of what they asked for. Did the agent make the right call on an edge case? Did the plan change halfway through?

Let's say you ask an agent to help you ship a feature. A month later, a teammate gets paged: a big customer is getting throttled. She opens the code and has so many questions that the commit can't answer. The agent that made all those calls (and had a reason for each, because you watched it reason through them and you're a 10x engineer!) is gone, and so is the session.

So she does the only thing she can do: reverse-engineer a decision that was already carefully made, compressing an hour of agent reasoning into a diff that keeps the conclusion and throws away the argument. The agent actually had it right, but none of it survived.

We've gotten very good at generating code, but also very bad at remembering how the code got where it did--leaving the hardest 20% getting done by someone debugging code they didn't write, along with none of the context that produced it.

Verification needs the trajectory, not just the output

The most useful distinction in Addy's post is between two kinds of evaluation. Output evaluation asks whether the final result is correct. Trajectory evaluation asks whether the path it took (ie tool calls and reasoning) was sound. You want to have both. An answer that looks correct but skipped its checks is more dangerous than one that's clearly broken.

This is the difference between a box score and game film. The box score tells us the result, whereas the film tells us whether the result was earned or lucky. Most of our developer tooling only keeps the box score. The PR queue we all still use was built for human-speed output including a diff, description, and a thumbs-up. Throwing the path away, it displays the output, and at the same time, agents ship volume. This gives developers two not-so-great options: review every diff and become the bottleneck, or ship blindly and hope.

A diff can't answer the question that really matters: did we build the right thing the right way? That answer lives in the path the agent took--and for most teams at the moment, that lives nowhere.

Where this goes next

I don't believe that the SDLC will die. I think it will invert.

Today, code is the artifact and intent is a ticket nobody reopens. That will flip with intent becoming the spine and code turning into just one layer you drill into. The unit of work is no longer the PR— it's the whole arc, including the ask, decisions, path, and evidence it works.

That's the bet we're making at Entire. Capture the reasoning chain, attach it to the code in git, and make that what you review, not a wall of diff. It's also how you close the last 20%: not by waiting for a smarter model, but by never losing the context that makes the hard part hard.

The future of building software isn't "agents write code faster." That's already here (41% of new code is AI-generated, per the numbers in the post)--and it's not enough. The future is teams that can understand, continue, and trust the work an agent did, even after the session ended.

The agents sped up. Now it's our turn to give the work a memory.

Top comments (40)

UnitBuilds • Jun 23

I actually faced this a week ago. I wanted to find a way for a LLM to retain knowledge, not just memories. So I used the graph DB I built to replace Neo4J in MCP-Lite, wired it up as a knowledge graph, gave the LLM a system instruction to make it curious and continuously question it's own knowledge when it's idle. surprisingly, it ran for 3 days and achieved a state of zen, where it just accepted that seeking knowledge is pointless. That being said it runs much better than the base model (Gemma 4 E4B). Still needs some work, I want to make the system capable of learning off source code and documents, so it can study source to code cleaner, but first got about 5 different projects taking priority before that.

Lizzie Siegle • Jun 23

Love that! "Retaining knowledge, not just memories" is a tighter version of what I tried to get at.

UnitBuilds • Jun 24

Similar to a project I tried out in Feb, I wanted to create a type 1 hypervisor where an empty AI would live, that just learns. Wrote a whole new ai, built on fractals, no matmul, purely nearest neighbor graph traversal. Worked shockingly well at knowing, but terrible and understanding it, because somehow I'd need to build it a tokenizer for an ever changing graph of fractals. So it got shelved and is currently just training in the background. Either way, the tensor train compression system ended up being more viable, by giving a 40% size reduction, while maintaining 99.9999% of knowledge and no need for a new tokenizer and could still add new knowledge on to it without too much trouble. the Fractal 1 was as a way for it to declutter from tensors and exist as something new, that learns natively, all the time, without taking up GB worth of vram.

Eva • Jun 24

Right now, reviewing code generated by AI agents is a massive bottleneck for teams and solo builders alike. The agent spits out a massive PR in seconds, but as the reviewer, you're just staring at a static wall of diffs with zero clue how the model arrived there. You either have to blindly merge it and pray, or spend an hour reverse-engineering the logic line by line, which completely defeats the speed advantage of using AI in the first place.

Lizzie Siegle • Jun 24

it is! The static diff problem is def a missing-context problem as git tells you what changed. Entire helps to tell you why, ie the /explain skill traces any function or file back to the original session/convo that created it, not just the commit.

What you describe isn't quite a code review bottleneck, I'd say it's a legibility issue. The code is fine, but what's missing is the reasoning chain behind it and Entire helps with that!

xulingfeng • Jun 23

The trajectory vs. output distinction hits hard. Wrote a whole series about this — the AI in my stories never failed because it generated wrong code. It failed because the reasoning behind the 450ms retry timeout was a 3AM postmortem note that never made it into the training data. The code was right. The context evaporated. 🤝

Lizzie Siegle • Jun 23 • Edited

"The code was right. The context evaporated"--you summarized my post in 2 phrases! The 450ms retry timeout being a 3AM postmortem note that never survived is perfect, proving the failure isn't the model, it's that the reasoning lived somewhere it could die. Where's the link for the series?!

xulingfeng • Jun 24

Haha guilty as charged 😅 Two links for you:
The 450ms story
The whole old series —15 AI Stories collection
Let me know which one hits hardest 👀

Mike Czerwinski • Jun 23

The inversion from code-as-artifact to intent-as-spine feels right. I’d add one uncomfortable constraint: preserving the trajectory gives us evidence, but not necessarily independent evidence. A beautifully retained reasoning chain can still document one coherent mistake.

The durable unit probably needs five parts: ask, decisions, trajectory, outcome, and an externally authored verifier that the same session could not rewrite. Then we can ask both: “Why did the agent do this?” and “What outside its reasoning path proved that it was right?”

Memory closes the context gap. Independence closes the trust gap. Without both, we may preserve the whole game film—and still discover every camera shared the same blind spot.

Lizzie Siegle • Jun 23

Best pushback I've gotten-- you're so right! I treat memory as the whole fix, but you've split it much more cleanly: memory closes the context gap while independence closes the trust gap. Where's your blog post on this?!

Mike Czerwinski • Jun 24

Ha thank you. Honestly the two-axis split is sharper in your phrasing than in anything I have written. The closest I have shipped is "A quorum costume" from this week, which lives almost entirely on the independence side. The memory side I worked through in "Thin half of memory" a bit earlier. The explicit "context gap vs trust gap, two different fixes, do not collapse them" post is unwritten, and now I have to write it, because "memory closes the context gap while independence closes the trust gap" is the spine of it. Crediting your framing if I ship it.

Mateo Ruiz • Jun 24

The point about agents not remembering is underrated. Most teams are focused on generation speed, but the bigger bottleneck is preserving decision context after the session ends.

We've seen a similar challenge in AI-assisted development projects where code gets generated quickly, but weeks later nobody can explain why certain architectural or implementation decisions were made. The code survives, the reasoning doesn't.

Capturing intent, decision history, and verification evidence alongside the output feels like a much bigger unlock than simply making models faster. Without that layer, every future change risks turning into a reverse-engineering exercise.

The idea of treating intent as a first-class artifact rather than just another ticket is an interesting direction for making agent-driven development more maintainable at scale.

Theo Valmis • Jun 24

The reframing that the last 20% is hard because the reasoning behind the first 80% is already gone is the sharpest version of this I have read. Generation being solved just relocates the problem to persistence: the diff survives, the why doesn't. This is exactly the thing we are building Mneme around, treating the decisions as the durable artifact and the code as the layer you drill into, so an agent picking the work back up two weeks later isn't starting from a fuzzy memory. Intent as the spine is the right call.

Joshua Brackin • Jun 24

I've tried a bunch of graph-based tools as a memory. And honestly, nothing has worked for me quite as well as Anthropic's own memory and documentation guidelines. Keeping Claude.md to 200 lines or less, and using it as an index instead of a long-form record. My flow recently has been to set up this memory and documentation system at the very beginning of each session. I do find that after 20-30 longer sessions, I need to do a maintenance pass on it to get things back to their proper sizes. Although that has been getting better as the models keep improving. But it's worked remarkably well. And I use a doc-sync skill to update all applicable documentation at the end of each session to prevent drift. So far this system is working really well for me.

Lizzie Siegle • Jun 24

This is helpful to hear! My claude.md def sometimes gets more like a long-form record

Charan Nihaal R • Jun 24

The “code gen is solved, memory isn’t” line makes so much sense. I love the idea of turning successful sessions into actual reusable recipes/procedural memory instead of just letting good solutions disappear into chat history. Definitely stealing that idea for my own projects. I did do Obsidian mcp and made a skill that saves notes on good changes that I approve of in my generated code but it falls short in minimizing token usage and efficiency.

Charan Nihaal R • Jun 24

I did work with graphifyy ( I am also intrigued in mixing it with entire skills like search and explain ) but I don't understand anything on how to do it so been asking ai and stuff on decoding entire skill repo so I can make one that just fits seemlessly

Giorgi Kobaidze • Jun 27

This is such a great perspective, and it's incredibly relatable. "Verification needs the trajectory, not just the output" is probably the most important quote here. I know so many people who judge a solution solely by its final output, and in my opinion, that's a huge mistake.

I've seen plenty of developers get stuck because of LLM memory limitations, and when you think about it, it makes perfect sense. That's why I always try to design things upfront with proper specifications and requirement documents. Beyond that, I also keep technical documentation that captures the most important architectural and implementation details of the software.

That way, even if you hit the model's context limit or start a completely new session, you can restore the full context and continue where you left off. It definitely takes more time in the beginning, but investing that time upfront usually translates into exponentially less time spent fixing problems and re-establishing context later on.

Slava • Jun 29

Trajectory evaluation asks whether the path it took (ie tool calls and reasoning) was sound

I made a open source lib which address exactly this issue: it reduces trajectory to event sequence and uses specific database search to extract traces with similar patter. Turned out it outperforms basic RAG in agent failure prediction significantly (benchmark available). Check out: github

View full discussion (40 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.