DEV Community

Cover image for Building a Memory Agent That Actually Forgets (And the Three Bugs That Taught Me Why That's Hard)
Nidhi
Nidhi

Posted on

Building a Memory Agent That Actually Forgets (And the Three Bugs That Taught Me Why That's Hard)

By Nidhi: Built for the Global AI Hackathon Series with Qwen Cloud, Track 1: MemoryAgent

When I signed up for Track 1, I figured memory was basically a vector database problem. Store embeddings, retrieve similar messages, done. The first prototype worked fine. Then I started actually trying to use it, and that's when things got interesting.

The idea

Most AI chat is stateless. Close the tab, and the next conversation starts from zero. That's a strange limitation for something we keep calling "intelligent": a person who knew you well wouldn't forget your name, your job, or what you told them you were working on, three days ago. So the plan was: build a memory layer underneath a chat agent. Every turn would recall relevant past context, respond with it in mind, and extract new facts worth remembering. Important things stick around. Trivial things expire.

That part came together quickly. FastAPI for the backend, split into two services: a memory core handling storage and intelligence, and a thin chat layer on top calling into it. Qwen Cloud for both the chat model and the embeddings. Neon Postgres with pgvector for the actual vector storage, Upstash Redis for caching, all of it deployed on Alibaba Cloud ECS. Within a couple of days I had a working chat UI that stored memories, scored them for importance, and recalled them on the next message.

It looked done. It was not done.

Getting there wasn't clean, either. Two FastAPI services that needed to actually talk to each other, a Postgres instance on Neon that needed pgvector enabled before anything would even insert, a Redis cache that silently no-ops if it can't connect instead of telling you it failed, and then the actual deploy — SSH into an Alibaba Cloud ECS box, realize the venv isn't activated by forgetting which environment you are in, watch both services crash with ModuleNotFoundError, activate it, restart, repeat. None of that shows up in an architecture diagram. All of it ate real time.

Bug one: the system was punishing specificity

I started running real conversations through it, not test data, just talking to it the way a user would. At one point I told it my name. A day later, I asked what it remembered, and my name had a 24-hour expiration timer on it. Separately, I'd mentioned growing cherry tomatoes and basil, and that specific fact scored lower than the vague, generic statement "I'm an indoor gardener" that I'd said in the same conversation.

That's backwards in a way that matters. The entire value of a memory system is recalling specifics — "user grows cherry tomatoes" is useful; "user likes gardening" barely is. But the scoring prompt I'd written only gave Qwen three anchor points: trivial greetings at 0.0, vague preferences at 0.5, and core facts at 1.0. There was no guidance at all for "specific personal fact" or "person's own name," so it was filling in the gap with something closer to vibes than calibration.

The fix was to stop assuming the model would infer the right priorities and instead write them down explicitly: a name should never score below 0.6. A specific instance of something should never score lower than the general category it belongs to. Once I added those as hard calibration rules rather than implicit expectations, both problems disappeared. The lesson that stuck with me: when you're using an LLM as a scoring function, it will dutifully fill in whatever gaps you leave in your rubric, in whatever direction is easiest- not necessarily the direction you'd want.

Bug two: "save this" did nothing, and said it worked anyway

This one was sneakier because it looked like success. I'd ask the agent for something — a training plan, a morning routine — and then say "save this for me." It would reply with a perfectly confident confirmation: "I've updated my memory with your specific training plan." And then I'd check the actual memory panel. Nothing relevant was there.

The bug was structural. My extraction step — the part that pulls durable facts out of a conversation turn to store — only ever looked at the current turn: the user's message and the assistant's reply, nothing else. When the user's message was "save this," there was nothing in that extraction call's view of the world to actually save. The real content — the training plan — had been generated several messages earlier, in a different turn the extractor never saw. The chat reply that generated the confirmation had full conversation history available to it, so it sounded confident and specific. The extraction call that was supposed to act on that confirmation had none of that context. Two parts of the same pipeline were operating with completely different amounts of information, and nothing in the architecture made that mismatch visible until I went looking for it by hand.

The fix was to feed recent conversation history into the extraction prompt too, with explicit instructions: if the user is pointing back at something said earlier rather than stating a new fact directly, resolve the reference using that history before deciding what to extract. After the fix, the same "save this" request correctly pulled out the actual content of whatever it was referring to.

What stuck with me here wasn't the fix — it was how convincing the failure was. A wrong answer that sounds tentative gets double-checked. A wrong answer delivered with total confidence usually doesn't. I only caught this because I happened to check the memory panel out of habit, not because anything in the system signaled a problem.

Bug three: the feature that had never once worked

Smart Forget was supposed to be one of the more interesting pieces.Instead of blindly deleting memories the moment a timer expired, it would gather anything past its expiration window and ask Qwen to make one more judgment call: does this still matter, even though its time technically ran out? I built it, tested it, and every single run reported back reviewed: 0, deleted: 0. I assumed that meant nothing had expired yet in my short test windows, which was plausible — TTLs were measured in days, and I'd only been testing for a few hours at a time.

It wasn't that. It was a crash, quietly swallowed by my own error handling. The database returns timestamps as timezone-aware datetime objects. My code was comparing them against datetime.utcnow(), which is timezone-naive. Python refuses to subtract one from the other, throws a TypeError, and my try/except block caught that error and defaulted to "keep." The scary part: nothing crashed loudly enough to notice. My own error handling turned a real failure into something that looked exactly like cautious, conservative behavior — when really, the feature had never run a single actual comparison.

I only found this because I built a real test suite, not unit tests with mocks, but a script that makes live HTTP calls against the actual running deployment and checks real outcomes. And even then, the test initially passed, because 0 deleted wasn't technically wrong; it just wasn't meaningfully right either. The thing that actually caught it was going one level further: manually pushing a memory's expiration into the past directly in the database, triggering Smart Forget, and watching the server logs in real time. That's when the actual TypeError showed up, instead of being hidden behind a graceful-looking fallback.

The fix itself was small: switch to timezone-aware datetimes everywhere, fix the underlying schema too so the mismatch couldn't quietly recur. The bigger fix was procedural: a passing test and a correctly-behaving system are not the same thing, and the gap between them is exactly where things like this hide.

What this actually taught me

None of these three bugs threw an error message I noticed, or failed a test on the first pass. Two of them gave me a confident-sounding reply while doing nothing underneath. The third one looked cautious and careful while actually just being broken. Turns out a smooth reply, a 200 status code, and a "0 deleted" that I assumed meant "nothing needed deleting" are not the same thing as "this is actually working."

What caught all three, every time: stop trusting the explanation and go check the data. Open the actual memory panel instead of reading the reply text. Tail the actual server logs instead of trusting the status code. Manufacture the edge case by hand instead of waiting for it to show up on its own. None of that is clever. It's just being willing to look one layer past whatever the system is telling you, instead of taking its word for it.

The project that came out the other side does what the brief asked: it remembers what matters, lets go of what doesn't, and reconciles contradictions instead of accumulating them forever. But the part I'd actually want to talk about in an interview isn't the architecture diagram. It's the three different ways a system can look like it's working when it isn't, and what it took to notice each one.


Built with Qwen Cloud (qwen-plus and text-embedding-v3), FastAPI, Neon PostgreSQL with pgvector, Upstash Redis, and deployed on Alibaba Cloud ECS, for Track 1: MemoryAgent of the Global AI Hackathon Series with Qwen Cloud.

Top comments (2)

Collapse
 
nazar_boyko profile image
Nazar Boyko

An LLM rubric will quietly fill any gap you leave, and it fills it in whatever direction is easiest rather than the one you meant, so "specific fact scores lower than the vague version" makes painful sense. Writing the hard floor of "a name never scores below 0.6" as an explicit rule instead of hoping the model infers it is the fix that generalizes way past this project. The thread running through all three, that a confident reply and a working system aren't the same thing, is worth the whole read.

Collapse
 
hereforlolz profile image
Nidhi

This is exactly what I ran into in practice. The model doesn’t leave gaps empty, it fills them with safe generalizations unless you explicitly define ordering. I ended up fixing it by adding hard floors and precedence rules for specificity, and that immediately stopped the drift into vague summaries.

Next step is moving away from a single scalar score and treating memory as structured types with explicit ranking rules instead of collapsing everything into one value.
Still seeing edge cases, but the pattern is pretty consistent so far.