I built an RL-native memory engine for LLM agents (every memory decision becomes a training signal)

#ai #llm #opensource #python

Here's something that bothered me for a long time.

Every production agent has some form of memory. You store facts, constraints, user preferences. The agent retrieves them. Tasks succeed or fail. You iterate.

But there's a gap in the loop: you never know which memory decisions were good.

Did that constraint actually help? Did retrieving that preference reduce tokens wasted? Did keeping that old session fact cause the task to fail silently? You have no idea. The agent makes memory decisions constantly, and none of them leave a trace you can learn from.

I started thinking about this differently. Every time an agent decides what to remember, that's a decision. And decisions can be treated as actions in a reinforcement learning sense — with states, rewards, and outcomes you can measure.

So I built memcell-rl.

What it actually does
Every memory cell in memcell-rl is typed. It's not just a string — it has a cell_type (constraint, preference, fact, episode), a scope (global, session, task), a criticality score, and a sensitivity level. This metadata is what the policy uses to make decisions.

When your agent asks "what should I remember right now?", the system runs a policy, selects cells, and — this is the key part — logs the full transition:

State: what cells were available, token budget, cell types in context
Action: which cells were selected or suppressed
Reward: did the task succeed? Was there a stale memory error? How many tokens were used?
Next state: what memory looks like after
Over time, this builds a dataset. You can export it and train a policy that learns from real outcomes — not just rules someone wrote upfront.

The current policy is a baseline
baseline_v0 is rule-based. Hard suppression for expired or deleted cells. Quarantine checks before anything sensitive gets used. Token budget enforcement by priority. It works, but it's static.

The interesting question — the one I'm working toward — is what a learned policy looks like. One that's been trained on real agent sessions and knows, from data, which memory decisions actually lead to better outcomes.

The repo exports completed transitions at /v1/rl/dataset. The format is clean enough to plug into DQN, behavioral cloning, or anything else. That's the long-term goal.

It runs locally, no SaaS, no vendor
Everything is SQLite. The API is plain HTTP. You can run it in a Docker container, on a laptop, inside a CI environment. No data leaves your machine.

The test suite has 42 tests covering the full API — policy enforcement, reward computation, RL transitions, dataset export. They all pass without an API key.

pytest tests/ -q
42 passed in 1.83s
I'm still early on this. The hardest part isn't the API — it's figuring out the right reward function for memory decisions, and whether the learned policy actually generalizes across different agent workloads.

If you're building agents with real memory requirements and have run into this problem, I'd genuinely like to talk. Drop a comment or open an issue.

→ github.com/adu3110/memcell-rl