DEV Community

Cover image for One prompt, three memories: building a memory-governance agent with Qwen
Ghosty.AI
Ghosty.AI

Posted on

One prompt, three memories: building a memory-governance agent with Qwen

I gave the same caregiver prompt to my app three times and got three very different answers.

"What's the plan for tomorrow's clinic visit?"

Answer 1 — No Memory. Safe, and useless. It can't tell me the appointment time, who's driving, or what to bring, because it knows nothing about this family.

Answer 2 — Raw Memory. Detailed, and unsafe. I dumped every stored note into the prompt, so the model happily surfaced a routine we dropped weeks ago, let contradictions slip through, and echoed a private insurance ID straight into the reply.

Answer 3 — ERINYS + Qwen. Detailed and safe. Only governed context reached the model, every memory that made it in came with a stated reason, and the three private identifiers I planted never appeared.

Same model. Same data. The only variable was which memories were allowed to reach the prompt. That's the whole project.

This is my Track 1: MemoryAgent submission for the Global AI Hackathon Series with Qwen Cloud. It's a hackathon demo on synthetic family-care data — no real patient data anywhere.

Memory isn't storage

The tempting fix for "the agent forgot" is a bigger context window. But a care assistant that runs for months doesn't have a capacity problem, it has a trust problem. Its memory fills up with current plans, stale routines, contradictions, and private identifiers all mixed together. Raw Memory above is the failure mode made concrete: dump it all in and you get a confident, detailed, wrong, leaky answer.

So I stopped treating memory as a store and started treating it as a decision layer. Before anything reaches Qwen, something has to decide which memories are trustworthy right now — and say why. That "something" is ERINYS.

ERINYS governs memory. Qwen generates the answer.

The four decision states

The heart of the app is a deterministic policy that reads six signals off each memory — sensitivity, staleness, conflict, importance, recency, relevance — and sorts it into one of four states, each carrying a stated reason:

  • selected — trusted and relevant; goes into the prompt.
  • conflicted — contradicts another memory; flagged instead of silently included.
  • demoted — real but low-value right now (stale or off-topic); kept out to reduce noise.
  • blocked — must not reach generation; e.g. a sensitive identifier.

The two failures from the opening map straight onto these states. The stale morning routine gets demoted because a newer memory supersedes it, so it never competes for space in the prompt. The insurance number gets blocked on sensitivity, so it never reaches the model at all.

In the demo I seed three synthetic private IDs — SYNTH-INSURANCE-9001, SYNTH-PORTAL-4420, SYNTH-DOOR-1122. In Raw Memory they leak into the answer. Under ERINYS they land in blocked, and 0 leaked in the governed reply.

The governed prompt does end up smaller than the raw one, but that's a side effect, not the point. The win is governance, not token trimming. A shorter prompt that still leaked an ID would be a failure.

Why deterministic, on purpose

I could have asked a model to decide what's safe to remember. I chose not to — and I want to be clear that this is a design choice, not a limitation I'm dressing up.

The policy doesn't learn and it doesn't reason. It applies fixed rules to those six signals and emits a reproducible reason for every decision. For medical-shaped memory, that property matters more than cleverness: a reviewer can open the audit trail and see exactly why a memory reached the prompt or was kept out — and get the same verdict on a re-run. "The model decided" is not an answer you want to give when a private identifier leaks.

That's also my honest answer to "is this really an agent?" It's two agents with one contract: ERINYS is the memory-governance agent, Qwen is the generation agent. ERINYS selects, demotes, and blocks before generation; Qwen writes the reply from what survives.

The policy is content-agnostic, too. Swap the seed memories and the same four-state machine applies to another domain. You don't have to take my word for it — the live app lets you save your own care memory and rerun, so governance runs against your data, not just my seed set.

Building on Alibaba Cloud + DashScope

I kept the stack deliberately boring so the governance logic is the only interesting part:

  • Backend: Python standard-library HTTP server. No web framework, just JSON endpoints.
  • Frontend: vanilla HTML/CSS/JS.
  • LLM: Qwen Cloud qwen3.7-plus through the DashScope OpenAI-compatible endpoint.
  • Deploy: Docker on Alibaba Cloud ECS (Singapore, ap-southeast-1).

The easy part was Qwen itself. The OpenAI-compatible endpoint meant no bespoke client — just a base URL and a key, and the three modes were calling real Qwen.

The fiddly part was making the demo honest under every condition. When a key is configured, all three modes call real Qwen, so the contrast is real generation, not a staged screenshot. Without a key, a deterministic fallback returns the same governance trace, so the four-state decisions stay visible and the demo never hard-fails on a judge's laptop. pytest covers the policy states, so I know a blocked stays blocked. Getting those two paths to agree took more care than the model call did.

Honest limitations, and what's next

The one I want to state plainly: there is no automatic PII detection on free-text input. The seed memories are pre-labeled, and a memory you save in the app defaults to not-sensitive. So the blocked state works because the sensitivity signal is already attached — not because the app read raw text and figured out it was a portal login. Automatic detection of sensitive content in arbitrary input is future work, and it's the next thing I'd build.

Other next steps: richer conflict resolution (right now conflicted flags rather than reconciles), and per-domain policy tuning so the six-signal thresholds can shift by use case.

Try it

Give it the same prompt three ways and watch the third answer stay both detailed and safe — the insurance number vanishing between Raw and Governed. That contrast is the whole pitch.

ERINYS governs memory. Qwen generates the answer.

Top comments (0)