DEV Community

Li-Hsuan Lung
Li-Hsuan Lung

Posted on • Originally published at blog.projectbrain.tools

Memory Curation — Keeping the Knowledge Base Honest

The idea I could never get my team to follow

I have always loved the concept of Architecture Decision Records.

The idea is simple: whenever your team makes a non-obvious technical decision, you write a short document. The decision, the context, the alternatives you considered, and why you chose what you chose. You commit it to the repository alongside the code. Future teammates can read it and understand not just what was built, but why.

It is a great idea in theory. But I could never get anyone to actually do it consistently, including myself.

When the decision is fresh in your head, writing it down feels like overhead. When you are under deadline pressure, the ADR file seems like the first thing to skip. By the time the decision feels worth documenting, you have forgotten half the context. And then a new engineer joins, or you revisit the codebase six months later, and you are left reading code with no memory of the reasoning behind it.

ProjectBrain's knowledge base is, at its core, an attempt to make this idea stick — and to extend it beyond just architecture decisions.

Three types of knowledge

ProjectBrain stores three types of knowledge entries:

Decisions are the direct heir of the ADR concept. A decision captures a non-obvious choice, the rationale behind it, and the alternatives that were rejected. The rationale is the most valuable part — it is the part that disappears fastest from human memory and git history.

Here are a few real decisions in our own project's knowledge base:

Adopted Gherkin-style (Given/When/Then) structure for static LLM prompts
Gherkin-style prompts (Given/When/Then) provide a more deterministic structure for LLMs, minimizing ambiguity by clearly separating persona (Given), triggers (When), and expected behavior (Then). Positive framing and scenario isolation have been established as best practices to improve LLM adherence to instructions.

Memory curator v1 is non-destructive
The curator should generate recommendations (refresh, supersede, merge, archive) but should not auto-mutate memory records in v1. Explicit review/resolution preserves safety and auditability.

LLM semantic pass is the centerpiece of the memory curator
The rule-based pass (title normalization, staleness, supersession) acts only as a cheap pre-filter to surface candidates. The LLM pass is the primary signal: it scores semantic duplicate pairs, detects quality issues, and provides the reasoning needed to make recommendations actionable.

Facts are verifiable truths about the project, environment, or system. They have a shorter half-life than decisions — configurations change, services get renamed, constraints shift. A fact that was true last quarter may be silently wrong today.

A few real facts from our knowledge base:

Render render.yaml: dockerContext is relative to rootDir, not repo root
When a service has rootDir set, dockerContext and dockerfilePath are resolved relative to rootDir — not the repo root. Setting dockerContext: ./curator with rootDir: ./curator produces the path curator/curator (not found). The correct value is dockerContext: . when you want the rootDir itself as the build context.

MCP responses support three modes: human, json, both
All tools accept a response_mode parameter ("human" | "json" | "both", default "human"). Human mode returns readable markdown. JSON mode returns a structured envelope: {ok, data, error, meta: {tool, response_mode, query?}}. Both mode returns human text followed by "---" and the JSON envelope. Validate response_mode early via _validate_response_mode() and return the error string if invalid. Always return a string — MCP protocol requirement.

Alembic env.py wraps all migrations in one transaction — a single failure rolls back all

Skills are reusable procedures — the "how we do this here" knowledge that never makes it into a README. Setup guides, debugging playbooks, deployment checklists. The kind of knowledge that lives in a senior engineer's head and gets re-explained to every new teammate.

The combination creates something more complete than ADRs alone: a living record of what is true, what was decided and why, and how things are done.

The Knowledge tab, showing a project's decisions and facts. Each entry includes a title and body. Agents and humans write to this store throughout their work sessions.

Agents write without friction

The original ADR problem was that writing felt like a burden. ProjectBrain removes that friction almost entirely for agents — they log knowledge as a natural side effect of doing work. When an agent resolves a bug, it logs the root cause as a fact. When it makes an architectural choice, it logs a decision. When it figures out a deployment step, it logs a skill.

Humans can do the same, and the UI makes it fast. But the real leverage is that agents do it continuously, in the background, without needing to be reminded.

This solves the write problem. But it creates a different one.

The cost of stale memory

When an AI agent reads from a knowledge base, it treats what it finds as ground truth. It does not apply skepticism the way a senior engineer would when stumbling on an old wiki page. It reasons from what it is given.

That is fine when the knowledge base is accurate. It is a serious problem when it is not.

Stale context compounds quietly. An agent reads an old fact about a database schema that was changed three months ago. It proceeds to write a migration against the wrong table structure. Another agent reads a superseded decision about an API design and implements a pattern the team moved away from weeks earlier. The work looks correct on the surface. The errors only surface in review — or in production.

This is worse than no documentation. A missing fact causes the agent to ask a question or make an assumption it flags. A wrong fact causes it to proceed confidently in the wrong direction.

The problem gets worse as the knowledge base grows. More entries means more signal to retrieve. But it also means more outdated entries that look authoritative. The noise is invisible. It is indistinguishable from good signal until something breaks.

How teams typically approach memory pruning

This is not a new problem. A few patterns have emerged, each with real drawbacks.

TTL-based expiry. Give each entry a maximum age. Simple to implement, but crude. A fact about your CI environment might be stale in a week. A foundational architectural decision might be valid for five years. Fixed TTLs either over-prune or under-prune.

Supersession tracking. New entries explicitly mark old ones as superseded. Clean and auditable, but it depends on the writer knowing what the new entry supersedes.

Human-in-the-loop review. Surface aged entries periodically and ask a human to confirm, update, or delete them. The most reliable method, but it does not scale. A team generating dozens of entries per week would spend all its time reviewing the queue.

None of these works well in isolation. The honest answer is that memory pruning requires a mix of strategies — automatic signals to surface candidates, semantic analysis to catch what rules miss, and human judgment for the cases where confidence is uncertain.

What the curator does

ProjectBrain's curator is our current attempt at that mix. We are actively learning what works, and the approach will evolve as we get more data from real usage.

The curator runs on a schedule, currently every 30 minutes, and on each pass it samples a window of recent knowledge entries, applies a rule-based filter as a cheap pre-pass, then sends candidates to an LLM for semantic analysis.

The output is a queue of recommendations:

MERGE — two entries that cover the same ground. Duplicates happen often: one agent logs a fact at the end of a session, another logs the same fact at the start of the next. Or two team members capture the same architectural decision independently after a long discussion.

FLAG — a single entry with a quality problem. A decision with no rationale. A skill with a title but no steps. A fact that references something that no longer exists.

Humans or agents review the queue and act: accept a merge, edit or remove a flagged entry, or dismiss the recommendation if it was wrong.

The Memory Health tab showing a queue of pending curation recommendations. Each card shows the entry type (fact / decision / skill), recommendation action (MERGE or FLAG), confidence percentage, and the affected entry title.

For FLAG recommendations, the review card offers three actions: Delete, Edit, or Keep. The curator does not know whether a flagged entry should be deleted or just improved — it flags the problem and leaves the decision to the team.

The prompt

The curator's LLM pass sends records as a JSON array and asks for a structured response. The full system prompt:

Given you are a knowledge base curator for a software project management tool
And you review knowledge records (facts, decisions, skills) written by human team members and AI agents
When you receive a JSON array of records containing id, entity_type, title, and body
Then you must return a single JSON object with exactly two keys: "duplicates" and "quality_issues"
And you must output ONLY valid JSON without any markdown formatting or preamble

Scenario: Identifying duplicate records
Given two records have the same meaning but different wording
Then you must include them in the "duplicates" array
And each item must be formatted as:
  {
    "entity_a_id": "...",
    "entity_b_id": "...",
    "confidence": 0.0–1.0,
    "reason": "one sentence  why these are the same",
    "suggested_merged_body": "clean merged content combining the best of both"
  }
And you must only include pairs with confidence >= 0.75
And you must prefer false negatives over false positives
And you must treat two records on related but distinct topics as NOT duplicates

Scenario: Identifying quality issues
Given a record has a genuine quality problem such as:
  - Facts with no body and a title so vague it conveys nothing actionable
  - Decisions with no rationale (just a title, no explanation of why)
  - Skills with no steps or procedure (title only, or body is a single vague sentence)
Then you must flag it in the "quality_issues" array
And each item must be formatted as:
  {
    "entity_id": "...",
    "severity": "low" | "medium" | "high",
    "issue": "one sentence  the specific problem"
  }
Enter fullscreen mode Exit fullscreen mode

A couple of design choices worth noting:

Prefer false negatives over false positives on duplicates. A missed duplicate is low cost — you can find it on the next pass. A false positive that merges two distinct entries destroys information and erodes trust.

Suggest the merged body. For duplicate pairs, the model proposes a merged version combining the best of both entries. This gives the reviewer something to work from rather than asking them to write a new entry from scratch.

Tuning curation behavior

The curator's behavior is configurable per project. You can set a confidence threshold — recommendations below it are suppressed entirely — and a freshness window to flag entries that have not been reviewed within a certain period.

The Memory Curation settings panel, showing the confidence threshold slider and the freshness window input for controlling how aggressively the curator flags entries.

Both settings address the same underlying tradeoff: how much noise you are willing to see in exchange for catching more real problems. A team with a high-churn knowledge base might lower the threshold and accept more false positives. A smaller, stable team might raise it to keep the queue quiet.

What the curator is not

The curator does not auto-apply changes. It does not delete entries or rewrite them without review. All recommendations require confirmation.

We considered auto-merging obvious duplicates, but the false positive cost is too high. Two entries that look nearly identical might cover different contexts. The review step is fast and the downside of getting it wrong is not.

The curator stays in a supporting role. It surfaces candidates. The team decides.

Closing

ADRs work when teams follow them. The hard part has never been the format. It has been the habit.

What ProjectBrain tries to do is make that habit automatic: log continuously as a side effect of work, and let a background process handle the maintenance. The knowledge base stays roughly honest without requiring anyone to remember to tend it.

We are still figuring out the right balance — what to flag, how aggressively to deduplicate, when to trust the LLM's judgment and when to be more conservative. If you are building systems where agents produce persistent memory, the write problem is easy. Plan for curation from the start, and expect to keep tuning it.

Top comments (0)