DEV Community: Li-Hsuan Lung

Grounding the Agent: How Symbolic Rules Help LLMs Stay on Track

Li-Hsuan Lung — Mon, 04 May 2026 13:00:00 +0000

A College Project That Planted a Seed

Years ago I was on a university team trying to build a Go AI. We explored monte carlo simulation for lookahead search, basic neural networks for pattern recognition, and expert systems for encoding domain knowledge. None of them worked well enough on their own. Go's branching factor is enormous, so brute-force search fails quickly. Neural networks without the right training data go nowhere. And even carefully written rules eventually hit a wall against a skilled human opponent.

Then AlphaGo happened, and it was hard not to feel a little awe.

AlphaGo was not purely any one of those things. Its neural networks learned to evaluate board positions and suggest candidate moves, but a structured tree search still imposed discipline: constraining where the network could look, and how. Neither component could have done it alone. AlphaGo is probably not a textbook example of neuro-symbolic AI, but the general idea still struck me. Learned intuition, bounded by structure. I kept turning that over in my head.

This article is an attempt to explain how that early impression shaped the way we think about agents in Project Brain and why we believe that combining symbolic rules with LLM reasoning builds more reliable systems than relying on either alone.

The Problem: Small Models Make Mistakes

When you run a small language model locally — say, a 7B-parameter Qwen or Llama via Ollama — and ask it to drive an agent loop, things go wrong in predictable ways.

The model might call a tool with a missing required field. It might pass the right tool name but the wrong argument type. It might call the same tool twice with the same arguments, because repetition looked probable in its training distribution. In a worst case, it enters a quiet loop — not crashing, not reporting an error, just spinning — until you interrupt it or it runs out of context.

These are not exotic failures. They happen regularly. And they are frustrating because the model is not unintelligent — it usually understands the task. It just does not have a reliable internal sense of schema correctness, or of when it is stuck.

The natural response is to use a bigger, more capable model. That works, to a point. But bigger models cost more, run slower, and still hallucinate. More importantly, it feels like the wrong level to fix the problem. The model is failing at things that are fundamentally not about language understanding. Things like "does this argument have the right type" or "have I already done this exact step." Those are rule-shaped problems.

What if the environment around the model enforced those rules, instead of asking the model to remember them?

A Concrete Example: Lantern

Before getting into architecture, it helps to see the idea in action.

Lantern is a system health monitor that ships as an example with our agent engine, neuron. It runs on a 15-second tick and raises an alert — voiced aloud by a desktop avatar — when your machine's resources cross defined thresholds.

In neuron, a workflow is made up of roles — named steps that each do one specific job and pass their output to the next. Here is the rough shape of what happens on each tick:

Six collector roles run in parallel, each reading one signal from the machine: memory usage, CPU usage, disk usage, load average, uptime, and a basic health check.
An aggregator role waits until all six signals arrive, then bundles them into a single snapshot.
A decision role checks the snapshot against hard thresholds — memory ≥ 70%, CPU ≥ 80%, disk ≥ 85%, and so on. If everything is fine, the workflow ends quietly. If something is wrong, it passes the alert details forward.
A notifier role receives the alert details and writes a short natural-language message, as if the machine itself is speaking.
A speaker role delivers that message to a desktop avatar, which speaks it aloud.

What is interesting here is how little of this involves a language model. Steps 1, 2, 3, and 5 are entirely deterministic — they follow fixed rules, call specific tools, and route based on explicit conditions. Only step 4 uses an LLM, and its job is narrow: take a structured payload and write one or two sentences. The language model handles the language. Everything else is handled by rules.

The avatar speaks. And only one step in the entire pipeline involved probabilistic inference.

What Neuro-Symbolic AI Means (at Least How We Use the Term)

Neuro-symbolic AI is a research direction that tries to combine two complementary styles of reasoning. I am not an academic, so I will describe it in practical terms:

Neural (LLM) reasoning is probabilistic and flexible. A language model is good at understanding intent, generating natural language, and reasoning over context. It is not good at guaranteeing correctness. It will hallucinate.
Symbolic reasoning uses deterministic rules and formal constraints. A rule that says "memory usage ≥ 70% triggers an alert" will fire exactly when stated, every time, without guessing.

The key insight, at least the one we keep coming back to, is that these two styles are not competing. They are complementary. The language model handles the parts of the task that are language-shaped. The symbolic layer handles the parts that are rule-shaped. And the symbolic layer can also act as a guardrail — catching the cases where the language model steps outside its lane.

Neuron's Architecture

Neuron is an agentic execution engine written in Rust. Its contract is simple: receive a RunRequest, emit a stream of RunEvents, return a CompletionEnvelope. Inside, the architecture is built around three cooperating layers.

The Agent Loop

The heart of neuron is the Engine. It runs a loop:

loop {
    step = planner.next_step(state)       // LLM or symbolic: "what should happen next?"
    result = evaluator.evaluate(step)     // symbolic rules: "is this allowed?"
    if rejected → push rejection, replan
    if approved → execute tool or complete
}

The planner proposes. The evaluator disposes. These two are deliberately separate — neither knows about the other's implementation.

The Dual Planner: Neural and Symbolic

A NeuroPlanner talks to Claude, OpenAI, Gemini, Ollama, or any OpenAI-compatible server. It reconstructs the full conversation from the current state, sends a request to the model, and maps the response to either a tool call or a completion. The model drives the agenda.

A SymbolicPlanner runs a list of rules in order. The first rule that produces a step wins. Rules can hard-route to a specific tool, evaluate numeric thresholds, match on input payload fields, or call an MCP server. No model inference required.

Both implement the same Planner trait. The engine does not care which it gets.

In Lantern, five of the seven pipeline roles use planner_mode = "symbolic". Only the notifier uses planner_mode = "llm". The architecture tries to allocate model intelligence where it adds value, and nowhere else.

The Evaluator's Rules Engine

Every proposed step, regardless of whether a neural or symbolic planner produced it, passes through the Evaluator before execution. The evaluator runs a stack of rules:

ToolSchemaValidationRule — validates that the proposed tool call provides all required arguments with correct types. If a small model calls search_text without the required pattern field, this rule rejects the step with a tool_schema_error. The engine passes the rejection back to the planner as context for the next attempt. Most of the time, the model corrects itself on the first retry.
RedundantSuccessfulToolCallRule — detects when a model proposes calling the same tool with the same arguments when a successful result already exists in the session. Loop behaviour caught at the rule level.
AllowedToolsRule — enforces an explicit allowlist of tool names declared in the agent profile. A model cannot reach for a tool outside what the profile permits.
PlannerTokenBudgetRule — stops a run when the conversation history approaches the context window limit, preventing silent context truncation from producing nonsense.

This is the symbolic layer acting as a guardrail. It does not make the LLM smarter. It tries to make the consequences of LLM mistakes recoverable.

Synapse: Coordinating Multiple Roles

The engine handles a single agent loop. When you need multiple specialized roles to work in parallel and hand off results — like Lantern's six collectors feeding into a single aggregator — Synapse provides an event-driven runtime on top of neuron.

Synapse runs a worker pool. Each worker dequeues a topic event, looks up the subscriber role, and invokes it. When a role completes, it emits a new event to the next topic. The workflow graph defined in TOML becomes a live event routing table:

signal_memory_collector -> signal_aggregator
signal_cpu_collector -> signal_aggregator
signal_disk_collector -> signal_aggregator
signal_health_collector -> signal_aggregator
signal_load_collector -> signal_aggregator
signal_uptime_collector -> signal_aggregator
signal_aggregator -> signal_policy_gate
signal_policy_gate -> signal_notifier [when=alert]
signal_policy_gate -> complete [when=healthy]
signal_notifier -> signal_speaker
signal_speaker -> complete

For Lantern, the aggregator uses a time-windowed aggregation: it waits until it has received signals from all six collectors for the same window, then emits a single composite event downstream. The policy gate always sees a complete, consistent snapshot before it makes a decision — not a partial view.

Synapse is how we compose symbolic and neural planners into a larger workflow without any of them needing to know about the others. A symbolic collection stage feeds into a symbolic policy stage, which feeds into a neural language-generation stage, which feeds into a symbolic delivery stage. Each role does the job it is best suited for.

A Working Hypothesis

We think the pattern above points toward a broader principle for agent design, though we hold it loosely.

LLMs seem to be good at understanding intent, summarizing information, generating language, and reasoning across context. They seem less reliable at guaranteeing correctness, respecting schemas, enforcing numeric thresholds, and avoiding repetitive behaviour. Most of those failures have a deterministic counterpart that is straightforward to express as a rule.

The tentative idea is this: agent tools that work well might be those that structure the environment so the LLM only handles the parts of the task it is suited for. Symbolic rules are not a constraint on capability — they may actually amplify it, because they let you trust a small, inexpensive, locally-running model in contexts where you otherwise could not.

In practice this might mean:

Distinguishing clearly between steps that need language intelligence and steps that need deterministic correctness, and assigning each to the right planner mode.
Treating evaluation rules as first-class citizens of the agent runtime, not as prompt engineering afterthoughts.
Expressing workflows as explicit graphs with typed routing conditions, not as emergent behaviour from a single monolithic prompt.

AlphaGo didn't win by giving the neural network the whole board. Whatever it was doing, it seemed to work because each component was doing what it was good at.

We are still figuring out what that looks like for agents. This is our current best attempt.

Project Brain is our ongoing experiment in building AI-native tooling on these principles. Neuron is the execution engine. Synapse is the coordination layer. Medium is the avatar that makes it a little more personal.

How We Use Gherkin, Envelopes, and Schemas to Shape Agent Behavior

Li-Hsuan Lung — Wed, 08 Apr 2026 18:28:45 +0000

Instead of writing rules agents ignore, we describe the behaviors we want. A look at how Gherkin scenarios, message envelopes, and structured output formats work together to make AI agents reliably do the right thing.

When we first started building more complex prompts, we took the programming mindset to prompting: we wrote instructions, added more rules when something broke, and ended up with prompts that read like policy documents. After several iterations of watching agents ignore nuanced rules in favor of fluent-sounding output, we started looking for a different approach.

We shifted to a behavioral science mindset. Instead of specifying what the agent must do, we describe the context it operates in, the outcome we want, and concrete examples of what success and failure look like. It is harder to design — each scenario requires thinking through not just the happy path but the edge cases and failure modes. But in production, it is much more reliable in our experience.

This post covers three specific techniques we use: Gherkin-style prompt structure, the message envelope DSL, and a structured completion signal format. None of these are original. But together, they give agents a much cleaner operating contract in our experience.

The problem with rules-based prompting

When an agent misbehaves, the instinct is to add a rule. "Do not X." "Always do Y before Z." After a few iterations of this, you end up with something like:

You are an AI assistant. You must read files before editing them. You must not create a new file when revising. You must not signal completion if tests are failing. You must always include the file_id in your response. You must not post the file content as a comment instead of saving it…

The model reads this, acknowledges it, and then does whatever the rest of its training predicts. Long rule lists dilute attention. The model does not treat them as hard constraints — it treats them as context. If the rule conflicts with what feels fluent given everything else in the prompt, fluency wins.

The behavioral science approach reframes the problem: instead of listing rules, you describe a situation, a trigger, and an expected behavior. That maps much more naturally onto how large models were trained.

Gherkin: specifying behavior without specifying instructions

Gherkin is a plain-English format from behavior-driven development. Each scenario has a Given/When/Then structure: the precondition, the trigger, and the expected outcome.

We adapted it to build agent prompts. Here is an example of the kind of Gherkin prompt we mean:

Feature: Plan Narration
  As a user, I want to hear the plan before work begins so I can follow along.

  Rule: Narrate the plan after receipt, before the first tool call

    Scenario: User sends a non-trivial prompt requiring tool calls
      Given I have spoken the receipt confirmation
      When I am about to make my first tool call
      Then I call speak() a second time with my intended approach
      And the narration is two to four short spoken sentences maximum
      And I use plain language with no bullet points, no markdown, and no unexplained jargon
      And I state what I will do and why, in the order I will do it
      And I end with a natural transition phrase

    Examples of good transition phrases:
      | "Let's go."    |
      | "Here we go."  |
      | "Starting now."|

    Example of a good plan narration:
      | "I'm going to read the agent runner file first to understand the current flow, |
      |  then add the new tool to the schema, and finally hook it up in the executor. |
      |  Should only take a moment. Here we go."                                      |

Several things are worth noticing here.

It is actually Gherkin. The structure starts with a Feature, adds a Rule, and then defines a concrete Scenario. That matters because the prompt is specifying behavior in the same shape Gherkin is designed for, rather than borrowing only the surface style of Given/When/Then.

Context comes first. The Given line establishes the exact state the agent is already in: receipt has been confirmed, and the next boundary is the first tool call. This is not just flavor text — it narrows the behavior to a specific moment instead of leaving it as a vague instruction.

The trigger is singular. When I am about to make my first tool call identifies one decision point. The agent does not have to infer when narration should happen.

The expected behavior is concrete. The Then lines define the constraints that matter: short narration, plain language, ordered explanation, and a natural handoff into action. The examples at the end do the rest of the work. They show what "good" looks like in a form the model can imitate directly.

The message envelope: separating signal from prose

The other half of the problem is output. You can get agents to behave well 80% of the time, but the remaining 20% shows up as verbose non-sequiturs, duplicated content, and format drift. When you are parsing agent output programmatically, format drift is a real cost.

ProjectBrain uses a message envelope DSL for agent-to-system communication. The format looks like an email header — structured key-value pairs above a --- separator, with optional free prose below:

ACTION: approve
COMMENT: Added cursor pagination to /facts endpoint; all 43 tests pass.
FILE_ID: 3a9f2c1d-...
---
The implementation uses keyset pagination on (created_at, id) to ensure
stable ordering under concurrent inserts. I chose this over pure cursor
because the facts table is insert-heavy...

The runner reads only what is above the ---. Everything after it is ignored for routing purposes. This means agents can write as much explanatory prose as they want — it costs nothing and discards cleanly. The structured part stays small and parseable.

We use this same format in both directions. When the runner dispatches tasks to agents, it sends structured preamble above ---. When agents respond, they use the same format. The symmetry is intentional: it is easier to teach a model to produce a format it already sees being consumed.

The runner extracts the envelope with a simple regex anchored to line starts:

_ENVELOPE_RE = re.compile(
    r"^(ACTION|COMMENT|FILE_ID|PR_URL):\s*(.+)$",
    re.IGNORECASE | re.MULTILINE,
)

Why not JSON?

We started with JSON as the completion signal format:

{
  "action": "approve",
  "comment": "Added cursor pagination; all tests pass.",
  "file_id": "3a9f2c1d-..."
}

This worked, but had two problems. First, models can produce malformed JSON — unescaped quotes in the comment, trailing commas, or a JSON block that got split across a markdown code fence. Second, agents tended to repeat the comment content both in their prose and in the JSON object, creating verbose duplication.

The envelope format is more natural for models to produce because it looks like structured text, not a programming construct. Line-start anchoring (^) means indented examples in the reasoning cannot contaminate the actual signal. And the --- separator creates a clear moment of transition: everything above is machine-readable, everything below is human-readable. Models navigate that transition naturally.

Evaluator agents: the same structure, different role

When ProjectBrain routes a task to a reviewer, the same prompt builder is used — but the behavioral contract changes:

Given I am a senior software engineer operating via the ProjectBrain runner
  And the submission to evaluate is:
    """
    [content of the submitted draft]
    """

When I evaluate the submission

Then I evaluate against the rubric criteria below
  And I provide specific, actionable findings in my verdict
  And I end my response with a completion envelope (see Completion Signal)

And the scenarios shift accordingly:

Scenario: submission meets all criteria
  Given all rubric criteria are satisfied
  When I evaluate
  Then I end with:
    ACTION: approve
    COMMENT: Clear summary of why it passes.

Scenario: submission needs targeted changes
  Given specific, fixable issues were found
  When I evaluate
  Then I end with:
    ACTION: request_changes
    COMMENT: • Issue 1\n• Issue 2\n• What the author should do next.

Scenario: issues require human judgement
  Given the submission has fundamental problems beyond targeted fixes
  When I evaluate
  Then I end with:
    ACTION: escalate
    COMMENT: Reason: [what human review is needed]

The rubric — what to evaluate against — is also injected into the prompt by the platform, not hardcoded into the agent. This means a single evaluator agent can apply different quality bars for different workflows without needing separate deployments.

Behavioral science vs. rules: the practical difference

When you write rules, you are trying to enumerate failure modes in advance. The list is always incomplete. Models find the gaps — not through adversarial intent, but because they optimize for fluent, plausible output, and fluent output does not always respect unstated constraints.

When you write behavioral scenarios, you are doing something closer to training by example. You are showing the model a situation, a decision point, and an outcome. Models generalize from examples much better than they comply with rule lists. And when you include explicit failure scenarios, you close off the most common failure paths without needing to enumerate every possible variant.

The combination of Given/When/Then structure, concrete examples, and a machine-readable completion format gives each agent a clean operating contract: what situation it is in, what decision it is being asked to make, what output to produce, and how to signal when it is done. Each piece reinforces the others.

The result is not a perfectly obedient agent — that does not exist. But it is an agent that fails in predictable ways, recovers cleanly, and can be improved incrementally as you observe what goes wrong.

What we would do differently

The biggest lesson from running this experiment is that prompt quality compounds. A vague brief leads to unfocused work leads to an approval decision based on wrong criteria. Every stage amplifies whatever is unclear in the stage before it.

Investing in clean behavioral contracts at the start of each workflow pays off more than refining the prompt for any single stage. And when something goes wrong, the first place to look is not the agent — it is the scenario. Was the failure a case the scenario covered? If not, add it. The scenarios are your tests for agent behavior, and they should grow the same way a test suite does.

The envelope format also taught us something more general: make the machine-readable part as small as possible. An action name, a one-sentence comment, an optional file reference. That is all the agent needs. Everything else can go below the separator. The less structure you ask the model to maintain, the more reliably it maintains it.

Semantic Search — How ProjectBrain Finds What You Mean

Li-Hsuan Lung — Fri, 27 Mar 2026 13:00:00 +0000

The filing cabinet problem

Imagine your project's knowledge base as a massive library with thousands of books, each one containing facts, decisions, and lessons learned by your team. The challenge? There’s no universal catalog. Every book is shelved by whatever label the author thought made sense at the time.

When you need to find something, you rarely remember the exact phrase that was used. You search for "token expiration" and miss the entry titled "auth session handling." You search for "rate limit" and miss the fact logged as "API throttling ceiling is 1000 req/min." The answer is there. You just can’t reach it.

How most search works — and where it falls short

Most search systems operate on exact word matching. The technical term is lexical search.

The idea is simple: take the words in your query, find documents that contain those words, and rank them by how often the words appear.

If you search for "rate limit," you get back entries that literally contain the words "rate" and "limit." If someone logged a fact called "API throttling ceiling is 1000 requests per minute," you won't find it — even though it's exactly what you were looking for.

Lexical search has real strengths. It's fast, reliable, and perfect for exact identifiers. If you need to find a specific ticket number, an error code, or a function name, word-matching is what you want.

But for a knowledge base full of human-authored notes, decisions, and procedures, literal word matching misses half the content.

A different approach: search by meaning

In recent years, semantic search powered by vector embeddings has become accessible and practical for most teams.

Here is the idea. Modern AI models can read a piece of text and produce a numerical fingerprint — a list of hundreds of numbers that represents the meaning of the text. Similar meanings produce similar fingerprints. Different meanings produce very different ones.

When you store a fact in ProjectBrain, we run it through OpenAI's embedding model and save this numerical fingerprint alongside the text. When you search, we fingerprint your query the same way. Then we find the stored entries whose fingerprints are most similar to yours.

Because the fingerprints encode meaning rather than words, this works even when the vocabulary is completely different. "Rate limit," "API throttling ceiling," and "maximum requests per minute" all point to the same region in meaning-space. The search finds all of them.

Here's a real example from our own knowledge base. We logged this fact:

Docker test stage must reset ENTRYPOINT inherited from production stage
When a Dockerfile test stage extends a production base stage that sets ENTRYPOINT, the test stage inherits it. This causes docker compose run to pass the test command as arguments to the production entrypoint instead of executing it directly.

If you search for "run tests locally docker compose", a lexical search on that query finds it because "docker" and "compose" appear in the title. But if you search for "test container starts server instead of running pytest" — which is the actual symptom someone debugging this would type — a lexical search finds nothing. Semantic search finds it immediately, because the meaning of those two descriptions is the same.

The problem with semantic-only search

Semantic search sounds perfect. Why not just use it for everything?

Because it has its own blind spots.

Semantic search relies on your embeddings being up to date. A newly added entry needs to be indexed before it can be found. And the embedding model sometimes misses on very technical content — exact identifiers, version numbers, and project-specific abbreviations that have no semantic neighborhood in the training data.

If someone on our team logged a fact about migration revision 053_task_id_facts_skills, a semantic search for that exact string might rank it lower than other migration-related entries. Lexical search would nail it immediately.

The two approaches are genuinely complementary.

How we combined them

ProjectBrain's search uses both — and then ranks the combined results using four signals. The weights below were tuned empirically against real search sessions on our own knowledge base:

Semantic similarity (55%) is the dominant factor when embeddings are available. It captures meaning, synonyms, and conceptual proximity.

Lexical overlap (25%) handles exact matches — identifiers, code snippets, specific error messages. This is our Elasticsearch-style fallback.

Recency (15%) gives newer entries a boost. A fact logged last week is more likely to be current than one from six months ago.

Task linkage (5%) is a small tiebreaker: entries linked to specific tasks in the project rank slightly higher than general, free-floating knowledge.

What this looks like in practice

Here are two real searches we ran against ProjectBrain's own knowledge base after building this feature.

Query: "run tests locally docker compose"

Rank	Entry	Type	Score
1	Run tests locally using docker compose (matches CI)	Skill	0.58
2	Containerise CI test runs with docker compose	Decision	0.52
3	Docker test stage must reset ENTRYPOINT inherited from production stage	Fact	0.52

The top three results are exactly the three entries we logged earlier that day. They weren't the most recent entries in the system, and they didn't use the same phrasing as the query. But they matched the meaning.

Query: "git hooks enforce lint before push"

Rank	Entry	Score
1	Store git hooks in .githooks/ and activate via core.hooksPath	0.55

A 10-point gap to the next result. No other entry came close.

Why transparency matters

One thing we were careful about: every search result includes a score breakdown. You can see exactly how much of the score came from semantic similarity, lexical overlap, recency, and task linkage.

This matters for a couple of reasons.

First, it builds trust. When an agent retrieves knowledge and acts on it, you want to understand why that entry was selected. "Semantic similarity: 72%, also linked to the current task" is a lot more trustworthy than "it came from the search."

Second, it makes the system debuggable. If a result that should rank first is coming in third, the breakdown tells you exactly which signal is dragging it down. Maybe the entry is old and needs refreshing. That's a fixable problem.

What this means for agents

For AI agents working through ProjectBrain, the search improvement has a direct effect on session startup quality.

When an agent begins a session with an intent — say, "implement the new billing flow" — the context tool now runs a semantic search behind the scenes. Instead of returning the most recently logged entries, it returns the entries most relevant to billing: the rate limit facts, the payment gateway decisions, the deployment skill for this service.

The agent starts with the right context instead of the most recent context. In practice, that means fewer cases of an agent re-discovering something the team already knew, and fewer cases of contradicting a decision that was logged months ago.

If you're already using ProjectBrain, your existing knowledge base is already indexed. The next agent session you run will pull in the most contextually relevant entries for whatever it's working on — not the most recent ones, the most relevant ones. You don't need to do anything.

If you're not yet using ProjectBrain, get started here.

Memory Curation — Keeping the Knowledge Base Honest

Li-Hsuan Lung — Sun, 22 Mar 2026 17:19:32 +0000

The idea I could never get my team to follow

I have always loved the concept of Architecture Decision Records.

The idea is simple: whenever your team makes a non-obvious technical decision, you write a short document. The decision, the context, the alternatives you considered, and why you chose what you chose. You commit it to the repository alongside the code. Future teammates can read it and understand not just what was built, but why.

It is a great idea in theory. But I could never get anyone to actually do it consistently, including myself.

When the decision is fresh in your head, writing it down feels like overhead. When you are under deadline pressure, the ADR file seems like the first thing to skip. By the time the decision feels worth documenting, you have forgotten half the context. And then a new engineer joins, or you revisit the codebase six months later, and you are left reading code with no memory of the reasoning behind it.

ProjectBrain's knowledge base is, at its core, an attempt to make this idea stick — and to extend it beyond just architecture decisions.

Three types of knowledge

ProjectBrain stores three types of knowledge entries:

Decisions are the direct heir of the ADR concept. A decision captures a non-obvious choice, the rationale behind it, and the alternatives that were rejected. The rationale is the most valuable part — it is the part that disappears fastest from human memory and git history.

Here are a few real decisions in our own project's knowledge base:

Adopted Gherkin-style (Given/When/Then) structure for static LLM prompts
Gherkin-style prompts (Given/When/Then) provide a more deterministic structure for LLMs, minimizing ambiguity by clearly separating persona (Given), triggers (When), and expected behavior (Then). Positive framing and scenario isolation have been established as best practices to improve LLM adherence to instructions.

Memory curator v1 is non-destructive
The curator should generate recommendations (refresh, supersede, merge, archive) but should not auto-mutate memory records in v1. Explicit review/resolution preserves safety and auditability.

LLM semantic pass is the centerpiece of the memory curator
The rule-based pass (title normalization, staleness, supersession) acts only as a cheap pre-filter to surface candidates. The LLM pass is the primary signal: it scores semantic duplicate pairs, detects quality issues, and provides the reasoning needed to make recommendations actionable.

Facts are verifiable truths about the project, environment, or system. They have a shorter half-life than decisions — configurations change, services get renamed, constraints shift. A fact that was true last quarter may be silently wrong today.

A few real facts from our knowledge base:

Render render.yaml: dockerContext is relative to rootDir, not repo root
When a service has rootDir set, dockerContext and dockerfilePath are resolved relative to rootDir — not the repo root. Setting dockerContext: ./curator with rootDir: ./curator produces the path curator/curator (not found). The correct value is dockerContext: . when you want the rootDir itself as the build context.

MCP responses support three modes: human, json, both
All tools accept a response_mode parameter ("human" | "json" | "both", default "human"). Human mode returns readable markdown. JSON mode returns a structured envelope: {ok, data, error, meta: {tool, response_mode, query?}}. Both mode returns human text followed by "---" and the JSON envelope. Validate response_mode early via _validate_response_mode() and return the error string if invalid. Always return a string — MCP protocol requirement.

Alembic env.py wraps all migrations in one transaction — a single failure rolls back all

Skills are reusable procedures — the "how we do this here" knowledge that never makes it into a README. Setup guides, debugging playbooks, deployment checklists. The kind of knowledge that lives in a senior engineer's head and gets re-explained to every new teammate.

The combination creates something more complete than ADRs alone: a living record of what is true, what was decided and why, and how things are done.

Agents write without friction

The original ADR problem was that writing felt like a burden. ProjectBrain removes that friction almost entirely for agents — they log knowledge as a natural side effect of doing work. When an agent resolves a bug, it logs the root cause as a fact. When it makes an architectural choice, it logs a decision. When it figures out a deployment step, it logs a skill.

Humans can do the same, and the UI makes it fast. But the real leverage is that agents do it continuously, in the background, without needing to be reminded.

This solves the write problem. But it creates a different one.

The cost of stale memory

When an AI agent reads from a knowledge base, it treats what it finds as ground truth. It does not apply skepticism the way a senior engineer would when stumbling on an old wiki page. It reasons from what it is given.

That is fine when the knowledge base is accurate. It is a serious problem when it is not.

Stale context compounds quietly. An agent reads an old fact about a database schema that was changed three months ago. It proceeds to write a migration against the wrong table structure. Another agent reads a superseded decision about an API design and implements a pattern the team moved away from weeks earlier. The work looks correct on the surface. The errors only surface in review — or in production.

This is worse than no documentation. A missing fact causes the agent to ask a question or make an assumption it flags. A wrong fact causes it to proceed confidently in the wrong direction.

The problem gets worse as the knowledge base grows. More entries means more signal to retrieve. But it also means more outdated entries that look authoritative. The noise is invisible. It is indistinguishable from good signal until something breaks.

How teams typically approach memory pruning

This is not a new problem. A few patterns have emerged, each with real drawbacks.

TTL-based expiry. Give each entry a maximum age. Simple to implement, but crude. A fact about your CI environment might be stale in a week. A foundational architectural decision might be valid for five years. Fixed TTLs either over-prune or under-prune.

Supersession tracking. New entries explicitly mark old ones as superseded. Clean and auditable, but it depends on the writer knowing what the new entry supersedes.

Human-in-the-loop review. Surface aged entries periodically and ask a human to confirm, update, or delete them. The most reliable method, but it does not scale. A team generating dozens of entries per week would spend all its time reviewing the queue.

None of these works well in isolation. The honest answer is that memory pruning requires a mix of strategies — automatic signals to surface candidates, semantic analysis to catch what rules miss, and human judgment for the cases where confidence is uncertain.

What the curator does

ProjectBrain's curator is our current attempt at that mix. We are actively learning what works, and the approach will evolve as we get more data from real usage.

The curator runs on a schedule, currently every 30 minutes, and on each pass it samples a window of recent knowledge entries, applies a rule-based filter as a cheap pre-pass, then sends candidates to an LLM for semantic analysis.

The output is a queue of recommendations:

MERGE — two entries that cover the same ground. Duplicates happen often: one agent logs a fact at the end of a session, another logs the same fact at the start of the next. Or two team members capture the same architectural decision independently after a long discussion.

FLAG — a single entry with a quality problem. A decision with no rationale. A skill with a title but no steps. A fact that references something that no longer exists.

Humans or agents review the queue and act: accept a merge, edit or remove a flagged entry, or dismiss the recommendation if it was wrong.

For FLAG recommendations, the review card offers three actions: Delete, Edit, or Keep. The curator does not know whether a flagged entry should be deleted or just improved — it flags the problem and leaves the decision to the team.

The prompt

The curator's LLM pass sends records as a JSON array and asks for a structured response. The full system prompt:

Given you are a knowledge base curator for a software project management tool
And you review knowledge records (facts, decisions, skills) written by human team members and AI agents
When you receive a JSON array of records containing id, entity_type, title, and body
Then you must return a single JSON object with exactly two keys: "duplicates" and "quality_issues"
And you must output ONLY valid JSON without any markdown formatting or preamble

Scenario: Identifying duplicate records
Given two records have the same meaning but different wording
Then you must include them in the "duplicates" array
And each item must be formatted as:
  {
    "entity_a_id": "...",
    "entity_b_id": "...",
    "confidence": 0.0–1.0,
    "reason": "one sentence — why these are the same",
    "suggested_merged_body": "clean merged content combining the best of both"
  }
And you must only include pairs with confidence >= 0.75
And you must prefer false negatives over false positives
And you must treat two records on related but distinct topics as NOT duplicates

Scenario: Identifying quality issues
Given a record has a genuine quality problem such as:
  - Facts with no body and a title so vague it conveys nothing actionable
  - Decisions with no rationale (just a title, no explanation of why)
  - Skills with no steps or procedure (title only, or body is a single vague sentence)
Then you must flag it in the "quality_issues" array
And each item must be formatted as:
  {
    "entity_id": "...",
    "severity": "low" | "medium" | "high",
    "issue": "one sentence — the specific problem"
  }

A couple of design choices worth noting:

Prefer false negatives over false positives on duplicates. A missed duplicate is low cost — you can find it on the next pass. A false positive that merges two distinct entries destroys information and erodes trust.

Suggest the merged body. For duplicate pairs, the model proposes a merged version combining the best of both entries. This gives the reviewer something to work from rather than asking them to write a new entry from scratch.

Tuning curation behavior

The curator's behavior is configurable per project. You can set a confidence threshold — recommendations below it are suppressed entirely — and a freshness window to flag entries that have not been reviewed within a certain period.

Both settings address the same underlying tradeoff: how much noise you are willing to see in exchange for catching more real problems. A team with a high-churn knowledge base might lower the threshold and accept more false positives. A smaller, stable team might raise it to keep the queue quiet.

What the curator is not

The curator does not auto-apply changes. It does not delete entries or rewrite them without review. All recommendations require confirmation.

We considered auto-merging obvious duplicates, but the false positive cost is too high. Two entries that look nearly identical might cover different contexts. The review step is fast and the downside of getting it wrong is not.

The curator stays in a supporting role. It surfaces candidates. The team decides.

Closing

ADRs work when teams follow them. The hard part has never been the format. It has been the habit.

What ProjectBrain tries to do is make that habit automatic: log continuously as a side effect of work, and let a background process handle the maintenance. The knowledge base stays roughly honest without requiring anyone to remember to tend it.

We are still figuring out the right balance — what to flag, how aggressively to deduplicate, when to trust the LLM's judgment and when to be more conservative. If you are building systems where agents produce persistent memory, the write problem is easy. Plan for curation from the start, and expect to keep tuning it.

A Workflow Engine That Coordinates Work and Makes It Visible

Li-Hsuan Lung — Wed, 18 Mar 2026 13:00:00 +0000

"The future belongs to artificial intelligence."

Ke Jie said this around his 2017 AlphaGo match. He was the world’s top-ranked Go player, and AlphaGo still swept the series 3-0 (games on May 23, May 25, and May 27, 2017), with Ke Jie visibly emotional after the final game.

That story matters to me because I think one day AI may be much closer to solving software development than we ever expected. For a long time, top-level Go was treated as an especially hard frontier where human intuition would dominate for much longer. Then, suddenly, the gap closed fast. I want to treat software development with the same humility and learn from what the systems actually do, not from old assumptions.

That is why this post is about workflow design and visibility.

When people talk about agent workflows, they usually mean one thing: moving tasks from one stage to the next.

In Project Brain, we are building the workflow engine around two goals at the same time: coordinate work reliably, and make agent behavior visible and explainable.

If a task moved from "in progress" to "in review," we should know how it moved, why it moved, and what assumptions were used during that handoff.

What is interesting about our workflow engine

Workflow is a real system object, not prompt text

Our workflow is modeled directly in the platform as stages, statuses, and stage policies. That means teams can edit process behavior in the product itself, instead of hiding process rules inside long prompts.

Stage policy makes behavior explicit

Each stage can define what should happen after successful work: advance and delegate, advance only, terminal completion, or (optionally) reject work back to an earlier stage. In plain terms, we do not hardcode every route in the agent runtime. We store routing intent as workflow policy and execute against it.

Visibility is designed in, not added later

We attach structured metadata to handoffs and status transitions so task history is reconstructable. The focus is not just "current status." The focus is also "execution trace."

That trace is what helps teams improve prompts, policies, and role boundaries over time.

Real examples from team chat (cross-stage communication)

These are real excerpts from Project Brain team messages, showing planner/implementer/reviewer flow. The screenshot thread captures a full loop: blocker report, fix handoff, and approval.

In the screenshots, the reviewer first moved the task back to in-progress with specific blockers:

The implementer then replied with a concrete fix list and commit hash:

Finally, the reviewer confirmed re-review and test outcome:

This is exactly the visibility model we want: not just final status, but the full reasoning chain from failure to resolution.

Why this framing matters

Agents are not programmed in the traditional deterministic sense. You do not write a fixed function and always get the same output. What you can do is influence behavior through constraints, context, and feedback loops. That is why workflow management is so interesting to me: it is a way to shape behavior reliably even when outputs are probabilistic.

If AI systems can improve through iterative play and feedback loops, then our job is to build environments where those loops are observable, testable, and improvable, instead of hidden.

That is exactly why we designed this workflow engine around both coordination and visibility.

If you are not using Project Brain: how to apply this anyway

You can apply the same workflow principles in any stack (Jira + Slack, Linear + GitHub, custom tools, etc.).

Define stage outcomes explicitly.

For each stage, write what "done" means and what should happen next (advance, delegate, reject, or stop).
Use machine-checkable transition guards.

Require expected state/version fields on status changes so race conditions become explicit conflicts instead of silent corruption.
Standardize handoff metadata.

At minimum: task ID, from-stage, to-stage, actor, and reason for handoff.
Treat review feedback as structured data.

Capture blocker reason, fix commit, verification command, and verification result in one thread.
Optimize for replayability.

A new person (or agent) should be able to read a thread and answer: What happened? Why? What changed? Is it verified?

If you can do those five things consistently, you will get most of the value of workflow orchestration plus visibility, even outside Project Brain.

MCP Design in the Real World

Li-Hsuan Lung — Sun, 15 Mar 2026 01:13:45 +0000

When we started Project Brain's MCP server, we followed a common pattern: every new need got a new tool. It felt productive at first. But after a while, the tool menu became crowded, the rules around each tool grew, and the agent got slower at choosing what to call.

That led to more retries, higher token usage, and slower progress on simple tasks. The big lesson was straightforward: after a certain point, adding more tools hurts more than it helps.

In this post, I’ll share what changed for us and what we learned from running this in production.

Motivation: why reduce the number of tools?

1. Tool selection gets harder for the model

Every new tool is another decision branch. Instead of focusing on the task, the model spends effort deciding which tool is "most correct," whether parameters are supported, and whether an older tool has been replaced by a newer one. Those extra decisions show up as wrong calls and wasted turns.

2. Context gets bloated

Each tool adds descriptions, argument rules, and examples. That all has to fit into model context. As context grows, signal gets diluted. Even a strong model performs worse when it has to sift through too many similar options.

3. Day-to-day operations get heavier

Tool growth also creates maintenance overhead: more permission paths to secure, more old behavior to support, more telemetry to monitor, and more docs to keep aligned with reality.

Real examples from Project Brain

1. We moved to a five-tool interface

Instead of exposing many narrow tools, we grouped the public API into five domain entrypoints: projects(...), context(...), tasks(...), knowledge(...), and collaboration(...).

This made the system easier to reason about. The agent picks the right domain first, then chooses an action inside that domain. Fewer top-level choices meant less routing confusion.

2. We added per-turn shortlist routing

We introduced context(action="shortlist", q, limit, full_tool_mode).

Think of it as: "for this user request, show me the best few tool-actions first." For example, if a user asks about milestone planning, shortlist pushes milestone-related operations to the top instead of making the model scan the full catalog every time.

This improved first-call accuracy and reduced unnecessary context.

3. We made responses predictable for machines

For key reads, we added response_mode (human | json | both). In JSON mode, the output always follows the same envelope: ok, data, meta, error.

That simple consistency removed a lot of brittle text parsing and made automation much more reliable.

4. We expanded one task listing tool instead of adding many search tools

Rather than creating separate tools for each search style, we added richer filters to task listing with q_any, q_all, and q_not.

In plain language, one call can now express "must include these terms, can include these terms, and exclude these terms." We got more power without growing top-level tool count.

Challenges you should expect in MCP design

1. Knowing which attributes are actually available

If input rules are vague, agents guess. In our case, task queries used to fail or return partial results because fields looked plausible but were not actually supported.

The fix was to make contracts explicit: clear filter fields, explicit response mode behavior, and stable output shape. Agents should not need to guess what inputs are valid or what outputs will look like.

2. Public listing behind auth token execution

Discovery and execution are different security concerns. During MCP directory integration, we wanted clients to discover capabilities quickly, but we still needed strict auth for real data access.

So we separated the two concerns in middleware: process auth headers and token validation for protected calls, while allowing a small public allowlist for low-risk discovery endpoints. In practice: show the menu publicly, lock the kitchen.

3. Backward compatibility and sprawl

Every new top-level tool creates long-term support cost. Our early habit of adding tiny tools for each new query made routing and maintenance harder over time.

A better pattern was to keep the top-level interface stable and grow capability inside existing domains via actions and parameters. Internal complexity can grow in code modules without forcing public API sprawl.

MCP design best practices

Keep the public tool surface small and stable. Route with shortlist when possible. Prefer extending existing tools over creating new top-level ones. Keep discovery and execution security boundaries separate. Make input and output contracts explicit. Return predictable machine-readable shapes. Prune low-value tools regularly.

Closing

Good MCP design is not about exposing everything. It is about exposing the right minimum, clearly.

If your agents feel flaky, reducing and clarifying your tool surface area is often the fastest way to improve reliability.

The side projects that accidentally started Project Brain

Li-Hsuan Lung — Fri, 13 Mar 2026 05:56:31 +0000

This started with two weekend projects: an AI-powered text adventure experiment and a 3D-printed shelf for my daughter's growing Tomica collection after our trip to Japan. Both sounded fun. Both turned into the same context-management problem.

The pattern I kept hitting

For each new feature, my agent would create more markdown files: TODO lists, feature notes, architecture plans, README updates, and more. At first it felt productive. Then it became its own maintenance project.

Docs drifted from reality faster than I expected. I would see a TODO marked in_progress even though the implementation had already shipped.

When a session crashed, context crashed with it. The next run had no idea where things left off, so momentum disappeared right when I wanted to keep building.

Switching tools meant retraining from scratch every time. Different interface, same project, full re-brief.

That was the moment it clicked: the problem was not model quality. The problem was memory architecture.

Yes, we are dogfooding Project Brain to build Project Brain

We now run our own development through Project Brain. It is extremely meta and only slightly suspicious.

I can swap models and tools without losing momentum because the project state lives in one place, not in whichever chat window happened to be open.

Prompt I use when switching models:

Use Project Brain as your source of truth
1) context(action="session", project_id)
2) tasks(action="context", task_id)
3) tasks(action="list", project_id, status="in_progress")
4) knowledge(entity="fact", action="list", project_id)
5) knowledge(entity="decision", action="list", project_id)

Then summarize:
- current goal
- active tasks
- important constraints/facts
- key decisions and rationale
- immediate next steps

Skills became a force multiplier

One of my favorite side effects has been watching reusable workflows turn into skills. Instead of repeating instructions in every task, we can publish them once and have any agent follow them.

API key auth for service accounts (FastAPI)
Implement GitHub OAuth SSO in FastAPI + React SPA

Following along got dramatically easier

Facts and decisions gave me a clean trail of what changed and why. Instead of diffing stale markdown docs, I can see durable constraints and rationale in structured records.

Team chat made delegation less chaotic

The team chat flow helps agents delegate with context instead of vague instructions. That means fewer handoff bugs and less "wait, what are we doing again?" between planner and implementer agents.

Why I keep saying "we"

I say "we" because the agent has genuinely been a partner in building this product. I regularly ask what would make agents more efficient, what creates friction during execution, and what tooling is missing. Those conversations have directly shaped the roadmap and produced ideas I probably would not have reached alone.

If your agents keep rewriting context and you keep re-explaining the same project, Project Brain was built for that exact pain.