Erez Shahaf

Posted on Apr 18

Eval-driven development for a local-LLM agent: how I shipped Lore 0.2.0 with confidence

#agents #llm #opensource #testing

I maintain Lore, an open source app that manages your personal memory. It sits in the system tray, opens a chat on a global shortcut, and uses Ollama + LanceDB to capture and recall your notes and todos entirely on your machine. No cloud, no API keys, MIT license.

The single hardest thing about building Lore is not the retrieval, the embeddings, or the code. It's that every prompt change silently regresses something else. It is especially hard because lore runs LLMs locally on the user's device, which limits our project to weaker models.

So I built an eval harness around the agent and made one rule for myself: no prompt change ships without a fresh eval run, and no eval failure gets fixed by special-casing the test.

This post is what that looks like in practice.

The shape of the problem

Lore is a multi-stage agent. A user message goes through:

Classification — what does the user want? (save, read, modify, converse, ask for clarification)
Action execution — handlers per intent, calling tools like save_documents, search_library, modify_documents.
Reply composition — a final user-facing reply, sometimes summarizing several actions.

Every one of those stages can be wrong in a way that produces a plausible final answer. "Done with run 5 mile and run 10 mile. I have deleted both tasks." reads like success. It is, in fact, a destructive bug. The user said "just finished the run" and meant one of them.

You can't catch that by reading the final string. You have to look at the trace.

The harness

Lore uses Promptfoo as the test runner, but the interesting part is what plugs into it.

A custom scenario provider

Promptfoo's standard model providers don't know how to run a multi-turn agent that has its own classifier, vector store, tools, and stateful library. So I wrote a custom provider — evals/provider/loreScenarioProvider.mjs — that:

Spins up a clean LanceDB profile per scenario (scripts/reset-db.mjs --profile eval).
Drives the same agent loop the production app uses (electron/services/loopAgentService.ts), in-process, against the configured Ollama model.
Captures a structured pipeline trace for every assistant turn: classifier output, retrieval results, tool calls, reply composition. The trace is what makes honest debugging possible.
Returns the trace alongside the final assistant text so Promptfoo's checks can assert against either.

That last bullet is the key design choice. The unit of evaluation is not "did the model say the right words?" but "did the right thing happen at every stage?"

A small custom viewer

Promptfoo's built-in viewer is fine, but it doesn't know about my pipeline trace, my retrieval results, or my todo state per step. So I built a tiny Vite app under evals/promptfoo-viewer/ that loads any of my result JSON files and shows: overview, transcript, failed checks (judge vs deterministic), events, retrieval, todos, per-step library snapshot, pipeline trace, raw row.

When a scenario fails, I open it in the viewer, jump to the failing step, and read the trace. Most of the time the bug screams from the trace before I look at the prompt at all.

Scenarios as policy, not test cases

Lore has 14 scenario files today, grouped by what aspect of the agent they exercise:

ambiguousReferenceScenarios       intentHeuristicTrapScenarios
conversationRobustnessScenarios   instructionPersistenceScenarios
largeCorpusRetrievalScenarios     memoryRetrievalScenarios
newChatTodoScenarios              safetyBoundaryScenarios
structuredDataScenarios           technicalReferenceRetrievalScenarios
todoCreationScenarios             todoDeleteScenarios
todoRetrievalScenarios            todoUpdateScenarios

Each scenario is a small object: an id, a topic, a list of suites it belongs to (smoke, crucial, full, problematic), and a sequence of steps. Each step has a userInput and an expect clause that mixes deterministic assertions (counts, content sets) with optional judge rubrics.

Here's a real one from ambiguousReferenceScenarios.mjs:

{
  id: 'ambiguous-run-completion-needs-clarification',
  topic: 'ambiguous-reference',
  suites: ['full', 'crucial'],
  steps: [
    {
      userInput: 'Todos: run 5 mile, run 10 mile',
      expect: { storedCount: 2, todoCount: 2 },
    },
    {
      userInput: 'just finished the run',
      expect: {
        requiresClarification: true,
        deletedCount: 0,
        todoCount: 2,
        responseJudge:
          'The assistant should explain that multiple run-related todos and ask which one the user completed. It must not delete any todo without clarification.',
      },
    },
  ],
}

Three things to notice:

Deterministic checks lead. deletedCount: 0 and todoCount: 2 will fail the test no matter how the model phrased its reply. The judge rubric is there to catch style regressions, not as the primary signal.
The scenario describes a class of behavior, not the literal phrasing. There are sister scenarios for "ride", for numeric follow-ups ('1'), and so on. If I fix one with a regex, the others will catch me.
Suite membership is on the scenario. crucial is a tight subset I run before every prompt change; full runs in CI. Suite tags live with the scenario so they don't drift.

The discipline: don't cheat the eval

Eval harnesses are easy to game. You see a failing test, you look at the user input, you add a special case in a handler, the test goes green, and you've bloated the tokens, and probably broke something that wasn't covered.

I wrote a skill called agent-improving.mdc. Here's the spirit of it:

Ground every fix in the pipeline trace, not the failure string. Open the trace, read the thinking stage of the model, name the failing stage, write a one-line hypothesis, then fix that stage.
Fix the earliest wrong stage first. If classification is wrong, don't patch the reply composer. Re-read the trace before touching anything downstream.
Prefer changes that generalize. If the fix only makes sense for the literal test string, it's not a fix.
Never weaken success criteria to turn a failure green unless I genuinely want the rubric changed.
One coherent change per iteration. Small change set, re-run the suite, read the new trace.

There's also a hard "are we cheating?" checklist:

Trade-off: could this fix harm a reasonable user goal that isn't in this test?
Narrowness: would this break or confuse inputs that are like the scenario but not identical?
Stage honesty: does the trace show this is the real failure stage?

At the beginning, I had many failing tests, so I used to just let my coding agent run this skill.

A worked example

Here's a fix that came out of v0.2.0. The ambiguous-run-completion-needs-clarification scenario was failing. The final assistant message:

Done with run 5 mile and run 10 mile. I have deleted both tasks.

deletedCount was 2, todoCount was 0, requiresClarification was false. Three reds.

Easy fix? I Added wording to the prompt along the lines of "if you found multiple matches, mention it.", and I ran it again.

What the trace showed for this turn:

Iteration 1: model called search_library("run") — fine. Got back two todos with high scores.
Iteration 2: model called modify_documents with action: delete against both returned IDs — wrong call.
Iteration 3: model wrote a confident confirmation message describing what it had just done.

The model had decided "user finished the run" was an unambiguous bulk-completion intent and queued a delete on every retrieval hit.

The actual fix - making the ambiguity rule for destructive tool calls explicit and unconditional, not advisory:

When the user asks to delete, complete, or edit something and search returns more than one match, stop and ask which one. Present the candidates as a numbered list with their verbatim content and let the user pick.

That's a class-level rule, not a string-level patch. It says nothing about "run" or "finished", it says "ambiguous destructive intent ⇒ list and ask, never bulk-act." The sister scenarios for "ride", for numeric follow-ups ('1' after a clarification list), and for picking by description ('the motorcycle one') all leaned on the same rule and went green together.

The general lesson holds even when there are no separate stages to point fingers at: the trace is the tool-call sequence, the bug is the earliest wrong call, and the fix belongs at the layer that decided to make that call — not at the reply that summarized it after the fact.

Things I'd tell past me

A few hard-won opinions from doing this for a month

Build the trace before you build a single test. If your eval framework only gives you final strings, you'll spend your debugging life in the wrong place.
Deterministic checks first, judges second. Use LLM judges for things that can't be checked structurally.
Scenario membership belongs with the scenario. Don't keep a separate "smoke list" file. It will drift.
If your agent logic is even slightly complex, use a single prompt that loops itself even if it means the context increases. Building an agent based on a decision tree is a nightmare.

Try it

Lore is free, MIT, and runs on Windows / macOS / Linux. v0.2.0 ships the live "thinking" stream in the chat UI, so you can watch the reasoning path in real time on your own machine.

Repo: https://github.com/ErezShahaf/Lore
Releases: https://github.com/ErezShahaf/Lore/releases
Discord: https://discord.gg/hsrsertbdb

If you want to benchmark a specific Ollama model against the crucial suite, the steps are in evals/README.md. I'd genuinely love to see results from models I haven't tested.

DEV Community