<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Erez Shahaf</title>
    <description>The latest articles on DEV Community by Erez Shahaf (@erez_shahaf).</description>
    <link>https://dev.to/erez_shahaf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3886352%2F97df7188-043e-4226-8b6d-68fdbdbf56b2.png</url>
      <title>DEV Community: Erez Shahaf</title>
      <link>https://dev.to/erez_shahaf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/erez_shahaf"/>
    <language>en</language>
    <item>
      <title>Eval-driven development for a local-LLM agent: how I shipped Lore 0.2.0 with confidence</title>
      <dc:creator>Erez Shahaf</dc:creator>
      <pubDate>Sat, 18 Apr 2026 18:03:22 +0000</pubDate>
      <link>https://dev.to/erez_shahaf/eval-driven-development-for-a-local-llm-agent-how-i-shipped-lore-020-with-confidence-2m75</link>
      <guid>https://dev.to/erez_shahaf/eval-driven-development-for-a-local-llm-agent-how-i-shipped-lore-020-with-confidence-2m75</guid>
      <description>&lt;p&gt;I maintain &lt;a href="https://github.com/ErezShahaf/Lore" rel="noopener noreferrer"&gt;Lore&lt;/a&gt;, an open source app that manages your personal memory. It sits in the system tray, opens a chat on a global shortcut, and uses &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; + &lt;a href="https://lancedb.com" rel="noopener noreferrer"&gt;LanceDB&lt;/a&gt; to capture and recall your notes and todos entirely on your machine. No cloud, no API keys, MIT license.&lt;/p&gt;

&lt;p&gt;The single hardest thing about building Lore is &lt;strong&gt;not&lt;/strong&gt; the retrieval, the embeddings, or the code. It's that every prompt change silently regresses something else. It is especially hard because lore runs LLMs locally on the user's device, which limits our project to weaker models.&lt;/p&gt;

&lt;p&gt;So I built an eval harness around the agent and made one rule for myself: &lt;strong&gt;no prompt change ships without a fresh eval run, and no eval failure gets fixed by special-casing the test.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This post is what that looks like in practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  The shape of the problem
&lt;/h2&gt;

&lt;p&gt;Lore is a multi-stage agent. A user message goes through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Classification&lt;/strong&gt; — what does the user want? (save, read, modify, converse, ask for clarification)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Action execution&lt;/strong&gt; — handlers per intent, calling tools like &lt;code&gt;save_documents&lt;/code&gt;, &lt;code&gt;search_library&lt;/code&gt;, &lt;code&gt;modify_documents&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Reply composition&lt;/strong&gt; — a final user-facing reply, sometimes summarizing several actions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every one of those stages can be wrong in a way that produces a &lt;em&gt;plausible&lt;/em&gt; final answer. "Done with run 5 mile and run 10 mile. I have deleted both tasks." reads like success. It is, in fact, a destructive bug. The user said "just finished the run" and meant &lt;em&gt;one&lt;/em&gt; of them.&lt;/p&gt;

&lt;p&gt;You can't catch that by reading the final string. You have to look at the trace.&lt;/p&gt;




&lt;h2&gt;
  
  
  The harness
&lt;/h2&gt;

&lt;p&gt;Lore uses &lt;a href="https://www.promptfoo.dev" rel="noopener noreferrer"&gt;Promptfoo&lt;/a&gt; as the test runner, but the interesting part is what plugs into it.&lt;/p&gt;

&lt;h3&gt;
  
  
  A custom scenario provider
&lt;/h3&gt;

&lt;p&gt;Promptfoo's standard model providers don't know how to run a multi-turn agent that has its own classifier, vector store, tools, and stateful library. So I wrote a custom provider — &lt;code&gt;evals/provider/loreScenarioProvider.mjs&lt;/code&gt; — that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spins up a clean LanceDB profile per scenario (&lt;code&gt;scripts/reset-db.mjs --profile eval&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Drives the same agent loop the production app uses (&lt;code&gt;electron/services/loopAgentService.ts&lt;/code&gt;), in-process, against the configured Ollama model.&lt;/li&gt;
&lt;li&gt;Captures a structured &lt;strong&gt;pipeline trace&lt;/strong&gt; for every assistant turn: classifier output, retrieval results, tool calls, reply composition. The trace is what makes honest debugging possible.&lt;/li&gt;
&lt;li&gt;Returns the trace alongside the final assistant text so Promptfoo's checks can assert against either.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last bullet is the key design choice. The unit of evaluation is not "did the model say the right words?" but "did the right thing happen at every stage?"&lt;/p&gt;

&lt;h3&gt;
  
  
  A small custom viewer
&lt;/h3&gt;

&lt;p&gt;Promptfoo's built-in viewer is fine, but it doesn't know about my pipeline trace, my retrieval results, or my todo state per step. So I built a tiny Vite app under &lt;code&gt;evals/promptfoo-viewer/&lt;/code&gt; that loads any of my result JSON files and shows: overview, transcript, failed checks (judge vs deterministic), events, retrieval, todos, per-step library snapshot, pipeline trace, raw row.&lt;/p&gt;

&lt;p&gt;When a scenario fails, I open it in the viewer, jump to the failing step, and read the trace. Most of the time the bug screams from the trace before I look at the prompt at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scenarios as policy, not test cases
&lt;/h2&gt;

&lt;p&gt;Lore has 14 scenario files today, grouped by what aspect of the agent they exercise:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ambiguousReferenceScenarios       intentHeuristicTrapScenarios
conversationRobustnessScenarios   instructionPersistenceScenarios
largeCorpusRetrievalScenarios     memoryRetrievalScenarios
newChatTodoScenarios              safetyBoundaryScenarios
structuredDataScenarios           technicalReferenceRetrievalScenarios
todoCreationScenarios             todoDeleteScenarios
todoRetrievalScenarios            todoUpdateScenarios
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each scenario is a small object: an id, a topic, a list of suites it belongs to (&lt;code&gt;smoke&lt;/code&gt;, &lt;code&gt;crucial&lt;/code&gt;, &lt;code&gt;full&lt;/code&gt;, &lt;code&gt;problematic&lt;/code&gt;), and a sequence of steps. Each step has a &lt;code&gt;userInput&lt;/code&gt; and an &lt;code&gt;expect&lt;/code&gt; clause that mixes deterministic assertions (counts, content sets) with optional judge rubrics.&lt;/p&gt;

&lt;p&gt;Here's a real one from &lt;code&gt;ambiguousReferenceScenarios.mjs&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ambiguous-run-completion-needs-clarification&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ambiguous-reference&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;suites&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;full&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;crucial&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;userInput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Todos: run 5 mile, run 10 mile&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;storedCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;todoCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;userInput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;just finished the run&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;requiresClarification&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;deletedCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;todoCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;responseJudge&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
          &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;The assistant should explain that multiple run-related todos and ask which one the user completed. It must not delete any todo without clarification.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Three things to notice:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic checks lead.&lt;/strong&gt; &lt;code&gt;deletedCount: 0&lt;/code&gt; and &lt;code&gt;todoCount: 2&lt;/code&gt; will fail the test no matter how the model phrased its reply. The judge rubric is there to catch &lt;em&gt;style&lt;/em&gt; regressions, not as the primary signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The scenario describes a class of behavior&lt;/strong&gt;, not the literal phrasing. There are sister scenarios for "ride", for numeric follow-ups (&lt;code&gt;'1'&lt;/code&gt;), and so on. If I fix one with a regex, the others will catch me.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suite membership is on the scenario.&lt;/strong&gt; &lt;code&gt;crucial&lt;/code&gt; is a tight subset I run before every prompt change; &lt;code&gt;full&lt;/code&gt; runs in CI. Suite tags live with the scenario so they don't drift.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The discipline: don't cheat the eval
&lt;/h2&gt;

&lt;p&gt;Eval harnesses are easy to game. You see a failing test, you look at the user input, you add a special case in a handler, the test goes green, and you've bloated the tokens, and probably broke something that wasn't covered.&lt;/p&gt;

&lt;p&gt;I wrote a skill called &lt;code&gt;agent-improving.mdc&lt;/code&gt;. Here's the spirit of it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ground every fix in the pipeline trace, not the failure string.&lt;/strong&gt; Open the trace, read the thinking stage of the model, name the failing stage, write a one-line hypothesis, then fix that stage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix the earliest wrong stage first.&lt;/strong&gt; If classification is wrong, don't patch the reply composer. Re-read the trace before touching anything downstream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefer changes that generalize.&lt;/strong&gt; If the fix only makes sense for the literal test string, it's not a fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Never weaken success criteria&lt;/strong&gt; to turn a failure green unless I genuinely want the rubric changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One coherent change per iteration.&lt;/strong&gt; Small change set, re-run the suite, read the new trace.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's also a hard "are we cheating?" checklist:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trade-off&lt;/strong&gt;: could this fix harm a reasonable user goal that isn't in this test?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Narrowness&lt;/strong&gt;: would this break or confuse inputs that are &lt;em&gt;like&lt;/em&gt; the scenario but not identical?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage honesty&lt;/strong&gt;: does the trace show this is the real failure stage?&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;At the beginning, I had many failing tests, so I used to just let my coding agent run this skill. &lt;/p&gt;

&lt;h2&gt;
  
  
  A worked example
&lt;/h2&gt;

&lt;p&gt;Here's a fix that came out of v0.2.0. The &lt;code&gt;ambiguous-run-completion-needs-clarification&lt;/code&gt; scenario was failing. The final assistant message:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Done with run 5 mile and run 10 mile. I have deleted both tasks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;deletedCount&lt;/code&gt; was 2, &lt;code&gt;todoCount&lt;/code&gt; was 0, &lt;code&gt;requiresClarification&lt;/code&gt; was false. Three reds.&lt;/p&gt;

&lt;p&gt;Easy fix? I Added wording to the prompt along the lines of "if you found multiple matches, mention it.", and I ran it again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the trace showed for this turn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Iteration 1:&lt;/strong&gt; model called &lt;code&gt;search_library("run")&lt;/code&gt; — fine. Got back two todos with high scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iteration 2:&lt;/strong&gt; model called &lt;code&gt;modify_documents&lt;/code&gt; with &lt;code&gt;action: delete&lt;/code&gt; against &lt;strong&gt;both&lt;/strong&gt; returned IDs — &lt;strong&gt;wrong call&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iteration 3:&lt;/strong&gt; model wrote a confident confirmation message describing what it had just done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model had decided "user finished the run" was an unambiguous bulk-completion intent and queued a delete on every retrieval hit. &lt;/p&gt;

&lt;p&gt;The actual fix - making the ambiguity rule for destructive tool calls explicit and unconditional, not advisory:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When the user asks to delete, complete, or edit something and &lt;strong&gt;search returns more than one match&lt;/strong&gt;, stop and ask which one. Present the candidates as a numbered list with their verbatim content and let the user pick.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's a class-level rule, not a string-level patch. It says nothing about "run" or "finished", it says "ambiguous destructive intent ⇒ list and ask, never bulk-act." The sister scenarios for "ride", for numeric follow-ups (&lt;code&gt;'1'&lt;/code&gt; after a clarification list), and for picking by description (&lt;code&gt;'the motorcycle one'&lt;/code&gt;) all leaned on the same rule and went green together.&lt;/p&gt;

&lt;p&gt;The general lesson holds even when there are no separate stages to point fingers at: the trace is the tool-call sequence, the bug is the earliest wrong call, and the fix belongs at the layer that decided to make that call — not at the reply that summarized it after the fact.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things I'd tell past me
&lt;/h2&gt;

&lt;p&gt;A few hard-won opinions from doing this for a month&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Build the trace before you build a single test.&lt;/strong&gt; If your eval framework only gives you final strings, you'll spend your debugging life in the wrong place.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic checks first, judges second.&lt;/strong&gt; Use LLM judges for things that can't be checked structurally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scenario membership belongs with the scenario.&lt;/strong&gt; Don't keep a separate "smoke list" file. It will drift.&lt;/li&gt;
&lt;li&gt;If your agent logic is even slightly complex, use a single prompt that loops itself even if it means the context increases. Building an agent based on a decision tree is a nightmare.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Lore is free, MIT, and runs on Windows / macOS / Linux. v0.2.0 ships the live "thinking" stream in the chat UI, so you can watch the reasoning path in real time on your own machine.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/ErezShahaf/Lore" rel="noopener noreferrer"&gt;https://github.com/ErezShahaf/Lore&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Releases:&lt;/strong&gt; &lt;a href="https://github.com/ErezShahaf/Lore/releases" rel="noopener noreferrer"&gt;https://github.com/ErezShahaf/Lore/releases&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discord:&lt;/strong&gt; &lt;a href="https://discord.gg/hsrsertbdb" rel="noopener noreferrer"&gt;https://discord.gg/hsrsertbdb&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to benchmark a specific Ollama model against the crucial suite, the steps are in &lt;code&gt;evals/README.md&lt;/code&gt;. I'd genuinely love to see results from models I haven't tested.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>llm</category>
      <category>opensource</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
