<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Oscar Rieken</title>
    <description>The latest articles on DEV Community by Oscar Rieken (@orieken).</description>
    <link>https://dev.to/orieken</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F409515%2F4b302a29-5754-4f45-a9ad-a8b955d04751.jpeg</url>
      <title>DEV Community: Oscar Rieken</title>
      <link>https://dev.to/orieken</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/orieken"/>
    <language>en</language>
    <item>
      <title>Making LLM outputs auditable: the provider abstraction pattern</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Mon, 01 Jun 2026 02:28:08 +0000</pubDate>
      <link>https://dev.to/orieken/making-llm-outputs-auditable-the-provider-abstraction-pattern-5c7e</link>
      <guid>https://dev.to/orieken/making-llm-outputs-auditable-the-provider-abstraction-pattern-5c7e</guid>
      <description>&lt;h2&gt;
  
  
  The problem with calling an LLM directly
&lt;/h2&gt;

&lt;p&gt;NumPath's teacher dashboard generates per-student insights — one-sentence observations like "Emma skips borrowing in 9 of 11 recent subtraction attempts" with a suggested action. The obvious implementation is to import the Anthropic SDK, call &lt;code&gt;messages.create()&lt;/code&gt;, and return the result.&lt;/p&gt;

&lt;p&gt;That works until you need to test it. Or run it offline. Or swap providers. Or audit where the insight came from.&lt;/p&gt;

&lt;p&gt;This post covers how NumPath abstracts the LLM behind a protocol interface, tests with a deterministic stub, and structures the insight pipeline so the evidence is assembled from database reads — not generated by the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Protocol: 6 lines
&lt;/h2&gt;

&lt;p&gt;The entire LLM abstraction is a Python &lt;code&gt;Protocol&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Protocol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;runtime_checkable&lt;/span&gt;

&lt;span class="nd"&gt;@runtime_checkable&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLMProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Protocol&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No base class. No ABC. No framework. Any object with an &lt;code&gt;async def complete(self, system, user, max_tokens)&lt;/code&gt; method satisfies this interface — that's structural typing via &lt;code&gt;Protocol&lt;/code&gt;. The &lt;code&gt;@runtime_checkable&lt;/code&gt; decorator lets you write &lt;code&gt;isinstance(provider, LLMProvider)&lt;/code&gt; if you need a runtime check, though in practice the type checker catches mismatches at lint time.&lt;/p&gt;

&lt;p&gt;The signature is deliberately narrow: one system prompt, one user message, one token limit. No conversation history, no tool use, no streaming. NumPath's insight generator makes a single completion call per request. If multi-turn conversation becomes necessary in Phase 3, the protocol gains a new method — existing implementations aren't broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two implementations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ClaudeProvider&lt;/strong&gt; — the production implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ClaudeProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;StubProvider&lt;/strong&gt; — deterministic, zero dependencies, zero API calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StubProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Deterministic LLM stub for tests and local dev without API keys.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Student is building foundational numeracy skills &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;with consistent effort.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;suggested_action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Try place value &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;exercises with physical manipulatives to reinforce digit positioning.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The stub returns a fixed JSON string that matches the expected response schema. Tests assert against this exact output. If someone changes the response schema, the stub breaks, the tests break, and the problem is caught before deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wiring: one environment variable
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_llm_provider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;LLMProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LLM_PROVIDER&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ClaudeProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;StubProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;LLM_PROVIDER&lt;/code&gt; defaults to &lt;code&gt;"stub"&lt;/code&gt;. Running &lt;code&gt;uv run pytest&lt;/code&gt; requires zero environment variables — no API key, no network. Production sets &lt;code&gt;LLM_PROVIDER=claude&lt;/code&gt; and provides &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;. The config uses &lt;code&gt;Literal["claude", "stub"]&lt;/code&gt; so a typo like &lt;code&gt;"Claude"&lt;/code&gt; fails at startup.&lt;/p&gt;

&lt;p&gt;The use case receives the provider through its constructor, not through a global:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GenerateInsightUseCase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LLMProvider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The router wires it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/students/{student_id}/insight&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;InsightResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_student_insight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_db&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;require_teacher&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;InsightResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_llm_provider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;use_case&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerateInsightUseCase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;use_case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Evidence is not generated — it's assembled
&lt;/h2&gt;

&lt;p&gt;This is the design decision that matters most for a research project. When a teacher sees an insight, they need to trust it — and "trust" in an educational context means "I can check this against the data."&lt;/p&gt;

&lt;p&gt;The insight prompt receives two blocks of structured data, both assembled from database queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KC states:
- SUB_BORROW: Novice (p_mastery=0.18, 8 attempts)
- PLACE_VALUE: Developing (p_mastery=0.45, 3 attempts)
- NUMBER_LINE: Novice (p_mastery=0.15, 1 attempt)

Recent attempts (last 10, most recent first):
1. Skill: SUB_BORROW | Correct: No | Mistake: BORROW_SKIP | Q: "52 − 27 = ?"
2. Skill: SUB_BORROW | Correct: No | Mistake: BORROW_SKIP | Q: "31 − 14 = ?"
3. Skill: PLACE_VALUE | Correct: Yes | Mistake: none | Q: "Which is larger: 47 or 74?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM generates two fields: &lt;code&gt;summary&lt;/code&gt; (what's happening) and &lt;code&gt;suggested_action&lt;/code&gt; (what to do). It does &lt;em&gt;not&lt;/em&gt; generate the evidence — the KC codes, mastery percentages, mistake counts, and attempt records are all server-side data. The LLM synthesises a narrative from that data, but the data itself is verifiable.&lt;/p&gt;

&lt;p&gt;The prompt enforces this structurally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a specialist math learning advisor for primary school teachers.
Given their Knowledge Component mastery states and recent attempt history,
generate a JSON response with exactly two fields:
- "summary": one sentence (max 20 words) describing the student's current learning state
- "suggested_action": one concrete teaching action (max 20 words) the teacher can take today

Respond with only the JSON object. No explanation, no markdown, no code fences.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strict JSON. Word limits. No room for hallucinated statistics or invented KC codes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Graceful fallback
&lt;/h2&gt;

&lt;p&gt;LLMs produce unpredictable output. The response parser handles malformed JSON without crashing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_FALLBACK_INSIGHT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InsightResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Insight temporarily unavailable.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;suggested_action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review the student&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s recent attempts for patterns.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_insight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;InsightResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;InsightResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;suggested_action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;suggested_action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insight_parse_failed_using_fallback raw=%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_FALLBACK_INSIGHT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fallback is a valid &lt;code&gt;InsightResponse&lt;/code&gt; — the teacher sees a neutral message, not a 500 error. The warning log captures the first 200 characters of the raw response for debugging without logging the entire LLM output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not LangChain?
&lt;/h2&gt;

&lt;p&gt;This was an explicit decision, documented in ADR-003. LangChain adds 50+ transitive dependencies and significant abstraction cost for what NumPath actually needs: one completion call with a system prompt and a user message. The protocol-based approach is 6 lines of interface, 8 lines of stub, 9 lines of production implementation. The total abstraction surface is smaller than LangChain's &lt;code&gt;ChatModel&lt;/code&gt; base class alone.&lt;/p&gt;

&lt;p&gt;If NumPath needed retrieval-augmented generation, multi-step chains, or agent loops, LangChain would earn its weight. For two structured completion calls (insight generation and hint narration), it would be accidental complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fitness function
&lt;/h2&gt;

&lt;p&gt;ADR-003 specifies a concrete test: &lt;code&gt;uv run pytest&lt;/code&gt; must pass using &lt;code&gt;StubProvider&lt;/code&gt; with no environment variables set. This means every LLM-dependent code path has a test that runs offline. If someone adds a new LLM feature and writes a test that requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;, CI fails — not because the test is wrong, but because it violates the architectural constraint that the test suite runs without external dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The current provider interface handles single-turn completions. Phase 3 may need multi-turn conversation for interactive teacher coaching. When that happens, the protocol gains a second method — &lt;code&gt;complete()&lt;/code&gt; stays unchanged, and a new &lt;code&gt;converse()&lt;/code&gt; method handles the multi-turn case. Existing implementations get a &lt;code&gt;NotImplementedError&lt;/code&gt; default until they're updated. The key is that the interface extends forward without breaking backward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Protocol-based abstraction costs 6 lines and buys full test isolation&lt;/strong&gt; — &lt;code&gt;StubProvider&lt;/code&gt; returns deterministic output; no API key, no network, no flaky tests; the type checker enforces the contract at lint time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidence must be assembled from data, not generated by the model&lt;/strong&gt; — the LLM writes the narrative but doesn't produce the numbers; KC codes, mastery percentages, and mistake counts come from database queries and are independently verifiable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful fallback is a first-class design requirement&lt;/strong&gt; — a teacher sees "insight temporarily unavailable" and a neutral suggestion, never a stack trace; the warning log captures the raw output for debugging without exposing it to the user&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>numpath</category>
      <category>python</category>
      <category>llm</category>
      <category>claude</category>
    </item>
    <item>
      <title>60 hand-crafted math problems: what I learned writing seed data for an adaptive tutor</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Mon, 01 Jun 2026 02:27:18 +0000</pubDate>
      <link>https://dev.to/orieken/60-hand-crafted-math-problems-what-i-learned-writing-seed-data-for-an-adaptive-tutor-2dob</link>
      <guid>https://dev.to/orieken/60-hand-crafted-math-problems-what-i-learned-writing-seed-data-for-an-adaptive-tutor-2dob</guid>
      <description>&lt;h2&gt;
  
  
  Why hand-author anything?
&lt;/h2&gt;

&lt;p&gt;The obvious approach for seeding an adaptive math tutor is to generate problems programmatically. Pick two random numbers, subtract them, done. I tried this first and it failed for a specific reason: generated problems don't have meaningful hints.&lt;/p&gt;

&lt;p&gt;A hint like "Try subtracting the ones column first" is generic. A hint like "2 ones minus 9 is impossible without borrowing — take a ten from the 3 tens" is diagnostic. It names the exact step where a dyscalculic student is likely to get stuck, and it names the operation they need to perform. That second kind of hint requires a human who understands the problem.&lt;/p&gt;

&lt;p&gt;NumPath's Phase 1 seeds 100 problems across 5 Knowledge Components, each with two progressive hints, a calibrated difficulty score, and structured metadata that the &lt;code&gt;MistakeClassifier&lt;/code&gt; uses to diagnose errors. Every one is hand-authored.&lt;/p&gt;

&lt;h2&gt;
  
  
  The content schema
&lt;/h2&gt;

&lt;p&gt;Each problem is a JSONB column in Postgres. The schema is intentionally flat — no nested objects, no polymorphism:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subtraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32 − 9 = ?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;23&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;difficulty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operands&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hints&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2 ones − 9 is impossible without borrowing. Take a ten from the 3 tens.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Now you have 12 ones. 12 − 9 = 3. You have 2 tens left. Answer: 23.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three fields deserve explanation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;operands&lt;/code&gt; / &lt;code&gt;choices&lt;/code&gt;&lt;/strong&gt; — these aren't shown to the student. They exist for the &lt;code&gt;MistakeClassifier&lt;/code&gt;. When a student answers "41" instead of "23" on a subtraction problem, the classifier checks whether the answer matches subtracting the digits in the wrong direction (&lt;code&gt;3 - 2 = 1&lt;/code&gt;, &lt;code&gt;9 - 0 = 9&lt;/code&gt; → &lt;code&gt;91&lt;/code&gt;... no). It checks whether the answer omits borrowing (&lt;code&gt;32 - 9&lt;/code&gt; without regrouping gives &lt;code&gt;33&lt;/code&gt;... no). It checks for digit reversal (&lt;code&gt;23&lt;/code&gt; → &lt;code&gt;32&lt;/code&gt;... close, but the student wrote &lt;code&gt;41&lt;/code&gt;). Each check operates on the operands, not the question string.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;difficulty&lt;/code&gt;&lt;/strong&gt; — a float from 0.1 to 0.9, calibrated by hand. This is the initial difficulty estimate. The adaptive engine uses it to match students to problems at their current level. I'll explain the calibration logic below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;hints&lt;/code&gt;&lt;/strong&gt; — always exactly two, always progressive. The first hint names the obstacle. The second hint walks through the solution. Students reveal hints one at a time, voluntarily. Hints are never forced — forcing hints on students who don't want them creates learned helplessness, which is the opposite of what we're trying to study.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five skill areas
&lt;/h2&gt;

&lt;p&gt;Each skill has 20 problems covering a difficulty gradient from 0.1 to 0.9:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill Code&lt;/th&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Example at 0.1&lt;/th&gt;
&lt;th&gt;Example at 0.9&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SUB_BORROW&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Subtraction&lt;/td&gt;
&lt;td&gt;11 − 4 = ?&lt;/td&gt;
&lt;td&gt;1003 − 567 = ?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PLACE_VALUE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number sense&lt;/td&gt;
&lt;td&gt;Which is larger: 3 or 8?&lt;/td&gt;
&lt;td&gt;What does the 6 represent in 3,641?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NUMBER_LINE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number sense&lt;/td&gt;
&lt;td&gt;What number comes after 3?&lt;/td&gt;
&lt;td&gt;What is halfway between 250 and 350?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NUMBER_SENSE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Number sense&lt;/td&gt;
&lt;td&gt;Which is more: 2 or 5?&lt;/td&gt;
&lt;td&gt;Order from smallest: 892, 829, 928, 289&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OPERATION_SIGN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Arithmetic&lt;/td&gt;
&lt;td&gt;2 + 3 = ?&lt;/td&gt;
&lt;td&gt;15 − 7 + 3 = ?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The difficulty gradient is not linear. The jump from 0.1 to 0.3 (single-digit to simple two-digit) is smaller than the jump from 0.7 to 0.9 (two-digit with borrowing across zeros to three-digit with cascading borrows). This mirrors what the dyscalculia research literature reports: difficulty is not proportional to number size. It's proportional to the number of cognitive steps, particularly steps that require regrouping or holding intermediate results in working memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hint design: what I got wrong
&lt;/h2&gt;

&lt;p&gt;My first draft of hints was procedural — they described &lt;em&gt;what&lt;/em&gt; to do:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Borrow from the tens column. Subtract. Write the answer."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is useless for a student with dyscalculia. The difficulty isn't knowing &lt;em&gt;what&lt;/em&gt; borrowing is — it's executing the procedure without losing track of which column they're in. The second draft of every hint follows two rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Name the specific obstacle.&lt;/strong&gt; Not "this is tricky" — rather "2 ones minus 9 is impossible without borrowing."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Walk through the state change.&lt;/strong&gt; Not "borrow and subtract" — rather "Take a ten from the 3 tens. Now you have 12 ones. 12 − 9 = 3. You have 2 tens left."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The second rule matters because dyscalculic students often lose the intermediate state — they borrow correctly but then forget what changed. The hint reconstructs the full number after regrouping so the student can see where they are.&lt;/p&gt;

&lt;p&gt;This pattern held across all five skill areas. Place value hints name the specific column ("the tens digit is the second from the right"). Number line hints name the direction and distance ("7 is to the right of 4 — count 3 steps forward"). Operation sign hints name the symbols and their meaning ("the − sign means subtract — take the second number away from the first").&lt;/p&gt;

&lt;h2&gt;
  
  
  Difficulty calibration
&lt;/h2&gt;

&lt;p&gt;Difficulty scores are not arbitrary. They follow a rubric I developed after the first round of testing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score range&lt;/th&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.1 – 0.2&lt;/td&gt;
&lt;td&gt;Single-digit or simple two-digit; one cognitive step&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.25 – 0.4&lt;/td&gt;
&lt;td&gt;Two-digit; requires one borrowing or comparison step&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.45 – 0.6&lt;/td&gt;
&lt;td&gt;Two-digit with borrowing across columns, or three-digit without borrowing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.65 – 0.8&lt;/td&gt;
&lt;td&gt;Three-digit with borrowing; or problems requiring intermediate computation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.85 – 0.9&lt;/td&gt;
&lt;td&gt;Three-digit with cascading borrows (e.g., borrowing from hundreds when tens is 0)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The adaptive engine uses a &lt;code&gt;DIFFICULTY_BAND&lt;/code&gt; of 0.15 around the target difficulty when selecting problems. So a student at target difficulty 0.5 sees problems between 0.35 and 0.65. This means each difficulty tier overlaps with its neighbors — a student improving from 0.4 to 0.6 transitions gradually rather than hitting a cliff.&lt;/p&gt;

&lt;h2&gt;
  
  
  The seed script
&lt;/h2&gt;

&lt;p&gt;The seed is idempotent — safe to run on every deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;seed_problems&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skill_id_map&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;skill_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;problems&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;PROBLEMS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;skill_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;skill_id_map&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skill_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;problems&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;difficulty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;stmt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="nf"&gt;pg_insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Problem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;skill_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;skill_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;difficulty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;difficulty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="n"&gt;problem_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on_conflict_do_nothing&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stmt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;on_conflict_do_nothing()&lt;/code&gt; means re-running the seed doesn't duplicate problems. The &lt;code&gt;difficulty&lt;/code&gt; field is stored both inside the JSONB &lt;code&gt;content&lt;/code&gt; and as a top-level column on the &lt;code&gt;Problem&lt;/code&gt; model — the column is indexed for the adaptive engine's range queries, while the JSONB copy preserves the original specification.&lt;/p&gt;

&lt;p&gt;The full seed runs inside a single transaction: skills first (because problems have a foreign key to skills), then problems, then test accounts. If any step fails, nothing is committed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Two things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More problems per skill.&lt;/strong&gt; Twenty problems with a 0.15 difficulty band means some bands have only 2–3 candidates. When the adaptive engine excludes recently-seen problems, it can run out of fresh options at a specific difficulty level. The fallback chain handles this gracefully (widen the band, then allow repeats), but 30 problems per skill would eliminate most fallback cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Machine-assisted hint generation.&lt;/strong&gt; The hints are the bottleneck — each one took 2–3 minutes to write well. For Phase 2, I plan to generate candidate hints with Claude and then manually review them. The human is still in the loop, but the first draft comes faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generated problems are easy; generated hints are not&lt;/strong&gt; — an adaptive tutor's value is in the scaffolding, not the arithmetic; hand-authoring hints that name the specific obstacle and walk through the state change is what makes the system useful for dyscalculia&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Difficulty is not proportional to number size&lt;/strong&gt; — it's proportional to cognitive steps, particularly regrouping and intermediate state; a three-digit problem with no borrowing (350 − 120) is easier than a two-digit problem with cascading borrows (100 − 67)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotent seeds inside a transaction are non-negotiable&lt;/strong&gt; — &lt;code&gt;on_conflict_do_nothing()&lt;/code&gt; plus a single transaction means the seed runs safely on every deployment, fresh clone, and CI pipeline&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>numpath</category>
      <category>dyscalculia</category>
      <category>python</category>
      <category>education</category>
    </item>
    <item>
      <title>Clean Architecture in a FastAPI + Vue 3 monorepo</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Mon, 01 Jun 2026 02:26:44 +0000</pubDate>
      <link>https://dev.to/orieken/clean-architecture-in-a-fastapi-vue-3-monorepo-38p3</link>
      <guid>https://dev.to/orieken/clean-architecture-in-a-fastapi-vue-3-monorepo-38p3</guid>
      <description>&lt;h2&gt;
  
  
  Why architecture matters in a research project
&lt;/h2&gt;

&lt;p&gt;Most research prototypes are throwaway code. NumPath is not. It needs to survive four phases over 30 weeks, accumulate real student data for a randomised controlled trial, and remain testable without live infrastructure at every step. That means the architecture has to enforce rules that hold up under pressure — not just conventions someone remembers to follow.&lt;/p&gt;

&lt;p&gt;This post walks through how NumPath uses Clean Architecture to keep a FastAPI backend, a Vue 3 frontend, and a Python ML module in a single repository without coupling them together.&lt;/p&gt;

&lt;h2&gt;
  
  
  The monorepo layout
&lt;/h2&gt;

&lt;p&gt;The project lives in a single repo with a clear namespace boundary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;phd-research/
├── numpath/
│   ├── backend/      # Python 3.12 + FastAPI + SQLAlchemy
│   ├── frontend/     # Vue 3 + Tailwind CSS + Pinia
│   └── ml/           # BKT, DKT, adaptive engine
├── docs/
│   ├── adrs/         # Architecture Decision Records
│   ├── architecture/ # System design, feature specs
│   └── posts/        # This blog series
└── DOMAIN_DICTIONARY.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The alternative was three separate repos. For a solo PhD project where a data model change touches the migration, the API schema, the ML engine, and the Vue component in the same commit, separate repos mean coordinated PRs across three remotes. That's overhead with no benefit when one person owns all three layers.&lt;/p&gt;

&lt;p&gt;The escape hatch is clean: if NumPath ever needs to become a standalone repo, the &lt;code&gt;numpath/&lt;/code&gt; directory lifts out intact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dependency direction: the one rule that matters
&lt;/h2&gt;

&lt;p&gt;Clean Architecture has many principles, but only one that I enforce mechanically: &lt;strong&gt;inner layers never import from outer layers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In NumPath's backend, the layers are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Domain (models)  →  Use Cases  →  Adapters (routers, DB, LLM)  →  Frameworks (FastAPI, SQLAlchemy)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A use case like &lt;code&gt;GetNextProblemUseCase&lt;/code&gt; receives a database session — but it does not import FastAPI, does not know about HTTP, and does not call &lt;code&gt;Depends()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GetNextProblemUseCase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NextProblemResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;kc_states&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_build_kc_states&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;recent_attempts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_fetch_recent_attempts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;recent_mistakes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_fetch_recent_mistakes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;selection&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ProblemSelection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;select_next_problem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;kc_states&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;kc_states&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;recent_correctness&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_correct&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;recent_attempts&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;current_difficulty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;recent_attempts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;difficulty&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;recent_attempts&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;recent_mistakes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;recent_mistakes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;problem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_select_problem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selection&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
        &lt;span class="c1"&gt;# ... return NextProblemResponse
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The router — the adapter layer — is the only file that knows about FastAPI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/next-problem/{student_id}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NextProblemResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_next_problem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AsyncSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_db&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;require_student&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NextProblemResponse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;use_case&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GetNextProblemUseCase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;use_case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The router does three things: parse the request, inject dependencies, and delegate to the use case. No business logic. If I replaced FastAPI with Litestar tomorrow, I'd rewrite the routers and touch nothing else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration as a boundary
&lt;/h2&gt;

&lt;p&gt;Settings are another place where framework details leak into domain code if you're not careful. NumPath uses Pydantic's &lt;code&gt;BaseSettings&lt;/code&gt; with a &lt;code&gt;.env&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseSettings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SettingsConfigDict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ignore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql+asyncpg://numpath:numpath@localhost:5432/numpath&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;LLM_PROVIDER&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;ENVIRONMENT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;development&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;CORS_ORIGINS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:5173&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every secret has a default that works locally. &lt;code&gt;LLM_PROVIDER&lt;/code&gt; defaults to &lt;code&gt;"stub"&lt;/code&gt; so that tests and local dev never require an API key. The &lt;code&gt;Literal&lt;/code&gt; type annotation means a typo in the &lt;code&gt;.env&lt;/code&gt; file fails at startup, not at runtime when a teacher clicks "Generate insight."&lt;/p&gt;

&lt;h2&gt;
  
  
  The ML module as a pure function boundary
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;ml/&lt;/code&gt; directory is a separate Python package (&lt;code&gt;numpath-ml&lt;/code&gt;) with its own &lt;code&gt;pyproject.toml&lt;/code&gt;. The backend depends on it, but the dependency is narrow: two functions and a dataclass.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;numpath_ml.adaptive_engine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;select_next_problem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProblemSelection&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;numpath_ml.bkt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KCState&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;select_next_problem()&lt;/code&gt; takes dictionaries and lists — no SQLAlchemy models, no async, no database. It returns a &lt;code&gt;ProblemSelection&lt;/code&gt; with a &lt;code&gt;skill_code&lt;/code&gt;, &lt;code&gt;target_difficulty&lt;/code&gt;, and &lt;code&gt;reason&lt;/code&gt; string. The use case translates between database rows and these pure data structures.&lt;/p&gt;

&lt;p&gt;This boundary exists because the ML code changes on a different cadence than the web application. When I replace the rule-based engine with Deep Knowledge Tracing in Phase 2, the use case stays the same — only the function it calls changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The frontend: same principle, different language
&lt;/h2&gt;

&lt;p&gt;The Vue 3 frontend mirrors the same layering:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API client&lt;/strong&gt; — a thin Axios wrapper that handles auth tokens and 401 redirects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;apiClient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/v1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;apiClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;interceptors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;localStorage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;token&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Authorization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`Bearer &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stores&lt;/strong&gt; — Pinia stores manage state. The auth store handles login/logout and persists the JWT to localStorage. Views consume stores, not the API client directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Views&lt;/strong&gt; — &lt;code&gt;PracticeView.vue&lt;/code&gt;, &lt;code&gt;TeacherView.vue&lt;/code&gt;, &lt;code&gt;LoginView.vue&lt;/code&gt;. Each view composes API calls and store access. No view imports another view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Router&lt;/strong&gt; — role-based guards redirect students and teachers to their respective views:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;beforeEach&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useAuthStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;requiresAuth&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;isAuthenticated&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/login&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;to&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;teacher&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/teacher&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/practice&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Docker Compose as the integration layer
&lt;/h2&gt;

&lt;p&gt;The four services — Postgres, Redis, backend, frontend — are composed with health checks so the backend waits for the database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+asyncpg://numpath:numpath@postgres:5432/numpath&lt;/span&gt;
  &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;
    &lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service_healthy&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./backend:/app/backend&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./ml:/app/ml&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Volume-mounting &lt;code&gt;backend/&lt;/code&gt; and &lt;code&gt;ml/&lt;/code&gt; means hot reload works inside Docker — change a use case, save, and the server restarts. The port mapping (&lt;code&gt;5433:5432&lt;/code&gt; for Postgres) avoids collisions with a local Postgres install.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this buys you
&lt;/h2&gt;

&lt;p&gt;Three concrete benefits I've already seen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test isolation&lt;/strong&gt; — use cases are testable with a real async database session and no HTTP server. The test creates a &lt;code&gt;GetNextProblemUseCase(db)&lt;/code&gt; directly. No FastAPI test client needed for business logic tests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLM swappability&lt;/strong&gt; — &lt;code&gt;GenerateInsightUseCase&lt;/code&gt; receives an &lt;code&gt;LLMProvider&lt;/code&gt; protocol. In tests it gets &lt;code&gt;StubProvider&lt;/code&gt;. In production it gets &lt;code&gt;ClaudeProvider&lt;/code&gt;. The use case doesn't know which one it has.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Safe ML replacement&lt;/strong&gt; — when BKT gives way to DKT, only &lt;code&gt;numpath_ml&lt;/code&gt; changes. The use case calls the same &lt;code&gt;select_next_problem()&lt;/code&gt; function with the same signature. The router doesn't change. The frontend doesn't change.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;I wouldn't change the layering, but I'd add one thing from day one: a fitness function that statically checks import direction. Right now the rule is "use cases don't import routers" — but it's enforced by code review (i.e., me reviewing my own code). A linter rule or CI check that fails on &lt;code&gt;from backend.routers&lt;/code&gt; inside &lt;code&gt;use_cases/&lt;/code&gt; would catch drift automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency direction is the only architectural rule worth enforcing mechanically&lt;/strong&gt; — inner layers never import outer layers; everything else is convention that erodes under deadline pressure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A monorepo is the right default for a solo research project&lt;/strong&gt; — coordinated PRs across three repos is overhead without benefit when one person owns all layers and changes cut across them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pure function boundaries between modules pay for themselves&lt;/strong&gt; — the ML module exports two functions and a dataclass; the web layer translates between database rows and those pure structures, making the ML code replaceable without touching the application&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>numpath</category>
      <category>cleanarchitecture</category>
      <category>python</category>
      <category>fastapi</category>
    </item>
    <item>
      <title>From Bayesian to deep knowledge tracing — upgrading NumPath's student model with a PyTorch LSTM</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Mon, 01 Jun 2026 02:26:00 +0000</pubDate>
      <link>https://dev.to/orieken/from-bayesian-to-deep-knowledge-tracing-upgrading-numpaths-student-model-with-a-pytorch-lstm-2ikm</link>
      <guid>https://dev.to/orieken/from-bayesian-to-deep-knowledge-tracing-upgrading-numpaths-student-model-with-a-pytorch-lstm-2ikm</guid>
      <description>&lt;p&gt;BKT told us how well a student knows subtraction-with-borrowing. It had no idea that a student who reverses digits on subtraction problems probably also reverses them on place value problems — because BKT treats every Knowledge Component as an island.&lt;/p&gt;

&lt;p&gt;Deep Knowledge Tracing (DKT) fixes that. Instead of four independent scalar parameters per KC, it maintains a shared LSTM hidden vector across all KCs and learns the dependencies from data. This is Phase 3 of NumPath: swapping out the Markov model for a neural sequence model.&lt;/p&gt;

&lt;p&gt;Here's what we built, the design decision that almost made us reach for a transformer, and the student simulator we had to build first to test it without any real students.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Built
&lt;/h2&gt;

&lt;p&gt;Two components that feed each other:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Student simulator&lt;/strong&gt; — five named personas that generate realistic attempt sequences for testing. Each persona has a per-KC accuracy curve and weighted mistake preferences drawn from the dyscalculia ITS literature:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Persona&lt;/th&gt;
&lt;th&gt;SUB_BORROW accuracy&lt;/th&gt;
&lt;th&gt;Characteristic errors&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ConfidentLearner&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.80&lt;/td&gt;
&lt;td&gt;Rare, careless (OFF_BY_TEN)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;StrugglingSUB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.35&lt;/td&gt;
&lt;td&gt;Frequent BORROW_SKIP, slow timing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PlaceValueGap&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;td&gt;DIGIT_REVERSAL across skill areas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FrustrationLoop&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;Fast random guessing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;FastMaster&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.90&lt;/td&gt;
&lt;td&gt;Near-zero mistakes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;DKT model&lt;/strong&gt; — a single-layer LSTM that takes a sequence of &lt;code&gt;(skill, correctness)&lt;/code&gt; interactions and predicts P(correct on skill k) at each subsequent step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DKTModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_skills&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initial_state&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Student answers SUB_BORROW correctly
&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skill_idx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_correct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Query mastery on any KC
&lt;/span&gt;&lt;span class="n"&gt;p_mastery&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skill_idx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# → float in (0, 1)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Design Decision
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why not stay with BKT?
&lt;/h3&gt;

&lt;p&gt;BKT's four parameters — p_mastery, p_learn, p_guess, p_slip — are per-KC and independent. A student who has &lt;code&gt;DIGIT_REVERSAL&lt;/code&gt; on subtraction problems and &lt;code&gt;DIGIT_REVERSAL&lt;/code&gt; on place value problems is modelled as having two unrelated problems. BKT cannot learn that these are the same underlying representational gap.&lt;/p&gt;

&lt;p&gt;DKT's hidden state is shared. After the student makes a digit-reversal error on subtraction, the LSTM adjusts its hidden vector in a way that also shifts the place value prediction. It learns the cross-KC structure from data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not a transformer?
&lt;/h3&gt;

&lt;p&gt;The sequence lengths we're working with are short — 10 to 30 attempts per session. Transformers need longer sequences to exploit their attention mechanism meaningfully. An LSTM is a better fit: it handles variable-length sequences natively, trains faster on small datasets, and produces interpretable per-step hidden states we can inspect.&lt;/p&gt;

&lt;p&gt;More importantly: the Piech et al. (2015) DKT paper established LSTMs as the baseline for knowledge tracing. Improving on the baseline is Phase 4 work; Phase 3 is implementing it correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  The encoding
&lt;/h3&gt;

&lt;p&gt;The input encoding follows Piech et al. exactly. At each step t, the input is a one-hot vector of size &lt;code&gt;2 × n_skills&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x[k]             = 1  if skill k was answered CORRECTLY
x[k + n_skills]  = 1  if skill k was answered INCORRECTLY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For three skills (SUB_BORROW=0, PLACE_VALUE=1, NUMBER_LINE=2), a correct subtraction answer encodes as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1, 0, 0,  0, 0, 0]
  ↑ correct half    ↑ incorrect half
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An incorrect subtraction answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[0, 0, 0,  1, 0, 0]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LSTM sees this 6-dimensional input and updates its hidden state. The output layer projects the hidden state back to 3 dimensions — one P(correct) per KC.&lt;/p&gt;

&lt;h2&gt;
  
  
  The training objective
&lt;/h2&gt;

&lt;p&gt;The model learns to predict the NEXT response from the current history. At step t, given the encoded interaction x_t, the LSTM outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ŷ_t[k] = σ(W × h_t + b)[k]  =  P(student answers skill k correctly at t+1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The loss at each step uses only the skill that was actually asked next:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# At step t, the next question has skill_idx q and correctness r
&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
&lt;span class="n"&gt;pred&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;loss_t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Training is one sequence at a time with Adam and gradient clipping. Small dataset — no need for batching yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the simulator came first
&lt;/h2&gt;

&lt;p&gt;We can't train DKT on real data until the pilot delivers ≥150 attempt records. But we can validate the architecture right now with the student simulator.&lt;/p&gt;

&lt;p&gt;The final integration test runs both pipelines end to end:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate 30 sequences from &lt;code&gt;StrugglingSUB&lt;/code&gt; (35% accuracy on SUB_BORROW)&lt;/li&gt;
&lt;li&gt;Generate 30 sequences from &lt;code&gt;FastMaster&lt;/code&gt; (90% accuracy on SUB_BORROW)&lt;/li&gt;
&lt;li&gt;Train two separate DKT models on each persona's sequences&lt;/li&gt;
&lt;li&gt;Simulate 6 practice steps with each model&lt;/li&gt;
&lt;li&gt;Assert &lt;code&gt;FastMaster&lt;/code&gt;'s model predicts higher mastery than &lt;code&gt;StrugglingSUB&lt;/code&gt;'s
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From test_dkt.py
&lt;/span&gt;&lt;span class="n"&gt;fast_mastery&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mastery_after_steps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_fast&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;    &lt;span class="c1"&gt;# 5/6 correct
&lt;/span&gt;
&lt;span class="n"&gt;struggling_mastery&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mastery_after_steps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result_struggling&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# 2/6 correct
&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;fast_mastery&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;struggling_mastery&lt;/span&gt;      &lt;span class="c1"&gt;# ✓ passes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives us confidence the model learns the right signal before we hand it real children's data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters for the Research
&lt;/h2&gt;

&lt;p&gt;BKT's independence assumption is a known limitation in the ITS literature. It was acceptable for Phase 1 and 2 because we didn't have cross-KC interaction data. Now that the mistake classifier is generating &lt;code&gt;BORROW_SKIP&lt;/code&gt; and &lt;code&gt;DIGIT_REVERSAL&lt;/code&gt; events consistently, we have a sequence model that can learn from them.&lt;/p&gt;

&lt;p&gt;The specific research claim that DKT enables: &lt;strong&gt;a student's error pattern on one KC predicts their likely error pattern on a related KC&lt;/strong&gt;. If DKT learns this and BKT doesn't, that's measurable evidence that the LSTM captures structure that the Markov model misses — and a direct contribution to the Phase 4 RCT analysis.&lt;/p&gt;

&lt;p&gt;The upgrade path is explicit:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pilot delivers ≥150 attempts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;train_dkt(sequences_from_db)&lt;/code&gt; on the full dataset&lt;/li&gt;
&lt;li&gt;Evaluate against BKT's predictions using held-out sessions&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;update_bkt&lt;/code&gt; in &lt;code&gt;SubmitAttemptUseCase&lt;/code&gt; when DKT's per-KC accuracy exceeds BKT's&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The ADR for this transition is on the backlog.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The student simulator is the missing test fixture for ITS research.&lt;/strong&gt; Standard software testing assumes you can construct any input you need. In adaptive tutoring, your input is a real child's learning trajectory. The simulator bridges that gap — it's not a replacement for real data, but it lets you test that the model responds in the right &lt;em&gt;direction&lt;/em&gt; before you commit to an ethical review and a cohort of participants.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BKT and DKT coexist cleanly at the domain layer.&lt;/strong&gt; &lt;code&gt;KCState&lt;/code&gt; stays unchanged. &lt;code&gt;DKTState&lt;/code&gt; is a separate dataclass with a different shape. The backend currently uses &lt;code&gt;KCState&lt;/code&gt;; swapping in &lt;code&gt;DKTState&lt;/code&gt; is an interface change at &lt;code&gt;SubmitAttemptUseCase&lt;/code&gt; and &lt;code&gt;GetNextProblemUseCase&lt;/code&gt; — two files, no schema migration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradient clipping mattered more than I expected.&lt;/strong&gt; Early training runs without &lt;code&gt;clip_grad_norm_&lt;/code&gt; diverged on the frustration-loop persona (all-incorrect sequences). Clipping at &lt;code&gt;max_norm=1.0&lt;/code&gt; stabilised training across all five personas.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Backend wiring: load the trained DKT model at startup, store hidden state vectors in Redis per student, and swap the two use cases. That's the integration step that puts DKT into the live adaptive loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;DKT's shared LSTM hidden state captures cross-KC dependencies that BKT's independent scalar parameters cannot — a student with DIGIT_REVERSAL on subtraction is more likely to have it on place value, and DKT learns this from data&lt;/li&gt;
&lt;li&gt;Build the student simulator before the model: testing an adaptive learning architecture requires synthetic student trajectories, and the simulator lets you validate directional correctness before any ethics review or pilot recruitment&lt;/li&gt;
&lt;li&gt;LSTM beats transformer for short sequences (10–30 steps): attention needs length to work; LSTMs handle variable-length sequences natively and train faster on the small datasets typical of ITS research&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>numpath</category>
      <category>adaptivelearning</category>
      <category>pytorch</category>
      <category>python</category>
    </item>
    <item>
      <title>Building a mistake taxonomy for dyscalculia — 8 error patterns, rule-based, no ML required</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Mon, 01 Jun 2026 02:14:58 +0000</pubDate>
      <link>https://dev.to/orieken/building-a-mistake-taxonomy-for-dyscalculia-8-error-patterns-rule-based-no-ml-required-3707</link>
      <guid>https://dev.to/orieken/building-a-mistake-taxonomy-for-dyscalculia-8-error-patterns-rule-based-no-ml-required-3707</guid>
      <description>&lt;p&gt;"Wrong" isn't a diagnosis.&lt;/p&gt;

&lt;p&gt;When a student answers 32 − 9 = 37, they didn't randomly guess. They subtracted in the wrong direction in the ones column — a specific, named error called a borrow-skip. A tutor that just marks it incorrect and moves on has wasted the most informative signal in the attempt: &lt;em&gt;why&lt;/em&gt; the student got it wrong.&lt;/p&gt;

&lt;p&gt;NumPath's Phase 2 mistake classifier turns wrong answers into structured &lt;code&gt;MistakeEvent&lt;/code&gt; records. Here's how we built it, what we got wrong the first time, and why rule-based classifiers beat a neural network for this job at this stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Built
&lt;/h2&gt;

&lt;p&gt;Eight rule-based classifiers covering all three of NumPath's Phase 1 skill areas:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DIGIT_REVERSAL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SUB_BORROW / NUMBER_LINE&lt;/td&gt;
&lt;td&gt;2-digit answer with digits transposed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WRONG_OPERATION&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SUB_BORROW&lt;/td&gt;
&lt;td&gt;Student added instead of subtracted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BORROW_SKIP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SUB_BORROW&lt;/td&gt;
&lt;td&gt;Ones subtracted in reverse — no borrow taken&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OFF_BY_TEN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SUB_BORROW&lt;/td&gt;
&lt;td&gt;Result ±10 from correct (borrow applied to wrong column)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PLACE_VALUE_CONFUSION&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PLACE_VALUE&lt;/td&gt;
&lt;td&gt;Compared units digits only, ignored tens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MAGNITUDE_MISJUDGE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PLACE_VALUE&lt;/td&gt;
&lt;td&gt;Chose the smaller number as larger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;NUMBER_LINE_DIRECTION&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NUMBER_LINE&lt;/td&gt;
&lt;td&gt;Said "left" when answer is "right"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OFF_BY_ONE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NUMBER_LINE&lt;/td&gt;
&lt;td&gt;Numeric answer ±1 from correct (miscounted steps)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each classifier is a pure Python predicate — no external dependencies, no DB imports, testable in isolation. The main function runs them in priority order and returns the first match.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Design Decision
&lt;/h2&gt;

&lt;p&gt;The first question was: classify with rules or train a model?&lt;/p&gt;

&lt;p&gt;The case for rules: we don't have labelled training data yet. Phase 1 just shipped. We have zero &lt;code&gt;MistakeEvent&lt;/code&gt; records. Training a classifier on nothing produces nothing.&lt;/p&gt;

&lt;p&gt;The case for ML: rules are brittle. A student might make a novel error we didn't anticipate, and rule-based code silently returns &lt;code&gt;None&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We went with rules for Phase 2 because the error patterns for dyscalculia are well-documented in the ITS literature — specifically in the work of VanLehn (1982) on subtraction bugs and the later SIERRA system. "Borrow-skip" and "digit reversal" aren't our taxonomy; they're 40-year-old findings from cognitive science. A rule that detects them is more reliable than a model trained on 150 attempts.&lt;/p&gt;

&lt;p&gt;The ML path opens in Phase 3 once the &lt;code&gt;mistake_events&lt;/code&gt; table has enough volume. The rule-based classifier generates the labelled training data that Phase 3 will learn from.&lt;/p&gt;

&lt;h2&gt;
  
  
  The BORROW_SKIP bug
&lt;/h2&gt;

&lt;p&gt;The Phase 1 classifier had a &lt;code&gt;BORROW_SKIP&lt;/code&gt; function. It was wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Phase 1 — incorrect
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_is_borrow_skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;problem_content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;given&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;operands&lt;/span&gt;
    &lt;span class="n"&gt;no_borrow_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# ← adds a + b, not the borrow-skip result
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;given&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;no_borrow_result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This detected addition (32 − 9 → 41) and called it &lt;code&gt;BORROW_SKIP&lt;/code&gt;. But addition is a completely different error — confusing +/− signs, not misapplying the borrowing algorithm. The mistake was labelled wrong in every event record.&lt;/p&gt;

&lt;p&gt;The real borrow-skip pattern: when ones(a) &amp;lt; ones(b), the student skips borrowing and instead subtracts in the wrong direction in the ones column.&lt;/p&gt;

&lt;p&gt;For 32 − 9:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Correct: borrow a ten → 12 − 9 = 3 ones, 2 tens → &lt;strong&gt;23&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Borrow-skip: ones = 9 − 2 = 7, tens = 3 (unchanged) → &lt;strong&gt;37&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Phase 2 — correct
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_is_borrow_skip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;given&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ones_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ones_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ones_a&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;ones_b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# no borrow needed — pattern doesn't apply
&lt;/span&gt;    &lt;span class="n"&gt;borrow_skip_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ones_b&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ones_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;given&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;borrow_skip_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verified: 32 − 9 → 37 ✓, 43 − 18 → 35 ✓, 31 − 14 → 23 ✓&lt;/p&gt;

&lt;p&gt;The old code was shipping the wrong signal for every borrow-skip attempt. This is exactly why &lt;code&gt;MistakeEvent&lt;/code&gt; records are useless until the classifier is correct — the adaptive engine was routing "borrow-skip" students to the wrong remediation path.&lt;/p&gt;

&lt;h2&gt;
  
  
  The priority ordering problem
&lt;/h2&gt;

&lt;p&gt;Multiple patterns can fire for the same wrong answer. For 43 − 16 = 27, the student wrote 72. That's a &lt;code&gt;DIGIT_REVERSAL&lt;/code&gt; (27 reversed). But priority ordering becomes meaningful when patterns genuinely overlap.&lt;/p&gt;

&lt;p&gt;The classifier runs a hierarchy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;subtraction problems:
  1. DIGIT_REVERSAL    ← most specific free-form error
  2. WRONG_OPERATION   ← added instead of subtracted
  3. BORROW_SKIP       ← skipped borrowing algorithm
  4. OFF_BY_TEN        ← borrow applied to wrong column

place_value problems (multiple-choice — no free-form digit writing):
  1. PLACE_VALUE_CONFUSION  ← compared units digits only (more specific)
  2. MAGNITUDE_MISJUDGE     ← picked the smaller number (less specific)

number_line problems:
  1. NUMBER_LINE_DIRECTION  ← wrong direction word
  2. DIGIT_REVERSAL         ← transposed digits in numeric answer
  3. OFF_BY_ONE             ← miscounted steps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Place value problems are multiple-choice, so &lt;code&gt;DIGIT_REVERSAL&lt;/code&gt; doesn't apply there — the student picks from a given set, they don't write digits freely. Scoping by problem type prevents false positives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters for the Research
&lt;/h2&gt;

&lt;p&gt;Every &lt;code&gt;MistakeEvent&lt;/code&gt; record becomes a training signal twice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now&lt;/strong&gt;: the adaptive engine reads the last &lt;code&gt;MISTAKE_WINDOW&lt;/code&gt; (3) events. Two &lt;code&gt;BORROW_SKIP&lt;/code&gt; codes in a row triggers remediation mode — the engine drops difficulty and targets &lt;code&gt;SUB_BORROW&lt;/code&gt; problems specifically. Correct classification = correct remediation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Later&lt;/strong&gt;: Phase 3 will train a logistic regression (and eventually a transformer) on the mistake events table. The rule-based classifier generates the initial labelled dataset. If the rules are wrong — as &lt;code&gt;BORROW_SKIP&lt;/code&gt; was — the ML model learns the wrong pattern from poisoned labels.&lt;/p&gt;

&lt;p&gt;For a dyscalculia intervention study, this matters more than it would in a general tutoring system. Dyscalculia-specific errors like borrow-skip and digit reversal appear in the ITS literature as distinct cognitive profiles. Getting them right means the model can eventually distinguish students who have a procedural gap (&lt;code&gt;BORROW_SKIP&lt;/code&gt;) from students who have a representational gap (&lt;code&gt;PLACE_VALUE_CONFUSION&lt;/code&gt;) — a distinction that should affect the instructional intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rule-based classifiers need domain literature, not just intuition.&lt;/strong&gt; The original &lt;code&gt;BORROW_SKIP&lt;/code&gt; implementation was plausible — "student added instead of subtracting" — but wrong. VanLehn's subtraction bug taxonomy makes the actual pattern explicit. Reading the paper would have saved months of mislabelled data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Priority ordering is a design document.&lt;/strong&gt; The order in which classifiers run encodes assumptions about what matters more. We chose "most specific fires first" — but that could be wrong. Maybe &lt;code&gt;WRONG_OPERATION&lt;/code&gt; (a conceptual error) should always beat &lt;code&gt;DIGIT_REVERSAL&lt;/code&gt; (a transcription error) regardless of specificity, because they imply different interventions. We don't have the data to answer that yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;50 tests is the right investment for a classifier that labels training data.&lt;/strong&gt; A wrong label propagates forward through every model that trains on it. Testing every predicate in isolation, including priority ordering and edge cases, is not over-engineering — it's protecting the integrity of the entire data pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;mistake_events&lt;/code&gt; table is now correctly populated with each session. Once the pilot delivers ≥150 records, Phase 3 can fit a logistic regression on the labelled events — using the rule-based codes as ground truth — and eventually replace the rules with a model that generalises to error patterns we haven't seen yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Rule-based mistake classifiers are the right first step when training data doesn't exist yet — they generate the labelled dataset that trains the eventual ML model&lt;/li&gt;
&lt;li&gt;The real borrow-skip pattern (subtract ones in reverse: 32−9=37) is different from wrong-operation (add instead of subtract: 32+9=41) — getting this wrong poisons every downstream model that trains on the events table&lt;/li&gt;
&lt;li&gt;Classifier priority ordering is a design decision that encodes instructional theory; document it explicitly and treat it as something to validate with data&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>numpath</category>
      <category>dyscalculia</category>
      <category>adaptivelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>Building a FastAPI + Vue 3 research platform: the 4 bugs that almost broke Phase 1</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Mon, 01 Jun 2026 02:11:33 +0000</pubDate>
      <link>https://dev.to/orieken/building-a-fastapi-vue-3-research-platform-the-4-bugs-that-almost-broke-phase-1-10de</link>
      <guid>https://dev.to/orieken/building-a-fastapi-vue-3-research-platform-the-4-bugs-that-almost-broke-phase-1-10de</guid>
      <description>&lt;p&gt;Phase 1 of NumPath is done. Seven of eight Definition of Done items are checked — the eighth requires real children completing pilot sessions, which no amount of code will substitute for. The stack runs cleanly in Docker Compose, 56 unit tests pass, and a student can log in, answer ten problems, and see their knowledge state update in real time.&lt;/p&gt;

&lt;p&gt;What the commit history doesn't show is the afternoon I spent fighting four bugs that don't appear in any FastAPI or Vue tutorial. This post is that afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Built
&lt;/h2&gt;

&lt;p&gt;NumPath is an adaptive math tutor for children with dyscalculia. Phase 1 ships the minimum research instrument: a student practice loop, a rule-based adaptive engine, and a read-only teacher dashboard. No ML yet — just clean infrastructure and a data collection pipeline capable of generating the 150+ attempt records that Phase 2 needs to train the BKT model.&lt;/p&gt;

&lt;p&gt;The stack: FastAPI 0.110 + SQLAlchemy 2 + Alembic + asyncpg on the backend; Vue 3 + Tailwind + Pinia on the frontend; PostgreSQL 16 + Redis 7 in Docker Compose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 1: passlib AttributeError on bcrypt ≥4.0
&lt;/h2&gt;

&lt;p&gt;The symptom was immediate on first login attempt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AttributeError: module 'bcrypt' has no attribute '__about__'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;passlib&lt;/code&gt; has a version check that reads &lt;code&gt;bcrypt.__about__.__version__&lt;/code&gt;. bcrypt 4.0 removed the &lt;code&gt;__about__&lt;/code&gt; module. The libraries have been incompatible for two years and &lt;code&gt;passlib&lt;/code&gt; is effectively unmaintained.&lt;/p&gt;

&lt;p&gt;The fix: delete &lt;code&gt;passlib&lt;/code&gt; entirely. Replace it with three lines of direct &lt;code&gt;bcrypt&lt;/code&gt; calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/auth/password.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;bcrypt&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hash_password&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bcrypt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hashpw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;bcrypt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gensalt&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify_password&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bcrypt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;checkpw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;hashed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;pyproject.toml&lt;/code&gt;: swap &lt;code&gt;"passlib[bcrypt]&amp;gt;=1.7.4"&lt;/code&gt; for &lt;code&gt;"bcrypt&amp;gt;=4.0.0"&lt;/code&gt;. Done. Don't reach for passlib on new Python projects — the dependency is dead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 2: pnpm 10 security policies blocking Docker builds
&lt;/h2&gt;

&lt;p&gt;The frontend Dockerfile used &lt;code&gt;node:20-slim&lt;/code&gt; and installed the latest pnpm via corepack. When pnpm 10 shipped, the build started failing with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERR_PNPM_PREPARE_PKG_FAILURE  Error when preparing the package
 Blocked by policy: electron-to-chromium@1.5.134 is not allowed
 because it was released 0 days ago (policy: minimumReleaseAge=3 days)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;pnpm 10 introduced release-age security policies that refuse to install packages published within the last N days. A reasonable feature in production — a CI-breaking surprise when your lock file pins a package that was published yesterday.&lt;/p&gt;

&lt;p&gt;Two separate policies hit us: &lt;code&gt;minimumReleaseAge&lt;/code&gt; and &lt;code&gt;ignored-builds&lt;/code&gt; (which blocks &lt;code&gt;esbuild&lt;/code&gt; and &lt;code&gt;vue-demi&lt;/code&gt; unless explicitly allowed). The &lt;code&gt;package.json&lt;/code&gt; &lt;code&gt;"pnpm"&lt;/code&gt; field that's supposed to configure these policies is silently ignored in pnpm 10 — it logs a warning and reads nothing.&lt;/p&gt;

&lt;p&gt;The fix: pin to pnpm 9:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; node:22-slim&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;corepack &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; corepack prepare pnpm@9.15.9 &lt;span class="nt"&gt;--activate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;pnpm 9 has no release-age policies. The upgrade to pnpm 10 can wait until the project has a proper CI environment to absorb the breaking change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 3: FastAPI container connecting to localhost instead of postgres
&lt;/h2&gt;

&lt;p&gt;The backend started cleanly. Every database call returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;asyncpg.exceptions.ConnectionRefusedError: connection refused (host 127.0.0.1, port 5432)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;DATABASE_URL&lt;/code&gt; in &lt;code&gt;.env&lt;/code&gt; was &lt;code&gt;postgresql+asyncpg://numpath:numpath@localhost:5432/numpath&lt;/code&gt;. Inside a Docker Compose network, &lt;code&gt;localhost&lt;/code&gt; is the container's own loopback — not the postgres service. The postgres container is reachable by its service name.&lt;/p&gt;

&lt;p&gt;The fix: override the env var at the service level in &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;env_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;../.env&lt;/span&gt;
  &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql+asyncpg://numpath:numpath@postgres:5432/numpath&lt;/span&gt;
    &lt;span class="na"&gt;REDIS_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis://redis:6379/0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;environment&lt;/code&gt; block wins over &lt;code&gt;env_file&lt;/code&gt;, so local development (which uses &lt;code&gt;localhost&lt;/code&gt;) keeps working. Containers talk to each other by service name.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 4: SQLAlchemy column defaults not applied at construction time
&lt;/h2&gt;

&lt;p&gt;This one cost the most time. &lt;code&gt;POST /attempts&lt;/code&gt; returned a 500:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The BKT update equation was subtracting from &lt;code&gt;p_learn&lt;/code&gt;, which was &lt;code&gt;None&lt;/code&gt;. The &lt;code&gt;KCStateRecord&lt;/code&gt; model had:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;KCStateRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;p_learn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapped_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;p_guess&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapped_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;p_slip&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;Mapped&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mapped_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bug: SQLAlchemy's &lt;code&gt;default=&lt;/code&gt; is a &lt;strong&gt;server-side or flush-time default&lt;/strong&gt;. When you construct &lt;code&gt;KCStateRecord()&lt;/code&gt; in Python and haven't flushed to the database yet, those columns are &lt;code&gt;None&lt;/code&gt; on the Python object. The domain code ran immediately after construction, before any flush.&lt;/p&gt;

&lt;p&gt;The fix: set defaults explicitly in the constructor, then flush and refresh before returning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KCStateRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;skill_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;skill_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p_mastery&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p_learn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p_guess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p_slip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;opportunity_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;         &lt;span class="c1"&gt;# write to DB so defaults are applied
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# re-read the DB-populated values
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rule: if you use a newly constructed SQLAlchemy model object before any flush, assume every &lt;code&gt;default=&lt;/code&gt; column is &lt;code&gt;None&lt;/code&gt;. Either set defaults in the constructor or flush first.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the BKT update looks like in practice
&lt;/h2&gt;

&lt;p&gt;With those bugs cleared, the full attempt flow works end to end. A correct answer on a &lt;code&gt;SUB_BORROW&lt;/code&gt; problem with a fresh KCState shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;before: p_mastery=0.100, opportunity_count=0
after:  p_mastery=0.533, opportunity_count=1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That 0.1 → 0.533 jump is the Bayesian update working — prior &lt;code&gt;p_mastery&lt;/code&gt; combines with &lt;code&gt;p_learn&lt;/code&gt;, corrected for &lt;code&gt;p_guess&lt;/code&gt; and &lt;code&gt;p_slip&lt;/code&gt;. The math is covered in detail in &lt;a href="https://dev.to/orieken/bayesian-knowledge-tracing-in-37-lines-of-python"&gt;Bayesian Knowledge Tracing in 37 lines of Python&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters for the Research
&lt;/h2&gt;

&lt;p&gt;Phase 1's job was never to be elegant — it was to be &lt;em&gt;instrumented&lt;/em&gt;. Every attempt record written to the &lt;code&gt;attempts&lt;/code&gt; table is a training signal for Phase 2's BKT parameter estimation. We need ≥150 records (5 students × 3 sessions × 10+ problems) before Phase 2 can begin.&lt;/p&gt;

&lt;p&gt;The bugs above are why research-grade software is harder than it looks. Each one silently corrupts data in a different way: password hashing fails outright (detectable), Docker networking fails silently on every write (detectable but subtle), SQLAlchemy defaults produce &lt;code&gt;None&lt;/code&gt; BKT parameters (corrupts ML inputs, hard to detect in test data).&lt;/p&gt;

&lt;p&gt;The fix for all of them is the same: run the full stack. Not unit tests. Not &lt;code&gt;import my_function; print(my_function())&lt;/code&gt;. Start the containers, log in as a real user, and watch what happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;The honest retrospective:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Seed data is harder than it looks.&lt;/strong&gt; Writing 60 hand-crafted math problems at three difficulty levels takes longer than writing the adaptive engine. Every problem needs a machine-checkable answer, a hint, and a calibrated difficulty score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker Compose &lt;code&gt;env_file&lt;/code&gt; + &lt;code&gt;environment&lt;/code&gt; is the right pattern.&lt;/strong&gt; &lt;code&gt;env_file&lt;/code&gt; carries the defaults; &lt;code&gt;environment&lt;/code&gt; carries container-specific overrides. The pattern is obvious in hindsight and invisible until you need it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;flush()&lt;/code&gt; + &lt;code&gt;refresh()&lt;/code&gt; pattern is load-bearing for async SQLAlchemy.&lt;/strong&gt; Any code that creates an ORM object and immediately passes it to domain logic needs an explicit flush. The async path doesn't auto-flush the way the synchronous ORM used to.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Phase 2: BKT parameter estimation from real student data, and a mistake classifier that categorises subtraction errors beyond "wrong." The &lt;code&gt;attempts&lt;/code&gt; table is waiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;passlib&lt;/code&gt; is dead — use &lt;code&gt;bcrypt&lt;/code&gt; directly; it's three functions and no transitive dependency risk&lt;/li&gt;
&lt;li&gt;Docker Compose containers reach each other by service name, not &lt;code&gt;localhost&lt;/code&gt;; override &lt;code&gt;DATABASE_URL&lt;/code&gt; in the &lt;code&gt;environment&lt;/code&gt; block rather than the &lt;code&gt;env_file&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;SQLAlchemy &lt;code&gt;default=&lt;/code&gt; columns are &lt;code&gt;None&lt;/code&gt; on a freshly constructed Python object until after a &lt;code&gt;flush()&lt;/code&gt; + &lt;code&gt;refresh()&lt;/code&gt; — always set constructor defaults explicitly when domain code runs immediately after creation&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>numpath</category>
      <category>fastapi</category>
      <category>docker</category>
      <category>python</category>
    </item>
    <item>
      <title>Bayesian Knowledge Tracing in 37 lines of Python — how NumPath models what a student knows</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Wed, 27 May 2026 05:23:15 +0000</pubDate>
      <link>https://dev.to/orieken/bayesian-knowledge-tracing-in-37-lines-of-python-how-numpath-models-what-a-student-knows-1if8</link>
      <guid>https://dev.to/orieken/bayesian-knowledge-tracing-in-37-lines-of-python-how-numpath-models-what-a-student-knows-1if8</guid>
      <description>&lt;h2&gt;
  
  
  What We Built
&lt;/h2&gt;

&lt;p&gt;NumPath maintains a &lt;code&gt;KCState&lt;/code&gt; for every student × Knowledge Component pair. After every attempt, &lt;code&gt;update_bkt()&lt;/code&gt; revises the probability that the student has mastered that KC. That probability — &lt;code&gt;p_mastery&lt;/code&gt; — is what the adaptive engine reads to pick the next problem and what the teacher dashboard displays as a progress bar.&lt;/p&gt;

&lt;p&gt;The entire model is 37 lines. Here it is unabridged.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;

&lt;span class="n"&gt;MASTERY_THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.80&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frozen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;KCState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;p_mastery&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;p_learn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;p_guess&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;p_slip&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;opportunity_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="nd"&gt;@property&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_mastered&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;p_mastery&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;MASTERY_THRESHOLD&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_bkt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;KCState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_correct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;KCState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Standard Bayesian Knowledge Tracing update (Corbett &amp;amp; Anderson, 1995).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;p_mastery&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;p_learn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;p_guess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;p_slip&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_correct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;posterior&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;posterior&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;S&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;p_new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;posterior&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;posterior&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;L&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;KCState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;p_mastery&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p_new&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="n"&gt;p_learn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;L&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;p_guess&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;G&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;p_slip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;opportunity_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;opportunity_count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Four Parameters
&lt;/h2&gt;

&lt;p&gt;BKT models each KC with four parameters, all probabilities between 0 and 1:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;NumPath default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;p_mastery&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;P(student has learned this KC)&lt;/td&gt;
&lt;td&gt;0.10 (prior — low, conservative)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;p_learn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;P(learning occurs on this attempt, given not yet learned)&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;p_guess&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;P(correct answer given KC &lt;strong&gt;not&lt;/strong&gt; learned)&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;p_slip&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;P(incorrect answer given KC &lt;strong&gt;is&lt;/strong&gt; learned)&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are Phase 1 seed values — not calibrated against real student data yet. The parameter estimation problem (fitting &lt;code&gt;p_learn&lt;/code&gt;, &lt;code&gt;p_guess&lt;/code&gt;, &lt;code&gt;p_slip&lt;/code&gt; per KC from observed attempts) is a Phase 4 task once the RCT produces enough data. For now they are reasonable priors from the BKT literature.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Update Equations
&lt;/h2&gt;

&lt;p&gt;After observing an answer, two steps happen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Bayesian update&lt;/strong&gt; (prior → posterior):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Correct:   posterior = p(1 - S) / [p(1 - S) + (1 - p)G]
Incorrect: posterior = pS       / [pS       + (1 - p)(1 - G)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is straight Bayes. A correct answer raises the posterior &lt;em&gt;unless&lt;/em&gt; the student is likely to have guessed. An incorrect answer lowers it &lt;em&gt;unless&lt;/em&gt; the student is likely to have slipped. A correct answer from a student with &lt;code&gt;p_mastery=0.95&lt;/code&gt; and &lt;code&gt;p_slip=0.10&lt;/code&gt; barely moves the needle — the model already thinks they know it. A correct answer from a student with &lt;code&gt;p_mastery=0.10&lt;/code&gt; and &lt;code&gt;p_guess=0.20&lt;/code&gt; moves it less than you might expect — the model discounts lucky guesses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Learning update&lt;/strong&gt; (posterior → next prior):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;p_new = posterior + (1 - posterior) × p_learn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even if the student answered incorrectly, there's a &lt;code&gt;p_learn&lt;/code&gt; probability that learning occurred anyway. The posterior is never the final state — the learning update always nudges &lt;code&gt;p_mastery&lt;/code&gt; upward slightly, reflecting that every attempt is an opportunity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Design Decision
&lt;/h2&gt;

&lt;p&gt;We evaluated three approaches before choosing standard BKT:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Item Response Theory (IRT)&lt;/strong&gt; — models item difficulty as well as student ability. More expressive, but requires calibrated item parameters we don't have. Rejected for Phase 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deep Knowledge Tracing (DKT)&lt;/strong&gt; — replaces the parametric model with an LSTM that learns latent student state from sequences of attempts. Better at capturing cross-KC transfer. Rejected for Phase 1 because it needs training data we haven't collected yet. It's on the Phase 2 roadmap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accuracy streak&lt;/strong&gt; — raise difficulty after 3 correct in a row, lower after 3 wrong. This is what most commercial apps do. Rejected because it gives you no probability estimate, no per-KC granularity, and no way to distinguish a guesser from a learner.&lt;/p&gt;

&lt;p&gt;Standard BKT is 30 years old and still the right choice when you're instrument-building before data collection. It gives you a per-KC probability estimate with interpretable parameters, it's fast to compute, and its failure modes are well understood.&lt;/p&gt;

&lt;p&gt;One implementation choice worth noting: &lt;code&gt;KCState&lt;/code&gt; is a frozen dataclass. &lt;code&gt;update_bkt()&lt;/code&gt; returns a new &lt;code&gt;KCState&lt;/code&gt; rather than mutating the existing one. This makes the update function a pure function — easy to test, easy to replay, and safe to call in parallel if we ever need to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters for the Research
&lt;/h2&gt;

&lt;p&gt;The RCT compares learning outcomes for students using NumPath against a control group using static worksheets. To measure a difference, you need a measurement instrument. &lt;code&gt;p_mastery&lt;/code&gt; is that instrument.&lt;/p&gt;

&lt;p&gt;After a session, the teacher dashboard shows each student's &lt;code&gt;p_mastery&lt;/code&gt; per KC as a progress bar. The adaptive engine uses it to pick the next problem. The LLM insight generator reads it to produce explanations like &lt;em&gt;"Aiden's &lt;code&gt;p_mastery&lt;/code&gt; on &lt;code&gt;SUB_BORROW&lt;/code&gt; is 0.18 — the model has seen 11 attempts and is not converging."&lt;/em&gt; All three downstream consumers depend on the same number being meaningful.&lt;/p&gt;

&lt;p&gt;BKT's key property for research purposes: it's falsifiable. If a student's &lt;code&gt;p_mastery&lt;/code&gt; stays low after 20 correct answers, that's a signal worth investigating — either the parameters are wrong, or the student is consistently guessing, or there's a measurement problem. An accuracy percentage doesn't give you that.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;The model is simple. Getting the parameters right is not.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;p_learn=0.30&lt;/code&gt; means the student has a 30% chance of learning the KC on any given attempt. That sounds reasonable. But it implies that after 10 attempts, a student who has not yet learned the KC has a 97% cumulative chance of learning it — which is almost certainly too optimistic. The seed parameters will need calibration.&lt;/p&gt;

&lt;p&gt;The other thing we learned: &lt;code&gt;opportunity_count&lt;/code&gt; is load-bearing. The adaptive engine uses it as a tiebreaker and the teacher dashboard shows it alongside &lt;code&gt;p_mastery&lt;/code&gt;. It's not computed from the BKT model — it's just a counter that increments on every &lt;code&gt;update_bkt()&lt;/code&gt; call. The frozen dataclass pattern makes this safe: the count in the database is always the count from the last &lt;code&gt;update_bkt()&lt;/code&gt; return value, never a stale mutation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Phase 2 adds a DKT model alongside BKT — trained on the data collected during the pilot. The two models will run in parallel so we can compare their predictions against observed outcomes before the RCT begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BKT separates learning from performance&lt;/strong&gt; — &lt;code&gt;p_guess&lt;/code&gt; and &lt;code&gt;p_slip&lt;/code&gt; let the model discount lucky correct answers and unlucky wrong ones; a 70% accuracy rate means something different depending on what the model thinks caused it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;p_mastery&lt;/code&gt; is the measurement instrument for the RCT&lt;/strong&gt; — every downstream consumer (adaptive engine, teacher dashboard, LLM insights) reads the same number, so getting it right matters more than getting it fast&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frozen dataclass + pure function = safe update chain&lt;/strong&gt; — &lt;code&gt;update_bkt()&lt;/code&gt; returns a new &lt;code&gt;KCState&lt;/code&gt;; there's no shared mutable state, the update is replayable, and the test suite can verify every case in isolation&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>numpath</category>
      <category>adaptivelearning</category>
      <category>bayesian</category>
      <category>python</category>
    </item>
    <item>
      <title>Two Cross-Platform Bugs in Our Go CLI (And How We Fixed Them)</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Wed, 27 May 2026 05:22:56 +0000</pubDate>
      <link>https://dev.to/orieken/two-cross-platform-bugs-in-our-go-cli-and-how-we-fixed-them-4h6l</link>
      <guid>https://dev.to/orieken/two-cross-platform-bugs-in-our-go-cli-and-how-we-fixed-them-4h6l</guid>
      <description>&lt;p&gt;Go's cross-platform story is genuinely good. Write code once, compile for any target, mostly just works. But "mostly" hides a couple of sharp edges that bit us while building TestSmith. Both bugs were invisible on macOS and Linux, only surfaced on Windows CI, and had the same root cause: assumptions about path separators and filesystem traversal boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 1: The Detector Boundary Escape
&lt;/h2&gt;

&lt;p&gt;TestSmith has five language drivers, each responsible for detecting whether a directory is a project of its type. The Python driver walks upward from the starting directory, looking for &lt;code&gt;pyproject.toml&lt;/code&gt; or &lt;code&gt;setup.py&lt;/code&gt;. The Go driver looks for &lt;code&gt;go.mod&lt;/code&gt;. And so on.&lt;/p&gt;

&lt;p&gt;The bug: every driver would happily walk past a &lt;code&gt;.git&lt;/code&gt; directory belonging to a &lt;em&gt;different&lt;/em&gt; project and claim files in an ancestor project.&lt;/p&gt;

&lt;p&gt;Here's what happened in practice. Our example projects live at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;testsmith/                  ← Go repo root (.git here)
  examples/
    python-service/          ← Python example project
      pyproject.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you ran &lt;code&gt;testsmith generate&lt;/code&gt; from inside &lt;code&gt;examples/python-service/&lt;/code&gt;, the Python driver would detect it correctly. But when you ran it from &lt;code&gt;examples/go-service/&lt;/code&gt; and the Python driver was tried first during registry detection, it would walk upward, find no Python markers in &lt;code&gt;go-service/&lt;/code&gt;, then continue upward, find no markers in &lt;code&gt;examples/&lt;/code&gt;, then continue upward... find &lt;code&gt;conftest.py&lt;/code&gt; at the testsmith repo root (left over from a previous test run), and claim the entire testsmith repo as a Python project.&lt;/p&gt;

&lt;p&gt;The naive fix is "stop when you see &lt;code&gt;.git&lt;/code&gt;." But that's wrong too — a legitimate project root can have both &lt;code&gt;pyproject.toml&lt;/code&gt; &lt;em&gt;and&lt;/em&gt; a &lt;code&gt;.git&lt;/code&gt; directory. If you stop at the first &lt;code&gt;.git&lt;/code&gt; you see, you'd refuse to detect projects that are also VCS roots.&lt;/p&gt;

&lt;p&gt;The correct rule: check VCS stop markers only at &lt;strong&gt;ancestor&lt;/strong&gt; directories, not at the starting directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;findRoot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startDir&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;startDir&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Only check VCS boundaries at ancestor dirs — the starting dir&lt;/span&gt;
        &lt;span class="c"&gt;// may legitimately have both a project marker and a .git directory.&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;startDir&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stop&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;stopMarkers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrProjectNotFound&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;rootMarkers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrProjectNotFound&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We applied this pattern to all five drivers. The key insight: &lt;code&gt;.git&lt;/code&gt; is a traversal-stopping sentinel when found in an &lt;em&gt;ancestor&lt;/em&gt;, but it's perfectly normal at the &lt;em&gt;project root itself&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bug 2: Hardcoded Path Separators
&lt;/h2&gt;

&lt;p&gt;The Windows test failure was more direct:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;    analyzer_test.go:179: DeriveTestPath("/proj/src/services/payment.py"):
        got "\\proj\\tests\\src\\services\\test_payment.py",
        want "/proj/tests/services/test_payment.py"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things wrong in that output: backslashes (expected on Windows, handled by the test via &lt;code&gt;filepath.ToSlash&lt;/code&gt;), and &lt;code&gt;src&lt;/code&gt; appearing in the output path when it should have been stripped.&lt;/p&gt;

&lt;p&gt;The code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;deriveTestPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sourcePath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Rel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sourcePath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Strip src/ prefix if present.&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"src/"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  &lt;span class="c"&gt;// ← BUG&lt;/span&gt;
        &lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;filepath.Rel&lt;/code&gt; on Windows returns &lt;code&gt;src\services\payment.py&lt;/code&gt;. The prefix check looks for &lt;code&gt;src/&lt;/code&gt; with a forward slash. On Windows, it never matches. The &lt;code&gt;src&lt;/code&gt; component stays in the path, so the output becomes &lt;code&gt;tests\src\services\test_payment.py&lt;/code&gt; instead of &lt;code&gt;tests\services\test_payment.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix normalises to forward slashes before the check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Normalise to forward slashes for the prefix check so this works on&lt;/span&gt;
&lt;span class="c"&gt;// Windows (where filepath.Rel returns backslash-separated paths).&lt;/span&gt;
&lt;span class="n"&gt;slashed&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToSlash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slashed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"src/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;slashed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;slashed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;rel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FromSlash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slashed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;filepath.ToSlash&lt;/code&gt; converts &lt;code&gt;\&lt;/code&gt; to &lt;code&gt;/&lt;/code&gt;. &lt;code&gt;strings.HasPrefix(slashed, "src/")&lt;/code&gt; works correctly on all platforms. &lt;code&gt;filepath.FromSlash&lt;/code&gt; converts back to the OS-native separator for the subsequent &lt;code&gt;filepath.Join&lt;/code&gt; call.&lt;/p&gt;

&lt;p&gt;The same pattern applied to &lt;code&gt;deriveModulePath&lt;/code&gt;, which had the identical bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern
&lt;/h2&gt;

&lt;p&gt;Both bugs share a structure: an algorithm that works correctly on the development platform (macOS/Linux) but silently produces wrong results on Windows because it makes assumptions about the filesystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bug 1&lt;/strong&gt;: assumes &lt;code&gt;.git&lt;/code&gt; presence implies "not a project root" (wrong at the starting dir)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug 2&lt;/strong&gt;: assumes &lt;code&gt;filepath.Rel&lt;/code&gt; uses forward slashes (wrong on Windows)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The remedies are similarly structured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bug 1&lt;/strong&gt;: be explicit about &lt;em&gt;which&lt;/em&gt; directories the invariant applies to&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug 2&lt;/strong&gt;: normalise to a known format before string operations, then convert back for OS operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Go's &lt;code&gt;filepath&lt;/code&gt; package is excellent — &lt;code&gt;filepath.Rel&lt;/code&gt;, &lt;code&gt;filepath.Join&lt;/code&gt;, &lt;code&gt;filepath.Dir&lt;/code&gt;, &lt;code&gt;filepath.Base&lt;/code&gt; all do the right thing. The problems arise when you mix &lt;code&gt;filepath&lt;/code&gt; results with hardcoded string literals (like &lt;code&gt;"src/"&lt;/code&gt;) that embed platform assumptions. The rule: use &lt;code&gt;filepath&lt;/code&gt; functions for path &lt;em&gt;operations&lt;/em&gt;, &lt;code&gt;filepath.ToSlash&lt;/code&gt; to convert &lt;em&gt;before&lt;/em&gt; any string matching, and &lt;code&gt;filepath.FromSlash&lt;/code&gt; to convert &lt;em&gt;back&lt;/em&gt; before passing to OS calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI as the Detector
&lt;/h2&gt;

&lt;p&gt;Neither bug would have been caught by running tests locally on macOS. The Windows CI job was the only place they surfaced.&lt;/p&gt;

&lt;p&gt;This is the case for a real cross-platform test matrix. It's not just about supporting Windows users — it's about finding any code that makes implicit platform assumptions. If your tests only run on one platform, that class of bug is invisible until a user reports it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;TestSmith is open source at &lt;a href="https://github.com/orieken/testsmith" rel="noopener noreferrer"&gt;github.com/orieken/testsmith&lt;/a&gt;. The full CI matrix runs on Ubuntu, macOS, and Windows with &lt;code&gt;-race&lt;/code&gt; enabled on all three.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>debugging</category>
      <category>windows</category>
      <category>testing</category>
    </item>
    <item>
      <title>Two Knowledge Hierarchies: Structuring Context for AI Agents and LLMs</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Wed, 27 May 2026 05:22:10 +0000</pubDate>
      <link>https://dev.to/orieken/two-knowledge-hierarchies-structuring-context-for-ai-agents-and-llms-2o6c</link>
      <guid>https://dev.to/orieken/two-knowledge-hierarchies-structuring-context-for-ai-agents-and-llms-2o6c</guid>
      <description>&lt;p&gt;TestSmith has two distinct audiences that need context about the project: AI agents that work &lt;em&gt;on&lt;/em&gt; the TestSmith codebase (helping develop and extend it), and the LLM that generates test code &lt;em&gt;for your project&lt;/em&gt; at runtime. These are different problems with different solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Agent Context — CLAUDE.md Hierarchies
&lt;/h2&gt;

&lt;p&gt;When an AI agent opens TestSmith to fix a bug or add a feature, it needs to understand the codebase structure without reading every file. A single large context file doesn't work well — an agent fixing a retry bug doesn't need to know the Java driver's fixture generation logic.&lt;/p&gt;

&lt;p&gt;The solution is a &lt;code&gt;CLAUDE.md&lt;/code&gt; hierarchy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CLAUDE.md                              ← package map, invariants, dependency direction
internal/domain/CLAUDE.md             ← interfaces, key types, "add a field" checklist
internal/generation/CLAUDE.md         ← pipeline data flow, verifier selection
internal/llm/CLAUDE.md                ← middleware stack, batch vs fan-out, cache key
internal/projectknowledge/CLAUDE.md   ← TESTSMITH.md hierarchy, budget tiers
internal/drivers/CLAUDE.md            ← how to add an adapter or language driver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The root file is the map. The per-package files are the territory. An agent touching the LLM retry logic loads &lt;code&gt;internal/llm/CLAUDE.md&lt;/code&gt; — it never sees the driver or generation docs.&lt;/p&gt;

&lt;p&gt;The root file contains three things that every agent needs regardless of task:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Package map&lt;/strong&gt; — what each internal package does and which files to read first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependency direction&lt;/strong&gt; — the hard architectural constraint (&lt;code&gt;domain&lt;/code&gt; never imports other internal packages; &lt;code&gt;drivers&lt;/code&gt; never import &lt;code&gt;generation&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invariants&lt;/strong&gt; — things that must remain true across all changes (e.g., &lt;code&gt;GeneratedFile.Language&lt;/code&gt; must always be set; &lt;code&gt;resolveAction&lt;/code&gt; has specific rules for fixture vs. non-fixture files)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Per-package files contain the "read this before touching this package" context: data flow diagrams for the pipeline, the middleware stack for the LLM layer, the adapter registration pattern for drivers.&lt;/p&gt;

&lt;p&gt;When Claude Code loads a file in a package, it automatically reads that package's &lt;code&gt;CLAUDE.md&lt;/code&gt;. The agent gets exactly what it needs, nothing more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Runtime LLM Context — TESTSMITH.md
&lt;/h2&gt;

&lt;p&gt;This is what TestSmith injects into prompts when generating tests for &lt;em&gt;your&lt;/em&gt; project. It's a conventions file you maintain alongside your source code.&lt;/p&gt;

&lt;p&gt;Two levels are merged at generation time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;project-root&amp;gt;/TESTSMITH.md     ← always loaded; project-wide framework, mock style
&amp;lt;source-dir&amp;gt;/TESTSMITH.md       ← optional; package-level overrides
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example root &lt;code&gt;TESTSMITH.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Project conventions&lt;/span&gt;

Framework: pytest
Mock style: pytest-mock (use &lt;span class="sb"&gt;`mocker.patch`&lt;/span&gt;, not &lt;span class="sb"&gt;`unittest.mock.patch`&lt;/span&gt;)
Assertion style: plain assert statements

&lt;span class="gh"&gt;# Module structure&lt;/span&gt;
Services are in &lt;span class="sb"&gt;`src/services/`&lt;/span&gt;. Each service has a single public class.
Tests go in &lt;span class="sb"&gt;`tests/`&lt;/span&gt; mirroring the &lt;span class="sb"&gt;`src/`&lt;/span&gt; structure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example per-directory override in &lt;code&gt;src/services/payment/TESTSMITH.md&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Payment service conventions&lt;/span&gt;
This module integrates with Stripe. Mock all &lt;span class="sb"&gt;`stripe.*`&lt;/span&gt; calls.
Use &lt;span class="sb"&gt;`pytest.mark.vcr`&lt;/span&gt; for HTTP interaction tests.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The root file is loaded once at startup and cached in &lt;code&gt;ProjectContext&lt;/code&gt;. The per-directory file is merged lazily — only when a file in that directory is being generated. A large monorepo never loads context it doesn't need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Both files go into the system prompt, not the user prompt.&lt;/strong&gt; This matters because the user prompt is subject to a configurable token budget (&lt;code&gt;PromptTokenBudget&lt;/code&gt;, default 6,000 tokens) with a priority-based trim:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Dropped when?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (never)&lt;/td&gt;
&lt;td&gt;Source code&lt;/td&gt;
&lt;td&gt;Never&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Internal dep signatures&lt;/td&gt;
&lt;td&gt;Budget exceeded after source&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Style snippet from nearby tests&lt;/td&gt;
&lt;td&gt;Dropped first&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Project knowledge is exempt from this budget entirely — it stays in the system prompt regardless of how large the source file is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dynamically Mined Conventions
&lt;/h2&gt;

&lt;p&gt;Beyond &lt;code&gt;TESTSMITH.md&lt;/code&gt;, TestSmith also mines conventions from existing tests in the same directory — up to 5 files, capped at 80 lines total. This gives the model real examples of the project's test style without requiring the developer to maintain a conventions doc.&lt;/p&gt;

&lt;p&gt;This is cheaper and more accurate than a hand-written guide: it automatically reflects the actual test patterns in use, and it updates itself as tests evolve. If your team starts using a new assertion pattern, the next generation run picks it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dependency Signature Index
&lt;/h2&gt;

&lt;p&gt;The third piece is the dep index: at the start of a &lt;code&gt;--all&lt;/code&gt; run, TestSmith analyses every source file once and builds a &lt;code&gt;modulePath → SourceAnalysis&lt;/code&gt; map. When generating tests for &lt;code&gt;payment.go&lt;/code&gt;, it can pull the public API signature of &lt;code&gt;discount.go&lt;/code&gt; (which &lt;code&gt;payment.go&lt;/code&gt; imports) from memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// In the prompt:&lt;/span&gt;
&lt;span class="c"&gt;// Internal dependency signatures:&lt;/span&gt;
&lt;span class="c"&gt;// discount.ApplyPromoCode(order Order, code string) (Order, error)&lt;/span&gt;
&lt;span class="c"&gt;// discount.ValidateCode(code string) bool&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells the model what the real interface looks like so it generates test doubles that match the actual signatures — not invented ones.&lt;/p&gt;

&lt;p&gt;In watch mode, when a file changes, only that file's entry is refreshed. The rest of the index stays warm between regens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Separation Matters
&lt;/h2&gt;

&lt;p&gt;The two layers solve different problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agent context&lt;/strong&gt; is about &lt;em&gt;development-time&lt;/em&gt; navigation. It's hierarchical, human-readable, and loaded selectively. It describes architecture and invariants. It lives in the repo and is maintained alongside the code it describes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Runtime LLM context&lt;/strong&gt; is about &lt;em&gt;generation-time&lt;/em&gt; quality. It's merged from two levels, injected into system prompts, and exempt from token budgets. It describes conventions and patterns specific to the target project — things an LLM can't infer from source code alone.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conflating the two leads to either bloated system prompts (dumping agent context into every generation request) or under-informed agents (giving them only the user-facing conventions doc with no architectural guidance). Keeping them separate means each audience gets exactly what it needs.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Next: the cross-platform bugs we hit shipping a Go CLI — detector boundary escapes and Windows path separators.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudeai</category>
      <category>llm</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Sat, 23 May 2026 16:34:37 +0000</pubDate>
      <link>https://dev.to/orieken/making-llm-calls-reliable-retry-semaphore-cache-and-batch-46fi</link>
      <guid>https://dev.to/orieken/making-llm-calls-reliable-retry-semaphore-cache-and-batch-46fi</guid>
      <description>&lt;p&gt;When TestSmith generates tests with &lt;code&gt;--llm&lt;/code&gt;, it calls an LLM for every public member of every source file being processed. A project with 20 files and 5 public functions each means up to 100 API calls in a single run. That's a lot of surface area for things to go wrong.&lt;/p&gt;

&lt;p&gt;Here's the reliability stack we built, layer by layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Retry with Exponential Backoff
&lt;/h2&gt;

&lt;p&gt;LLM APIs fail transiently. Rate limits, timeouts, occasional 5xx responses — all of these are recoverable if you wait and retry. We built a retry middleware that wraps any &lt;code&gt;Provider&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;RetryProvider&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;inner&lt;/span&gt;      &lt;span class="n"&gt;Provider&lt;/span&gt;
    &lt;span class="n"&gt;maxRetries&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;RetryProvider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="n"&gt;CompletionRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CompletionResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;lastErr&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;maxRetries&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;wait&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Millisecond&lt;/span&gt;
            &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;After&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;CompletionResponse&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;lastErr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;CompletionResponse&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"after %d attempts: %w"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;maxRetries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lastErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;MaxRetryAttempts&lt;/code&gt; defaults to 3. With exponential backoff: attempt 1 is immediate, attempt 2 waits 200ms, attempt 3 waits 400ms. Total worst-case wait per call is under a second — acceptable latency for a background tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Semaphore for Concurrency Control
&lt;/h2&gt;

&lt;p&gt;With up to 100 calls to make, goroutine fan-out is the obvious approach. But hitting an LLM API with 100 concurrent requests triggers rate limiting immediately. A semaphore caps the in-flight calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;SemaphoreProvider&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;inner&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;
    &lt;span class="n"&gt;sem&lt;/span&gt;   &lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;NewSemaphoreProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inner&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxConcurrent&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SemaphoreProvider&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;SemaphoreProvider&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;inner&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;inner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sem&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;maxConcurrent&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SemaphoreProvider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="n"&gt;CompletionRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CompletionResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sem&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}{}&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sem&lt;/span&gt; &lt;span class="p"&gt;}()&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;CompletionResponse&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;MaxConcurrentCalls&lt;/code&gt; defaults to 5. Each retry attempt acquires its own semaphore slot — this is important. If retry logic held a slot while waiting between attempts, other goroutines would be blocked unnecessarily. The retry wrapper is the outer layer; semaphore is the inner layer.&lt;/p&gt;

&lt;p&gt;The middleware stack assembled by the factory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;retry → semaphore → raw provider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 3: Result Cache
&lt;/h2&gt;

&lt;p&gt;Many test generation runs touch the same files repeatedly — watch mode is the extreme case. Calling the LLM for the same source code twice is wasteful. A content-addressed cache avoids it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;ResultCache&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;mu&lt;/span&gt;      &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RWMutex&lt;/span&gt;
    &lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;][]&lt;/span&gt;&lt;span class="n"&gt;BodyGenResult&lt;/span&gt;
    &lt;span class="n"&gt;hits&lt;/span&gt;    &lt;span class="kt"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;misses&lt;/span&gt;  &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;cacheKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="n"&gt;BodyGenRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;%s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MemberName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SourceCode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Framework&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EncodeToString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is a SHA-256 hash of the language, member name, source code, and framework. If the source file changes, the hash changes and the cache misses — you always get fresh results for changed code.&lt;/p&gt;

&lt;p&gt;After a run, &lt;code&gt;--verbose&lt;/code&gt; prints the cache stats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM cache — hits: 12  misses: 8  entries: 8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 4: Batch Generation
&lt;/h2&gt;

&lt;p&gt;The fan-out approach makes one API call per public member. For a file with 10 functions, that's 10 calls. Batch generation collapses this to one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;LLMBodyGenerator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;GenerateBatchBodies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reqs&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;BodyGenRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="n"&gt;BodyGenResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;buildBatchPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reqs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CompletionRequest&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;SystemPrompt&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;batchSystemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;UserPrompt&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;          &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MaxTokens&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;maxTokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reqs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c"&gt;// scale with request count&lt;/span&gt;
        &lt;span class="n"&gt;Temperature&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ResponseFormat&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"json_object"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c"&gt;// structured output&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use OpenAI's &lt;code&gt;response_format: {"type": "json_object"}&lt;/code&gt; to get structured output. The model returns a JSON envelope with one entry per member:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tests"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ProcessPayment"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"func TestProcessPayment(t *testing.T) { ... }"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RefundPayment"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"func TestRefundPayment(t *testing.T) { ... }"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We parse that with a primary JSON parser, with a fallback to a delimiter-regex parser for providers that don't support structured output.&lt;/p&gt;

&lt;p&gt;The pipeline checks for the &lt;code&gt;BatchBodyGenerator&lt;/code&gt; interface via type assertion. If the generator implements it, batch mode is used. If not (or if the driver explicitly opts out), it falls back to goroutine fan-out with individual calls. This keeps the interface opt-in and backward compatible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability: Cache Stats
&lt;/h2&gt;

&lt;p&gt;With all this happening in the background, it's useful to know what actually ran. The &lt;code&gt;cacheStatsReporter&lt;/code&gt; interface lets the CLI query stats without importing the &lt;code&gt;llm&lt;/code&gt; package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// In cmd/testsmith/generate.go — avoids importing internal/llm from the CLI layer&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;cacheStatsReporter&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;CacheStats&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;misses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;printCacheStats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bg&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BodyGenerator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cacheStatsReporter&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;misses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CacheStats&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"LLM cache — hits: %d  misses: %d  entries: %d&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;misses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the interface segregation principle at work: the CLI knows about &lt;code&gt;domain.BodyGenerator&lt;/code&gt; (which it needs for the pipeline) and &lt;code&gt;cacheStatsReporter&lt;/code&gt; (which it needs for stats output). It doesn't need to know anything else about the LLM implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;In practice, on a mid-size Go project with 40 source files and an average of 6 public functions each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Without batch&lt;/strong&gt;: 240 API calls, ~4 minutes at 5 concurrent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With batch&lt;/strong&gt;: 40 API calls (one per file), ~45 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Second run with warm cache&lt;/strong&gt;: near-instant for unchanged files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cache and batch generation together turn what would be a "go make coffee" operation into something you can run while you're still in the flow.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Next: how we structure context for both AI agents working on TestSmith itself and for the LLM generating tests for your project.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>llm</category>
      <category>ai</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Language-Agnostic Code Generation: The Driver Plugin Model</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Sat, 23 May 2026 16:29:11 +0000</pubDate>
      <link>https://dev.to/orieken/language-agnostic-code-generation-the-driver-plugin-model-2ip0</link>
      <guid>https://dev.to/orieken/language-agnostic-code-generation-the-driver-plugin-model-2ip0</guid>
      <description>&lt;p&gt;TestSmith generates test scaffolds for five languages: Go, Python, TypeScript, Java, and C#. Each language has its own project structure conventions, test frameworks, import styles, and code patterns. The naive implementation would be a big &lt;code&gt;switch&lt;/code&gt; statement throughout the codebase. We chose a plugin model instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Hardcoded Branches
&lt;/h2&gt;

&lt;p&gt;When a codebase switches on language in multiple places, every new language requires touching every branch point. Miss one and you get a silent bug — the new language falls through to some default behavior that doesn't apply to it. This is the classic Open-Closed violation: you have to modify existing code to extend it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LanguageDriver Interface
&lt;/h2&gt;

&lt;p&gt;Every language in TestSmith implements a single interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;LanguageDriver&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Detection&lt;/span&gt;
    &lt;span class="n"&gt;DetectProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;FileExtensions&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;

    &lt;span class="c"&gt;// Analysis&lt;/span&gt;
    &lt;span class="n"&gt;AnalyzeFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SourceAnalysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ClassifyDependency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dep&lt;/span&gt; &lt;span class="n"&gt;ImportInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;DependencyCategory&lt;/span&gt;
    &lt;span class="n"&gt;DeriveTestPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sourcePath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;DeriveModulePath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sourcePath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Generation&lt;/span&gt;
    &lt;span class="n"&gt;GenerateTestFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SourceAnalysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="n"&gt;GenerateOpts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;GeneratedFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;GenerateFixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dep&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SourceAnalysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="n"&gt;GenerateOpts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;GeneratedFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;GenerateBootstrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;GenerationPlan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;GeneratedFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Framework config&lt;/span&gt;
    &lt;span class="n"&gt;GetTestFrameworkConfig&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;TestFrameworkConfig&lt;/span&gt;
    &lt;span class="n"&gt;SelectAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;TestAdapter&lt;/span&gt;

    &lt;span class="c"&gt;// LLM integration&lt;/span&gt;
    &lt;span class="n"&gt;LLMContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;LLMVocabulary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;

    &lt;span class="c"&gt;// Migration and validation&lt;/span&gt;
    &lt;span class="n"&gt;ListMigrators&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Migrator&lt;/span&gt;
    &lt;span class="n"&gt;ValidateFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="n"&gt;ValidationIssue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generation pipeline, the CLI commands, and the watch mode all work against this interface. They never import a specific driver package.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Detection Works
&lt;/h2&gt;

&lt;p&gt;When you run &lt;code&gt;testsmith generate&lt;/code&gt;, the first step is figuring out what language you're in. The registry tries each registered driver in turn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LanguageDriver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;drivers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DetectProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrProjectNotFound&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each driver's &lt;code&gt;DetectProject&lt;/code&gt; walks upward from the starting directory looking for its own project markers — &lt;code&gt;go.mod&lt;/code&gt; for Go, &lt;code&gt;pyproject.toml&lt;/code&gt; or &lt;code&gt;setup.py&lt;/code&gt; for Python, &lt;code&gt;package.json&lt;/code&gt; for TypeScript, &lt;code&gt;pom.xml&lt;/code&gt; or &lt;code&gt;build.gradle&lt;/code&gt; for Java, &lt;code&gt;.csproj&lt;/code&gt; or &lt;code&gt;.sln&lt;/code&gt; for C#.&lt;/p&gt;

&lt;p&gt;One subtle requirement: a driver must not claim an ancestor project that belongs to a different language. If you run TestSmith from inside an example project that lives inside a Go repo, the Python driver shouldn't walk up past the Go project's &lt;code&gt;.git&lt;/code&gt; boundary and claim the repo root. We solve this by checking VCS stop markers (&lt;code&gt;.git&lt;/code&gt;, &lt;code&gt;.hg&lt;/code&gt;, &lt;code&gt;.svn&lt;/code&gt;) at ancestor directories only — not at the starting directory itself, since a legitimate project root can have both a project marker and a &lt;code&gt;.git&lt;/code&gt; directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;findRoot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startDir&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;startDir&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// At ancestor dirs, stop at VCS boundaries first.&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;startDir&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stop&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;stopMarkers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrProjectNotFound&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="c"&gt;// Then check for project markers.&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;rootMarkers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrProjectNotFound&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Adapter Layer
&lt;/h2&gt;

&lt;p&gt;Within a language, there can be multiple test frameworks. TypeScript has Jest, Vitest, and Mocha. Java has JUnit 4, JUnit 5, TestNG, and Spring Boot Test. Each framework has its own import style, mock library, assertion syntax, and file naming conventions.&lt;/p&gt;

&lt;p&gt;We model this with a &lt;code&gt;TestAdapter&lt;/code&gt; interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;TestAdapter&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;FileNamingConvention&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;FileNaming&lt;/span&gt;
    &lt;span class="n"&gt;ImportStyle&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;ImportStyle&lt;/span&gt;
    &lt;span class="n"&gt;MockLibrary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;AssertionStyle&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;LLMVocabulary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each driver has a registry of adapters and a &lt;code&gt;SelectAdapter&lt;/code&gt; method that reads the project config (or sniffs &lt;code&gt;package.json&lt;/code&gt; devDependencies, &lt;code&gt;pom.xml&lt;/code&gt; dependencies, etc.) to pick the right one. The LLM prompt gets the vocabulary from the selected adapter — so the model knows to generate &lt;code&gt;expect(x).toBe(y)&lt;/code&gt; for Jest but &lt;code&gt;assert.Equal(t, x, y)&lt;/code&gt; for Go's testify.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding a New Language
&lt;/h2&gt;

&lt;p&gt;Because everything flows through the interface, adding a new language driver is isolated:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a new package under &lt;code&gt;internal/drivers/&amp;lt;lang&amp;gt;/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Implement &lt;code&gt;domain.LanguageDriver&lt;/code&gt; — the compiler tells you exactly what's missing&lt;/li&gt;
&lt;li&gt;Register it in &lt;code&gt;internal/registry/registry.go&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Optionally add a &lt;code&gt;Verifier&lt;/code&gt; in &lt;code&gt;internal/generation/verify.go&lt;/code&gt; for post-write compile checking&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No other files change. The existing drivers are untouched. The pipeline, CLI, and watch mode pick it up automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dependency Direction
&lt;/h2&gt;

&lt;p&gt;The plugin model enforces a strict dependency direction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cmd → generation → domain ← drivers
                           ← llm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;domain&lt;/code&gt; defines the interfaces. &lt;code&gt;drivers&lt;/code&gt; implement them. &lt;code&gt;generation&lt;/code&gt; uses them via the interface. Neither &lt;code&gt;generation&lt;/code&gt; nor &lt;code&gt;drivers&lt;/code&gt; imports the other. This is the Dependency Inversion Principle applied at the package level — and it's enforced by Go's import cycle detector.&lt;/p&gt;

&lt;p&gt;When you add a new driver, it's impossible to accidentally reach into the generation pipeline or the LLM layer — Go won't compile it. The architecture is self-enforcing.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Next in this series: making LLM calls reliable when you're hitting them for every public member of every source file.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>architecture</category>
      <category>designpatterns</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Why We Rewrote Our Python CLI in Go (and What We Gained)</title>
      <dc:creator>Oscar Rieken</dc:creator>
      <pubDate>Sat, 23 May 2026 16:23:38 +0000</pubDate>
      <link>https://dev.to/orieken/why-we-rewrote-our-python-cli-in-go-and-what-we-gained-21bg</link>
      <guid>https://dev.to/orieken/why-we-rewrote-our-python-cli-in-go-and-what-we-gained-21bg</guid>
      <description>&lt;p&gt;TestSmith v1 was a Python CLI. It worked. Users could &lt;code&gt;pip install testsmith&lt;/code&gt;, point it at a source file, and get a test scaffold back. But every team that tried to wire it into CI hit the same wall: Python environments.&lt;/p&gt;

&lt;p&gt;The problem wasn't Python itself — it was distribution. A static analysis tool that requires a matching Python version, a virtual environment, and a pinned dependency tree is a hard sell for a step that runs on every push. We were shipping a tool, not a library. Tools should be frictionless.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision
&lt;/h2&gt;

&lt;p&gt;We rewrote TestSmith v2 in Go. The goal was a single static binary with no runtime dependencies — something you could drop into any CI runner, any Docker image, any developer's PATH, and it would just work.&lt;/p&gt;

&lt;p&gt;Go was the right choice for three reasons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single binary.&lt;/strong&gt; &lt;code&gt;go build&lt;/code&gt; produces one self-contained executable. No pip, no venv, no &lt;code&gt;requirements.txt&lt;/code&gt;. Users download a binary or &lt;code&gt;brew install&lt;/code&gt; it and they're done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-platform with one build.&lt;/strong&gt; The v1 CI matrix was a headache — different Python versions across Ubuntu, macOS, and Windows, with slightly different behavior on each. Go's cross-compilation gave us &lt;code&gt;linux/amd64&lt;/code&gt;, &lt;code&gt;darwin/amd64&lt;/code&gt;, &lt;code&gt;darwin/arm64&lt;/code&gt;, and &lt;code&gt;windows/amd64&lt;/code&gt; from a single build step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native concurrency.&lt;/strong&gt; Test generation is embarrassingly parallel — each file is independent. Go's goroutines and channels made the fan-out generation and the debounced file watcher straightforward to implement without pulling in async libraries.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;The command surface was cleaned up in the same pass. v1 used flags for everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;testsmith &amp;lt;file&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="c"&gt;# generate&lt;/span&gt;
&lt;span class="gp"&gt;testsmith --all          #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;generate all
&lt;span class="gp"&gt;testsmith --graph        #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;dependency graph
&lt;span class="gp"&gt;testsmith --prune        #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;prune stale fixtures
&lt;span class="gp"&gt;testsmith --watch        #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;watch mode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;v2 uses proper subcommands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;testsmith generate &amp;lt;file&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="go"&gt;testsmith generate --all
testsmith graph
testsmith prune
testsmith watch
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made shell completion, help text, and per-command flags much cleaner. Cobra's built-in completion generator gives us bash, zsh, fish, and PowerShell completion for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Change
&lt;/h2&gt;

&lt;p&gt;v1 was monolithic Python — one codebase with hardcoded branches for each language it supported. Adding a new language meant editing multiple core files.&lt;/p&gt;

&lt;p&gt;v2 uses a &lt;code&gt;LanguageDriver&lt;/code&gt; interface. Each language (Go, Python, TypeScript, Java, C#) is a separate package that implements the interface. The core generation pipeline never knows which language it's dealing with — it just calls through the interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;LanguageDriver&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;DetectProject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;AnalyzeFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SourceAnalysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;DeriveTestPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sourcePath&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ProjectContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;GenerateTestFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SourceAnalysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="n"&gt;GenerateOpts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;GeneratedFile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// ... and more&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding a new language is now a matter of creating a new package and registering it — no changes to the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Kept
&lt;/h2&gt;

&lt;p&gt;The v1 Python package didn't disappear. It lives in &lt;code&gt;archive/v1/&lt;/code&gt; and continues to receive bug fixes during the transition period. Teams already using v1 in production don't need to migrate immediately. The v2 binary is a clean break for new users; v1 stays stable for existing ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Was It Worth It?
&lt;/h2&gt;

&lt;p&gt;Yes, unambiguously. The CI story went from "install Python, set up venv, pin deps" to "download one binary." The Windows test matrix went from flaky (Python path issues) to clean. And the plugin architecture means we can add a Ruby or Rust driver without touching the core generation logic.&lt;/p&gt;

&lt;p&gt;The rewrite took longer than a feature would have. But distribution friction is a silent project killer — nobody files a bug for "this was annoying to set up," they just stop using the tool.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;TestSmith is an open-source CLI for generating test scaffolds across Go, Python, TypeScript, Java, and C#. The source is at &lt;a href="https://github.com/orieken/testsmith" rel="noopener noreferrer"&gt;github.com/orieken/testsmith&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>python</category>
      <category>cli</category>
      <category>devtools</category>
    </item>
  </channel>
</rss>
