DEV Community: Oscar Rieken

Toward an AI Operating System: Context Engineering as the First Runtime Primitive

Oscar Rieken — Wed, 08 Jul 2026 01:55:33 +0000

After publishing the first few posts about ai-assistant-dot-files, the obvious next question is:

Where does this go?

The tempting answer is "more agents."

I do not think that is the right answer.

The more interesting direction is an AI Operating System: not an OS in the kernel-and-device-driver sense, but a runtime model for governed agentic work.

That distinction matters. The goal is not to wrap every task in bigger orchestration theater. The goal is to ask what runtime primitives agentic systems actually need if they are going to be useful, inspectable, and safe enough to trust with real software work.

The first primitive is context.

Why context comes first

In the current framework, Context Engineering is already doing operating-system-shaped work.

The context-engineer agent builds a context-manifest.md before the delivery pipeline starts. That manifest scopes relevant files, prior deliveries, Knowledge Items, ADRs, and token budget pressure.

That is not just prompt hygiene.

It is resource management.

An operating system decides what a process can see, which resources it can access, and how much room it has to work before it starts corrupting other work. Context Engineering plays a similar role for agents.

It answers:

What should this agent see?
What should it not see?
Which prior knowledge is relevant?
Which stale artifact should be summarized instead of loaded in full?
How much of the context budget should this phase consume?

That is why I keep coming back to the phrase: treat the context window like a budget, not a junk drawer.

If an AOS exists, context is one of its schedulable resources.

The current system is not AOS yet

The repo today is a Context Engineering Framework.

It has one canonical shared/ layer, 24 agents, 53 skills, inter-agent contracts, a memory lifecycle, and platform projections for Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini/Antigravity, and Codex.

That is real.

The AOS idea is earlier.

The notes in docs/aos/AOS_Governance_Design_Pack.zip describe design seeds: capability, governance, learning, memory engineering, context engineering, and continuous improvement. They sketch pairs like:

Context Engineer ↔ Context Auditor
Memory Engineer ↔ Memory Auditor
Prompt Architect ↔ Prompt Evaluator
Orchestrator ↔ Scheduler
Learning Engine ↔ Forgetting Engine
Cost Optimizer ↔ Quality Optimizer

That is not a shipped runtime.

It is a question set.

And honestly, that is the part I trust most. A premature AOS would be easy to overbuild. A useful AOS has to start by finding which governance gaps are real.

Governance is the operating model

The strongest thing in the current framework is not the number of agents.

It is that each important handoff has some kind of counterbalance.

docs/AGENT_REFERENCE.md names four kinds:

structural contracts
downstream agent review
human approval gates
aggregate or delayed metrics

That model is small, but it changes the conversation.

Instead of asking "Can the agent do the task?" we ask "What checks the agent's work?"

That is the AOS-flavored question.

An operating system is not just a place where programs run. It is a place where programs run under rules:
permissions, scheduling, memory boundaries, process isolation, accounting, cleanup.

For agents, the equivalent rules are things like:

Does this agent have the right context and only the right context?
Is its output structurally valid?
Is there an independent reviewer where judgment matters?
Are irreversible actions gated by human approval?
Are repeated failures converted into learning rather than buried in chat history?
Is stale or duplicated knowledge expired instead of retrieved forever?

Those questions are less flashy than "autonomous swarm."

Good. Flash is usually where the bugs breed.

Memory is the second runtime primitive

If context is what an agent sees now, memory is what survives the run.

The current memory model uses a promotion lifecycle:

Capture -> Candidate -> Audit -> Approve -> Index -> Retrieve -> Expire

That lifecycle is important because it refuses to treat memory as a pile of saved notes.

In an AOS model, memory needs governance the same way context does.

Not every observation deserves to become durable. Not every durable item deserves to live forever. Not every retrieved item deserves to enter the active context window.

The AOS notes call out the pair:

Memory Engineer ↔ Memory Auditor

That feels right. One side promotes and organizes. The other side asks whether the memory is reusable, non-duplicative, supported by evidence, and still true.

Memory without forgetting is just entropy with a search box.

Entropy management might be the underrated subsystem

One of the AOS notes sketches an Entropy Manager.

Its job:

remove duplicate knowledge
detect stale docs
detect unused agents
merge overlapping rules
reduce repository entropy

That is not glamorous.

It might be essential.

Any agent framework that learns will also accumulate. It will accumulate rules, prompts, skills, memories, exceptions, platform quirks, and "temporary" workarounds that quietly become permanent.

Without an entropy-management function, learning turns into clutter.

The forgotten Cursor symlink story is a small example. The repo already had .cursor/agents and .cursor/skills symlinked to shared/, but the decision was not documented, checked, or integrated into the platform model. The fix was not only to make it work. The fix was to make it maintained.

That is AOS territory too.

Not "can the system do the thing once?"

"Can the system preserve the reason it does the thing?"

What I do not want AOS to become

There are a few traps I want to avoid.

First: AOS should not become a cooler name for a giant prompt library.

Second: it should not pretend every tool has the same capabilities. The current framework already learned that lesson through its platform tier system. Claude Code, Cursor, Copilot, Gemini, Windsurf, and Codex do not expose the same runtime primitives.

Third: it should not optimize for maximum autonomy by default.

Autonomy without counterbalances is not maturity. It is just speed with a longer blast radius.

The goal is governed agency: more work can happen through agents because the system knows where to place boundaries, reviews, summaries, approvals, and forgetting.

The direction

If the current framework answers:

"How do we define agents, skills, rules, context, memory, and governance once, then project them into many AI coding tools?"

Then AOS asks:

"What runtime model lets those agents operate with explicit context, memory, scheduling, permissions, fitness functions, and entropy control?"

That is the north star.

But the path there should stay grounded:

Improve Context Engineering first.
Add Context Auditing before adding more autonomy.
Keep Memory Engineering evidence-based and approval-driven.
Track fitness functions like context precision, retrieval quality, token efficiency, memory quality, and entropy.
Treat every new subsystem as guilty until it proves it closes a real gap.

The point is not to build an AI Operating System because the metaphor sounds good.

The point is to discover which parts of software delivery become safer and more comprehensible when agents
run inside a governed runtime instead of a chat transcript.

That is the direction I want to explore.

Source trail

README.md — current framework shape: canonical shared/ layer, 24 agents, 53 skills, six platform targets.
`docs/runbooks/context-engineering.md](https://github.com/orieken/ai-assistant-dot-files/tree/main/docs/runbooks/context-engineering.md) — Context/Memory/Learning distinction and context manifest role.
docs/runbooks/memory-engineering.md — memory promotion lifecycle.
docs/AGENT_REFERENCE.md — counterbalance model for every current agent.
docs/aos/AOS_Governance_Design_Pack.zip — exploratory AOS notes, especially 00-AOS-Vision.md, 01-Governance-Checks-and-Balances.md, 02-Context-Governance.md, 08-Fitness-Functions.md, and 09-Entropy-Manager.md.
docs/features/context-engineering-framework/TODO.md — Epic 30 forgotten symlink story and the concrete drift-prevention fix.

The Forgotten Symlink: Why 'It Works' Is Not the Same as 'It Is Maintained'

Oscar Rieken — Wed, 08 Jul 2026 01:13:08 +0000

The best bug in a tooling repo is the one where you discover someone already solved your problem.

The worst version is realizing nobody remembered.

While adding native Cursor agent and skill support to ai-assistant-dot-files, we found that .cursor/agents and .cursor/skills already existed in the repo as symlinks.

They pointed to ../shared/agents and ../shared/skills.

That was exactly the design we needed.

They had been committed on 2026-04-09 in commit d0b54d3, with the message "expanded to work for all platforms."

Then they faded out of the system's working memory.

The context

The framework has one canonical source of truth: shared/.

Agents, skills, rules, contracts, Knowledge Items, the domain dictionary, team topology, and platform
registry all live there. Platform-specific files are either symlinked or generated from that canonical
layer.

That is how one repo can project the same rules into Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini/Antigravity, and OpenAI Codex.

The repo has a capability tier system because those tools do not all support the same primitives.

Claude Code has full native agent orchestration. Cursor used to be treated mainly as a rules/persona target,
with generated .mdc files. Then Cursor shipped native Agent Skills and subagents:

.cursor/skills/*/SKILL.md
.cursor/agents/*.md

That changed the design.

The right move was not to generate Cursor-specific copies of the agents and skills. The right move was to symlink Cursor directly to shared/, just like Claude Code already did.

And then we found out the repo already had those symlinks.

The missing part was not the symlink

The symlinks existed.

What did not exist was a system that made them visible, verified, and meaningful.

They were created before the capability tier system explained why they mattered. They were not part of the parity check. They were not clearly represented in the platform registry. They were not woven into the install flow as a first-class design choice.

So the fix was not "create symlinks."

The fix was:

document Cursor's mixed strategy in docs/ARCHITECTURE.md
update shared/platform-registry.json
teach install.sh to symlink .cursor/agents and .cursor/skills
update scripts/check-parity.sh so those symlinks cannot silently disappear
update the README capability matrix with the real Cursor behavior

That last part matters most.

If a behavior is important but not checked, it is folklore.

Folklore decays.

Drift is not only duplication

When people talk about configuration drift, they usually mean duplicated text getting out of sync.

That is one kind.

This was another: a correct structural decision existed, but the rest of the system did not know how to protect it.

The repo already had a parity script because earlier versions had copied instructions into .cursorrules, copilot-instructions.md, and CLAUDE.md independently. That drift was easy to see once you knew to look for it.

The forgotten symlink was quieter.

It did not fail loudly. It just stopped influencing future design.

"It works" is a weak invariant

There is a tempting maintenance posture that says: if the file exists and the app loads it, we are done.

This story made the opposite case.

For cross-tool AI configuration, "it works" is not enough. The stronger invariant is:

the source of truth is named
each projection mechanism is documented
platform capability differences are explicit
parity checks fail when the projection drifts
install behavior recreates the intended shape

That turns a one-off fix into a maintained feature.

The useful lesson

The lesson is not "use symlinks."

Sometimes copying is right. Sometimes generation is right. Sometimes a platform cannot follow references, so inlining is the only honest option. Cursor rules still need generated .mdc files with inlined content, even though Cursor agents and skills can symlink directly to shared/.

The lesson is: make the maintenance contract match the platform's actual capabilities.

For Cursor, the result is now mixed:

agents and skills: symlink to shared/
rules: generate fully inlined .mdc files

That mixed strategy is less elegant than pretending everything works the same way.

It is also truer.

And in tooling, true ages better than elegant.

Source trail

docs/features/context-engineering-framework/TODO.md — Epic 30 Cursor native skills/agents parity and the 2026-04-09 symlink rediscovery.
docs/ARCHITECTURE.md — Cursor mixed strategy and capability tier update.
shared/platform-registry.json — Cursor platform notes and install strategy.
scripts/check-parity.sh — explicit .cursor/agents and .cursor/skills symlink checks.
README.md — updated platform capability matrix.

Memory Engineering Is a Promotion Pipeline, Not a Pile of Notes

Oscar Rieken — Wed, 08 Jul 2026 00:08:10 +0000

A lot of AI memory systems start with the same temptation:

"Just save the useful thing."

That sounds harmless until the knowledge base becomes a junk drawer. Half the notes are too specific, a few are duplicates, some are obsolete, and nobody knows which ones the agent should trust.

In ai-assistant-dot-files, the memory system is deliberately slower.

It uses a promotion lifecycle:

Capture -> Candidate -> Audit -> Approve -> Index -> Retrieve -> Expire

That lifecycle is documented in docs/runbooks/memory-engineering.md, and the important word is not "capture."

It is "candidate."

Nothing writes directly to memory

The framework has a durable memory layer: Knowledge Items in shared/knowledge/, ADRs in docs/adrs/, the domain dictionary, team topology, a feature archive, and a registry at shared/memory-registry.json.

But a lesson from a delivery does not jump straight into shared/knowledge/.

It first becomes a Candidate Record.

That record has required fields:

Source
Type
Evidence
Tags
Expiration condition

Then memory-engineer audits it:

Is it reusable?
Is it already covered?
Is it too speculative?
Does it belong as a Knowledge Item, or should it become a rule change, prompt edit, or ADR instead?

Only after that does a human approve the destination.

The design is intentionally similar to code review. Durable memory changes future behavior, so they deserve a paper trail.

Rejection is a feature

One of my favorite parts of the memory runbook is that it has explicit rejection rules.

Do not promote a memory when it is:

a one-off
already covered
too speculative

That makes "zero candidates promoted this cycle" a healthy result, not a failure.

This is where memory engineering starts to look less like note-taking and more like gardening. The point is not to preserve every leaf. The point is to keep the soil useful.

Expiration matters

The lifecycle also includes expiration.

A Knowledge Item can become stale when the underlying code, agent, or pattern changes. It can be superseded by a better KI. Or usage analytics can show that it never appears in context manifests, which may mean it is not useful or just badly tagged.

The repo does not delete those blindly. Expired KIs move to shared/knowledge/expired/ with a note.

That choice matters because a wrong memory is still evidence. It tells you what the team used to believe.

Why not build the big retrieval system now?

There is a runbook for LightRAG integration at docs/runbooks/lightrag-integration.md.

There is intentionally no implementation.

That is not an omission. It is a YAGNI decision.

The current retrieval path is smaller: search-ki searches Knowledge Items and ADRs; query-memory works across the broader memory registry. The repo currently has 4 portable Knowledge Items, so building a bigger retrieval subsystem before the corpus needs it would add moving parts without solving an observed bottleneck.

The runbook exists so the future integration has a shape if the need becomes real.

That is the kind of "not yet" I trust: documented, intentional, and reversible.

Governance by design, not vibes

The memory system fits into a larger governance model.

docs/AGENT_REFERENCE.md lists every one of the 24 agents and names what checks its work: a structural
contract, a downstream reviewer, a human approval gate, or an aggregate metric.

Some gaps are real and stated plainly. For example, test-driven-developer deliberately bypasses the full review chain for speed. The doc does not pretend otherwise.

That same honesty shows up in memory.

The system does not claim every remembered thing is true forever.

It asks:

Where did this come from?
What evidence supports it?
Who approved it?
What would make it expire?

Those are small questions, but they change the shape of the system.

What I would copy into another project

If you are adding memory to an AI workflow, I would start here:

Do not let agents write directly to durable memory.
Make every memory a candidate first.
Require evidence and an expiration condition.
Treat rejection as a valid outcome.
Periodically compress duplicates instead of making retrieval disambiguate them forever.

The hard part of memory is not remembering.

It is staying worth remembering.

Source trail

docs/runbooks/memory-engineering.md — full memory lifecycle and Candidate/Audit fields.
docs/runbooks/context-engineering.md — distinction between Context, Memory, and Learning.
shared/memory-registry.json — registry of durable memory sources and retrieval backends.
docs/AGENT_REFERENCE.md — agent counterbalances and explicit gaps.
docs/features/context-engineering-framework/TODO.md — Epic 22 memory engineering and Epic 26 documentation-manager boundary cleanup.
docs/runbooks/lightrag-integration.md — documented future path, intentionally not implemented yet.

Treat the Context Window Like a Budget, Not a Junk Drawer

Oscar Rieken — Tue, 07 Jul 2026 23:54:12 +0000

Most AI coding workflows treat context as something that happens accidentally.

You open a few files. Paste a stack trace. Ask the model to inspect a directory. Then another. Then the chat grows heavy, the model starts missing earlier instructions, and everyone pretends the problem is "the model got weird."

In ai-assistant-dot-files(I probably need to rename this now as its grown into something else)
I wanted to treat the context window as a budget.

Not a vibe. Not a giant bucket. A budget.

The repo now ships a Context Engineering Framework that defines one canonical set of agents, skills, and rules in shared/, then projects them into six AI coding tools: Claude Code, Cursor, Windsurf, GitHub Copilot, Gemini/Antigravity, and OpenAI Codex. The current repo has 24 agents, 53 skills, 13 inter-agent
contracts, and 6 platform targets.

The core idea is simple: before an agent does serious work, another agent should decide what belongs in the room.

The context-engineer agent

The framework has a dedicated context-engineer agent. Its job is not to implement anything.

Its job is to produce a context-manifest.md before the rest of the pipeline starts.

That manifest scopes the bounded context, identifies relevant files, surfaces Knowledge Items and ADRs, notes prior related deliveries, and estimates token budget pressure for the downstream agents.

This matters because the feature-delivery pipeline is not one prompt. It is a sequence:

spec-writing
product review
context engineering
analysis
architecture
performance review
data review
development
code review
accessibility review
security review
QA
SRE review
documentation
DevOps

If the first few steps load the wrong material, every later agent pays for it.

The important design move is that the context manifest is itself a governed artifact. It has a contract in shared/contracts/context-manifest-contract.md, and the validate-artifact skill checks that required sections are present before the pipeline moves forward.

Context is not just "whatever the chat accumulated."

It is an explicit handoff.

Context, memory, and learning are different problems

One of the most useful distinctions in the repo lives in docs/runbooks/context-engineering.md:

Context is what is loaded into the model right now.
Memory is durable knowledge that outlives the current run.
Learning is a feedback loop that changes future behavior.

People often flatten all three into "RAG."

That loses important design pressure.

Context is a working set. It should be small, relevant, and current.

Memory is a durable corpus. It should be curated, searchable, and allowed to expire.

Learning is a loop. It should turn repeated delivery evidence into changed rules, changed prompts, or new Knowledge Items.

The context-engineer reads memory to build better context, but it does not automatically rewrite memory.
That separation keeps a bad or noisy run from polluting the durable layer.

Context decay

The framework also uses "context decay."

An artifact more than two pipeline phases old should be read as a summary, not in full. The target is a small gist, roughly 200 words, produced through the summarize-artifact skill.

That is an intentionally boring mechanism, and that is why I like it.

Most context-window failures do not need a magic retrieval system. They need fewer stale artifacts loaded verbatim.

If the developer is five phases downstream from the analyst, they probably need the current acceptance criteria, edge cases, and constraints. They do not need every sentence of the analyst's intermediate reasoning still floating in the model's attention.

Platform reality changes the design

The repo does not pretend every AI coding tool has the same capabilities.

docs/ARCHITECTURE.md defines a capability tier system. Claude Code has full agent orchestration. Cursor
now has real .cursor/agents/ and .cursor/skills/ loading, but its rule files still need fully inlined
content. Windsurf and Copilot get persona/rule projections. Gemini/Antigravity reads AGENTS.md and has
confirmed skill invocation. Codex gets an inlined .openai.md.

That means the framework has to distinguish "agent" from "persona."

An agent can have tool access and participate in a multi-step process. A persona is a context frame: useful,
but not autonomous.

The practical result is a shared/ canonical layer, plus generation and parity checks for each platform.
scripts/check-parity.sh exists because hand-copying instructions across tools is how drift wins.

What I would copy into another project

If you do not need a 24-agent framework, I would still steal these ideas:

Write down the difference between Context, Memory, and Learning.
Add a pre-flight context manifest before complex work starts.
Treat old artifacts as summaries by default.
Make context handoffs structural, not conversational.
Add a parity check for any instruction copied across multiple tools.

The context window is not just a bigger prompt.
It is a scarce design surface.
Use it like one.

Source trail

README.md — canonical shared/ layer, 24-agent roster, 53-skill catalog, six platform targets.
docs/ARCHITECTURE.md — capability tiers, platform projection model, and context flow.
docs/runbooks/context-engineering.md — Context/Memory/Learning distinction, context decay, manifest role.
docs/AGENT_REFERENCE.md — context-engineer role and counterbalances.
docs/features/context-engineering-framework/TODO.md — Epic 5 contract work and Epic 23 contract closure.

Making LLM outputs auditable: the provider abstraction pattern

Oscar Rieken — Mon, 01 Jun 2026 02:28:08 +0000

The problem with calling an LLM directly

NumPath's teacher dashboard generates per-student insights — one-sentence observations like "Emma skips borrowing in 9 of 11 recent subtraction attempts" with a suggested action. The obvious implementation is to import the Anthropic SDK, call messages.create(), and return the result.

That works until you need to test it. Or run it offline. Or swap providers. Or audit where the insight came from.

This post covers how NumPath abstracts the LLM behind a protocol interface, tests with a deterministic stub, and structures the insight pipeline so the evidence is assembled from database reads — not generated by the model.

The Protocol: 6 lines

The entire LLM abstraction is a Python Protocol:

from typing import Protocol, runtime_checkable

@runtime_checkable
class LLMProvider(Protocol):
    async def complete(self, system: str, user: str, max_tokens: int = 256) -> str: ...

No base class. No ABC. No framework. Any object with an async def complete(self, system, user, max_tokens) method satisfies this interface — that's structural typing via Protocol. The @runtime_checkable decorator lets you write isinstance(provider, LLMProvider) if you need a runtime check, though in practice the type checker catches mismatches at lint time.

The signature is deliberately narrow: one system prompt, one user message, one token limit. No conversation history, no tool use, no streaming. NumPath's insight generator makes a single completion call per request. If multi-turn conversation becomes necessary in Phase 3, the protocol gains a new method — existing implementations aren't broken.

Two implementations

ClaudeProvider — the production implementation:

class ClaudeProvider:
    def __init__(self) -> None:
        self._client = anthropic.AsyncAnthropic(api_key=settings.ANTHROPIC_API_KEY)

    async def complete(self, system: str, user: str, max_tokens: int = 256) -> str:
        message = await self._client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=max_tokens,
            system=system,
            messages=[{"role": "user", "content": user}],
        )
        return message.content[0].text

StubProvider — deterministic, zero dependencies, zero API calls:

class StubProvider:
    """Deterministic LLM stub for tests and local dev without API keys."""

    async def complete(self, system: str, user: str, max_tokens: int = 256) -> str:
        return (
            '{"summary": "Student is building foundational numeracy skills '
            'with consistent effort.", "suggested_action": "Try place value '
            'exercises with physical manipulatives to reinforce digit positioning."}'
        )

The stub returns a fixed JSON string that matches the expected response schema. Tests assert against this exact output. If someone changes the response schema, the stub breaks, the tests break, and the problem is caught before deployment.

Wiring: one environment variable

def get_llm_provider() -> LLMProvider:
    if settings.LLM_PROVIDER == "claude":
        return ClaudeProvider()
    return StubProvider()

LLM_PROVIDER defaults to "stub". Running uv run pytest requires zero environment variables — no API key, no network. Production sets LLM_PROVIDER=claude and provides ANTHROPIC_API_KEY. The config uses Literal["claude", "stub"] so a typo like "Claude" fails at startup.

The use case receives the provider through its constructor, not through a global:

class GenerateInsightUseCase:
    def __init__(self, db: AsyncSession, llm: LLMProvider) -> None:
        self._db = db
        self._llm = llm

The router wires it:

@router.get("/students/{student_id}/insight", response_model=InsightResponse)
async def get_student_insight(
    student_id: uuid.UUID,
    db: AsyncSession = Depends(get_db),
    _: dict = Depends(require_teacher),
) -> InsightResponse:
    llm = get_llm_provider()
    use_case = GenerateInsightUseCase(db, llm)
    return await use_case.execute(student_id)

Evidence is not generated — it's assembled

This is the design decision that matters most for a research project. When a teacher sees an insight, they need to trust it — and "trust" in an educational context means "I can check this against the data."

The insight prompt receives two blocks of structured data, both assembled from database queries:

KC states:
- SUB_BORROW: Novice (p_mastery=0.18, 8 attempts)
- PLACE_VALUE: Developing (p_mastery=0.45, 3 attempts)
- NUMBER_LINE: Novice (p_mastery=0.15, 1 attempt)

Recent attempts (last 10, most recent first):
1. Skill: SUB_BORROW | Correct: No | Mistake: BORROW_SKIP | Q: "52 − 27 = ?"
2. Skill: SUB_BORROW | Correct: No | Mistake: BORROW_SKIP | Q: "31 − 14 = ?"
3. Skill: PLACE_VALUE | Correct: Yes | Mistake: none | Q: "Which is larger: 47 or 74?"

The LLM generates two fields: summary (what's happening) and suggested_action (what to do). It does not generate the evidence — the KC codes, mastery percentages, mistake counts, and attempt records are all server-side data. The LLM synthesises a narrative from that data, but the data itself is verifiable.

The prompt enforces this structurally:

You are a specialist math learning advisor for primary school teachers.
Given their Knowledge Component mastery states and recent attempt history,
generate a JSON response with exactly two fields:
- "summary": one sentence (max 20 words) describing the student's current learning state
- "suggested_action": one concrete teaching action (max 20 words) the teacher can take today

Respond with only the JSON object. No explanation, no markdown, no code fences.

Strict JSON. Word limits. No room for hallucinated statistics or invented KC codes.

Graceful fallback

LLMs produce unpredictable output. The response parser handles malformed JSON without crashing:

_FALLBACK_INSIGHT = InsightResponse(
    summary="Insight temporarily unavailable.",
    suggested_action="Review the student's recent attempts for patterns.",
)

def _parse_insight(raw: str) -> InsightResponse:
    try:
        data = json.loads(raw)
        return InsightResponse(
            summary=data["summary"],
            suggested_action=data["suggested_action"],
        )
    except (json.JSONDecodeError, KeyError, TypeError):
        logger.warning("insight_parse_failed_using_fallback raw=%s", raw[:200])
        return _FALLBACK_INSIGHT

The fallback is a valid InsightResponse — the teacher sees a neutral message, not a 500 error. The warning log captures the first 200 characters of the raw response for debugging without logging the entire LLM output.

Why not LangChain?

This was an explicit decision, documented in ADR-003. LangChain adds 50+ transitive dependencies and significant abstraction cost for what NumPath actually needs: one completion call with a system prompt and a user message. The protocol-based approach is 6 lines of interface, 8 lines of stub, 9 lines of production implementation. The total abstraction surface is smaller than LangChain's ChatModel base class alone.

If NumPath needed retrieval-augmented generation, multi-step chains, or agent loops, LangChain would earn its weight. For two structured completion calls (insight generation and hint narration), it would be accidental complexity.

The fitness function

ADR-003 specifies a concrete test: uv run pytest must pass using StubProvider with no environment variables set. This means every LLM-dependent code path has a test that runs offline. If someone adds a new LLM feature and writes a test that requires ANTHROPIC_API_KEY, CI fails — not because the test is wrong, but because it violates the architectural constraint that the test suite runs without external dependencies.

What's next

The current provider interface handles single-turn completions. Phase 3 may need multi-turn conversation for interactive teacher coaching. When that happens, the protocol gains a second method — complete() stays unchanged, and a new converse() method handles the multi-turn case. Existing implementations get a NotImplementedError default until they're updated. The key is that the interface extends forward without breaking backward.

Key Takeaways

Protocol-based abstraction costs 6 lines and buys full test isolation — StubProvider returns deterministic output; no API key, no network, no flaky tests; the type checker enforces the contract at lint time
Evidence must be assembled from data, not generated by the model — the LLM writes the narrative but doesn't produce the numbers; KC codes, mastery percentages, and mistake counts come from database queries and are independently verifiable
Graceful fallback is a first-class design requirement — a teacher sees "insight temporarily unavailable" and a neutral suggestion, never a stack trace; the warning log captures the raw output for debugging without exposing it to the user

60 hand-crafted math problems: what I learned writing seed data for an adaptive tutor

Oscar Rieken — Mon, 01 Jun 2026 02:27:18 +0000

Why hand-author anything?

The obvious approach for seeding an adaptive math tutor is to generate problems programmatically. Pick two random numbers, subtract them, done. I tried this first and it failed for a specific reason: generated problems don't have meaningful hints.

A hint like "Try subtracting the ones column first" is generic. A hint like "2 ones minus 9 is impossible without borrowing — take a ten from the 3 tens" is diagnostic. It names the exact step where a dyscalculic student is likely to get stuck, and it names the operation they need to perform. That second kind of hint requires a human who understands the problem.

NumPath's Phase 1 seeds 100 problems across 5 Knowledge Components, each with two progressive hints, a calibrated difficulty score, and structured metadata that the MistakeClassifier uses to diagnose errors. Every one is hand-authored.

The content schema

Each problem is a JSONB column in Postgres. The schema is intentionally flat — no nested objects, no polymorphism:

{
    "type": "subtraction",
    "question": "32 − 9 = ?",
    "answer": "23",
    "difficulty": 0.25,
    "operands": [32, 9],
    "hints": [
        "2 ones − 9 is impossible without borrowing. Take a ten from the 3 tens.",
        "Now you have 12 ones. 12 − 9 = 3. You have 2 tens left. Answer: 23.",
    ],
}

Three fields deserve explanation:

operands / choices — these aren't shown to the student. They exist for the MistakeClassifier. When a student answers "41" instead of "23" on a subtraction problem, the classifier checks whether the answer matches subtracting the digits in the wrong direction (3 - 2 = 1, 9 - 0 = 9 → 91... no). It checks whether the answer omits borrowing (32 - 9 without regrouping gives 33... no). It checks for digit reversal (23 → 32... close, but the student wrote 41). Each check operates on the operands, not the question string.

difficulty — a float from 0.1 to 0.9, calibrated by hand. This is the initial difficulty estimate. The adaptive engine uses it to match students to problems at their current level. I'll explain the calibration logic below.

hints — always exactly two, always progressive. The first hint names the obstacle. The second hint walks through the solution. Students reveal hints one at a time, voluntarily. Hints are never forced — forcing hints on students who don't want them creates learned helplessness, which is the opposite of what we're trying to study.

The five skill areas

Each skill has 20 problems covering a difficulty gradient from 0.1 to 0.9:

Skill Code	Domain	Example at 0.1	Example at 0.9
`SUB_BORROW`	Subtraction	11 − 4 = ?	1003 − 567 = ?
`PLACE_VALUE`	Number sense	Which is larger: 3 or 8?	What does the 6 represent in 3,641?
`NUMBER_LINE`	Number sense	What number comes after 3?	What is halfway between 250 and 350?
`NUMBER_SENSE`	Number sense	Which is more: 2 or 5?	Order from smallest: 892, 829, 928, 289
`OPERATION_SIGN`	Arithmetic	2 + 3 = ?	15 − 7 + 3 = ?

The difficulty gradient is not linear. The jump from 0.1 to 0.3 (single-digit to simple two-digit) is smaller than the jump from 0.7 to 0.9 (two-digit with borrowing across zeros to three-digit with cascading borrows). This mirrors what the dyscalculia research literature reports: difficulty is not proportional to number size. It's proportional to the number of cognitive steps, particularly steps that require regrouping or holding intermediate results in working memory.

Hint design: what I got wrong

My first draft of hints was procedural — they described what to do:

"Borrow from the tens column. Subtract. Write the answer."

This is useless for a student with dyscalculia. The difficulty isn't knowing what borrowing is — it's executing the procedure without losing track of which column they're in. The second draft of every hint follows two rules:

Name the specific obstacle. Not "this is tricky" — rather "2 ones minus 9 is impossible without borrowing."
Walk through the state change. Not "borrow and subtract" — rather "Take a ten from the 3 tens. Now you have 12 ones. 12 − 9 = 3. You have 2 tens left."

The second rule matters because dyscalculic students often lose the intermediate state — they borrow correctly but then forget what changed. The hint reconstructs the full number after regrouping so the student can see where they are.

This pattern held across all five skill areas. Place value hints name the specific column ("the tens digit is the second from the right"). Number line hints name the direction and distance ("7 is to the right of 4 — count 3 steps forward"). Operation sign hints name the symbols and their meaning ("the − sign means subtract — take the second number away from the first").

Difficulty calibration

Difficulty scores are not arbitrary. They follow a rubric I developed after the first round of testing:

Score range	Criteria
0.1 – 0.2	Single-digit or simple two-digit; one cognitive step
0.25 – 0.4	Two-digit; requires one borrowing or comparison step
0.45 – 0.6	Two-digit with borrowing across columns, or three-digit without borrowing
0.65 – 0.8	Three-digit with borrowing; or problems requiring intermediate computation
0.85 – 0.9	Three-digit with cascading borrows (e.g., borrowing from hundreds when tens is 0)

The adaptive engine uses a DIFFICULTY_BAND of 0.15 around the target difficulty when selecting problems. So a student at target difficulty 0.5 sees problems between 0.35 and 0.65. This means each difficulty tier overlaps with its neighbors — a student improving from 0.4 to 0.6 transitions gradually rather than hitting a cliff.

The seed script

The seed is idempotent — safe to run on every deployment:

async def seed_problems(session, skill_id_map: dict[str, str]) -> int:
    for skill_code, problems in PROBLEMS.items():
        skill_id = skill_id_map.get(skill_code)
        for p in problems:
            content = {k: v for k, v in p.items() if k != "difficulty"}
            stmt = (
                pg_insert(Problem)
                .values(
                    skill_id=skill_id,
                    content=content,
                    difficulty=p["difficulty"],
                    problem_type=p["type"],
                )
                .on_conflict_do_nothing()
            )
            await session.execute(stmt)

on_conflict_do_nothing() means re-running the seed doesn't duplicate problems. The difficulty field is stored both inside the JSONB content and as a top-level column on the Problem model — the column is indexed for the adaptive engine's range queries, while the JSONB copy preserves the original specification.

The full seed runs inside a single transaction: skills first (because problems have a foreign key to skills), then problems, then test accounts. If any step fails, nothing is committed.

What I'd do differently

Two things:

More problems per skill. Twenty problems with a 0.15 difficulty band means some bands have only 2–3 candidates. When the adaptive engine excludes recently-seen problems, it can run out of fresh options at a specific difficulty level. The fallback chain handles this gracefully (widen the band, then allow repeats), but 30 problems per skill would eliminate most fallback cases.

Machine-assisted hint generation. The hints are the bottleneck — each one took 2–3 minutes to write well. For Phase 2, I plan to generate candidate hints with Claude and then manually review them. The human is still in the loop, but the first draft comes faster.

Key Takeaways

Generated problems are easy; generated hints are not — an adaptive tutor's value is in the scaffolding, not the arithmetic; hand-authoring hints that name the specific obstacle and walk through the state change is what makes the system useful for dyscalculia
Difficulty is not proportional to number size — it's proportional to cognitive steps, particularly regrouping and intermediate state; a three-digit problem with no borrowing (350 − 120) is easier than a two-digit problem with cascading borrows (100 − 67)
Idempotent seeds inside a transaction are non-negotiable — on_conflict_do_nothing() plus a single transaction means the seed runs safely on every deployment, fresh clone, and CI pipeline

Clean Architecture in a FastAPI + Vue 3 monorepo

Oscar Rieken — Mon, 01 Jun 2026 02:26:44 +0000

Why architecture matters in a research project

Most research prototypes are throwaway code. NumPath is not. It needs to survive four phases over 30 weeks, accumulate real student data for a randomised controlled trial, and remain testable without live infrastructure at every step. That means the architecture has to enforce rules that hold up under pressure — not just conventions someone remembers to follow.

This post walks through how NumPath uses Clean Architecture to keep a FastAPI backend, a Vue 3 frontend, and a Python ML module in a single repository without coupling them together.

The monorepo layout

The project lives in a single repo with a clear namespace boundary:

phd-research/
├── numpath/
│   ├── backend/      # Python 3.12 + FastAPI + SQLAlchemy
│   ├── frontend/     # Vue 3 + Tailwind CSS + Pinia
│   └── ml/           # BKT, DKT, adaptive engine
├── docs/
│   ├── adrs/         # Architecture Decision Records
│   ├── architecture/ # System design, feature specs
│   └── posts/        # This blog series
└── DOMAIN_DICTIONARY.md

The alternative was three separate repos. For a solo PhD project where a data model change touches the migration, the API schema, the ML engine, and the Vue component in the same commit, separate repos mean coordinated PRs across three remotes. That's overhead with no benefit when one person owns all three layers.

The escape hatch is clean: if NumPath ever needs to become a standalone repo, the numpath/ directory lifts out intact.

Dependency direction: the one rule that matters

Clean Architecture has many principles, but only one that I enforce mechanically: inner layers never import from outer layers.

In NumPath's backend, the layers are:

Domain (models)  →  Use Cases  →  Adapters (routers, DB, LLM)  →  Frameworks (FastAPI, SQLAlchemy)

A use case like GetNextProblemUseCase receives a database session — but it does not import FastAPI, does not know about HTTP, and does not call Depends():

class GetNextProblemUseCase:
    def __init__(self, db: AsyncSession) -> None:
        self._db = db

    async def execute(self, student_id: uuid.UUID) -> NextProblemResponse:
        kc_states = await self._build_kc_states(student_id)
        recent_attempts = await self._fetch_recent_attempts(student_id)
        recent_mistakes = await self._fetch_recent_mistakes(student_id)

        selection: ProblemSelection = select_next_problem(
            kc_states=kc_states,
            recent_correctness=[row.is_correct for row in recent_attempts],
            current_difficulty=recent_attempts[0].difficulty if recent_attempts else 0.3,
            recent_mistakes=recent_mistakes,
        )

        problem = await self._select_problem(selection, ...)
        # ... return NextProblemResponse

The router — the adapter layer — is the only file that knows about FastAPI:

@router.get("/next-problem/{student_id}", response_model=NextProblemResponse)
async def get_next_problem(
    student_id: uuid.UUID,
    db: AsyncSession = Depends(get_db),
    _: dict = Depends(require_student),
) -> NextProblemResponse:
    use_case = GetNextProblemUseCase(db)
    return await use_case.execute(student_id)

The router does three things: parse the request, inject dependencies, and delegate to the use case. No business logic. If I replaced FastAPI with Litestar tomorrow, I'd rewrite the routers and touch nothing else.

Configuration as a boundary

Settings are another place where framework details leak into domain code if you're not careful. NumPath uses Pydantic's BaseSettings with a .env file:

class Settings(BaseSettings):
    model_config = SettingsConfigDict(env_file=".env", extra="ignore")

    DATABASE_URL: str = "postgresql+asyncpg://numpath:numpath@localhost:5432/numpath"
    LLM_PROVIDER: Literal["claude", "stub"] = "stub"
    ANTHROPIC_API_KEY: str = ""
    ENVIRONMENT: Literal["development", "production", "test"] = "development"
    CORS_ORIGINS: list[str] = ["http://localhost:5173"]

Every secret has a default that works locally. LLM_PROVIDER defaults to "stub" so that tests and local dev never require an API key. The Literal type annotation means a typo in the .env file fails at startup, not at runtime when a teacher clicks "Generate insight."

The ML module as a pure function boundary

The ml/ directory is a separate Python package (numpath-ml) with its own pyproject.toml. The backend depends on it, but the dependency is narrow: two functions and a dataclass.

from numpath_ml.adaptive_engine import select_next_problem, ProblemSelection
from numpath_ml.bkt import KCState

select_next_problem() takes dictionaries and lists — no SQLAlchemy models, no async, no database. It returns a ProblemSelection with a skill_code, target_difficulty, and reason string. The use case translates between database rows and these pure data structures.

This boundary exists because the ML code changes on a different cadence than the web application. When I replace the rule-based engine with Deep Knowledge Tracing in Phase 2, the use case stays the same — only the function it calls changes.

The frontend: same principle, different language

The Vue 3 frontend mirrors the same layering:

API client — a thin Axios wrapper that handles auth tokens and 401 redirects:

export const apiClient = axios.create({
  baseURL: '/api/v1',
})

apiClient.interceptors.request.use((config) => {
  const token = localStorage.getItem('token')
  if (token) config.headers.Authorization = `Bearer ${token}`
  return config
})

Stores — Pinia stores manage state. The auth store handles login/logout and persists the JWT to localStorage. Views consume stores, not the API client directly.

Views — PracticeView.vue, TeacherView.vue, LoginView.vue. Each view composes API calls and store access. No view imports another view.

Router — role-based guards redirect students and teachers to their respective views:

router.beforeEach((to) => {
  const auth = useAuthStore()
  if (to.meta.requiresAuth && !auth.isAuthenticated) return '/login'
  if (to.meta.role && auth.role !== to.meta.role) {
    return auth.role === 'teacher' ? '/teacher' : '/practice'
  }
})

Docker Compose as the integration layer

The four services — Postgres, Redis, backend, frontend — are composed with health checks so the backend waits for the database:

backend:
  environment:
    DATABASE_URL: postgresql+asyncpg://numpath:numpath@postgres:5432/numpath
  depends_on:
    postgres:
      condition: service_healthy
    redis:
      condition: service_healthy
  volumes:
    - ./backend:/app/backend
    - ./ml:/app/ml
  command: uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload

Volume-mounting backend/ and ml/ means hot reload works inside Docker — change a use case, save, and the server restarts. The port mapping (5433:5432 for Postgres) avoids collisions with a local Postgres install.

What this buys you

Three concrete benefits I've already seen:

Test isolation — use cases are testable with a real async database session and no HTTP server. The test creates a GetNextProblemUseCase(db) directly. No FastAPI test client needed for business logic tests.
LLM swappability — GenerateInsightUseCase receives an LLMProvider protocol. In tests it gets StubProvider. In production it gets ClaudeProvider. The use case doesn't know which one it has.
Safe ML replacement — when BKT gives way to DKT, only numpath_ml changes. The use case calls the same select_next_problem() function with the same signature. The router doesn't change. The frontend doesn't change.

What I'd do differently

I wouldn't change the layering, but I'd add one thing from day one: a fitness function that statically checks import direction. Right now the rule is "use cases don't import routers" — but it's enforced by code review (i.e., me reviewing my own code). A linter rule or CI check that fails on from backend.routers inside use_cases/ would catch drift automatically.

Key Takeaways

Dependency direction is the only architectural rule worth enforcing mechanically — inner layers never import outer layers; everything else is convention that erodes under deadline pressure
A monorepo is the right default for a solo research project — coordinated PRs across three repos is overhead without benefit when one person owns all layers and changes cut across them
Pure function boundaries between modules pay for themselves — the ML module exports two functions and a dataclass; the web layer translates between database rows and those pure structures, making the ML code replaceable without touching the application

From Bayesian to deep knowledge tracing — upgrading NumPath's student model with a PyTorch LSTM

Oscar Rieken — Mon, 01 Jun 2026 02:26:00 +0000

BKT told us how well a student knows subtraction-with-borrowing. It had no idea that a student who reverses digits on subtraction problems probably also reverses them on place value problems — because BKT treats every Knowledge Component as an island.

Deep Knowledge Tracing (DKT) fixes that. Instead of four independent scalar parameters per KC, it maintains a shared LSTM hidden vector across all KCs and learns the dependencies from data. This is Phase 3 of NumPath: swapping out the Markov model for a neural sequence model.

Here's what we built, the design decision that almost made us reach for a transformer, and the student simulator we had to build first to test it without any real students.

What We Built

Two components that feed each other:

Student simulator — five named personas that generate realistic attempt sequences for testing. Each persona has a per-KC accuracy curve and weighted mistake preferences drawn from the dyscalculia ITS literature:

Persona	SUB_BORROW accuracy	Characteristic errors
`ConfidentLearner`	0.80	Rare, careless (OFF_BY_TEN)
`StrugglingSUB`	0.35	Frequent BORROW_SKIP, slow timing
`PlaceValueGap`	0.60	DIGIT_REVERSAL across skill areas
`FrustrationLoop`	0.30	Fast random guessing
`FastMaster`	0.90	Near-zero mistakes

DKT model — a single-layer LSTM that takes a sequence of (skill, correctness) interactions and predicts P(correct on skill k) at each subsequent step.

model = DKTModel(n_skills=3, hidden_size=64)
state = model.initial_state()

# Student answers SUB_BORROW correctly
state = model.step(state, skill_idx=0, is_correct=True)

# Query mastery on any KC
p_mastery = model.predict(state, skill_idx=0)  # → float in (0, 1)

The Design Decision

Why not stay with BKT?

BKT's four parameters — p_mastery, p_learn, p_guess, p_slip — are per-KC and independent. A student who has DIGIT_REVERSAL on subtraction problems and DIGIT_REVERSAL on place value problems is modelled as having two unrelated problems. BKT cannot learn that these are the same underlying representational gap.

DKT's hidden state is shared. After the student makes a digit-reversal error on subtraction, the LSTM adjusts its hidden vector in a way that also shifts the place value prediction. It learns the cross-KC structure from data.

Why not a transformer?

The sequence lengths we're working with are short — 10 to 30 attempts per session. Transformers need longer sequences to exploit their attention mechanism meaningfully. An LSTM is a better fit: it handles variable-length sequences natively, trains faster on small datasets, and produces interpretable per-step hidden states we can inspect.

More importantly: the Piech et al. (2015) DKT paper established LSTMs as the baseline for knowledge tracing. Improving on the baseline is Phase 4 work; Phase 3 is implementing it correctly.

The encoding

The input encoding follows Piech et al. exactly. At each step t, the input is a one-hot vector of size 2 × n_skills:

x[k]             = 1  if skill k was answered CORRECTLY
x[k + n_skills]  = 1  if skill k was answered INCORRECTLY

For three skills (SUB_BORROW=0, PLACE_VALUE=1, NUMBER_LINE=2), a correct subtraction answer encodes as:

[1, 0, 0,  0, 0, 0]
  ↑ correct half    ↑ incorrect half

An incorrect subtraction answer:

[0, 0, 0,  1, 0, 0]

The LSTM sees this 6-dimensional input and updates its hidden state. The output layer projects the hidden state back to 3 dimensions — one P(correct) per KC.

The training objective

The model learns to predict the NEXT response from the current history. At step t, given the encoded interaction x_t, the LSTM outputs:

ŷ_t[k] = σ(W × h_t + b)[k]  =  P(student answers skill k correctly at t+1)

The loss at each step uses only the skill that was actually asked next:

# At step t, the next question has skill_idx q and correctness r
target = torch.tensor([float(r)])
pred   = logits[0, t, q].unsqueeze(0)
loss_t = BCE(pred, target)

Training is one sequence at a time with Adam and gradient clipping. Small dataset — no need for batching yet.

Why the simulator came first

We can't train DKT on real data until the pilot delivers ≥150 attempt records. But we can validate the architecture right now with the student simulator.

The final integration test runs both pipelines end to end:

Generate 30 sequences from StrugglingSUB (35% accuracy on SUB_BORROW)
Generate 30 sequences from FastMaster (90% accuracy on SUB_BORROW)
Train two separate DKT models on each persona's sequences
Simulate 6 practice steps with each model
Assert FastMaster's model predicts higher mastery than StrugglingSUB's

# From test_dkt.py
fast_mastery = mastery_after_steps(result_fast.model,
    [True, True, True, True, True, False])    # 5/6 correct

struggling_mastery = mastery_after_steps(result_struggling.model,
    [True, False, False, True, False, False]) # 2/6 correct

assert fast_mastery > struggling_mastery      # ✓ passes

This gives us confidence the model learns the right signal before we hand it real children's data.

Why It Matters for the Research

BKT's independence assumption is a known limitation in the ITS literature. It was acceptable for Phase 1 and 2 because we didn't have cross-KC interaction data. Now that the mistake classifier is generating BORROW_SKIP and DIGIT_REVERSAL events consistently, we have a sequence model that can learn from them.

The specific research claim that DKT enables: a student's error pattern on one KC predicts their likely error pattern on a related KC. If DKT learns this and BKT doesn't, that's measurable evidence that the LSTM captures structure that the Markov model misses — and a direct contribution to the Phase 4 RCT analysis.

The upgrade path is explicit:

Pilot delivers ≥150 attempts
train_dkt(sequences_from_db) on the full dataset
Evaluate against BKT's predictions using held-out sessions
Replace update_bkt in SubmitAttemptUseCase when DKT's per-KC accuracy exceeds BKT's

The ADR for this transition is on the backlog.

What We Learned

The student simulator is the missing test fixture for ITS research. Standard software testing assumes you can construct any input you need. In adaptive tutoring, your input is a real child's learning trajectory. The simulator bridges that gap — it's not a replacement for real data, but it lets you test that the model responds in the right direction before you commit to an ethical review and a cohort of participants.

BKT and DKT coexist cleanly at the domain layer. KCState stays unchanged. DKTState is a separate dataclass with a different shape. The backend currently uses KCState; swapping in DKTState is an interface change at SubmitAttemptUseCase and GetNextProblemUseCase — two files, no schema migration.

Gradient clipping mattered more than I expected. Early training runs without clip_grad_norm_ diverged on the frustration-loop persona (all-incorrect sequences). Clipping at max_norm=1.0 stabilised training across all five personas.

What's Next

Backend wiring: load the trained DKT model at startup, store hidden state vectors in Redis per student, and swap the two use cases. That's the integration step that puts DKT into the live adaptive loop.

Key Takeaways

DKT's shared LSTM hidden state captures cross-KC dependencies that BKT's independent scalar parameters cannot — a student with DIGIT_REVERSAL on subtraction is more likely to have it on place value, and DKT learns this from data
Build the student simulator before the model: testing an adaptive learning architecture requires synthetic student trajectories, and the simulator lets you validate directional correctness before any ethics review or pilot recruitment
LSTM beats transformer for short sequences (10–30 steps): attention needs length to work; LSTMs handle variable-length sequences natively and train faster on the small datasets typical of ITS research

Building a mistake taxonomy for dyscalculia — 8 error patterns, rule-based, no ML required

Oscar Rieken — Mon, 01 Jun 2026 02:14:58 +0000

"Wrong" isn't a diagnosis.

When a student answers 32 − 9 = 37, they didn't randomly guess. They subtracted in the wrong direction in the ones column — a specific, named error called a borrow-skip. A tutor that just marks it incorrect and moves on has wasted the most informative signal in the attempt: why the student got it wrong.

NumPath's Phase 2 mistake classifier turns wrong answers into structured MistakeEvent records. Here's how we built it, what we got wrong the first time, and why rule-based classifiers beat a neural network for this job at this stage.

What We Built

Eight rule-based classifiers covering all three of NumPath's Phase 1 skill areas:

Code	Skill	Pattern
`DIGIT_REVERSAL`	SUB_BORROW / NUMBER_LINE	2-digit answer with digits transposed
`WRONG_OPERATION`	SUB_BORROW	Student added instead of subtracted
`BORROW_SKIP`	SUB_BORROW	Ones subtracted in reverse — no borrow taken
`OFF_BY_TEN`	SUB_BORROW	Result ±10 from correct (borrow applied to wrong column)
`PLACE_VALUE_CONFUSION`	PLACE_VALUE	Compared units digits only, ignored tens
`MAGNITUDE_MISJUDGE`	PLACE_VALUE	Chose the smaller number as larger
`NUMBER_LINE_DIRECTION`	NUMBER_LINE	Said "left" when answer is "right"
`OFF_BY_ONE`	NUMBER_LINE	Numeric answer ±1 from correct (miscounted steps)

Each classifier is a pure Python predicate — no external dependencies, no DB imports, testable in isolation. The main function runs them in priority order and returns the first match.

The Design Decision

The first question was: classify with rules or train a model?

The case for rules: we don't have labelled training data yet. Phase 1 just shipped. We have zero MistakeEvent records. Training a classifier on nothing produces nothing.

The case for ML: rules are brittle. A student might make a novel error we didn't anticipate, and rule-based code silently returns None.

We went with rules for Phase 2 because the error patterns for dyscalculia are well-documented in the ITS literature — specifically in the work of VanLehn (1982) on subtraction bugs and the later SIERRA system. "Borrow-skip" and "digit reversal" aren't our taxonomy; they're 40-year-old findings from cognitive science. A rule that detects them is more reliable than a model trained on 150 attempts.

The ML path opens in Phase 3 once the mistake_events table has enough volume. The rule-based classifier generates the labelled training data that Phase 3 will learn from.

The BORROW_SKIP bug

The Phase 1 classifier had a BORROW_SKIP function. It was wrong.

# Phase 1 — incorrect
def _is_borrow_skip(problem_content: dict, given: str) -> bool:
    a, b = operands
    no_borrow_result = str(a + b)    # ← adds a + b, not the borrow-skip result
    return given == no_borrow_result

This detected addition (32 − 9 → 41) and called it BORROW_SKIP. But addition is a completely different error — confusing +/− signs, not misapplying the borrowing algorithm. The mistake was labelled wrong in every event record.

The real borrow-skip pattern: when ones(a) < ones(b), the student skips borrowing and instead subtracts in the wrong direction in the ones column.

For 32 − 9:

Correct: borrow a ten → 12 − 9 = 3 ones, 2 tens → 23
Borrow-skip: ones = 9 − 2 = 7, tens = 3 (unchanged) → 37

# Phase 2 — correct
def _is_borrow_skip(a: int, b: int, given: str) -> bool:
    ones_a, ones_b = a % 10, b % 10
    if ones_a >= ones_b:
        return False  # no borrow needed — pattern doesn't apply
    borrow_skip_result = (a // 10 - b // 10) * 10 + (ones_b - ones_a)
    return given == str(borrow_skip_result)

Verified: 32 − 9 → 37 ✓, 43 − 18 → 35 ✓, 31 − 14 → 23 ✓

The old code was shipping the wrong signal for every borrow-skip attempt. This is exactly why MistakeEvent records are useless until the classifier is correct — the adaptive engine was routing "borrow-skip" students to the wrong remediation path.

The priority ordering problem

Multiple patterns can fire for the same wrong answer. For 43 − 16 = 27, the student wrote 72. That's a DIGIT_REVERSAL (27 reversed). But priority ordering becomes meaningful when patterns genuinely overlap.

The classifier runs a hierarchy:

subtraction problems:
  1. DIGIT_REVERSAL    ← most specific free-form error
  2. WRONG_OPERATION   ← added instead of subtracted
  3. BORROW_SKIP       ← skipped borrowing algorithm
  4. OFF_BY_TEN        ← borrow applied to wrong column

place_value problems (multiple-choice — no free-form digit writing):
  1. PLACE_VALUE_CONFUSION  ← compared units digits only (more specific)
  2. MAGNITUDE_MISJUDGE     ← picked the smaller number (less specific)

number_line problems:
  1. NUMBER_LINE_DIRECTION  ← wrong direction word
  2. DIGIT_REVERSAL         ← transposed digits in numeric answer
  3. OFF_BY_ONE             ← miscounted steps

Place value problems are multiple-choice, so DIGIT_REVERSAL doesn't apply there — the student picks from a given set, they don't write digits freely. Scoping by problem type prevents false positives.

Why It Matters for the Research

Every MistakeEvent record becomes a training signal twice:

Now: the adaptive engine reads the last MISTAKE_WINDOW (3) events. Two BORROW_SKIP codes in a row triggers remediation mode — the engine drops difficulty and targets SUB_BORROW problems specifically. Correct classification = correct remediation.

Later: Phase 3 will train a logistic regression (and eventually a transformer) on the mistake events table. The rule-based classifier generates the initial labelled dataset. If the rules are wrong — as BORROW_SKIP was — the ML model learns the wrong pattern from poisoned labels.

For a dyscalculia intervention study, this matters more than it would in a general tutoring system. Dyscalculia-specific errors like borrow-skip and digit reversal appear in the ITS literature as distinct cognitive profiles. Getting them right means the model can eventually distinguish students who have a procedural gap (BORROW_SKIP) from students who have a representational gap (PLACE_VALUE_CONFUSION) — a distinction that should affect the instructional intervention.

What We Learned

Rule-based classifiers need domain literature, not just intuition. The original BORROW_SKIP implementation was plausible — "student added instead of subtracting" — but wrong. VanLehn's subtraction bug taxonomy makes the actual pattern explicit. Reading the paper would have saved months of mislabelled data.

Priority ordering is a design document. The order in which classifiers run encodes assumptions about what matters more. We chose "most specific fires first" — but that could be wrong. Maybe WRONG_OPERATION (a conceptual error) should always beat DIGIT_REVERSAL (a transcription error) regardless of specificity, because they imply different interventions. We don't have the data to answer that yet.

50 tests is the right investment for a classifier that labels training data. A wrong label propagates forward through every model that trains on it. Testing every predicate in isolation, including priority ordering and edge cases, is not over-engineering — it's protecting the integrity of the entire data pipeline.

What's Next

The mistake_events table is now correctly populated with each session. Once the pilot delivers ≥150 records, Phase 3 can fit a logistic regression on the labelled events — using the rule-based codes as ground truth — and eventually replace the rules with a model that generalises to error patterns we haven't seen yet.

Key Takeaways

Rule-based mistake classifiers are the right first step when training data doesn't exist yet — they generate the labelled dataset that trains the eventual ML model
The real borrow-skip pattern (subtract ones in reverse: 32−9=37) is different from wrong-operation (add instead of subtract: 32+9=41) — getting this wrong poisons every downstream model that trains on the events table
Classifier priority ordering is a design decision that encodes instructional theory; document it explicitly and treat it as something to validate with data

Building a FastAPI + Vue 3 research platform: the 4 bugs that almost broke Phase 1

Oscar Rieken — Mon, 01 Jun 2026 02:11:33 +0000

Phase 1 of NumPath is done. Seven of eight Definition of Done items are checked — the eighth requires real children completing pilot sessions, which no amount of code will substitute for. The stack runs cleanly in Docker Compose, 56 unit tests pass, and a student can log in, answer ten problems, and see their knowledge state update in real time.

What the commit history doesn't show is the afternoon I spent fighting four bugs that don't appear in any FastAPI or Vue tutorial. This post is that afternoon.

What We Built

NumPath is an adaptive math tutor for children with dyscalculia. Phase 1 ships the minimum research instrument: a student practice loop, a rule-based adaptive engine, and a read-only teacher dashboard. No ML yet — just clean infrastructure and a data collection pipeline capable of generating the 150+ attempt records that Phase 2 needs to train the BKT model.

The stack: FastAPI 0.110 + SQLAlchemy 2 + Alembic + asyncpg on the backend; Vue 3 + Tailwind + Pinia on the frontend; PostgreSQL 16 + Redis 7 in Docker Compose.

Bug 1: passlib AttributeError on bcrypt ≥4.0

The symptom was immediate on first login attempt:

AttributeError: module 'bcrypt' has no attribute '__about__'

passlib has a version check that reads bcrypt.__about__.__version__. bcrypt 4.0 removed the __about__ module. The libraries have been incompatible for two years and passlib is effectively unmaintained.

The fix: delete passlib entirely. Replace it with three lines of direct bcrypt calls:

# backend/auth/password.py
import bcrypt

def hash_password(plain: str) -> str:
    return bcrypt.hashpw(plain.encode(), bcrypt.gensalt()).decode()

def verify_password(plain: str, hashed: str) -> bool:
    return bcrypt.checkpw(plain.encode(), hashed.encode())

pyproject.toml: swap "passlib[bcrypt]>=1.7.4" for "bcrypt>=4.0.0". Done. Don't reach for passlib on new Python projects — the dependency is dead.

Bug 2: pnpm 10 security policies blocking Docker builds

The frontend Dockerfile used node:20-slim and installed the latest pnpm via corepack. When pnpm 10 shipped, the build started failing with:

ERR_PNPM_PREPARE_PKG_FAILURE  Error when preparing the package
 Blocked by policy: electron-to-chromium@1.5.134 is not allowed
 because it was released 0 days ago (policy: minimumReleaseAge=3 days)

pnpm 10 introduced release-age security policies that refuse to install packages published within the last N days. A reasonable feature in production — a CI-breaking surprise when your lock file pins a package that was published yesterday.

Two separate policies hit us: minimumReleaseAge and ignored-builds (which blocks esbuild and vue-demi unless explicitly allowed). The package.json "pnpm" field that's supposed to configure these policies is silently ignored in pnpm 10 — it logs a warning and reads nothing.

The fix: pin to pnpm 9:

FROM node:22-slim
RUN corepack enable && corepack prepare pnpm@9.15.9 --activate

pnpm 9 has no release-age policies. The upgrade to pnpm 10 can wait until the project has a proper CI environment to absorb the breaking change.

Bug 3: FastAPI container connecting to localhost instead of postgres

The backend started cleanly. Every database call returned:

asyncpg.exceptions.ConnectionRefusedError: connection refused (host 127.0.0.1, port 5432)

The DATABASE_URL in .env was postgresql+asyncpg://numpath:numpath@localhost:5432/numpath. Inside a Docker Compose network, localhost is the container's own loopback — not the postgres service. The postgres container is reachable by its service name.

The fix: override the env var at the service level in docker-compose.yml:

backend:
  env_file: ../.env
  environment:
    DATABASE_URL: postgresql+asyncpg://numpath:numpath@postgres:5432/numpath
    REDIS_URL: redis://redis:6379/0

The environment block wins over env_file, so local development (which uses localhost) keeps working. Containers talk to each other by service name.

Bug 4: SQLAlchemy column defaults not applied at construction time

This one cost the most time. POST /attempts returned a 500:

TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

The BKT update equation was subtracting from p_learn, which was None. The KCStateRecord model had:

class KCStateRecord(Base):
    p_learn: Mapped[float] = mapped_column(Float, default=0.3)
    p_guess: Mapped[float] = mapped_column(Float, default=0.2)
    p_slip:  Mapped[float] = mapped_column(Float, default=0.1)

The bug: SQLAlchemy's default= is a server-side or flush-time default. When you construct KCStateRecord() in Python and haven't flushed to the database yet, those columns are None on the Python object. The domain code ran immediately after construction, before any flush.

The fix: set defaults explicitly in the constructor, then flush and refresh before returning:

record = KCStateRecord(
    student_id=student_id,
    skill_id=skill_id,
    p_mastery=0.1,
    p_learn=0.3,
    p_guess=0.2,
    p_slip=0.1,
    opportunity_count=0,
)
self._db.add(record)
await self._db.flush()         # write to DB so defaults are applied
await self._db.refresh(record) # re-read the DB-populated values

The rule: if you use a newly constructed SQLAlchemy model object before any flush, assume every default= column is None. Either set defaults in the constructor or flush first.

What the BKT update looks like in practice

With those bugs cleared, the full attempt flow works end to end. A correct answer on a SUB_BORROW problem with a fresh KCState shows:

before: p_mastery=0.100, opportunity_count=0
after:  p_mastery=0.533, opportunity_count=1

That 0.1 → 0.533 jump is the Bayesian update working — prior p_mastery combines with p_learn, corrected for p_guess and p_slip. The math is covered in detail in Bayesian Knowledge Tracing in 37 lines of Python.

Why It Matters for the Research

Phase 1's job was never to be elegant — it was to be instrumented. Every attempt record written to the attempts table is a training signal for Phase 2's BKT parameter estimation. We need ≥150 records (5 students × 3 sessions × 10+ problems) before Phase 2 can begin.

The bugs above are why research-grade software is harder than it looks. Each one silently corrupts data in a different way: password hashing fails outright (detectable), Docker networking fails silently on every write (detectable but subtle), SQLAlchemy defaults produce None BKT parameters (corrupts ML inputs, hard to detect in test data).

The fix for all of them is the same: run the full stack. Not unit tests. Not import my_function; print(my_function()). Start the containers, log in as a real user, and watch what happens.

What We Learned

The honest retrospective:

Seed data is harder than it looks. Writing 60 hand-crafted math problems at three difficulty levels takes longer than writing the adaptive engine. Every problem needs a machine-checkable answer, a hint, and a calibrated difficulty score.

Docker Compose env_file + environment is the right pattern. env_file carries the defaults; environment carries container-specific overrides. The pattern is obvious in hindsight and invisible until you need it.

The flush() + refresh() pattern is load-bearing for async SQLAlchemy. Any code that creates an ORM object and immediately passes it to domain logic needs an explicit flush. The async path doesn't auto-flush the way the synchronous ORM used to.

What's Next

Phase 2: BKT parameter estimation from real student data, and a mistake classifier that categorises subtraction errors beyond "wrong." The attempts table is waiting.

Key Takeaways

passlib is dead — use bcrypt directly; it's three functions and no transitive dependency risk
Docker Compose containers reach each other by service name, not localhost; override DATABASE_URL in the environment block rather than the env_file
SQLAlchemy default= columns are None on a freshly constructed Python object until after a flush() + refresh() — always set constructor defaults explicitly when domain code runs immediately after creation

Bayesian Knowledge Tracing in 37 lines of Python — how NumPath models what a student knows

Oscar Rieken — Wed, 27 May 2026 05:23:15 +0000

What We Built

NumPath maintains a KCState for every student × Knowledge Component pair. After every attempt, update_bkt() revises the probability that the student has mastered that KC. That probability — p_mastery — is what the adaptive engine reads to pick the next problem and what the teacher dashboard displays as a progress bar.

The entire model is 37 lines. Here it is unabridged.

from dataclasses import dataclass

MASTERY_THRESHOLD = 0.80

@dataclass(frozen=True)
class KCState:
    p_mastery: float
    p_learn: float
    p_guess: float
    p_slip: float
    opportunity_count: int = 0

    @property
    def is_mastered(self) -> bool:
        return self.p_mastery >= MASTERY_THRESHOLD


def update_bkt(state: KCState, is_correct: bool) -> KCState:
    """Standard Bayesian Knowledge Tracing update (Corbett & Anderson, 1995)."""
    p, L, G, S = state.p_mastery, state.p_learn, state.p_guess, state.p_slip

    if is_correct:
        posterior = (p * (1 - S)) / (p * (1 - S) + (1 - p) * G)
    else:
        posterior = (p * S) / (p * S + (1 - p) * (1 - G))

    p_new = posterior + (1 - posterior) * L

    return KCState(
        p_mastery=min(1.0, max(0.0, p_new)),
        p_learn=L,
        p_guess=G,
        p_slip=S,
        opportunity_count=state.opportunity_count + 1,
    )

The Four Parameters

BKT models each KC with four parameters, all probabilities between 0 and 1:

Parameter	Meaning	NumPath default
`p_mastery`	P(student has learned this KC)	0.10 (prior — low, conservative)
`p_learn`	P(learning occurs on this attempt, given not yet learned)	0.30
`p_guess`	P(correct answer given KC not learned)	0.20
`p_slip`	P(incorrect answer given KC is learned)	0.10

These are Phase 1 seed values — not calibrated against real student data yet. The parameter estimation problem (fitting p_learn, p_guess, p_slip per KC from observed attempts) is a Phase 4 task once the RCT produces enough data. For now they are reasonable priors from the BKT literature.

The Update Equations

After observing an answer, two steps happen.

Step 1 — Bayesian update (prior → posterior):

Correct:   posterior = p(1 - S) / [p(1 - S) + (1 - p)G]
Incorrect: posterior = pS       / [pS       + (1 - p)(1 - G)]

This is straight Bayes. A correct answer raises the posterior unless the student is likely to have guessed. An incorrect answer lowers it unless the student is likely to have slipped. A correct answer from a student with p_mastery=0.95 and p_slip=0.10 barely moves the needle — the model already thinks they know it. A correct answer from a student with p_mastery=0.10 and p_guess=0.20 moves it less than you might expect — the model discounts lucky guesses.

Step 2 — Learning update (posterior → next prior):

p_new = posterior + (1 - posterior) × p_learn

Even if the student answered incorrectly, there's a p_learn probability that learning occurred anyway. The posterior is never the final state — the learning update always nudges p_mastery upward slightly, reflecting that every attempt is an opportunity.

The Design Decision

We evaluated three approaches before choosing standard BKT:

Item Response Theory (IRT) — models item difficulty as well as student ability. More expressive, but requires calibrated item parameters we don't have. Rejected for Phase 1.

Deep Knowledge Tracing (DKT) — replaces the parametric model with an LSTM that learns latent student state from sequences of attempts. Better at capturing cross-KC transfer. Rejected for Phase 1 because it needs training data we haven't collected yet. It's on the Phase 2 roadmap.

Accuracy streak — raise difficulty after 3 correct in a row, lower after 3 wrong. This is what most commercial apps do. Rejected because it gives you no probability estimate, no per-KC granularity, and no way to distinguish a guesser from a learner.

Standard BKT is 30 years old and still the right choice when you're instrument-building before data collection. It gives you a per-KC probability estimate with interpretable parameters, it's fast to compute, and its failure modes are well understood.

One implementation choice worth noting: KCState is a frozen dataclass. update_bkt() returns a new KCState rather than mutating the existing one. This makes the update function a pure function — easy to test, easy to replay, and safe to call in parallel if we ever need to.

Why It Matters for the Research

The RCT compares learning outcomes for students using NumPath against a control group using static worksheets. To measure a difference, you need a measurement instrument. p_mastery is that instrument.

After a session, the teacher dashboard shows each student's p_mastery per KC as a progress bar. The adaptive engine uses it to pick the next problem. The LLM insight generator reads it to produce explanations like "Aiden's p_mastery on SUB_BORROW is 0.18 — the model has seen 11 attempts and is not converging." All three downstream consumers depend on the same number being meaningful.

BKT's key property for research purposes: it's falsifiable. If a student's p_mastery stays low after 20 correct answers, that's a signal worth investigating — either the parameters are wrong, or the student is consistently guessing, or there's a measurement problem. An accuracy percentage doesn't give you that.

What We Learned

The model is simple. Getting the parameters right is not.

p_learn=0.30 means the student has a 30% chance of learning the KC on any given attempt. That sounds reasonable. But it implies that after 10 attempts, a student who has not yet learned the KC has a 97% cumulative chance of learning it — which is almost certainly too optimistic. The seed parameters will need calibration.

The other thing we learned: opportunity_count is load-bearing. The adaptive engine uses it as a tiebreaker and the teacher dashboard shows it alongside p_mastery. It's not computed from the BKT model — it's just a counter that increments on every update_bkt() call. The frozen dataclass pattern makes this safe: the count in the database is always the count from the last update_bkt() return value, never a stale mutation.

What's Next

Phase 2 adds a DKT model alongside BKT — trained on the data collected during the pilot. The two models will run in parallel so we can compare their predictions against observed outcomes before the RCT begins.

Key Takeaways

BKT separates learning from performance — p_guess and p_slip let the model discount lucky correct answers and unlucky wrong ones; a 70% accuracy rate means something different depending on what the model thinks caused it
p_mastery is the measurement instrument for the RCT — every downstream consumer (adaptive engine, teacher dashboard, LLM insights) reads the same number, so getting it right matters more than getting it fast
Frozen dataclass + pure function = safe update chain — update_bkt() returns a new KCState; there's no shared mutable state, the update is replayable, and the test suite can verify every case in isolation

Two Cross-Platform Bugs in Our Go CLI (And How We Fixed Them)

Oscar Rieken — Wed, 27 May 2026 05:22:56 +0000

Go's cross-platform story is genuinely good. Write code once, compile for any target, mostly just works. But "mostly" hides a couple of sharp edges that bit us while building TestSmith. Both bugs were invisible on macOS and Linux, only surfaced on Windows CI, and had the same root cause: assumptions about path separators and filesystem traversal boundaries.

Bug 1: The Detector Boundary Escape

TestSmith has five language drivers, each responsible for detecting whether a directory is a project of its type. The Python driver walks upward from the starting directory, looking for pyproject.toml or setup.py. The Go driver looks for go.mod. And so on.

The bug: every driver would happily walk past a .git directory belonging to a different project and claim files in an ancestor project.

Here's what happened in practice. Our example projects live at:

testsmith/                  ← Go repo root (.git here)
  examples/
    python-service/          ← Python example project
      pyproject.toml

When you ran testsmith generate from inside examples/python-service/, the Python driver would detect it correctly. But when you ran it from examples/go-service/ and the Python driver was tried first during registry detection, it would walk upward, find no Python markers in go-service/, then continue upward, find no markers in examples/, then continue upward... find conftest.py at the testsmith repo root (left over from a previous test run), and claim the entire testsmith repo as a Python project.

The naive fix is "stop when you see .git." But that's wrong too — a legitimate project root can have both pyproject.toml and a .git directory. If you stop at the first .git you see, you'd refuse to detect projects that are also VCS roots.

The correct rule: check VCS stop markers only at ancestor directories, not at the starting directory.

func findRoot(startDir string) (string, error) {
    dir := startDir
    for {
        // Only check VCS boundaries at ancestor dirs — the starting dir
        // may legitimately have both a project marker and a .git directory.
        if dir != startDir {
            for _, stop := range stopMarkers {
                if _, err := os.Stat(filepath.Join(dir, stop)); err == nil {
                    return "", domain.ErrProjectNotFound
                }
            }
        }
        for _, marker := range rootMarkers {
            if _, err := os.Stat(filepath.Join(dir, marker)); err == nil {
                return dir, nil
            }
        }
        parent := filepath.Dir(dir)
        if parent == dir {
            break
        }
        dir = parent
    }
    return "", domain.ErrProjectNotFound
}

We applied this pattern to all five drivers. The key insight: .git is a traversal-stopping sentinel when found in an ancestor, but it's perfectly normal at the project root itself.

Bug 2: Hardcoded Path Separators

The Windows test failure was more direct:

    analyzer_test.go:179: DeriveTestPath("/proj/src/services/payment.py"):
        got "\\proj\\tests\\src\\services\\test_payment.py",
        want "/proj/tests/services/test_payment.py"

Two things wrong in that output: backslashes (expected on Windows, handled by the test via filepath.ToSlash), and src appearing in the output path when it should have been stripped.

The code:

func deriveTestPath(sourcePath string, ctx *domain.ProjectContext) (string, error) {
    rel, err := filepath.Rel(ctx.Root, sourcePath)
    if err != nil {
        return "", err
    }

    // Strip src/ prefix if present.
    if len(rel) > 4 && rel[:4] == "src/" {  // ← BUG
        rel = rel[4:]
    }
    // ...
}

filepath.Rel on Windows returns src\services\payment.py. The prefix check looks for src/ with a forward slash. On Windows, it never matches. The src component stays in the path, so the output becomes tests\src\services\test_payment.py instead of tests\services\test_payment.py.

The fix normalises to forward slashes before the check:

// Normalise to forward slashes for the prefix check so this works on
// Windows (where filepath.Rel returns backslash-separated paths).
slashed := filepath.ToSlash(rel)
if strings.HasPrefix(slashed, "src/") {
    slashed = slashed[4:]
}
rel = filepath.FromSlash(slashed)

filepath.ToSlash converts \ to /. strings.HasPrefix(slashed, "src/") works correctly on all platforms. filepath.FromSlash converts back to the OS-native separator for the subsequent filepath.Join call.

The same pattern applied to deriveModulePath, which had the identical bug.

The Pattern

Both bugs share a structure: an algorithm that works correctly on the development platform (macOS/Linux) but silently produces wrong results on Windows because it makes assumptions about the filesystem:

Bug 1: assumes .git presence implies "not a project root" (wrong at the starting dir)
Bug 2: assumes filepath.Rel uses forward slashes (wrong on Windows)

The remedies are similarly structured:

Bug 1: be explicit about which directories the invariant applies to
Bug 2: normalise to a known format before string operations, then convert back for OS operations

Go's filepath package is excellent — filepath.Rel, filepath.Join, filepath.Dir, filepath.Base all do the right thing. The problems arise when you mix filepath results with hardcoded string literals (like "src/") that embed platform assumptions. The rule: use filepath functions for path operations, filepath.ToSlash to convert before any string matching, and filepath.FromSlash to convert back before passing to OS calls.

CI as the Detector

Neither bug would have been caught by running tests locally on macOS. The Windows CI job was the only place they surfaced.

This is the case for a real cross-platform test matrix. It's not just about supporting Windows users — it's about finding any code that makes implicit platform assumptions. If your tests only run on one platform, that class of bug is invisible until a user reports it.

TestSmith is open source at github.com/orieken/testsmith. The full CI matrix runs on Ubuntu, macOS, and Windows with -race enabled on all three.