What We Built
NumPath now generates a teacher insight on demand: click "Generate insight" on any student's panel and the system reads that student's KC mastery states plus their 10 most recent attempts, sends them to Claude, and returns a structured response: an actionable observation (text), a severity signal (type: warn/good/info), and a traceable evidence block pointing to the specific KC, p_mastery value, and mistake count that drove the insight. The evidence isn't generated by the LLM — it's assembled server-side from DB reads. Teachers get an interpretation layer backed by auditable data.
The backend piece is GenerateInsightUseCase — three DB queries, one LLM call, one JSON parse, one server-side evidence assembly. This post is about the schema design, the prompt engineering, and everything that can go wrong with the parse step.
The Design Decision
Structured JSON vs. free-form text
The original insight.txt prompt asked for a single sentence. Simple, but not structured enough for a UI that needs to display a summary and a suggested action as separate elements.
We rewrote the prompt — but here's the key design decision: the LLM only generates two fields. The evidence block is assembled server-side from DB reads before the call, so it's auditable regardless of what the model says.
You are a specialist math learning advisor for primary school teachers.
Given a student's Knowledge Component mastery data and recent attempt history,
generate a JSON response with exactly two fields:
- "text": one actionable sentence (max 30 words) citing a specific KC or mistake type
- "type": one of "warn", "good", or "info" based on urgency
Respond with only the JSON object. No explanation, no markdown, no code fences.
That last line — "no markdown, no code fences" — is the most important one. Without it, Claude wraps the JSON in a json block, and json.loads() raises JSONDecodeError on the backtick prefix. Every time.
Defensive fallback, never a 500
LLM output can't be trusted unconditionally. We introduced a module-level fallback constant and a pure parse function:
_FALLBACK_INSIGHT_TEXT = InsightResponse(
text="Insight temporarily unavailable.",
type="info",
evidence=InsightEvidence(kc="", p_mastery=0.0, mistake_type=None, mistake_count=0, window=""),
)
def _parse_llm_fields(raw: str) -> tuple[str, str]:
"""Parse only the LLM-generated fields: text + type."""
try:
data = json.loads(raw)
return data["text"], data["type"]
except (json.JSONDecodeError, KeyError, TypeError):
logger.warning("insight_parse_failed_using_fallback raw=%s", raw[:200])
return None, None
The evidence block is never parsed from the LLM — it's built from the same DB data that was sent to the model. If parsing fails, the fallback fires. The endpoint always returns 200 with a clean InsightResponse. The warning log is the signal to watch for prompt drift after model updates.
StubProvider updated for the new schema
The StubProvider had to be updated to match — the old stub returned a plain string, which would now fail parsing:
class StubProvider:
async def complete(self, system: str, user: str, max_tokens: int = 256) -> str:
return (
'{"text": "Student is consistently skipping the borrowing step — targeted regrouping practice recommended.", '
'"type": "warn"}'
)
The stub returns only text and type — the evidence block is assembled by the use case from DB data, not by the provider. All tests run offline with LLM_PROVIDER=stub — no API key needed, no network dependency, deterministic results. The provider abstraction (ADR-003) pays off here.
Why It Matters for the Research
The MacLellan NTI Framework (2018) requires that every AI insight be explainable and traceable — not just "this student needs more practice" but "this student has p_mastery=0.18 on SUB_BORROW with BORROW_SKIP classified in 9 of the last 11 attempts." The evidence field makes that traceability structural, not aspirational.
The Teacher-in-the-Loop principle is now fully implemented across three phases:
- Phase 1: KC mastery bars — where is the student stuck?
- Phase 2: Attempt history — what exactly did they do wrong?
- Phase 3: LLM insight — what should the teacher do about it?
The distinction between Phase 2 and Phase 3 matters for the RCT. Phase 2 gives teachers raw evidence — they can verify it, disagree with it, or draw their own conclusions. Phase 3 adds an interpretation. The research question becomes: do teachers who see the LLM interpretation intervene differently from those who only see the raw data? We don't know yet. But we'll be able to measure it.
What We Learned
The hardest part wasn't the LLM call — it was making the output reliable enough to drive UI state. A streaming response or a multi-turn conversation would require a different interface entirely (ADR-003 explicitly deferred that). For a single completion call, the pattern is: instruct precisely, parse defensively, fall back gracefully.
The insight quality degrades predictably for students with few attempts — a student who's done 3 problems gets a much vaguer insight than one who's done 30. That's expected and honest, but it's worth surfacing to teachers: "This insight is based on 3 attempts" would make the confidence level explicit. That's a Phase 4 enhancement.
What's Next
The LLM provider abstraction supports multiple models — the next natural step is A/B testing Claude Sonnet vs. Haiku on insight quality vs. cost. That's not planned for MVP, but the infrastructure already supports it.
Key Takeaways
-
"No markdown, no code fences" is load-bearing prompt text — without it, Claude wraps JSON in backtick blocks and
json.loads()fails; always explicitly instruct raw output -
The fallback constant pattern beats exception propagation — a named
_FALLBACK_INSIGHTat the module level makes the graceful degradation path explicit, testable, and readable - The StubProvider must match the current schema — when you change what the LLM is expected to return, update the stub immediately or you'll have tests that pass against outdated fixtures
Top comments (0)