The problem with calling an LLM directly
NumPath's teacher dashboard generates per-student insights — one-sentence observations like "Emma skips borrowing in 9 of 11 recent subtraction attempts" with a suggested action. The obvious implementation is to import the Anthropic SDK, call messages.create(), and return the result.
That works until you need to test it. Or run it offline. Or swap providers. Or audit where the insight came from.
This post covers how NumPath abstracts the LLM behind a protocol interface, tests with a deterministic stub, and structures the insight pipeline so the evidence is assembled from database reads — not generated by the model.
The Protocol: 6 lines
The entire LLM abstraction is a Python Protocol:
from typing import Protocol, runtime_checkable
@runtime_checkable
class LLMProvider(Protocol):
async def complete(self, system: str, user: str, max_tokens: int = 256) -> str: ...
No base class. No ABC. No framework. Any object with an async def complete(self, system, user, max_tokens) method satisfies this interface — that's structural typing via Protocol. The @runtime_checkable decorator lets you write isinstance(provider, LLMProvider) if you need a runtime check, though in practice the type checker catches mismatches at lint time.
The signature is deliberately narrow: one system prompt, one user message, one token limit. No conversation history, no tool use, no streaming. NumPath's insight generator makes a single completion call per request. If multi-turn conversation becomes necessary in Phase 3, the protocol gains a new method — existing implementations aren't broken.
Two implementations
ClaudeProvider — the production implementation:
class ClaudeProvider:
def __init__(self) -> None:
self._client = anthropic.AsyncAnthropic(api_key=settings.ANTHROPIC_API_KEY)
async def complete(self, system: str, user: str, max_tokens: int = 256) -> str:
message = await self._client.messages.create(
model="claude-sonnet-4-6",
max_tokens=max_tokens,
system=system,
messages=[{"role": "user", "content": user}],
)
return message.content[0].text
StubProvider — deterministic, zero dependencies, zero API calls:
class StubProvider:
"""Deterministic LLM stub for tests and local dev without API keys."""
async def complete(self, system: str, user: str, max_tokens: int = 256) -> str:
return (
'{"summary": "Student is building foundational numeracy skills '
'with consistent effort.", "suggested_action": "Try place value '
'exercises with physical manipulatives to reinforce digit positioning."}'
)
The stub returns a fixed JSON string that matches the expected response schema. Tests assert against this exact output. If someone changes the response schema, the stub breaks, the tests break, and the problem is caught before deployment.
Wiring: one environment variable
def get_llm_provider() -> LLMProvider:
if settings.LLM_PROVIDER == "claude":
return ClaudeProvider()
return StubProvider()
LLM_PROVIDER defaults to "stub". Running uv run pytest requires zero environment variables — no API key, no network. Production sets LLM_PROVIDER=claude and provides ANTHROPIC_API_KEY. The config uses Literal["claude", "stub"] so a typo like "Claude" fails at startup.
The use case receives the provider through its constructor, not through a global:
class GenerateInsightUseCase:
def __init__(self, db: AsyncSession, llm: LLMProvider) -> None:
self._db = db
self._llm = llm
The router wires it:
@router.get("/students/{student_id}/insight", response_model=InsightResponse)
async def get_student_insight(
student_id: uuid.UUID,
db: AsyncSession = Depends(get_db),
_: dict = Depends(require_teacher),
) -> InsightResponse:
llm = get_llm_provider()
use_case = GenerateInsightUseCase(db, llm)
return await use_case.execute(student_id)
Evidence is not generated — it's assembled
This is the design decision that matters most for a research project. When a teacher sees an insight, they need to trust it — and "trust" in an educational context means "I can check this against the data."
The insight prompt receives two blocks of structured data, both assembled from database queries:
KC states:
- SUB_BORROW: Novice (p_mastery=0.18, 8 attempts)
- PLACE_VALUE: Developing (p_mastery=0.45, 3 attempts)
- NUMBER_LINE: Novice (p_mastery=0.15, 1 attempt)
Recent attempts (last 10, most recent first):
1. Skill: SUB_BORROW | Correct: No | Mistake: BORROW_SKIP | Q: "52 − 27 = ?"
2. Skill: SUB_BORROW | Correct: No | Mistake: BORROW_SKIP | Q: "31 − 14 = ?"
3. Skill: PLACE_VALUE | Correct: Yes | Mistake: none | Q: "Which is larger: 47 or 74?"
The LLM generates two fields: summary (what's happening) and suggested_action (what to do). It does not generate the evidence — the KC codes, mastery percentages, mistake counts, and attempt records are all server-side data. The LLM synthesises a narrative from that data, but the data itself is verifiable.
The prompt enforces this structurally:
You are a specialist math learning advisor for primary school teachers.
Given their Knowledge Component mastery states and recent attempt history,
generate a JSON response with exactly two fields:
- "summary": one sentence (max 20 words) describing the student's current learning state
- "suggested_action": one concrete teaching action (max 20 words) the teacher can take today
Respond with only the JSON object. No explanation, no markdown, no code fences.
Strict JSON. Word limits. No room for hallucinated statistics or invented KC codes.
Graceful fallback
LLMs produce unpredictable output. The response parser handles malformed JSON without crashing:
_FALLBACK_INSIGHT = InsightResponse(
summary="Insight temporarily unavailable.",
suggested_action="Review the student's recent attempts for patterns.",
)
def _parse_insight(raw: str) -> InsightResponse:
try:
data = json.loads(raw)
return InsightResponse(
summary=data["summary"],
suggested_action=data["suggested_action"],
)
except (json.JSONDecodeError, KeyError, TypeError):
logger.warning("insight_parse_failed_using_fallback raw=%s", raw[:200])
return _FALLBACK_INSIGHT
The fallback is a valid InsightResponse — the teacher sees a neutral message, not a 500 error. The warning log captures the first 200 characters of the raw response for debugging without logging the entire LLM output.
Why not LangChain?
This was an explicit decision, documented in ADR-003. LangChain adds 50+ transitive dependencies and significant abstraction cost for what NumPath actually needs: one completion call with a system prompt and a user message. The protocol-based approach is 6 lines of interface, 8 lines of stub, 9 lines of production implementation. The total abstraction surface is smaller than LangChain's ChatModel base class alone.
If NumPath needed retrieval-augmented generation, multi-step chains, or agent loops, LangChain would earn its weight. For two structured completion calls (insight generation and hint narration), it would be accidental complexity.
The fitness function
ADR-003 specifies a concrete test: uv run pytest must pass using StubProvider with no environment variables set. This means every LLM-dependent code path has a test that runs offline. If someone adds a new LLM feature and writes a test that requires ANTHROPIC_API_KEY, CI fails — not because the test is wrong, but because it violates the architectural constraint that the test suite runs without external dependencies.
What's next
The current provider interface handles single-turn completions. Phase 3 may need multi-turn conversation for interactive teacher coaching. When that happens, the protocol gains a second method — complete() stays unchanged, and a new converse() method handles the multi-turn case. Existing implementations get a NotImplementedError default until they're updated. The key is that the interface extends forward without breaking backward.
Key Takeaways
-
Protocol-based abstraction costs 6 lines and buys full test isolation —
StubProviderreturns deterministic output; no API key, no network, no flaky tests; the type checker enforces the contract at lint time - Evidence must be assembled from data, not generated by the model — the LLM writes the narrative but doesn't produce the numbers; KC codes, mastery percentages, and mistake counts come from database queries and are independently verifiable
- Graceful fallback is a first-class design requirement — a teacher sees "insight temporarily unavailable" and a neutral suggestion, never a stack trace; the warning log captures the raw output for debugging without exposing it to the user
Top comments (0)