We Used Character Count as Comprehension — Hindsight Disagreed
We shipped a feature that told students "Excellent conceptual understanding!" — and the entire basis for that judgment was that they had typed more than 240 characters and used the word "therefore" at least once.
That is not a metaphor. That is the literal scoring function inside src/pages/TeachBack.tsx. And the reason we didn't catch it for weeks is that the feedback looked completely convincing — structured, warm, specific — and there was no mechanism connecting what we told students about their understanding to whether they actually performed better afterward. That feedback loop is what Hindsight is built to close, and the absence of it let a broken metric run unchecked behind a polished UI.
What Teach-Back Is Supposed to Do
The Teach-Back feature is one of the most pedagogically sound ideas in the project. The concept — explain a topic in your own words as if teaching it to someone else — is a well-established learning technique. If a student can accurately explain photosynthesis without referring to notes, they probably understand it. If they can't, no amount of re-reading will help as much as trying to explain it and discovering where their explanation breaks down.
The flow in TeachBack.tsx is clean:
Student picks a topic from a fixed list of six (Photosynthesis, Stack, Newton's Second Law, Water Cycle, DNA Replication, Recursion)
They type a free-form explanation in a textarea with a character progress bar
They submit, see a 2.5 second "AI is evaluating…" loading state
They receive a score out of 10, a Strengths list, an Areas to Improve list, and numbered tips
The UI is genuinely good. The circular progress ring animates in. The color changes based on score — green above 8, yellow above 6, red below 4. The feedback panels look considered and authoritative. A student seeing this for the first time would reasonably believe something intelligent had just evaluated their understanding.
What Actually Scores the Explanation
Here is the complete scoring function:
tsconst submit = () => {
if (!explanation.trim() || explanation.length < 20) return;
setPhase("evaluating");
setTimeout(() => {
const len = explanation.length;
let baseScore = Math.min(10, Math.floor(len / 40) + 3);
const keywords = ["because", "example", "means", "process", "result",
"therefore", "however", "function", "step"];
const keywordCount = keywords.filter((k) =>
explanation.toLowerCase().includes(k)
).length;
baseScore = Math.min(10, baseScore + Math.floor(keywordCount / 2));
const finalScore = Math.max(2, Math.min(10, baseScore));
setScore(finalScore);
const template = feedbackTemplates.find((t) => finalScore >= t.minScore)
|| feedbackTemplates[3];
setFeedback(template);
setPhase("feedback");
}, 2500);
};
The scoring model has two components. First, base score from length: every 40 characters adds 1 point, starting from a floor of 3. A 240-character response scores 3 + floor(240/40) = 3 + 6 = 9. Before we've read a single word, a student who types 240 characters is already at 9/10. Second, bonus from keywords: for every 2 of the 9 connective words present, add 1 point, capped at 10.
The 2.5 second setTimeout before the score appears is not processing time. There is no processing. The score is computed synchronously and then held in a variable while the loading animation plays. We made it wait to feel like evaluation.
Why This Breaks in Practice
The proxy metric problem here is not subtle. Consider these two student responses for "Explain Recursion":
Student A (scores ~9/10):
"Recursion is a process where a function calls itself because it needs to repeat a step. For example, calculating factorial means you call the function step by step therefore reducing the problem each time. The result builds up through each function call however you need a base case or it loops forever."
Student B (scores ~4/10):
"A function calls itself with a smaller input until it hits the base case, which stops it. Without the base case, you get a stack overflow."
Student A wrote 302 characters and hit 7 of the 9 keywords. Student B wrote 99 characters and hit 2. By our metric, Student A understands recursion better. But Student B's explanation is cleaner, more accurate, and demonstrates a more precise mental model. Student A padded with connective tissue — "therefore", "however", "means" — that sounds structured but adds nothing.
The feedback Student A receives: "Excellent conceptual understanding. Clear and logical structure. Great use of analogies." The feedback Student B receives: "Missing some key details. Could use more concrete examples." Both are pulled from hardcoded feedbackTemplates arrays keyed by score threshold — there is no dynamic feedback at all.
tsconst feedbackTemplates = [
{
minScore: 9,
strengths: ["Excellent conceptual understanding", "Clear and logical structure",
"Great use of analogies"],
weaknesses: ["Could add more edge cases"],
tips: ["Try explaining to a real person next", "Challenge yourself with a harder topic"],
},
{
minScore: 7,
strengths: ["Good grasp of core concepts", "Decent explanation flow"],
weaknesses: ["Missing some key details", "Could use more concrete examples"],
tips: ["Re-read the section on underlying mechanisms", ...],
},
// ...
];
Every student who scores a 9 gets identical feedback. Every student who scores a 7 gets identical feedback. The feedback is not about their explanation — it is about their score bucket. And the score bucket is determined by how many characters they typed.
What Hindsight Would Surface
The core problem is that we have no way to know whether our scoring function is measuring what it claims to measure. Is a student who scores 8/10 on Teach-Back actually more prepared for the quiz on that topic than a student who scored 4/10? We have no idea. We never measured it.
This is the feedback loop that Hindsight is designed to close. Hindsight's model for agent memory and learning is built around accumulating signal across sessions — not just logging individual responses, but tracking whether those responses produced the outcomes they were supposed to produce. Applied here, the pattern would be:
Log each Teach-Back attempt: topic, explanation text, score assigned
Log each subsequent Quiz result on the same topic
Measure correlation: do high Teach-Back scores predict better quiz performance on the same topic?
If the correlation is strong, our proxy metric is working. If students who score 9/10 on Teach-Back still fail the recursion quiz at the same rate as students who scored 4/10, the metric is broken and we need to know that. Without Hindsight-style cross-session tracing, we are producing confident-looking feedback with no idea whether any of it is helping.
The Deeper Problem: Metric Gaming Is Inevitable
Even if our current students don't game the metric deliberately, the metric will shape behavior over time. A character progress bar that turns green at 150 characters teaches students that 150 characters is the goal. A keyword list that rewards "therefore" and "however" teaches students to insert those words regardless of whether their argument needs them.
We already built the coaching into the UI without realizing it:
tsx
{charCount < minLength
? ${minLength - charCount} more chars needed
: charCount < goodLength
? "Keep going…"
: "Great length! ✨"}
The progress bar literally says "Great length! ✨" at 150 characters. We are rewarding verbosity and calling it comprehension. We built the Goodhart's Law trap directly into the interface.
What We'd Do Differently
Use a real LLM to score explanations. A single API call to evaluate a student's explanation against a rubric — accuracy, coverage of key concepts, presence of a concrete example, correct use of terminology — would cost fractions of a cent per submission and produce feedback that is actually about what the student wrote. The UI scaffolding is already perfect for it.
Separate score from feedback template. The hardcoded feedbackTemplates arrays mean every student at a given score bucket gets the same feedback regardless of what they actually wrote. Even with a proxy metric, feedback should reference the specific explanation — at minimum, which keywords were present and which were missing.
Measure outcomes, not proxies. Track whether Teach-Back scores predict quiz performance. If they don't, the scoring function is wrong. This requires cross-session data, which requires something like Hindsight's memory layer to make it tractable.
Be honest in the UI about what is being measured. "Great length!" is coaching students toward the wrong goal. If we're measuring length, say so. Better: don't measure length.
The Teach-Back feature has the right pedagogical instinct. The scoring just needs to actually measure what it claims to — and the only way to know if it does is to close the feedback loop between what we tell students and what they subsequently learn.
Top comments (0)