Our AI scored a wrong answer 9/10.
Here's how it happened — and what we fixed.
We built a "Teach-Back" feature for StudySpark: students explain a concept in their own words and get a score out of 10 with feedback.
The scoring logic? Character count + connective keywords. Type 280 characters and you score 10/10. We didn't know it was broken because the UI looked great.
The feedback panels looked authoritative. The circular score ring animated in. The student moved on feeling confident.
Nothing threw an error. Nothing returned null. From the outside, the feature was working perfectly.
What we were missing: a connection between what we told students about their understanding and what they actually demonstrated afterward.
We replaced the heuristic with a real LLM eval — rubric-based, dimensional, fraction of a cent per call. It catches "therefore therefore therefore" for what it is.
The UI scaffolding was already right. We just put the wrong thing inside it.
Biggest lesson: proxy metrics don't fail loudly. They just quietly reward the wrong behavior until you go looking.
Top comments (0)