So I finally got the topic classifier to a place where it doesn't actively embarrass me — 73% accuracy on the validation set, which is honestly higher than I expected after three weeks of mostly guessing. I used LangSmith for the eval runs mostly because I saw it mentioned in a thread here and the logging UI saved me from going blind in the terminal. The dataset is still a mess — had to relabel about 600 examples by hand after realizing our previous annotator was marking "technology" as "science" about half the time, which explained a lot.
The weird part is that I'm now unsure whether 73% is actually enough for the MVP demo next Friday. Maybe it's fine and I'm overthinking it. The classifier works fine on clean inputs but I'm watching it choke on typos, which feels solvable but also not something I want to debug on a deadline. The client's been quiet about the scope anyway — they keep adding notes to the Figma files but haven't answered my last two emails about acceptance criteria, which I try not to read too much into but do anyway.
Anyway — for those of you running classification models with small training sets, did you find a threshold where accuracy became noticeable in actual usage versus just showing up in metrics? I'm not sure if users will care about that last 10% or if they only notice when it's below 60%.
Top comments (0)