DEV Community

Cover image for I added a local eval loop to my personal AI assistant — here's what 800 scored interactions taught me
Liam Steiner
Liam Steiner

Posted on

I added a local eval loop to my personal AI assistant — here's what 800 scored interactions taught me

I'd been using my self-hosted assistant daily for a few months. Long enough to have a sense that some interactions were useful and some weren't. Not long enough to do anything about it.

The problem: no feedback mechanism. I could tell a bad response when I saw it, but there was no signal that accumulated.
So I added one.

Every interaction now gets scored by a local Ollama model — fast enough to not be annoying, scoring on accuracy, relevance, and appropriate confidence.
Interactions below a threshold trigger a reflection prompt: the model looks at the interaction and generates a short analysis of what went wrong.
Those reflections feed into DSPy to optimize the underlying system prompts, run periodically when there's enough new data.

After around 800 scored interactions, patterns started coming through.
The most consistent one: the assistant was overconfident on estimates. Timelines, complexity, quantities.
Systematically biased toward underestimating.
Not something I'd have caught session by session.

Shorter, more direct answers also consistently scored better than thorough ones. Useful to know.

Honest caveats: the Ollama scoring model is imperfect, and DSPy convergence is slow on a single-user dataset.

This is genuinely more experiment than finished feature.
But having a feedback loop at all changes how you think about the system.

https://github.com/sliamh11/Deus

Top comments (0)