Hey folks,
I wanted to build out my own personal list of evaluations, early on into putting this together I realised I wanted a way to not just evaluate the model but also the agentic harness that the model is running within, as I find the majority of my use of LLMs is more and more inside of a suite of CLI agentic harnesses.
I've listed in the video a multitutde of motivations for why I built this, but the primary ones were all the hype announcements and wanting a way to see for myself what models and their capabilities were like in the actual tools I use.
A paper by Google over on Kaggle recently went as far as to state that which LLM being used inside of an agentic harness perhaps only contributes 10% towards how effecitve that harness will be for a given task. I am not sure I agree with the figure, but I do agree with the sentiment.
One question I keep asking myself is when do I need to switch from my qwen3.6-27b that I am running locally on my twin 3090 setup, to a cloud model. At the moment I am making this decision on vibes/gut feel, and I think that might be okay for when I am working closely with the model but I am using these cli tools headlessly in quite a few workflows now and not just personally but professionally, so I want to make sure I am picking the right combo for the task.
The repo can be found here: https://github.com/ScottRBK/eval-harness, there is an explanation of the architecture. I have added example evaluations as I built it out to help me think about the different patterns I have utilise for evaluations.
The evaluations are quite easy as they are about resources contained within the model weights. The idea behind it is though I (and anyone else whom might want to fork the repo and curate their own) will build out a private list of evals that are held away from public that people can use to evaluate existing and new models and harnesses as they are released.
I also spent a good bit of time seeing how well cli agents themselves are able to build evaluations and have put together a list of skills that they can use alongside the tool. They do an okay job, but you need to really step through the logic of whatever they produce, they often produce quite brittle evaluations, so try getting them to stick to the example patterns already provided helps quite a bit.
An ideal position for me will be having the ability to ask the agent to generate an evaluation using the skills having just finished a session where I found a particular agent was struggling to complete a task. Theres often been a time where I've come across a problem that the agent has struggled to resolve and I've wished at that point I could make an evaluation out of it, but you are often in the middle of something and it ends up just as another item on my ever growing TODO: list.
This is my first time building an actual evaluation suite or framework of this kind for that matter. I have previously used existing frameworks, such as deepeval, so I was not toally unfamiliar with the topic but as with the other motivations already listed I built this as a learning exercise as well as to get a tool out of it.
If it is useful for you please get in touch and let me know, any feedback as well is also appreciated, as this is my first go and this kind of framework - i expect there is a lot that can be improved and I have potentially got wrong.
Enjoy the rest of your sunday folks.
Top comments (0)