DEV Community

Frank Brsrk
Frank Brsrk

Posted on

Wait, you guys run evals?

Comes in my mind a meme with this expression but clearly cannot find the image related to.

my question folks of this community is: whenever u build a system or a product or anything that contains a model in the backend that takes actions and is in charge of decisions that require rigor, u search few good peer reviewed benchmarks run the hardest tasks to grant ur self a bon bon of antisycophancy and see where u stand above or below. great, but still some metrics are not built for ur exact use case u built the product for, do u even step aside, and think to build an eval specific and designed to find the real benefits of ur system? this spawns new findings positive and negative aspects of ur work, and results as a map of failures to suppress and strengths to amplify. this question arises, because each one of u have ur own blueprints and way of seeing and running things, and a pov has its place in this post. thanks for reading

Top comments (0)