Goodhart’s law is ruining models, a call to (experienced) arms.

#ai #productivity #programming #webdev

I’ve been coding for about 20 years professionally, and frankly I don’t really “code” anymore—if by coding you mean typing syntax on a keyboard. Surprisingly, I don’t feel any less effective or any less of an engineer. I actually feel rather excited as various bottlenecks around me are removed. So yeah, I’m not worried about AI taking my career away. It already took away my coding, and I still love my job. But I am worried about something else: overfitting.

As a senior engineer, I want better models and agents! The more I can delegate with confidence, the more actual engineering I can do and the more valuable systems I can architect. Unfortunately, recent models have been overhyped, overfit, and benchmaxxed, all while increasing in price. Why? It’s Goodhart’s Law in practice:

“When a measure becomes a target, it ceases to be a good measure.”

We’ve made these benchmarks the measure of a model’s efficacy, but in the process we’ve ended up ruining the measurement and polluting the training. Real signal right now is coming from places like this, or X, or the proverbial water cooler. Experienced engineers can sense the quality of a model almost immediately. It’s not even hard for us to do—almost like an uncanny valley of skill—we just sort of “feel” that a given model is good or bad after using it for a day.

I think we’ve made a mistake as a tech industry. We’ve delegated too much of the requisite engineering for the “AI revolution” to data scientists. In order for AI to proliferate to Main Street, a lot of engineering is going to be required, and it’s already going terribly wrong. I don’t want to waste what limited attention of yours I have, so I won’t go through the countless examples of poor-quality code being shipped by AI labs themselves, because I think we all get the gist.

The Dunning–Kruger effect has something to say on this, of course. We’re dealing with the “systematic tendency of people with low ability in a specific area to give overly positive assessments of that ability”—data scientists who don’t code themselves think models are better at coding than they really are. After all, they perform well on benchmarks. I don’t mean this as hurtful criticism, but rather as constructive guidance that comes with a proposal: a new, subjective measure.

You know that “vibe” we experienced engineers all get from a model in the first few days after it’s released—Opus is good at UI but can’t stay on task, GPT can run for hours on task but makes UIs from 2001 look good, Gemini just… doesn’t work. Those “vibes” don’t show up in benchmarks. I think we should change that.

We need a subjective “benchmark” that uses relative measures from experienced engineers who have developed their uncanny valley detector for coding agents. A “VibeBench.” When new models are released (or even pre-release, with the cooperation of labs), a pool of experienced engineers spend a couple days with the model doing their normal work and report their findings, which are then compiled into an objective result from the subjective data.

So, my team built it. But… we need a lot of experienced devs to sign up for it to become a reality. So, if you’re an experienced developer, please join! Let’s bring some signal back to the noise.
 
https://vibebench.standardagents.ai/

DEV Community

Goodhart’s law is ruining models, a call to (experienced) arms.

Top comments (0)