A few years ago, I worked on a Generative AI project, a customer-facing AI assistant. The company had great data and was convinced AI could turn it into something valuable.
We built a prototype fast. Users were excited.
Iteration was quick. Each tweak made the AI feel better.
Then we hit a wall.
We kept changing things, but⦠was it actually getting better? Or just different?
We didn't know.
When "Iterating" Is Just Making Random Changes
At first, improving the AI felt obvious. We spotted issues, fixed them, and saw real progress. But suddenly, everything slowed down.
- Some changes made things better, but we weren't sure why.
- Other changes made things worse, but we couldn't explain how.
- Sometimes, things just felt⦠different, not actually better.
It took me way too long to realize: we weren't iterating. We were guessing.
We were tweaking prompts, adjusting retrieval parameters, fine-tuning the model⦠but none of it was measured. We were just testing on a few cherry-picked examples and convincing ourselves that it felt better.
And that's exactly how most AI teams get stuck.
Better on a Few Examples Isn't Better
When you're close to a project, it's easy to think you can tell when something improves. You run a few tests. The output looks better. So you assume progress.
But:
- Did it actually improve across the board?
- Did it break something else in the process?
- Are you fixing what users actually care about or just what you noticed?
Most teams think they're iterating. They're just moving in random directions π
Iterate Without Measurement... and Fail!
And that's the real problem.
Most teams, when they hit this wall, do what we did: try more things.
- More prompt tweaks.
- More model adjustments.
- More retrieval fine-tuning.
But real iteration isn't about making changes. It's about knowing, at every step, whether those changes actually work.
Without that, you're just optimizing in the dark.
So What's the Fix?
The teams that move past this don't just build better models, they build better ways to measure what βbetterβ means.
Instead of relying on gut feeling, they:
- Define clear success criteria. What actually makes an answer useful?
- Measure changes systematically. Not just on a few cherry-picked examples.
- Make sure improvements don't break what already works.
The Bottom Line
Most AI teams don't struggle to build AI. They struggle to improve it.
I learned this the hard way. But once I started treating iteration as something that needs clear feedback loops, not gut feeling, everything changed.
In a following article, I'll break down how to actually measure AI improvement without getting trapped by misleading metrics.
π Follow to get notified when it's out.
π In the meantime, if you want to go deeper on AI iteration and continuous improvement, check out my Blog.
Top comments (2)
Will definitely keep this in mind. Great post!
good information provided