One of the strangest things about AI engineering is that your test suite can be 100% green while your product is getting worse.
Traditional software taught us to think in absolutes:
- pass/fail
- correct/incorrect
- deterministic/non-deterministic
AI systems force us to think in distributions, thresholds, confidence intervals, and trade-offs.
Itβs a subtle shift, but it changes how you build, test, and deploy software.
I collected some of the lessons Iβve learned while building AI-powered applications and learning evals.
https://kig.re/2026/06/22/writing-evals-for-ai-powered-apps.html
Curious how others are approaching evaluation and regression testing in production AI systems.
And here is the summary of most recent research on the subject so you don't have to :-)
Top comments (0)