I've been building AI agents for a while now. Customer support, task automation, the usual stuff. And for the longest time I had the same problem everyone else seems to have — the agent would work fine in testing, go live, and within a few weeks I'd notice it kept making the same wrong decisions on the same types of tasks.
The frustrating part wasn't that it failed. It was that it failed the same way, over and over, with no way to improve without me manually going in and rewriting prompts or hardcoding rules.
I logged everything. I had Langsmith traces, I had application logs, I had all the data. But none of it told me which action was actually correct for which task. It told me what happened. Not whether it was right.
So I built something for my own agents. Nothing fancy at first — just a small layer that tracked which action was taken on which task type, scored the outcome after the fact, and used that history to recommend better actions the next time a similar task came in.
Three things surprised me:
The cold start problem is real but solvable. The first 20-30 runs are basically random exploration. Once you have enough outcome history, the recommendations get genuinely good. In my own testing, correct action rate went from around 70% to 92% after enough runs — not because the model changed, but because the decision layer learned what worked.
Knowing when NOT to act is as important as knowing what to do. I added confidence gating — if the system doesn't have enough history on a task type, it steps aside and lets the base model decide rather than pushing a low-confidence recommendation. This alone reduced bad decisions significantly on edge cases.
The feedback loop compounds. This is the part I didn't expect. Every run makes the next run slightly better. After a few hundred outcomes, the system has a clear picture of what actions work in which contexts, and the recommendations become very reliable.
I've been running this on my own agents for a while now. Not sure if others have hit this wall — curious what people are doing to handle decision quality in production agents. Are you manually reviewing logs? Building your own scoring systems? Just accepting the failure rate?
Top comments (0)