I scored 3/50 on a take-home benchmark for a job application. And I still got the job.
At the time, I hadn't built a fully agentic system before. While I had worked with LLM pipelines and small AI tools, an entirely autonomous architecture was completely new to me. And this taught me a few important lessons — not just about AI agents, but how to approach unfamiliar engineering problems.
After showing my results in the job interview, my (now) CTO mentioned that he had noticed I was using a cheap mini LLM (endless testing had racked up quite a bill!) and that he had tried out my agent with the frontier Opus model. Funnily enough, the agent actually performed worse, scoring only 2/50!
The problem was not the model. I had architected a clean system with plugin-based tooling, consistent interfaces, and a semi-autonomous pipeline that enforced structure around the agent. My plan was to start constrained — give the agent specific tools to achieve a subset of questions and slowly expand the agent with new tools around my well-architected abstractions until it could solve everything.
On paper, the code looked solid. In practice, it couldn't solve the problems.
During the interview, I learned that if I had simply built an agent loop with a Python sandbox and allowed the agent to execute arbitrary (well not entirely arbitrary) code, it would have scored 80% on the benchmark. No prompt engineering needed, no special tools, no advanced abstractions. Just an LLM, a while loop, and a Python environment.
This is not to say that everyone should just release production AI agents that execute arbitrary code on their systems! The point is that I had built before knowing. I assumed I knew the right abstractions, knew which tools to build, and knew how to appropriately constrain the agent. In fact, I did not. I was not even an inkling familiar with the domain of the benchmark.
Looking back, the right approach was to start simple.
Build an agent loop and let the agent safely execute Python code. From there, you already have a working baseline. Everything else — tooling, constraints, abstractions — only exist to improve reliability, cost, or latency.
More importantly, this experience let me properly internalize what an agent is. In traditional software, if you want a system that solves many different problems, you explicitly encode that complexity. Agents flip this assumption because the complexity lives in the model itself.
For me, I see two main takeaways.
The first is practical. There is a trend towards more autonomy for agents. And (somewhat) ironically, I don't think that coding agents themselves have caught onto this. Claude tended to steer me toward less autonomy, more "pipeliney" designs. If we give agents ways to safely execute Python, Python is a tool. And a flexible and powerful one at that. And as frameworks like Pydantic Monty appear on the scene, this is becoming easier than ever.
The second takeaway is about how to approach engineering. Start simple. Understand the domain. Understand the failure modes. And build your system around that. You can't guess powerful frameworks into existence — constraints should be learned, not assumed. This goes for AI agents and beyond.
I've seen other people make the same errors with over-microservicing, over-normalising, and over-abstracting. My mistake will help me avoid this classic engineering trap next time.
The winning pattern is never the most elegant on day one but the simplest one that could possibly work.
Top comments (0)