Vinicius Dallacqua

Posted on Feb 10

Agents That Build Agents — Building Autonomous Browsing with Claude Code

#webdev #agents #ai

Agents That Build and Improve Agents — Building Autonomous Browsing with Claude Code

There's a way of working with AI agents that's becomming a common emerging pattern. But it requires a different posture, one that is kind of unnatural to most developers. A willingness to be comfortable with not knowing exactly where you're going. Let me explain.

I've spent the last year researching and experimenting with agents, building my own agent specialized in performance based on a fork of DevTools. And since the holidays last year, after much research, I felt it was time to venture into a larger scope project: the browser. I felt that there was more to agents and browsers and the future of the web than what I saw in the currently available options and decided to dig deep into the problem myself.

After setting up some of the initial service layers, it was time to build the automation service layer for it. I've grew to love using Claude Code over a regular IDE, and since the early release of sub-agents some emmerging patterns began to form on my workflow when working with it. But lets begin by better describing what am I building this time around.

This new side-project is meant to be the convergence of the best personal assistant and a browser, and experiment with the future of the web and interfaces.

It has many layers like a gen-ai pipeline for reimagining websites whilst keeping the branding aspects and a multilayered memory system that abstracts interests over time. But for this writeup I'll describe the early R&D for the autonomous browsing service. The concept for this service is no different to any current AI browser, the user makes a request, the browser navigates and executes a plan, and reports back with what it found.

After working for a company that handles browser automation for agentic QA (QA.tech) for 5 months at this time, I now know better of what it takes to drive browsers with the help of AI agents. Working wich codebases painstakingly built over years at the teaches one to to appreciate the evolution of code built for such complext task.

So I was curious how a codebase built for a similar purpose would be conceived nowadays, where models and tools have evolved significantly in capability. So I decided to try a different approach, one where I was genuinely unsure how well would it work.

What surprised me wasn't that it worked. It was how quickly I got to an MVP and the learnings I developed.

Sandboxes as a learning ground

Developers are now getting more used to think and use AI assistants as execution engines. You know what you want, you describe it and the AI writes the code. After which you review, iterate and ship.

What I ended up leaning towards is to treat the agent as a research partner. Where you have a direction, not necessarily destination.

Here's a concrete example. When I built the first draft of the browsing agent, I was trying to make it create pages in Notion without hardcoded instructions. The first obvious approach would be to try and prompt your way into teaching the agent what to click, what to look for. So the agent could understand how to drive that session to a given goal.

It failed. The agent was uncapable of even getting to the new page after many tries and when it finally did manage it could not do much more than entereing a title. Notion was picked as a testing ground due to how complex the underlying DOM structure and interactions are.

The first session tried to type into buttons. Pasted full markdown content including the title twice. Didn't verify anything between actions. It assumed structure instead of observing it.

If you've used agent-browser by Vercel you probably noticed that the 'baseline' intelligence for agents on driving web pages is quite powerful now. With nothing but a simple --help and in a few iterations, agents can learn how to drive web pages without much prompting at all. But when you are building a service, you don't have the luxury to allow the agent to spend a few iterations learning how to use the browser every time. That's what system prompts are for.

So I tried something different, instead of telling the agent what todo, I asked: what if the agent could figure it out? I know I just said that I don't have that luxury when a user is trying to use it. But, what if I can pre-compute that?

So "the agent" in this case became actually not just one but many. I used Claude Code with agent-browser CLI and tasked Opus to a session where it was the orchestrator, leading Sonnet models to drive CDP sessions based on tasks with a goal and a few vague step suggestions based on it. Nothing trying to 'tell' the browsing agent what to do, but instead just some tips to steer it towards the goal, but nothing technical or too scripted. The goal was to 'distil' the accumulated knowledge into designing the right CoT prompt and tools to drive a CDP session. Creating evals along the way. And these would in turn be used by Gemini 3 Flash as the actual 'brains' of the service. So the first sessions became fixtures to design for evals that ended up being ran by Gemini to guague and measure the iterations, but the design of it and 'lab' version of it would be entirely via Claude Code via different scritps to allow for some basic tokenomics and each sub-agent reporting their 'experience' along the way.

After a few iterations, based on how the sub-agents experienced the session, they feedbacked to Opus, and together they created a chain-of-thought prompt and tools to successfully drive some basic interactions around notion. Without any input from me on specifically how to drive the session or what to look out for. The curious part was that they arrived at a mental model very similar to how I've approached similar problems at work and other places.

This was quite a moment for me.

I had 3-5 sub-agents trying to independently find their way around Notion to execute the same task. Each one exploring, failing, learning. Then reporting back what worked.

And from that chaos, a pattern emerged. Not one I designed, one they collectivelly discovered.

For the most experienced out there, there might be questions on overfitting as an outcome. Thought I cannot be 100% certain just yet without running a larger experiment, the design and guardrails were structure to avoid it.

I had 'hand written' code for agents before, where I've forked and extracted parts of DevTools to implement my own [Performance specialized agent(https://github.com/PerfLab-io/perfagent). But this was different, the mindset and process, and the time of iteration was something I had to think of and rebuild my entire development process around.

From coding agent to research assistant

The sub-agent's findings were more structured than I first started with, fllowing a certain pattern. One that looks familiar to anyone building for browser automation:

OBSERVE → REASON → ACT → VERIFY

OBSERVE: Take a snapshot to understand current state
REASON: Decide what to do based on what's actually there
ACT: Perform ONE action only - the action itself might be a combination of keystrokes or commands ctrl + Z for undo.
VERIFY: Take a fresh snapshot to confirm the result

Opus as the coordinating agent arrived a form of schema to map actions for tool call schemas and error avoidance rules as the ones bellow extracted from one of the reports:

Never fill a button—buttons are for clicking
Never use stale refs—always refresh after mutations
Never paste complex content at once—break into simple paragraphs

I was expecting the pattern to arrive at something that uses more vision context to extract coordinates for actions, but the agents leaned more into a combination of singals. They very quickly 'decided' to use the accessibility tree that browsers expose for screen readers as a core signal, alongside screenshots as a validation layer.

The accessibility tree is ground truth.

That phrase came from one of the sub-agent's own analysis and notes. The agents discovered that the accessibility tree contained what it needed to understand the page. No CSS selectors, no coordinates, just the meaning embedded in the interface itself. Granted, the a11ty tree is often incomplete and neglected, but in conjunction with screenshots a good dual-layer to drive autonomous browsing.

This kinda represents one of the most interesting arcs in web dev to me, where we'll end up making web interfaces more accessible but humans won't be doing most of the clicking around.

Another important learning brought up in notes was that "Observation isn't free, but mistakes are more expensive". The cost of taking an extra snapshot is negligible. The cost of acting on stale information and course correction is exponentially bad.

One of the agents noted that "Element refs are ephemeral" and to "Treat them as valid only until the next mutation", as it tried to interact with the dynamic nature of the Notion interface after pressing 'Enter', the same textbox might have a different ref number.

Using Claude code as a research lab with a lead agent orchestrating smaller agents to use a form of 'knowledge distilation' and autonomous learning was not in my playbook before not so long ago. I had a hunch that generic approaches might beat specific ones from experience, that a certain patterns are powerful. But to see those 'simulation engines' arriving at their own conclusions was very exciting and something that I used for other more advanced features that will become the topic of future writeups.

Moving with the process

There's a passage from Dune, the movie adaptation versiion of the quote:

"The mystery of life isn't a problem to be solved, but a reality to experience; a process that cannot be understood by stopping it. We must move with the flow of the process, we must join it, we must flow with it."

Though the irony on the background for the series is based does not escape me, this feels surprisingly relevant to this new moment.

The instinct most developers have is to control the process. Trust nothing you can't immediatelly verify. This makes sense when you're writing deterministic code. But it might slow down the development when you're working with systems that are more autonomous by nature.

The better approach was to join the flow. Set a direction, provide constraints, then watch what patterns emerge.

I've seen more than once people questioning if we are going to end up 'dumber' or less capable than before as engineers. And thoguht it's hard to asnwer for sure what will happen, similar movements have happened before. We are moving to higher forms of abstraction, where we as software engineers are getting more involved on architecting, researching and validating than directly writting code, for the most part. Ensuring the correct abstractions and safeguards for the systems and leaving implementation details parts more as a 'read' operation then a 'write' operation.

This is quite similar to the move from lower level languages to higher levels and from focusing on the standard libraries to using frameworks as we consolidated best practices and common abstractions.

As a result, we as engineers are able to focus less on (re)building basic feature details to more on complex features and capabilities and getting closer and closer to the product parts.

A good example is another research task where I launched a "sub-agent committee" to find better ways to handle handover between agents (the conversation and browsing agents). I've used four parallel research agents investigating different aspects of the same problem.

The premise was simple, the inballance of input to output token is quite significant. So if you can avoid loading into context and 'repeating' the same tokens on handover between agents, you will arrive at a more structured, faster and also cheaper outcome. The problem is not in storage/handover format, but in how the receiving agent consumes the context.

The more autonomous process meant I could iterate with an orchestration layer of sub-agents over different possible alternatives to arrive at the best possible implementation. Part of the discovery came from progressive disclosure and how SKILLS.md work on levereaging this principle. But the details will be further explored in another writeup as well.

These insights emerged because I let them emerge. I defined the problem space and designed experiments to prove / disprove different premisses, provided the tools, and reviewed the outcome on the other side.

Both researches involved designing features and patterns that were more complex in nature to some of the code I've written by hand before, and I arrived at an earlier draft of it much earlier than I'd expect to otherwise.

Thinking machines and the future

Working with agents as research partners changes something subtle but important: your relationship to not knowing the outcome upfront.

Developers tend to treat uncertainty as a problem to eliminate. You research until you know, then you build. Knowledge precedes action.

But some problems we'll be trying to solve on these new systems can't be known in advance. The solution space is too large and the interactions are too complex. You have to act your way into understanding.

This is where agents become a multiplying factor. They let you explore faster than your own 'context window' allows. They're the external actor that makes iteration possible at the speed insight requires.

There's a common anxiety in tech right now: what happens to developers when AI can code?

Well, it already can (and for quite some time now). So what then?

I already see very smart people arriving at the same conclusion, the craft is evolving, not dying.

Ten years ago, the craft was knowing syntax and language quirks. Then it was knowing frameworks and ecosystems. Then it was knowing distributed systems and infrastructure.

Now it's knowing how to work with agents in your workflow and understanding how to better shape context for the best outcome. It's knowing when to specify and when to explore, developing intuition for what agents are good at and where they need guidance.

The result is not about how better I would have done it myself, but now how much more can I experiment with, how reproducible 'good' code can be and how fast can I iterate over ideas.

It's still the same craft, but different.

So experiment away.