Mrunmay

Posted on Dec 15, 2025

How Deep Agents Actually Work: A Browsr Architecture Walkthrough

#webdev #llm #opensource #ai

Deep agents don’t fail loudly — they drift.

Once an agent runs for 10–50 steps, debugging becomes guesswork. You don’t know which tool call caused the issue, why the plan changed, or where cost and context exploded.

In this post, we’ll break down how a real deep agent works under the hood by walking through the architecture of Browsr, a browser-based deep agent, and observing its execution step by step.

What Makes an Agent “Deep”

Keeps a running plan / TODO list of what still needs to be done.
Uses tools (like a browser, shell, APIs) to act in the world step by step.
Stores persistent memory (artifacts, notes, intermediate results) so it doesn’t forget earlier work.
Regularly evaluates its own progress, adjusts the plan, and retries when something fails.

Because it can plan, remember, and correct itself, a deep agent can run for a long duration, tens or hundreds of steps without losing the thread of the task.

Let’s debug and observe Browsr using vLLora(a tool for agent observability) and see what happens under the hood.

Browsr

Browsr is a headless browser agent that lets you create sequences using a deep agent pattern and then hands you the payloads to run over APIs at scale. It also exports website data as structured or LLM-friendly markdown.

At a high level, Browsr is a deep agent that:

Plans its next action explicitly
Executes browser commands in controlled steps
Persists state between iterations
Evaluates progress before continuing

You can explore the definition and related configurations in this repo.

Note: Always respect the copyright rules and terms of the sites you scrape.

Debugging with vLLora

To make the execution observable, we’ll inspect the agent using request-level traces and timelines captured during execution.

vLLora lets you debug and observe your agents locally. vLLora can help us to better understand our architecture; toolcalls and observe the full agent timeline. It also works with all popular models.

Browsr iterates in 1–3 command bursts as a single step, saving context to artifacts and completes the task with final tool.

Driver: browser_step is the main executor; every turn runs 1–3 browser commands with explicit thinking, evaluation_previous_goal, memory, and next_goal.
Context control: Large tool outputs are written to disk so the model can drop token-heavy responses and reload them on demand.
Stateful loop: Up to eight iterations, each grounded in the latest observation block (DOM + screenshot) to avoid hallucinating.
Strict tool contract: Exactly one tool call per reply (no free text), keeping the agent deterministic and debuggable.

Lets further examine tool definitions as stated below.

browser_step is the driver between steps. The system prompt forces the model to read the latest DOM and screenshot, report the current state, and then decide what to do next. Each turn must include:

thinking: Reasoning about the current state.
evaluation_previous_goal: Verdict on last step
next_goal: Next immediate goal in one sentence.
commands: Array of commands to be executed.

You can checkout the full agent defintion here.

Example: In one representative run, Browsr used the available context to navigate in step one, click in step two, and then run a JS evaluation to return structured data from the page.

Sample Traces

Average cost and no. of steps using gpt-4.1-mini

Average cost per trace ≈ $0.0303 per run
Average steps ≈ 10.5 steps per run

Why Observability Is Critical for Deep Agents

Once agents move beyond single-shot prompts, debugging stops being straightforward.

Engineers often find themselves tweaking system prompts, stepping through tool calls, and guessing what went wrong somewhere in the middle of a long run. When an agent executes 50+ steps and makes hundreds of decisions, failures rarely have a single obvious cause.

This is where observability becomes essential.

Drift over time

An agent may start out doing exactly what you expect, then gradually veer off course due to noisy context, misinterpreted instructions, or a small mistake early on that compounds across later steps.
Cost and context visibility

Without traces, it’s hard to see where tokens spike, context balloons, or expensive branches are triggered — especially when comparing behavior across different models.
Traceable decisions

Lining up what the agent read, decided, and executed at each step makes cause-and-effect visible instead of speculative.
End-to-end execution clarity

Long-running agents blur where time and money are spent: planning, tool execution, retries, or extraction. Observability provides the full picture.

Tools like vLLora make this practical by exposing request-level traces and timelines, allowing you to see what a deep agent is actually doing across an entire run — not just the final output.

If you want to discuss observability patterns, agent anatomy, or agent tooling in more detail, join the vLLora Slack community to connect with other developers.

Key Takeaways

Deep agents fail gradually, not catastrophically
Observability turns debugging from guesswork into inspection
Cost, context, and behavior are architectural concerns
Deterministic tool execution makes long runs understandable

As deep agents become more common, observability isn’t optional — it’s the difference between hoping an agent works and knowing why it does.

DEV Community