DEV Community

Cover image for Why AI Browser Agents Need a Runbook Before They Need More Prompts
web4browser
web4browser

Posted on

Why AI Browser Agents Need a Runbook Before They Need More Prompts

When an AI browser agent fails, the first instinct is often to rewrite the prompt.

Make it clearer.

Add more steps.

Add more warnings.

Tell the agent to be careful.

That can help sometimes. But in real browser workflows, especially workflows involving logged-in accounts, persistent browser profiles, proxies, and human review, the problem is often not the prompt.

The problem is that the agent has no runbook.

A prompt tells the agent what you want.

A runbook tells the agent how to operate inside a real browser environment.

That distinction matters.

A browser agent that can click buttons is useful. A browser agent that knows which account it is using, which profile is loaded, which proxy should be active, when to stop, when not to retry, and what evidence to save is much more useful.

This article is about that missing layer.

Not more prompts.

Better browser operations.


A prompt is not an operating model

A prompt is good for expressing intent.

For example:

Check this account and summarize any issues.
Enter fullscreen mode Exit fullscreen mode

That is understandable.

But it does not answer the operational questions:

Which account?
Which browser profile?
Which proxy?
Which region?
What can be changed?
What must never be changed?
When should the agent stop?
How many retries are allowed?
What evidence should be saved?
Who reviews risky steps?
Enter fullscreen mode Exit fullscreen mode

For a public page, this may not matter much.

For a logged-in browser profile, it matters a lot.

The browser is no longer just a runtime. It is carrying account state: cookies, local storage, permissions, previous sessions, extensions, proxy assumptions, language settings, and sometimes team history.

If the agent is operating inside that environment, the environment needs rules.

Putting all of those rules into one giant prompt usually creates a brittle workflow.

A better pattern is:

Prompt = task intent
Runbook = operating rules
Enter fullscreen mode Exit fullscreen mode

The prompt can stay short.

The runbook carries the boundaries.


Why browser agents fail in real workflows

AI browser agents usually do not fail in only one way.

They fail at the edges between automation, identity, and operations.

Wrong account context

The agent opens the correct page, but the wrong account is logged in.

The task may still appear successful. The dashboard loads. The agent extracts data. The summary looks reasonable.

But the result belongs to the wrong account.

That is worse than a visible failure.

Profile drift

A persistent browser profile slowly changes over time.

Cookies expire. Local storage changes. Timezone settings drift. Proxy bindings are updated. Locale assumptions become outdated. Extensions may be enabled or disabled.

The agent is still using a profile, but not necessarily the profile state you expected.

Prompt overreach

A human writes:

Find the problem and fix it.
Enter fullscreen mode Exit fullscreen mode

The agent interprets “fix it” broadly.

It changes settings, retries logins, clicks recovery flows, or updates account details.

The original goal may have been inspection. The actual behavior became account modification.

Silent retry loops

Network timeouts can be retried.

Temporary 5xx errors can often be retried.

But login failure, verification prompts, permission errors, and region mismatches should usually stop the run.

Without retry rules, an agent may keep trying and turn a small issue into a bigger one.

No human checkpoint

Some actions should not be fully automatic:

  • payment
  • credential entry
  • wallet action
  • security setting change
  • account recovery
  • password reset
  • identity verification

A workflow that does not define human review points is relying on the model to improvise.

That is not a safety strategy.

No evidence trail

A run fails and the only output is:

Error: timeout
Enter fullscreen mode Exit fullscreen mode

That does not tell the team whether the issue came from the page, the profile, the proxy, the task instruction, the account state, or the agent’s reasoning.

Without evidence, the same failure will happen again.


What a browser agent runbook should contain

A browser agent runbook does not need to be complicated.

It only needs to make the hidden assumptions explicit.

Here are the fields I would define before letting an AI browser agent operate inside a logged-in profile.


1. Account context

Do not give the agent only a URL.

Give it an account context.

{
  "account_id": "acct_us_042",
  "profile_id": "profile_us_042",
  "account_group": "us-social-review"
}
Enter fullscreen mode Exit fullscreen mode

The key field is account_id.

Everything else should map around it.

The agent should know:

This is the account I am operating for.
This is the browser profile attached to it.
This is the account group or workflow category.
Enter fullscreen mode Exit fullscreen mode

This prevents a common failure: correct page, wrong account.

For multi-account workflows, account context should not live in someone’s memory or a spreadsheet note. It should be part of the run.


2. Environment assumptions

A browser run often depends on environment assumptions.

For example:

{
  "expected_country": "US",
  "timezone": "America/New_York",
  "locale": "en-US",
  "proxy_id": "proxy_us_07"
}
Enter fullscreen mode Exit fullscreen mode

These fields are not decoration.

They define the expected operating environment.

If expected_country is US, but the current exit IP is somewhere else, the agent should not continue blindly.

If the profile assumes America/New_York, but the browser timezone does not match, that should be visible before the task starts.

In many browser automation failures, the page is not the problem.

The environment is.

A runbook should make proxy, timezone, locale, and region assumptions checkable.


3. Task scope

The agent needs to know what kind of task it is performing.

A read-only inspection is different from an account-changing action.

{
  "task_type": "read-only-inspection",
  "allowed_actions": [
    "inspect",
    "summarize",
    "export_report"
  ],
  "blocked_actions": [
    "payment",
    "password_change",
    "security_settings"
  ]
}
Enter fullscreen mode Exit fullscreen mode

This is more reliable than writing:

Be careful.
Enter fullscreen mode Exit fullscreen mode

“Be careful” is vague.

blocked_actions is explicit.

For browser agents, task scope is one of the most important runbook fields because agents are flexible by design. They can adapt, interpret, and recover.

That flexibility needs a boundary.


4. Stop conditions

A good agent is not one that always continues.

A good agent knows when to stop.

{
  "stop_if": [
    "verification_prompt",
    "unexpected_login_page",
    "payment_page",
    "proxy_region_mismatch",
    "repeated_failed_attempts"
  ]
}
Enter fullscreen mode Exit fullscreen mode

Stop conditions are especially important for logged-in workflows.

The agent should stop if:

A verification prompt appears.
A login page appears unexpectedly.
A payment page appears.
The proxy region does not match the expected region.
The same action fails repeatedly.
The page asks for sensitive account recovery.
Enter fullscreen mode Exit fullscreen mode

Stopping is not failure.

Stopping is part of the workflow.

A runbook makes that behavior predictable.


5. Retry policy

Retries are useful.

Unbounded retries are not.

A runbook should define what can be retried and what should stop immediately.

{
  "retry_policy": {
    "max_attempts": 2,
    "retry_on": [
      "network_timeout",
      "temporary_5xx"
    ],
    "do_not_retry_on": [
      "login_failed",
      "verification_required",
      "permission_denied"
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

This keeps the agent from treating every error as a temporary obstacle.

A network timeout is not the same as a failed login.

A 502 is not the same as a permission denial.

A verification challenge is not something to brute-force with more clicks.

Retry policy is boring.

That is why it is useful.

It turns panic behavior into predictable behavior.


6. Human review rule

Human-in-the-loop is not a weakness.

For browser automation, it is often the safety layer.

{
  "human_review_required_for": [
    "credential_entry",
    "wallet_action",
    "payment",
    "account_recovery",
    "security_change"
  ]
}
Enter fullscreen mode Exit fullscreen mode

This tells the agent:

You may inspect.
You may summarize.
You may prepare.
But you may not cross these lines without review.
Enter fullscreen mode Exit fullscreen mode

That matters because browser agents operate in environments where some clicks have real consequences.

A review point should not depend on the model deciding whether something “feels risky.”

It should be defined before the run starts.


7. Evidence requirements

Every run should leave enough evidence for review.

{
  "evidence": {
    "save_screenshot": true,
    "save_dom_snapshot": false,
    "save_console_log": true,
    "save_proxy_check": true,
    "save_final_summary": true
  }
}
Enter fullscreen mode Exit fullscreen mode

Evidence does not need to be excessive.

But it should answer the basic questions:

Which account was used?
Which profile was loaded?
Which proxy was active?
What did the agent observe?
Where did it stop?
What error appeared?
What did it summarize?
Enter fullscreen mode Exit fullscreen mode

For development teams, this feels similar to test artifacts.

A failed CI run without logs is frustrating.

A failed browser agent run without evidence is worse, because it may involve account state, browser state, proxy state, and model decisions at the same time.


8. Completion criteria

An agent should not decide that a task is done just because it reached a plausible stopping point.

Define what done means.

{
  "done_when": [
    "account_status_collected",
    "no_blocking_error_found",
    "summary_saved",
    "evidence_attached"
  ]
}
Enter fullscreen mode Exit fullscreen mode

This makes completion verifiable.

For example, a status inspection is not complete until:

The account status was collected.
No blocking error was found.
The summary was saved.
Required evidence was attached.
Enter fullscreen mode Exit fullscreen mode

Without completion criteria, an agent may produce a confident summary for a half-finished task.

That is one of the easiest ways to get a polished but unreliable result.


A minimal browser agent runbook template

Here is a compact template you can adapt.

{
  "run_id": "run_2026_05_20_001",

  "account": {
    "account_id": "acct_us_042",
    "profile_id": "profile_us_042",
    "account_group": "us-social-review"
  },

  "environment": {
    "expected_country": "US",
    "timezone": "America/New_York",
    "locale": "en-US",
    "proxy_id": "proxy_us_07"
  },

  "task": {
    "task_type": "read-only-inspection",
    "allowed_actions": [
      "inspect",
      "summarize",
      "export_report"
    ],
    "blocked_actions": [
      "payment",
      "password_change",
      "security_settings"
    ]
  },

  "stop_if": [
    "verification_prompt",
    "unexpected_login_page",
    "proxy_region_mismatch",
    "repeated_failed_attempts"
  ],

  "retry_policy": {
    "max_attempts": 2,
    "retry_on": [
      "network_timeout",
      "temporary_5xx"
    ],
    "do_not_retry_on": [
      "login_failed",
      "verification_required",
      "permission_denied"
    ]
  },

  "human_review_required_for": [
    "credential_entry",
    "payment",
    "account_recovery",
    "security_change"
  ],

  "evidence": {
    "save_screenshot": true,
    "save_console_log": true,
    "save_proxy_check": true,
    "save_final_summary": true
  },

  "done_when": [
    "account_status_collected",
    "summary_saved",
    "evidence_attached"
  ]
}
Enter fullscreen mode Exit fullscreen mode

The important part is not the exact schema.

The important part is that the agent is no longer operating in a vague environment.

It has a declared account, environment, task scope, stop logic, retry policy, review rule, evidence requirement, and completion definition.


How this changes the prompt

Without a runbook, the prompt often becomes overloaded:

Check this account and fix any issues. Be careful. Do not do anything risky. If something seems wrong, stop. Make sure to save useful information.
Enter fullscreen mode Exit fullscreen mode

That sounds reasonable, but it is vague.

With a runbook, the prompt can be shorter:

Use the attached runbook.
Perform only read-only inspection.
Stop if verification, payment, login failure, or proxy mismatch appears.
Save evidence and summarize only what was observed.
Enter fullscreen mode Exit fullscreen mode

Now the prompt is not carrying the entire operating model.

It is only invoking it.

This is easier to review, easier to reuse, and easier to debug.


Where Playwright, MCP, and browser-use fit

A runbook does not replace browser automation tools.

It gives them operating rules.

A simple way to think about the layers:

Playwright controls the browser.
MCP exposes browser capabilities.
The agent decides the next step.
The runbook defines what is allowed.
Enter fullscreen mode Exit fullscreen mode

These layers solve different problems.

Playwright is good at deterministic browser control.

MCP or a tool layer can expose browser actions to an AI agent.

An agent framework can plan and adapt.

But none of those automatically defines account boundaries, retry rules, stop conditions, human review points, or evidence requirements.

That is what the runbook is for.

If your workflow depends on persistent login state, it is also worth understanding the difference between storageState vs persistent context. The more your automation depends on long-lived account continuity, the more important the operating layer becomes.


When a simple script is still better

Not every workflow needs an AI browser agent.

Sometimes a script is better.

Use a normal Playwright or Puppeteer script when:

The page is public.
The task is deterministic.
There is no persistent account identity.
There is no sensitive state.
There is no human review step.
There are no high-risk actions.
The workflow is short-lived.
The expected result is easy to assert.
Enter fullscreen mode Exit fullscreen mode

Examples:

Take screenshots of public pages.
Run a CI smoke test.
Check whether a landing page loads.
Submit a staging form.
Validate a basic UI flow.
Enter fullscreen mode Exit fullscreen mode

In those cases, adding an AI agent may only make the system harder to reason about.

If the task is deterministic, low-risk, and short-lived, a script is usually better than an agent.


When a browser workspace becomes useful

A workspace layer becomes useful when the browser environment itself becomes part of the workflow.

That usually happens when you have:

  • multiple long-lived accounts
  • persistent browser profiles
  • proxy-region mapping
  • recurring account checks
  • MCP or reusable browser skills
  • human review
  • execution logs
  • team handoff
  • headless and headed modes used together

At that point, the problem is no longer only browser control.

The problem is coordination.

You need to keep the runbook close to the real operating environment:

Account
Profile
Proxy
Task
Permission
Review
Evidence
Enter fullscreen mode Exit fullscreen mode

For teams moving from single scripts to repeatable account-aware browser workflows, an account-aware browser workspace can make runbooks easier to keep close to profiles, proxies, tasks, logs, and review steps.

The workspace layer does not replace Playwright.

It gives Playwright and AI agents a more reliable place to operate.


A practical pre-run checklist

Before the agent starts, ask:

[ ] Is the correct account selected?
[ ] Is the correct browser profile loaded?
[ ] Does the proxy match the expected region?
[ ] Do timezone and locale match the account assumptions?
[ ] Is the task scope read-only or action-taking?
[ ] Are blocked actions clearly defined?
[ ] Are stop conditions defined?
[ ] Is the retry policy safe?
[ ] Are human review points defined?
[ ] Will screenshots, logs, or summaries be saved?
[ ] Is done clearly defined?
Enter fullscreen mode Exit fullscreen mode

This checklist is simple.

That is the point.

A browser agent should not need to guess the operating model every time it runs.


Final thought

Better prompts can help an AI browser agent follow instructions.

But prompts alone do not create reliable operations.

For logged-in browser workflows, the missing layer is often a runbook:

Account context
Environment assumptions
Task scope
Stop conditions
Retry policy
Human review
Evidence
Completion criteria
Enter fullscreen mode Exit fullscreen mode

The future of AI browser automation is not just agents that can click.

It is agents that understand the rules of the environment they are operating in.

Top comments (0)