What if you could test your web app by just describing what to test in plain English? No Selenium scripts, no Cypress configs — just tell an AI agent "test the signup flow" and watch it navigate, click, type, and verify results in a real browser.
That's exactly what we're building in this article: a QA testing agent that combines Agent Browser for headless browser control, the Vercel AI SDK for tool-calling orchestration, and LLM Gateway as the unified LLM provider.
The full source code is available at github.com/theopenco/llmgateway-templates/templates/qa-agent.
How It Works
The architecture is straightforward:
- User describes a test in natural language (e.g., "Navigate to the login page, enter invalid credentials, and verify an error is shown")
- Vercel AI SDK sends the prompt to an LLM via LLM Gateway with browser tools attached
- The LLM decides which browser actions to take — navigate, snapshot, click, type, etc.
- Agent Browser executes those actions on a real headless Chromium instance
- Results stream back in real-time as NDJSON — each step, each screenshot, and a final test summary
The LLM acts as the brain, Agent Browser provides the hands, and LLM Gateway lets you swap between models (Claude, GPT-4o, Gemini) with a single string change.
The Tech Stack
- Next.js 16 (App Router) — framework
-
Vercel AI SDK v6 —
generateTextwith tool calling andstepCountIsfor limiting agent loops - @llmgateway/ai-sdk-provider — LLM Gateway's native AI SDK provider
- agent-browser — headless browser automation with accessibility snapshots
- Zod — tool input schema validation
Setting Up
Install Dependencies
npm install @llmgateway/ai-sdk-provider ai agent-browser zod next react
Environment
# .env.local
LLMGATEWAY_API_KEY=your_api_key_here
Get your API key from llmgateway.io.
Building the Browser Tools
The key insight is that Agent Browser provides accessibility snapshots — a text-based tree of the page with element refs like @e1, @e3. The LLM reads these snapshots to understand the page, then uses refs to click and type. No CSS selectors, no XPaths — just semantic understanding.
Here's how we define the browser tools using the AI SDK's tool() helper:
import { tool } from "ai";
import { BrowserManager } from "agent-browser/dist/browser.js";
import { z } from "zod";
function createBrowserTools(browser: BrowserManager) {
return {
browser_navigate: tool({
description: "Navigate the browser to a URL",
inputSchema: z.object({
url: z.string().describe("The URL to navigate to"),
}),
execute: async ({ url }) => {
const page = browser.getPage();
await page.goto(url, { waitUntil: "domcontentloaded" });
return { url, title: await page.title() };
},
}),
browser_snapshot: tool({
description:
"Get an accessibility snapshot of the current page. Returns a text tree with element refs (e.g. [ref=e1]) that you can use with browser_click and browser_type.",
inputSchema: z.object({}),
execute: async () => {
const snapshot = await browser.getSnapshot({ interactive: true });
const tree =
typeof snapshot.tree === "string"
? snapshot.tree
: JSON.stringify(snapshot.tree);
// Truncate to avoid blowing up context
const maxChars = 30_000;
if (tree.length > maxChars) {
return { snapshot: tree.slice(0, maxChars) + "\n... (truncated)" };
}
return { snapshot: tree };
},
}),
browser_click: tool({
description: "Click an element using its ref from a snapshot (e.g. @e1)",
inputSchema: z.object({
ref: z.string().describe("The ref of the element to click"),
}),
execute: async ({ ref }) => {
const locator = browser.getLocator(ref);
await locator.click();
return { clicked: ref };
},
}),
browser_type: tool({
description: "Type text into an input field using its ref",
inputSchema: z.object({
ref: z.string().describe("The ref of the input element"),
text: z.string().describe("The text to type"),
clear: z.boolean().optional().describe("Clear first (default: true)"),
}),
execute: async ({ ref, text, clear = true }) => {
const locator = browser.getLocator(ref);
if (clear) {
await locator.fill(text);
} else {
await locator.pressSequentially(text);
}
return { typed: text, into: ref };
},
}),
browser_press_key: tool({
description: "Press a keyboard key (e.g. Enter, Tab, Escape)",
inputSchema: z.object({
key: z.string().describe("The key to press"),
}),
execute: async ({ key }) => {
const page = browser.getPage();
await page.keyboard.press(key);
return { pressed: key };
},
}),
browser_scroll: tool({
description: "Scroll the page in a direction",
inputSchema: z.object({
direction: z.enum(["up", "down", "left", "right"]),
amount: z.number().optional().describe("Pixels (default: 500)"),
}),
execute: async ({ direction, amount = 500 }) => {
const page = browser.getPage();
const deltaX =
direction === "left" ? -amount : direction === "right" ? amount : 0;
const deltaY =
direction === "up" ? -amount : direction === "down" ? amount : 0;
await page.mouse.wheel(deltaX, deltaY);
return { scrolled: direction, amount };
},
}),
browser_hover: tool({
description: "Hover over an element using its ref",
inputSchema: z.object({
ref: z.string().describe("The ref of the element to hover"),
}),
execute: async ({ ref }) => {
const locator = browser.getLocator(ref);
await locator.hover();
return { hovered: ref };
},
}),
browser_go_back: tool({
description: "Go back to the previous page",
inputSchema: z.object({}),
execute: async () => {
const page = browser.getPage();
await page.goBack();
return { url: page.url() };
},
}),
};
}
Each tool has a clear description (the LLM reads this to decide when to use it), a Zod inputSchema (for validated structured input), and an execute function (the actual browser action).
The API Route: Orchestrating Everything
The core is a Next.js API route that launches the browser, wires up tools, and streams results back:
import { createLLMGateway } from "@llmgateway/ai-sdk-provider";
import { generateText, stepCountIs } from "ai";
import { BrowserManager } from "agent-browser/dist/browser.js";
export const maxDuration = 120;
export async function POST(request: Request) {
const { instruction, model, targetUrl } = await request.json();
const llmgateway = createLLMGateway({
apiKey: process.env.LLMGATEWAY_API_KEY,
});
const browser = new BrowserManager();
const encoder = new TextEncoder();
let stepCount = 0;
const stream = new ReadableStream({
async start(controller) {
const emit = (data: Record<string, unknown>) =>
controller.enqueue(encoder.encode(JSON.stringify(data) + "\n"));
try {
emit({ type: "status", message: "Launching headless browser..." });
await browser.launch({
id: "qa",
action: "launch",
headless: true,
});
// Stream live screenshots to the frontend
await browser.startScreencast(
(frame) => {
emit({ type: "screenshot", imageData: frame.data });
},
{
format: "jpeg",
quality: 50,
maxWidth: 1280,
maxHeight: 720,
everyNthFrame: 2,
}
);
emit({ type: "status", message: "Browser ready. Running test..." });
const tools = createBrowserTools(browser);
const result = await generateText({
model: llmgateway(model || "anthropic/claude-sonnet-4-5"),
tools,
stopWhen: stepCountIs(25),
system: `You are a QA testing agent. Your task is to test a web application by interacting with it through a browser.
INSTRUCTIONS:
1. First, navigate to: ${targetUrl}
2. Use browser_snapshot to read the current page state before interacting
3. Execute the test described by the user step by step
4. Use browser_click to click elements (use the ref attribute from snapshots, e.g. @e1)
5. Use browser_type to type text into input fields
6. After completing the test, provide a clear summary of results — what passed, what failed, and why
Be methodical: always snapshot the page before acting so you know what elements are available.`,
prompt: instruction,
onStepFinish({ toolCalls, text }) {
if (toolCalls?.length) {
for (const call of toolCalls) {
stepCount++;
emit({
type: "action",
step: stepCount,
tool: call.toolName,
args: call.input,
status: "done",
});
}
}
if (text) {
emit({ type: "text", content: text });
}
},
});
emit({ type: "result", summary: result.text });
} catch (err) {
const message =
err instanceof Error ? err.message : String(err);
emit({ type: "error", message });
} finally {
await browser.stopScreencast();
await browser.close();
controller.close();
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "application/x-ndjson",
"Transfer-Encoding": "chunked",
},
});
}
Key Details
stopWhen: stepCountIs(25) — This is a safety guardrail from the AI SDK. It prevents the agent from running indefinitely by capping it at 25 tool-calling steps.
onStepFinish — This callback fires after each agent step, letting us stream actions to the frontend in real-time. Users see each click, each navigation as it happens.
Live screencast — Agent Browser streams JPEG frames of the browser viewport via startScreencast. These are forwarded to the frontend as NDJSON events, giving users a live preview of what the agent sees.
NDJSON streaming — Each event is a newline-delimited JSON object. The frontend reads them incrementally to build a real-time action timeline.
Switching Models
Because we're using LLM Gateway, switching the underlying LLM is trivial:
// Anthropic Claude
model: llmgateway("anthropic/claude-sonnet-4-5")
// OpenAI GPT-4o
model: llmgateway("openai/gpt-4o")
// Google Gemini
model: llmgateway("google/gemini-2.5-pro")
Same code, same tools, same agent — different brain. This is great for comparing which model performs best at QA tasks for your specific app.
The Response Stream
The frontend receives NDJSON events like:
{"type":"status","message":"Launching headless browser..."}
{"type":"status","message":"Browser ready. Running test..."}
{"type":"action","step":1,"tool":"browser_navigate","args":{"url":"http://localhost:3000"},"status":"done"}
{"type":"action","step":2,"tool":"browser_snapshot","args":{},"status":"done"}
{"type":"action","step":3,"tool":"browser_click","args":{"ref":"@e5"},"status":"done"}
{"type":"screenshot","imageData":"base64..."}
{"type":"text","content":"I can see the signup form with email and password fields."}
{"type":"result","summary":"Test passed: signup flow works correctly."}
You can render this as a step-by-step timeline — each action shows what tool was called, what arguments were used, and what happened.
Example Test Instructions
Here are some test prompts that work well:
- "Test the signup flow and verify a confirmation message appears"
- "Navigate to the login page, enter invalid credentials, and verify an error is shown"
- "Add an item to the cart and verify the cart count updates"
- "Go to the settings page, change the display name, save, and verify it persists after a page refresh"
- "Test keyboard navigation on the main form — tab through all fields and submit with Enter"
The agent handles all the details: finding the right elements, filling in forms, waiting for page transitions, and verifying outcomes.
Why This Architecture Works
Accessibility snapshots > screenshots for tool calling. Instead of sending expensive screenshots to the LLM and hoping it understands pixel coordinates, Agent Browser provides a semantic text tree. The LLM reads element labels, roles, and refs — much cheaper and more reliable.
Streaming > polling. NDJSON gives you real-time visibility into every agent step. No waiting for the entire test to finish before seeing results.
Provider-agnostic. LLM Gateway means you're not locked into one provider. Claude is great at following complex multi-step instructions, but GPT-4o might be faster for simple tests. Try both without changing code.
Guardrails built in. stepCountIs(25) prevents runaway agents. The 120-second maxDuration on the API route adds a hard timeout. Both are essential for production use.
Running the Template
git clone https://github.com/theopenco/llmgateway-templates.git
cd llmgateway-templates/templates/qa-agent
pnpm install
cp .env.example .env.local
# Add your LLMGATEWAY_API_KEY to .env.local
pnpm dev
Open http://localhost:3001, enter your app's URL, describe a test, and hit Run.
What's Next
This template is a starting point. You could extend it with:
- Test suites — run multiple test instructions sequentially and aggregate results
- Visual regression — capture screenshots at key points and compare against baselines
- CI integration — run QA agents as part of your GitHub Actions pipeline
- Custom assertions — add tools that check specific DOM states or API responses
The combination of AI tool calling + browser automation opens up a lot of possibilities beyond traditional test frameworks.
Resources:
Top comments (0)