smakosh

Posted on Mar 2

Build an AI-Powered QA Agent with Agent Browser, Vercel AI SDK, and LLM Gateway

#javascript #testing #ai #webdev

What if you could test your web app by just describing what to test in plain English? No Selenium scripts, no Cypress configs — just tell an AI agent "test the signup flow" and watch it navigate, click, type, and verify results in a real browser.

That's exactly what we're building in this article: a QA testing agent that combines Agent Browser for headless browser control, the Vercel AI SDK for tool-calling orchestration, and LLM Gateway as the unified LLM provider.

The full source code is available at github.com/theopenco/llmgateway-templates/templates/qa-agent.

How It Works

The architecture is straightforward:

User describes a test in natural language (e.g., "Navigate to the login page, enter invalid credentials, and verify an error is shown")
Vercel AI SDK sends the prompt to an LLM via LLM Gateway with browser tools attached
The LLM decides which browser actions to take — navigate, snapshot, click, type, etc.
Agent Browser executes those actions on a real headless Chromium instance
Results stream back in real-time as NDJSON — each step, each screenshot, and a final test summary

The LLM acts as the brain, Agent Browser provides the hands, and LLM Gateway lets you swap between models (Claude, GPT-4o, Gemini) with a single string change.

The Tech Stack

Next.js 16 (App Router) — framework
Vercel AI SDK v6 — generateText with tool calling and stepCountIs for limiting agent loops
@llmgateway/ai-sdk-provider — LLM Gateway's native AI SDK provider
agent-browser — headless browser automation with accessibility snapshots
Zod — tool input schema validation

Setting Up

Install Dependencies

npm install @llmgateway/ai-sdk-provider ai agent-browser zod next react

Environment

# .env.local
LLMGATEWAY_API_KEY=your_api_key_here

Get your API key from llmgateway.io.

Building the Browser Tools

The key insight is that Agent Browser provides accessibility snapshots — a text-based tree of the page with element refs like @e1, @e3. The LLM reads these snapshots to understand the page, then uses refs to click and type. No CSS selectors, no XPaths — just semantic understanding.

Here's how we define the browser tools using the AI SDK's tool() helper:

import { tool } from "ai";
import { BrowserManager } from "agent-browser/dist/browser.js";
import { z } from "zod";

function createBrowserTools(browser: BrowserManager) {
  return {
    browser_navigate: tool({
      description: "Navigate the browser to a URL",
      inputSchema: z.object({
        url: z.string().describe("The URL to navigate to"),
      }),
      execute: async ({ url }) => {
        const page = browser.getPage();
        await page.goto(url, { waitUntil: "domcontentloaded" });
        return { url, title: await page.title() };
      },
    }),

    browser_snapshot: tool({
      description:
        "Get an accessibility snapshot of the current page. Returns a text tree with element refs (e.g. [ref=e1]) that you can use with browser_click and browser_type.",
      inputSchema: z.object({}),
      execute: async () => {
        const snapshot = await browser.getSnapshot({ interactive: true });
        const tree =
          typeof snapshot.tree === "string"
            ? snapshot.tree
            : JSON.stringify(snapshot.tree);
        // Truncate to avoid blowing up context
        const maxChars = 30_000;
        if (tree.length > maxChars) {
          return { snapshot: tree.slice(0, maxChars) + "\n... (truncated)" };
        }
        return { snapshot: tree };
      },
    }),

    browser_click: tool({
      description: "Click an element using its ref from a snapshot (e.g. @e1)",
      inputSchema: z.object({
        ref: z.string().describe("The ref of the element to click"),
      }),
      execute: async ({ ref }) => {
        const locator = browser.getLocator(ref);
        await locator.click();
        return { clicked: ref };
      },
    }),

    browser_type: tool({
      description: "Type text into an input field using its ref",
      inputSchema: z.object({
        ref: z.string().describe("The ref of the input element"),
        text: z.string().describe("The text to type"),
        clear: z.boolean().optional().describe("Clear first (default: true)"),
      }),
      execute: async ({ ref, text, clear = true }) => {
        const locator = browser.getLocator(ref);
        if (clear) {
          await locator.fill(text);
        } else {
          await locator.pressSequentially(text);
        }
        return { typed: text, into: ref };
      },
    }),

    browser_press_key: tool({
      description: "Press a keyboard key (e.g. Enter, Tab, Escape)",
      inputSchema: z.object({
        key: z.string().describe("The key to press"),
      }),
      execute: async ({ key }) => {
        const page = browser.getPage();
        await page.keyboard.press(key);
        return { pressed: key };
      },
    }),

    browser_scroll: tool({
      description: "Scroll the page in a direction",
      inputSchema: z.object({
        direction: z.enum(["up", "down", "left", "right"]),
        amount: z.number().optional().describe("Pixels (default: 500)"),
      }),
      execute: async ({ direction, amount = 500 }) => {
        const page = browser.getPage();
        const deltaX =
          direction === "left" ? -amount : direction === "right" ? amount : 0;
        const deltaY =
          direction === "up" ? -amount : direction === "down" ? amount : 0;
        await page.mouse.wheel(deltaX, deltaY);
        return { scrolled: direction, amount };
      },
    }),

    browser_hover: tool({
      description: "Hover over an element using its ref",
      inputSchema: z.object({
        ref: z.string().describe("The ref of the element to hover"),
      }),
      execute: async ({ ref }) => {
        const locator = browser.getLocator(ref);
        await locator.hover();
        return { hovered: ref };
      },
    }),

    browser_go_back: tool({
      description: "Go back to the previous page",
      inputSchema: z.object({}),
      execute: async () => {
        const page = browser.getPage();
        await page.goBack();
        return { url: page.url() };
      },
    }),
  };
}

Each tool has a clear description (the LLM reads this to decide when to use it), a Zod inputSchema (for validated structured input), and an execute function (the actual browser action).

The API Route: Orchestrating Everything

The core is a Next.js API route that launches the browser, wires up tools, and streams results back:

import { createLLMGateway } from "@llmgateway/ai-sdk-provider";
import { generateText, stepCountIs } from "ai";
import { BrowserManager } from "agent-browser/dist/browser.js";

export const maxDuration = 120;

export async function POST(request: Request) {
  const { instruction, model, targetUrl } = await request.json();

  const llmgateway = createLLMGateway({
    apiKey: process.env.LLMGATEWAY_API_KEY,
  });

  const browser = new BrowserManager();
  const encoder = new TextEncoder();
  let stepCount = 0;

  const stream = new ReadableStream({
    async start(controller) {
      const emit = (data: Record<string, unknown>) =>
        controller.enqueue(encoder.encode(JSON.stringify(data) + "\n"));

      try {
        emit({ type: "status", message: "Launching headless browser..." });

        await browser.launch({
          id: "qa",
          action: "launch",
          headless: true,
        });

        // Stream live screenshots to the frontend
        await browser.startScreencast(
          (frame) => {
            emit({ type: "screenshot", imageData: frame.data });
          },
          {
            format: "jpeg",
            quality: 50,
            maxWidth: 1280,
            maxHeight: 720,
            everyNthFrame: 2,
          }
        );

        emit({ type: "status", message: "Browser ready. Running test..." });

        const tools = createBrowserTools(browser);

        const result = await generateText({
          model: llmgateway(model || "anthropic/claude-sonnet-4-5"),
          tools,
          stopWhen: stepCountIs(25),
          system: `You are a QA testing agent. Your task is to test a web application by interacting with it through a browser.

INSTRUCTIONS:
1. First, navigate to: ${targetUrl}
2. Use browser_snapshot to read the current page state before interacting
3. Execute the test described by the user step by step
4. Use browser_click to click elements (use the ref attribute from snapshots, e.g. @e1)
5. Use browser_type to type text into input fields
6. After completing the test, provide a clear summary of results — what passed, what failed, and why

Be methodical: always snapshot the page before acting so you know what elements are available.`,
          prompt: instruction,
          onStepFinish({ toolCalls, text }) {
            if (toolCalls?.length) {
              for (const call of toolCalls) {
                stepCount++;
                emit({
                  type: "action",
                  step: stepCount,
                  tool: call.toolName,
                  args: call.input,
                  status: "done",
                });
              }
            }
            if (text) {
              emit({ type: "text", content: text });
            }
          },
        });

        emit({ type: "result", summary: result.text });
      } catch (err) {
        const message =
          err instanceof Error ? err.message : String(err);
        emit({ type: "error", message });
      } finally {
        await browser.stopScreencast();
        await browser.close();
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "application/x-ndjson",
      "Transfer-Encoding": "chunked",
    },
  });
}

Key Details

stopWhen: stepCountIs(25) — This is a safety guardrail from the AI SDK. It prevents the agent from running indefinitely by capping it at 25 tool-calling steps.

onStepFinish — This callback fires after each agent step, letting us stream actions to the frontend in real-time. Users see each click, each navigation as it happens.

Live screencast — Agent Browser streams JPEG frames of the browser viewport via startScreencast. These are forwarded to the frontend as NDJSON events, giving users a live preview of what the agent sees.

NDJSON streaming — Each event is a newline-delimited JSON object. The frontend reads them incrementally to build a real-time action timeline.

Switching Models

Because we're using LLM Gateway, switching the underlying LLM is trivial:

// Anthropic Claude
model: llmgateway("anthropic/claude-sonnet-4-5")

// OpenAI GPT-4o
model: llmgateway("openai/gpt-4o")

// Google Gemini
model: llmgateway("google/gemini-2.5-pro")

Same code, same tools, same agent — different brain. This is great for comparing which model performs best at QA tasks for your specific app.

The Response Stream

The frontend receives NDJSON events like:

{"type":"status","message":"Launching headless browser..."}
{"type":"status","message":"Browser ready. Running test..."}
{"type":"action","step":1,"tool":"browser_navigate","args":{"url":"http://localhost:3000"},"status":"done"}
{"type":"action","step":2,"tool":"browser_snapshot","args":{},"status":"done"}
{"type":"action","step":3,"tool":"browser_click","args":{"ref":"@e5"},"status":"done"}
{"type":"screenshot","imageData":"base64..."}
{"type":"text","content":"I can see the signup form with email and password fields."}
{"type":"result","summary":"Test passed: signup flow works correctly."}

You can render this as a step-by-step timeline — each action shows what tool was called, what arguments were used, and what happened.

Example Test Instructions

Here are some test prompts that work well:

"Test the signup flow and verify a confirmation message appears"
"Navigate to the login page, enter invalid credentials, and verify an error is shown"
"Add an item to the cart and verify the cart count updates"
"Go to the settings page, change the display name, save, and verify it persists after a page refresh"
"Test keyboard navigation on the main form — tab through all fields and submit with Enter"

The agent handles all the details: finding the right elements, filling in forms, waiting for page transitions, and verifying outcomes.

Why This Architecture Works

Accessibility snapshots > screenshots for tool calling. Instead of sending expensive screenshots to the LLM and hoping it understands pixel coordinates, Agent Browser provides a semantic text tree. The LLM reads element labels, roles, and refs — much cheaper and more reliable.

Streaming > polling. NDJSON gives you real-time visibility into every agent step. No waiting for the entire test to finish before seeing results.

Provider-agnostic. LLM Gateway means you're not locked into one provider. Claude is great at following complex multi-step instructions, but GPT-4o might be faster for simple tests. Try both without changing code.

Guardrails built in. stepCountIs(25) prevents runaway agents. The 120-second maxDuration on the API route adds a hard timeout. Both are essential for production use.

Running the Template

git clone https://github.com/theopenco/llmgateway-templates.git
cd llmgateway-templates/templates/qa-agent
pnpm install
cp .env.example .env.local
# Add your LLMGATEWAY_API_KEY to .env.local
pnpm dev

Open http://localhost:3001, enter your app's URL, describe a test, and hit Run.

What's Next

This template is a starting point. You could extend it with:

Test suites — run multiple test instructions sequentially and aggregate results
Visual regression — capture screenshots at key points and compare against baselines
CI integration — run QA agents as part of your GitHub Actions pipeline
Custom assertions — add tools that check specific DOM states or API responses

The combination of AI tool calling + browser automation opens up a lot of possibilities beyond traditional test frameworks.

Resources:

Top comments (1)

Raju Dandigam • Jun 30

The accessibility snapshot approach is the most interesting part here because it moves testing closer to how users perceive the UI instead of how developers structure selectors. For AI-assisted QA, I think the big challenge is making failures explainable when the agent takes an unexpected path. Capturing actions, screenshots, tool inputs, assertions, and retry behavior becomes just as important as the test result itself. That traceability layer feels like the bridge between a useful demo and a reliable QA workflow.