Phạm Thanh Hằng

Posted on Mar 11

Building Verifai: How We Used 3 Gemini Models to Create an AI QA Agent That Finds Real Bugs

#ai #geminiliveagentchallenge #webdev #programming

An inside look at building an autonomous QA testing agent with Gemini Computer Use, multi-model architecture, and Google Cloud — for the Gemini Live Agent Challenge.

QA engineers spend hours clicking through the same flows after every sprint. They write the same Jira tickets. They attach the same screenshots. They catch the obvious bugs, but the subtle ones — the ones that only show up with specific user accounts or edge-case data — slip through.

We built Verifai to change that. It's an AI agent that reads your Jira tickets, opens a real browser, tests your application the way a human QA engineer would, and files Jira tickets for the bugs it finds — complete with screenshots and reproduction steps.

This post walks through exactly how we built it using Google's AI models and cloud services, the architectural decisions that made it work, and the mistakes we made along the way.

The Core Idea: An Agent That Sees Before It Acts

Most browser automation tools follow a script. Playwright runs a sequence of commands. Selenium clicks selectors. If the page layout changes or an element moves, the test breaks.

Verifai doesn't run scripts. For every single action it takes, it follows this loop:

Screenshot the current browser state
Send the screenshot to Gemini with the Computer Use tool
Gemini analyzes what's on screen and decides what to do next
Execute that one action in the browser
Screenshot again and verify whether the expected outcome happened
Repeat until the test step passes, fails, or can't be completed

The AI sees the live page before every decision. If a button moves, the login form looks different, or an unexpected popup appears, Gemini adapts. This is fundamentally different from running a pre-written test script — it's a Computer Use agent that happens to do QA.

Three Gemini Models, Three Distinct Jobs

One of our most important design decisions was splitting work across three specialized Gemini models instead of using one model for everything. Each model was chosen for its specific capability:

Gemini 3 Flash — The Agent's Eyes and Hands

This model powers the core agentic loop. Using the native Computer Use tool, Gemini 3 Flash looks at a screenshot and returns a structured action: "click at pixel coordinates (640, 350)" or "type 'standard_user' into the field at (400, 280)."

The Computer Use tool is critical because it returns coordinate-based actions through a proper tool-calling protocol. The model isn't generating JSON text that we parse and hope is valid — it's using a structured tool interface that returns typed actions. This matters enormously for reliability.

Here's what a single action decision looks like in the code:

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: [{
    role: "user",
    parts: [
      { inlineData: { mimeType: "image/jpeg", data: screenshotBase64 } },
      { text: `You are a QA browser agent looking at a live screenshot.
               Task: ${step.text}
               Expected: ${step.expectedBehavior}
               Decide the NEXT single action.` }
    ],
  }],
  config: {
    tools: [{
      computerUse: {
        environment: Environment.ENVIRONMENT_BROWSER,
      },
    }],
  },
});

The model sees the actual rendered page — not the DOM, not the HTML source — the pixels on screen. It decides where to click based on what it sees, just like a human tester would.

Gemini 2.5 Flash Lite — The Agent's Brain

Every task that doesn't require Computer Use goes to Flash Lite. This includes:

Spec parsing: When you give Verifai a Jira ticket or Confluence page, Flash Lite reads the text and generates a sequential test plan — 5 to 8 atomic browser actions with expected outcomes for each.

Step verification: After each action executes, Flash Lite gets a fresh screenshot and checks: "Did the expected behavior happen?" It returns a structured verdict with a finding description and severity rating.

Bug description enrichment: When a step fails, Flash Lite writes a detailed bug title and description from the screenshot — suitable for a Jira ticket that a developer can actually act on.

Real-time narration: During execution, Flash Lite generates one-sentence narration lines for the live transcript panel: "Clicking the login button — expecting redirect to inventory page."

By routing all of these tasks to Flash Lite, we keep Computer Use calls reserved exclusively for browser action decisions.

Gemini 2.5 Flash TTS — The Agent's Voice

The most memorable demo feature: the agent speaks aloud during test execution. At key moments — session start, each step beginning, bug discovery, session end — we send a text narration to Gemini TTS and stream the audio to the frontend.

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-tts-preview",
  contents: [{
    role: "user",
    parts: [{
      text: `You are Verifai, a professional QA agent. 
             Narrate: "Bug found — cart badge not updating after add to cart"`
    }],
  }],
  config: {
    responseModalities: ["AUDIO"],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: "Kore" },
      },
    },
  },
});

Voice is entirely fire-and-forget — it never blocks step execution or slows down the session. If TTS fails or is rate-limited, text narration continues normally. But when it works, the effect is striking: the AI narrates its own testing in real time.

Google Cloud: The Infrastructure Layer

Verifai uses four Google Cloud services, each solving a specific problem:

Cloud Run — Hosting the Agent

The agent server runs on Cloud Run with specific configuration for our use case. Playwright needs memory (Chromium is hungry), WebSocket connections need session affinity, and test sessions can take several minutes. Our Cloud Run config:

2Gi memory / 2 CPU — headroom for Chromium + screenshot processing
10 minute timeout — sessions with many steps need room
Session affinity — WebSocket connections must stick to one instance
Low concurrency (5 per instance) — each session runs its own browser

Firestore — Report Persistence

Every test session generates a report with the tri-state results (passed, failed, incomplete), bug details, step outcomes, and metadata. These are saved to Firestore so users can browse test history, re-open past reports, and track trends over time.

The report structure is denormalized — each document contains the full step list, all bugs, and computed metrics. This keeps reads simple (one document fetch per report) at the cost of larger documents, which is the right tradeoff for a QA reporting tool where writes happen once and reads happen many times.

Cloud Storage — Bug Screenshots

When Verifai finds a bug, the screenshot evidence needs a permanent home. We upload each bug screenshot to GCS and include the public URL in the Jira ticket. The file path includes the session ID and step ID for organization:

gs://verifai-screenshots/screenshots/{sessionId}/{stepId}-{timestamp}.jpg

These URLs go directly into Jira ticket descriptions, so developers can see exactly what the AI saw when it identified the bug.

Cloud Build — CI/CD

A cloudbuild.yaml handles the deployment pipeline: build the Docker image (including Playwright's Chromium and all its system dependencies), push to Artifact Registry, deploy to Cloud Run. The Dockerfile is carefully constructed — Chromium needs a specific set of system libraries that are easy to miss:

RUN apt-get update && apt-get install -y \
    libnss3 libnspr4 libdbus-1-3 libatk1.0-0 libatk-bridge2.0-0 \
    libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 \
    libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 \
    libasound2 libatspi2.0-0 libwayland-client0 fonts-liberation

Miss one library and Chromium silently fails to launch. We learned this the hard way.

The Vision Loop in Detail

The vision loop is the heart of Verifai. Here's what actually happens for a single test step — say, "Enter username 'standard_user' in the login form":

Step 1: Observe. Playwright takes a JPEG screenshot (compressed to 1024px width for token efficiency) and captures the accessibility tree (AOM snapshot). Both are sent to Gemini 3 Flash.

Step 2: Decide. Gemini sees the screenshot, reads the accessibility tree, and uses the Computer Use tool to decide: "type 'standard_user' at coordinates (640, 280)." It returns a structured action, not free-form text.

Step 3: Highlight. Before executing, we inject a red circle overlay at the target coordinates and take another screenshot. This streams to the frontend so the operator can see exactly what the AI is targeting — a visual confirmation of intent.

Step 4: Execute. Playwright clicks at the coordinates, then types with a 30ms delay between keystrokes (simulating human typing speed).

Step 5: Wait. A brief pause for the page to settle — DOM updates, network requests, animations.

Step 6: Verify. A fresh screenshot goes to Gemini 2.5 Flash Lite with the question: "Expected behavior: 'Username field populated with standard_user.' Did this happen?" The model returns a structured pass/fail verdict.

If the action fails (wrong coordinates, element not clickable), the self-heal kicks in: Gemini sees the error context and the new screenshot, then tries a different approach — different coordinates, a different element, or a different action type entirely.

Tri-State Reporting: What Happened vs. What's a Bug

Early in development, we had binary pass/fail. Then reality hit: what happens when a step times out? Or the page takes too long to load? Or Chromium crashes? Those aren't bugs — they're infrastructure noise. But in a binary system, they look like failures.

Our solution: tri-state reporting.

Passed — the step executed and Gemini verified the expected outcome appeared on screen
Failed — Gemini verified that something is wrong (a real product bug, with evidence)
Incomplete — the step couldn't be assessed (timeout, crash, rate limit, user skip)

The overall report status follows clear rules: if any step failed, the report status is "Failed." If no steps failed but some are incomplete, it's "Incomplete." Only when everything passes is it "Passed."

This distinction matters because the report tells you "we found 2 real bugs and couldn't check 1 step due to a timeout" instead of "3 things failed." Developers trust the results because failures always mean verified bugs, never infrastructure noise.

Human-in-the-Loop: The AI Knows What It Doesn't Know

Full automation is impressive, but responsible AI requires admitting uncertainty. Verifai includes a Human-in-the-Loop system that activates when the AI encounters situations it can't handle confidently.

Every action decision includes a confidence score (0.0 to 1.0). When confidence drops below a configurable threshold, the agent pauses and asks the human operator for guidance. A modal appears with the current screenshot, the AI's question, and context-appropriate decision buttons.

The options change based on why the agent paused:

Low confidence action: "Does this look right? Proceed / Skip / Re-analyze"
Destructive action detected: "This might delete data. Allow / Skip / Abort"
Ambiguous verification: "I can't tell if this passed. Mark Passed / Mark Failed / Re-verify"
Authentication wall: "I've detected a login page. I've Logged In / Skip / Abort"

Every human intervention is logged with a timestamp, the question asked, the decision made, an optional human note, and how long the operator took to decide. This audit trail is included in the report — judges (and future auditors) can see that the AI operated with appropriate human oversight.

Integration: Jira and Confluence

Verifai plugs into the tools QA teams already use.

Input: Test specs come from three sources — Jira tickets (summary, description, and acceptance criteria are extracted via the REST API), Confluence pages (HTML storage format is converted to plain text, with optional child page inclusion), or free-form text pasted directly.

Output: When a bug is found, Verifai auto-creates a Jira ticket in the configured project. The ticket includes the bug title and description (enriched by Gemini), expected vs. actual behavior, severity-based priority mapping, a GCS screenshot link, and labels for traceability (verifai-auto, source-{ticket}, failure-{type}).

This closes the loop: spec in Jira → Verifai tests → bugs back in Jira. The QA engineer reviews the results instead of performing the testing.

What We'd Do Differently

Start with the vision loop. We wasted time on a "generate plan, execute blindly" architecture before realizing the agent must see the browser before every action. This should have been the starting assumption.

Model specialization from day one. Our first version used one model for everything. Splitting into three models (Computer Use, verification, voice) should have been the architecture from the start — each model has a fundamentally different job.

Build the reporting system early. Tri-state reporting required touching nearly every file in the codebase when we added it. If we'd designed the type system with three states from the beginning, it would have saved significant refactoring.

Try It

Verifai is open source:

GitHub: https://github.com/phamthanhhang208/verifai

Built with Gemini 3 Flash, Gemini 2.5 Flash Lite, Gemini 2.5 Flash TTS, Cloud Run, Firestore, Cloud Storage, and Cloud Build for the Gemini Live Agent Challenge.

Verifai was built for the Gemini Live Agent Challenge hackathon (UI Navigator category). The source code and all implementation prompts are available in the GitHub repository.

DEV Community