<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Phạm Thanh Hằng</title>
    <description>The latest articles on DEV Community by Phạm Thanh Hằng (@phamthanhhang208).</description>
    <link>https://dev.to/phamthanhhang208</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3818721%2F7561a9fd-2cb6-4959-bf73-579675878cb1.jpeg</url>
      <title>DEV Community: Phạm Thanh Hằng</title>
      <link>https://dev.to/phamthanhhang208</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/phamthanhhang208"/>
    <language>en</language>
    <item>
      <title>VaultRoom: A DeFi Risk Agent Where Notion Is the Control Plane</title>
      <dc:creator>Phạm Thanh Hằng</dc:creator>
      <pubDate>Wed, 25 Mar 2026 04:16:01 +0000</pubDate>
      <link>https://dev.to/phamthanhhang208/vaultroom-a-defi-risk-agent-where-notion-is-the-control-plane-2nb1</link>
      <guid>https://dev.to/phamthanhhang208/vaultroom-a-defi-risk-agent-where-notion-is-the-control-plane-2nb1</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/notion-2026-03-04"&gt;Notion MCP Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;DeFi operators managing lending positions across multiple blockchains face a familiar problem: alerts scattered across Discord bots, risk tracking in spreadsheets, and critical decisions made via Telegram messages. There's no single source of truth, no structured escalation workflow, and no clean way for a human to stay in the loop when an AI agent flags something dangerous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VaultRoom&lt;/strong&gt; is a multi-chain DeFi risk monitoring agent that turns Notion into a &lt;strong&gt;bidirectional control plane&lt;/strong&gt;. The agent monitors on-chain positions across Cardano and Ethereum, detects anomalies using rule-based checks and Gemini 2.5 Pro analysis, and manages a living Notion workspace — where human operators set thresholds, approve escalations, and receive AI-written daily digests.&lt;/p&gt;

&lt;p&gt;Every single Notion interaction flows through the &lt;strong&gt;remote Notion MCP server&lt;/strong&gt; via OAuth. Zero &lt;code&gt;@notionhq/client&lt;/code&gt; SDK usage. MCP is the protocol layer between the agent and Notion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key idea: Notion isn't a dashboard. It's the control plane. Data flows both ways.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the agent does:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Polls &lt;strong&gt;Cardano&lt;/strong&gt; (Blockfrost) and &lt;strong&gt;Ethereum&lt;/strong&gt; (ethers.js) for wallet balances and transactions&lt;/li&gt;
&lt;li&gt;Detects risk signals: health factor drops, whale movements, balance anomalies&lt;/li&gt;
&lt;li&gt;Calls &lt;strong&gt;Gemini 2.5 Pro&lt;/strong&gt; to analyze critical signals and generate plain-English risk assessments&lt;/li&gt;
&lt;li&gt;Writes risk events, position updates, and alerts to Notion databases via MCP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalates critical alerts&lt;/strong&gt; and waits for human approval in Notion before resolving&lt;/li&gt;
&lt;li&gt;Leaves &lt;strong&gt;comments&lt;/strong&gt; on Notion pages to acknowledge human decisions&lt;/li&gt;
&lt;li&gt;Generates rich &lt;strong&gt;daily digest pages&lt;/strong&gt; with tables, callouts, and toggles — all via Notion-flavored Markdown through MCP&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What the human does (in Notion):
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Configures which wallets to monitor and sets alert thresholds&lt;/li&gt;
&lt;li&gt;Adds protocols to a watchlist&lt;/li&gt;
&lt;li&gt;Reviews AI-generated risk analysis&lt;/li&gt;
&lt;li&gt;Approves or rejects escalated alerts by changing a status field&lt;/li&gt;
&lt;li&gt;Reads daily portfolio digests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4ewgn8o3schqbl9hs4v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4ewgn8o3schqbl9hs4v.png" alt="Risk Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wpt5bz66rsxsvysag8b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0wpt5bz66rsxsvysag8b.png" alt="Daily Digest"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Video Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/2cZroZTegYI"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Show us the code
&lt;/h2&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/phamthanhhang208" rel="noopener noreferrer"&gt;
        phamthanhhang208
      &lt;/a&gt; / &lt;a href="https://github.com/phamthanhhang208/vault-room" rel="noopener noreferrer"&gt;
        vault-room
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;🏦 VaultRoom&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Multi-chain DeFi risk agent with Notion as the control plane — powered by Notion MCP.&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Built for the &lt;a href="https://dev.to/challenges/notion-2026-03-04" rel="nofollow"&gt;Notion MCP Challenge&lt;/a&gt; · March 2026&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;The Problem&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;DeFi operators managing positions across Cardano and Ethereum rely on scattered tools — Discord bots for alerts, spreadsheets for tracking, Telegram for coordination. There's no single source of truth, no structured escalation workflow, and no human-in-the-loop approval for critical actions.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;The Solution&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;VaultRoom turns Notion into a DeFi risk control plane. An AI agent monitors on-chain positions, detects anomalies, and writes structured risk analysis directly into Notion databases — &lt;strong&gt;entirely through Notion MCP&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Human operators configure thresholds, approve escalations, and receive AI-written daily digests. All without leaving Notion.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Notion isn't a dashboard. It's the control plane. MCP is the protocol.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;

&lt;/div&gt;

  &lt;div class="js-render-enrichment-target"&gt;
    &lt;div class="render-plaintext-hidden"&gt;
      &lt;pre&gt;graph TB
    subgraph USER["👤 Human Operator"]
        NOTION_UI["Notion Workspace&amp;lt;br/&amp;gt;(Browser / App)"]
    end
    subgraph NOTION_MCP["☁️ Notion MCP Server&amp;lt;br/&amp;gt;&amp;lt;i&amp;gt;mcp.notion.com&amp;lt;/i&amp;gt;"]
        MCP_API["MCP Protocol&amp;lt;br/&amp;gt;Streamable HTTP +&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/phamthanhhang208/vault-room" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  How I Used Notion MCP
&lt;/h2&gt;

&lt;p&gt;This is where I went deep. VaultRoom connects to Notion's &lt;strong&gt;remote hosted MCP server&lt;/strong&gt; (&lt;code&gt;https://mcp.notion.com/mcp&lt;/code&gt;) via OAuth 2.0 with PKCE and uses &lt;strong&gt;7 MCP tools&lt;/strong&gt; as its core integration:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;MCP Tool&lt;/th&gt;
&lt;th&gt;Direction&lt;/th&gt;
&lt;th&gt;How VaultRoom Uses It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;notion-search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read&lt;/td&gt;
&lt;td&gt;Finds databases by name, locates existing position pages for upsert logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;notion-fetch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read&lt;/td&gt;
&lt;td&gt;Reads Config DB to get thresholds, reads Watchlist, polls Risk Dashboard for human approvals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;notion-create-pages&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Write&lt;/td&gt;
&lt;td&gt;Creates risk events, position entries, alert log items, and daily digest pages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;notion-update-page&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Write&lt;/td&gt;
&lt;td&gt;Updates position data on re-scan, resolves escalated events after human approval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;notion-create-database&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Write&lt;/td&gt;
&lt;td&gt;Scaffolds the entire 6-database Notion workspace using SQL DDL syntax&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;notion-create-comment&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Write&lt;/td&gt;
&lt;td&gt;Agent leaves acknowledgment comments on escalated items after human approves&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;notion-get-comments&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read&lt;/td&gt;
&lt;td&gt;Reads discussion threads on escalated risk events&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Bidirectional Loop
&lt;/h3&gt;

&lt;p&gt;Most MCP integrations I've seen are one-directional — an agent writes to Notion. VaultRoom goes both ways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human → Agent&lt;/strong&gt; (Notion → MCP → VaultRoom):&lt;/p&gt;

&lt;p&gt;The agent reads the Config and Watchlist databases every monitoring cycle. If a human changes a health factor threshold from 1.2 to 1.5 in the Notion UI, the agent picks it up on the next cycle and adjusts its detection sensitivity. No redeployment, no config files — the human edits Notion, the agent adapts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent → Human&lt;/strong&gt; (VaultRoom → MCP → Notion):&lt;/p&gt;

&lt;p&gt;Risk events, positions, and alerts are written to structured databases. But the key showcase is the escalation flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Facxvu69mbt42bxbaduhg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Facxvu69mbt42bxbaduhg.png" alt="Escalation flow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Risk engine detects a critical signal (e.g., health factor drops below 1.0)&lt;/li&gt;
&lt;li&gt;Agent creates a Risk Dashboard entry via &lt;code&gt;notion-create-pages&lt;/code&gt; with status set to "Escalated"&lt;/li&gt;
&lt;li&gt;Agent polls the dashboard via &lt;code&gt;notion-search&lt;/code&gt; every cycle, waiting for status change&lt;/li&gt;
&lt;li&gt;Human reviews the AI analysis in Notion and changes status to "Approved"&lt;/li&gt;
&lt;li&gt;Agent detects the change, updates status to "Resolved" via &lt;code&gt;notion-update-page&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Agent leaves a comment on the page via &lt;code&gt;notion-create-comment&lt;/code&gt;: &lt;em&gt;"✅ VaultRoom agent acknowledged approval. Escalation resolved."&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That last step — the agent commenting on a Notion page — is what makes this feel like a real conversation between the AI and the human, all inside Notion.&lt;/p&gt;

&lt;h3&gt;
  
  
  Notion-Flavored Markdown for Rich Pages
&lt;/h3&gt;

&lt;p&gt;The daily digest is the visual payoff. Instead of constructing JSON block arrays, the agent writes Notion-flavored Markdown through MCP. The server converts it to rich Notion blocks automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;&amp;gt;&lt;/code&gt; blockquotes become &lt;strong&gt;callouts&lt;/strong&gt; with portfolio snapshots&lt;/li&gt;
&lt;li&gt;Standard Markdown tables become &lt;strong&gt;Notion tables&lt;/strong&gt; with position data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;lt;details&amp;gt;&lt;/code&gt; tags become &lt;strong&gt;toggles&lt;/strong&gt; containing the full Gemini analysis&lt;/li&gt;
&lt;li&gt;Numbered lists become recommendation items&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One MCP call, one Markdown string, and Notion renders a professional portfolio briefing with callouts, tables, and expandable sections. This is significantly cleaner than building block trees through the REST API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitor Cycle Data Flow
&lt;/h3&gt;

&lt;p&gt;Each monitoring cycle follows a four-phase pattern — reading config from Notion, fetching on-chain data, detecting risks, and writing results back:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuf33i4435tc0rdfsspcg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuf33i4435tc0rdfsspcg.png" alt="Data Flow"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetlml4v0xf4jemvls069.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetlml4v0xf4jemvls069.png" alt="Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Notion Database Schema
&lt;/h3&gt;

&lt;p&gt;VaultRoom manages 6 interconnected databases, all created programmatically via MCP using SQL DDL syntax:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fph76d9ms46f4snsbue6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fph76d9ms46f4snsbue6e.png" alt="ER Diagram"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why DeFi + Notion MCP?
&lt;/h3&gt;

&lt;p&gt;I build DeFi products professionally — lending protocols and yield platforms on Cardano. The risk scenarios in VaultRoom aren't hypothetical. Health factor monitoring, whale movement detection, and liquidation risk assessment are problems I deal with daily.&lt;/p&gt;

&lt;p&gt;No other submission in this challenge touches DeFi. VaultRoom brings genuine domain expertise to a real problem, and Notion MCP turns out to be a surprisingly natural fit for operational risk management: structured databases for tracking, rich pages for reporting, comments for human-agent communication, and the whole thing accessible from a phone without building a custom UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lessons Learned: Hosted MCP Quirks
&lt;/h3&gt;

&lt;p&gt;Building a custom MCP client against the hosted Notion MCP server taught me things the docs don't mention:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The hosted MCP has its own OAuth&lt;/strong&gt; — You can't use a Notion REST API token. The MCP server uses PKCE with dynamic client registration. I implemented the full RFC 9470 → RFC 8414 discovery flow.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SQL DDL for database creation&lt;/strong&gt; — &lt;code&gt;CREATE TABLE&lt;/code&gt; syntax with custom types: &lt;code&gt;TITLE&lt;/code&gt;, &lt;code&gt;RICH_TEXT&lt;/code&gt;, &lt;code&gt;SELECT('opt1', 'opt2')&lt;/code&gt;, &lt;code&gt;MULTI_SELECT&lt;/code&gt;, &lt;code&gt;CHECKBOX&lt;/code&gt;. Not the JSON schema from the REST API.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Property values are SQLite-flavored&lt;/strong&gt; — Checkboxes are &lt;code&gt;"__YES__"&lt;/code&gt; / &lt;code&gt;"__NO__"&lt;/code&gt; (not booleans). Dates need expanded keys like &lt;code&gt;"date:Field Name:start"&lt;/code&gt;. Multi-selects are JSON array strings.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pages in databases use &lt;code&gt;data_source_id&lt;/code&gt;&lt;/strong&gt; — When creating rows, you reference the &lt;code&gt;collection://&lt;/code&gt; ID, not the database page ID.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Notion-flavored Markdown just works&lt;/strong&gt; — Blockquotes → callouts, tables → rich Notion tables, &lt;code&gt;&amp;lt;details&amp;gt;&lt;/code&gt; → toggles. One string, one MCP call, beautiful output.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Tech Stack
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Runtime&lt;/td&gt;
&lt;td&gt;Node.js 20 + TypeScript (strict)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP Client&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@modelcontextprotocol/sdk&lt;/code&gt; (Streamable HTTP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notion MCP&lt;/td&gt;
&lt;td&gt;Remote hosted &lt;code&gt;mcp.notion.com&lt;/code&gt; (OAuth 2.0 PKCE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cardano&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@blockfrost/blockfrost-js&lt;/code&gt; (Preprod testnet)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ethereum&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ethers&lt;/code&gt; v6 (Sepolia testnet)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@google/generative-ai&lt;/code&gt; (Gemini 2.5 Pro)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduler&lt;/td&gt;
&lt;td&gt;&lt;code&gt;node-cron&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;zod&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logging&lt;/td&gt;
&lt;td&gt;&lt;code&gt;winston&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Built solo for the Notion MCP Challenge · March 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>notionchallenge</category>
      <category>mcp</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building Verifai: How We Used 3 Gemini Models to Create an AI QA Agent That Finds Real Bugs</title>
      <dc:creator>Phạm Thanh Hằng</dc:creator>
      <pubDate>Wed, 11 Mar 2026 16:31:52 +0000</pubDate>
      <link>https://dev.to/phamthanhhang208/building-verifai-how-we-used-3-gemini-models-to-create-an-ai-qa-agent-that-finds-real-bugs-16m7</link>
      <guid>https://dev.to/phamthanhhang208/building-verifai-how-we-used-3-gemini-models-to-create-an-ai-qa-agent-that-finds-real-bugs-16m7</guid>
      <description>&lt;p&gt;&lt;em&gt;An inside look at building an autonomous QA testing agent with Gemini Computer Use, multi-model architecture, and Google Cloud — for the Gemini Live Agent Challenge.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;QA engineers spend hours clicking through the same flows after every sprint. They write the same Jira tickets. They attach the same screenshots. They catch the obvious bugs, but the subtle ones — the ones that only show up with specific user accounts or edge-case data — slip through.&lt;/p&gt;

&lt;p&gt;We built Verifai to change that. It's an AI agent that reads your Jira tickets, opens a real browser, tests your application the way a human QA engineer would, and files Jira tickets for the bugs it finds — complete with screenshots and reproduction steps.&lt;/p&gt;

&lt;p&gt;This post walks through exactly how we built it using Google's AI models and cloud services, the architectural decisions that made it work, and the mistakes we made along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea: An Agent That Sees Before It Acts
&lt;/h2&gt;

&lt;p&gt;Most browser automation tools follow a script. Playwright runs a sequence of commands. Selenium clicks selectors. If the page layout changes or an element moves, the test breaks.&lt;/p&gt;

&lt;p&gt;Verifai doesn't run scripts. For every single action it takes, it follows this loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Screenshot&lt;/strong&gt; the current browser state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Send the screenshot&lt;/strong&gt; to Gemini with the Computer Use tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini analyzes&lt;/strong&gt; what's on screen and decides what to do next&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute&lt;/strong&gt; that one action in the browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screenshot again&lt;/strong&gt; and &lt;strong&gt;verify&lt;/strong&gt; whether the expected outcome happened&lt;/li&gt;
&lt;li&gt;Repeat until the test step passes, fails, or can't be completed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The AI sees the live page before every decision. If a button moves, the login form looks different, or an unexpected popup appears, Gemini adapts. This is fundamentally different from running a pre-written test script — it's a Computer Use agent that happens to do QA.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Gemini Models, Three Distinct Jobs
&lt;/h2&gt;

&lt;p&gt;One of our most important design decisions was splitting work across three specialized Gemini models instead of using one model for everything. Each model was chosen for its specific capability:&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini 3 Flash — The Agent's Eyes and Hands
&lt;/h3&gt;

&lt;p&gt;This model powers the core agentic loop. Using the native Computer Use tool, Gemini 3 Flash looks at a screenshot and returns a structured action: "click at pixel coordinates (640, 350)" or "type 'standard_user' into the field at (400, 280)."&lt;/p&gt;

&lt;p&gt;The Computer Use tool is critical because it returns coordinate-based actions through a proper tool-calling protocol. The model isn't generating JSON text that we parse and hope is valid — it's using a structured tool interface that returns typed actions. This matters enormously for reliability.&lt;/p&gt;

&lt;p&gt;Here's what a single action decision looks like in the code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateContent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gemini-3-flash-preview&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;inlineData&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;mimeType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;image/jpeg&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;screenshotBase64&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`You are a QA browser agent looking at a live screenshot.
               Task: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;
               Expected: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expectedBehavior&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;
               Decide the NEXT single action.`&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="na"&gt;computerUse&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ENVIRONMENT_BROWSER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model sees the actual rendered page — not the DOM, not the HTML source — the pixels on screen. It decides where to click based on what it sees, just like a human tester would.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini 2.5 Flash Lite — The Agent's Brain
&lt;/h3&gt;

&lt;p&gt;Every task that doesn't require Computer Use goes to Flash Lite. This includes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spec parsing:&lt;/strong&gt; When you give Verifai a Jira ticket or Confluence page, Flash Lite reads the text and generates a sequential test plan — 5 to 8 atomic browser actions with expected outcomes for each.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step verification:&lt;/strong&gt; After each action executes, Flash Lite gets a fresh screenshot and checks: "Did the expected behavior happen?" It returns a structured verdict with a finding description and severity rating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug description enrichment:&lt;/strong&gt; When a step fails, Flash Lite writes a detailed bug title and description from the screenshot — suitable for a Jira ticket that a developer can actually act on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time narration:&lt;/strong&gt; During execution, Flash Lite generates one-sentence narration lines for the live transcript panel: "Clicking the login button — expecting redirect to inventory page."&lt;/p&gt;

&lt;p&gt;By routing all of these tasks to Flash Lite, we keep Computer Use calls reserved exclusively for browser action decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gemini 2.5 Flash TTS — The Agent's Voice
&lt;/h3&gt;

&lt;p&gt;The most memorable demo feature: the agent speaks aloud during test execution. At key moments — session start, each step beginning, bug discovery, session end — we send a text narration to Gemini TTS and stream the audio to the frontend.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generateContent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gemini-2.5-flash-tts-preview&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`You are Verifai, a professional QA agent. 
             Narrate: "Bug found — cart badge not updating after add to cart"`&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;responseModalities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;AUDIO&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;speechConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;voiceConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;prebuiltVoiceConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;voiceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Kore&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voice is entirely fire-and-forget — it never blocks step execution or slows down the session. If TTS fails or is rate-limited, text narration continues normally. But when it works, the effect is striking: the AI narrates its own testing in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Google Cloud: The Infrastructure Layer
&lt;/h2&gt;

&lt;p&gt;Verifai uses four Google Cloud services, each solving a specific problem:&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Run — Hosting the Agent
&lt;/h3&gt;

&lt;p&gt;The agent server runs on Cloud Run with specific configuration for our use case. Playwright needs memory (Chromium is hungry), WebSocket connections need session affinity, and test sessions can take several minutes. Our Cloud Run config:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2Gi memory / 2 CPU — headroom for Chromium + screenshot processing&lt;/li&gt;
&lt;li&gt;10 minute timeout — sessions with many steps need room&lt;/li&gt;
&lt;li&gt;Session affinity — WebSocket connections must stick to one instance&lt;/li&gt;
&lt;li&gt;Low concurrency (5 per instance) — each session runs its own browser&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Firestore — Report Persistence
&lt;/h3&gt;

&lt;p&gt;Every test session generates a report with the tri-state results (passed, failed, incomplete), bug details, step outcomes, and metadata. These are saved to Firestore so users can browse test history, re-open past reports, and track trends over time.&lt;/p&gt;

&lt;p&gt;The report structure is denormalized — each document contains the full step list, all bugs, and computed metrics. This keeps reads simple (one document fetch per report) at the cost of larger documents, which is the right tradeoff for a QA reporting tool where writes happen once and reads happen many times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Storage — Bug Screenshots
&lt;/h3&gt;

&lt;p&gt;When Verifai finds a bug, the screenshot evidence needs a permanent home. We upload each bug screenshot to GCS and include the public URL in the Jira ticket. The file path includes the session ID and step ID for organization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gs://verifai-screenshots/screenshots/{sessionId}/{stepId}-{timestamp}.jpg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These URLs go directly into Jira ticket descriptions, so developers can see exactly what the AI saw when it identified the bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Build — CI/CD
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;cloudbuild.yaml&lt;/code&gt; handles the deployment pipeline: build the Docker image (including Playwright's Chromium and all its system dependencies), push to Artifact Registry, deploy to Cloud Run. The Dockerfile is carefully constructed — Chromium needs a specific set of system libraries that are easy to miss:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    libnss3 libnspr4 libdbus-1-3 libatk1.0-0 libatk-bridge2.0-0 &lt;span class="se"&gt;\
&lt;/span&gt;    libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 &lt;span class="se"&gt;\
&lt;/span&gt;    libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 &lt;span class="se"&gt;\
&lt;/span&gt;    libasound2 libatspi2.0-0 libwayland-client0 fonts-liberation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Miss one library and Chromium silently fails to launch. We learned this the hard way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vision Loop in Detail
&lt;/h2&gt;

&lt;p&gt;The vision loop is the heart of Verifai. Here's what actually happens for a single test step — say, "Enter username 'standard_user' in the login form":&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Observe.&lt;/strong&gt; Playwright takes a JPEG screenshot (compressed to 1024px width for token efficiency) and captures the accessibility tree (AOM snapshot). Both are sent to Gemini 3 Flash.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Decide.&lt;/strong&gt; Gemini sees the screenshot, reads the accessibility tree, and uses the Computer Use tool to decide: "type 'standard_user' at coordinates (640, 280)." It returns a structured action, not free-form text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Highlight.&lt;/strong&gt; Before executing, we inject a red circle overlay at the target coordinates and take another screenshot. This streams to the frontend so the operator can see exactly what the AI is targeting — a visual confirmation of intent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Execute.&lt;/strong&gt; Playwright clicks at the coordinates, then types with a 30ms delay between keystrokes (simulating human typing speed).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Wait.&lt;/strong&gt; A brief pause for the page to settle — DOM updates, network requests, animations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Verify.&lt;/strong&gt; A fresh screenshot goes to Gemini 2.5 Flash Lite with the question: "Expected behavior: 'Username field populated with standard_user.' Did this happen?" The model returns a structured pass/fail verdict.&lt;/p&gt;

&lt;p&gt;If the action fails (wrong coordinates, element not clickable), the self-heal kicks in: Gemini sees the error context and the new screenshot, then tries a different approach — different coordinates, a different element, or a different action type entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tri-State Reporting: What Happened vs. What's a Bug
&lt;/h2&gt;

&lt;p&gt;Early in development, we had binary pass/fail. Then reality hit: what happens when a step times out? Or the page takes too long to load? Or Chromium crashes? Those aren't bugs — they're infrastructure noise. But in a binary system, they look like failures.&lt;/p&gt;

&lt;p&gt;Our solution: tri-state reporting.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Passed&lt;/strong&gt; — the step executed and Gemini verified the expected outcome appeared on screen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed&lt;/strong&gt; — Gemini verified that something is wrong (a real product bug, with evidence)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incomplete&lt;/strong&gt; — the step couldn't be assessed (timeout, crash, rate limit, user skip)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The overall report status follows clear rules: if any step failed, the report status is "Failed." If no steps failed but some are incomplete, it's "Incomplete." Only when everything passes is it "Passed."&lt;/p&gt;

&lt;p&gt;This distinction matters because the report tells you "we found 2 real bugs and couldn't check 1 step due to a timeout" instead of "3 things failed." Developers trust the results because failures always mean verified bugs, never infrastructure noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  Human-in-the-Loop: The AI Knows What It Doesn't Know
&lt;/h2&gt;

&lt;p&gt;Full automation is impressive, but responsible AI requires admitting uncertainty. Verifai includes a Human-in-the-Loop system that activates when the AI encounters situations it can't handle confidently.&lt;/p&gt;

&lt;p&gt;Every action decision includes a confidence score (0.0 to 1.0). When confidence drops below a configurable threshold, the agent pauses and asks the human operator for guidance. A modal appears with the current screenshot, the AI's question, and context-appropriate decision buttons.&lt;/p&gt;

&lt;p&gt;The options change based on why the agent paused:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low confidence action:&lt;/strong&gt; "Does this look right? Proceed / Skip / Re-analyze"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destructive action detected:&lt;/strong&gt; "This might delete data. Allow / Skip / Abort"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ambiguous verification:&lt;/strong&gt; "I can't tell if this passed. Mark Passed / Mark Failed / Re-verify"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication wall:&lt;/strong&gt; "I've detected a login page. I've Logged In / Skip / Abort"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every human intervention is logged with a timestamp, the question asked, the decision made, an optional human note, and how long the operator took to decide. This audit trail is included in the report — judges (and future auditors) can see that the AI operated with appropriate human oversight.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integration: Jira and Confluence
&lt;/h2&gt;

&lt;p&gt;Verifai plugs into the tools QA teams already use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt; Test specs come from three sources — Jira tickets (summary, description, and acceptance criteria are extracted via the REST API), Confluence pages (HTML storage format is converted to plain text, with optional child page inclusion), or free-form text pasted directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; When a bug is found, Verifai auto-creates a Jira ticket in the configured project. The ticket includes the bug title and description (enriched by Gemini), expected vs. actual behavior, severity-based priority mapping, a GCS screenshot link, and labels for traceability (&lt;code&gt;verifai-auto&lt;/code&gt;, &lt;code&gt;source-{ticket}&lt;/code&gt;, &lt;code&gt;failure-{type}&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;This closes the loop: spec in Jira → Verifai tests → bugs back in Jira. The QA engineer reviews the results instead of performing the testing.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with the vision loop.&lt;/strong&gt; We wasted time on a "generate plan, execute blindly" architecture before realizing the agent must see the browser before every action. This should have been the starting assumption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model specialization from day one.&lt;/strong&gt; Our first version used one model for everything. Splitting into three models (Computer Use, verification, voice) should have been the architecture from the start — each model has a fundamentally different job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the reporting system early.&lt;/strong&gt; Tri-state reporting required touching nearly every file in the codebase when we added it. If we'd designed the type system with three states from the beginning, it would have saved significant refactoring.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Verifai is open source:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/phamthanhhang208/verifai" rel="noopener noreferrer"&gt;https://github.com/phamthanhhang208/verifai&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Built with Gemini 3 Flash, Gemini 2.5 Flash Lite, Gemini 2.5 Flash TTS, Cloud Run, Firestore, Cloud Storage, and Cloud Build for the Gemini Live Agent Challenge.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Verifai was built for the Gemini Live Agent Challenge hackathon (UI Navigator category). The source code and all implementation prompts are available in the GitHub repository.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>geminiliveagentchallenge</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
