DEV Community: Phạm Thanh Hằng

VaultRoom: A DeFi Risk Agent Where Notion Is the Control Plane

Phạm Thanh Hằng — Wed, 25 Mar 2026 04:16:01 +0000

This is a submission for the Notion MCP Challenge

What I Built

DeFi operators managing lending positions across multiple blockchains face a familiar problem: alerts scattered across Discord bots, risk tracking in spreadsheets, and critical decisions made via Telegram messages. There's no single source of truth, no structured escalation workflow, and no clean way for a human to stay in the loop when an AI agent flags something dangerous.

VaultRoom is a multi-chain DeFi risk monitoring agent that turns Notion into a bidirectional control plane. The agent monitors on-chain positions across Cardano and Ethereum, detects anomalies using rule-based checks and Gemini 2.5 Pro analysis, and manages a living Notion workspace — where human operators set thresholds, approve escalations, and receive AI-written daily digests.

Every single Notion interaction flows through the remote Notion MCP server via OAuth. Zero @notionhq/client SDK usage. MCP is the protocol layer between the agent and Notion.

The key idea: Notion isn't a dashboard. It's the control plane. Data flows both ways.

What the agent does:

Polls Cardano (Blockfrost) and Ethereum (ethers.js) for wallet balances and transactions
Detects risk signals: health factor drops, whale movements, balance anomalies
Calls Gemini 2.5 Pro to analyze critical signals and generate plain-English risk assessments
Writes risk events, position updates, and alerts to Notion databases via MCP
Escalates critical alerts and waits for human approval in Notion before resolving
Leaves comments on Notion pages to acknowledge human decisions
Generates rich daily digest pages with tables, callouts, and toggles — all via Notion-flavored Markdown through MCP

What the human does (in Notion):

Configures which wallets to monitor and sets alert thresholds
Adds protocols to a watchlist
Reviews AI-generated risk analysis
Approves or rejects escalated alerts by changing a status field
Reads daily portfolio digests

Video Demo

Show us the code

phamthanhhang208 / vault-room

🏦 VaultRoom

Multi-chain DeFi risk agent with Notion as the control plane — powered by Notion MCP.

Built for the Notion MCP Challenge · March 2026

The Problem

DeFi operators managing positions across Cardano and Ethereum rely on scattered tools — Discord bots for alerts, spreadsheets for tracking, Telegram for coordination. There's no single source of truth, no structured escalation workflow, and no human-in-the-loop approval for critical actions.

The Solution

VaultRoom turns Notion into a DeFi risk control plane. An AI agent monitors on-chain positions, detects anomalies, and writes structured risk analysis directly into Notion databases — entirely through Notion MCP.

Human operators configure thresholds, approve escalations, and receive AI-written daily digests. All without leaving Notion.

Notion isn't a dashboard. It's the control plane. MCP is the protocol.

Architecture

graph TB
    subgraph USER["👤 Human Operator"]
        NOTION_UI["Notion Workspace<br/>(Browser / App)"]
    end
    subgraph NOTION_MCP["☁️ Notion MCP Server<br/><i>mcp.notion.com</i>"]
        MCP_API["MCP Protocol<br/>Streamable HTTP +

…

View on GitHub

How I Used Notion MCP

This is where I went deep. VaultRoom connects to Notion's remote hosted MCP server (https://mcp.notion.com/mcp) via OAuth 2.0 with PKCE and uses 7 MCP tools as its core integration:

MCP Tool	Direction	How VaultRoom Uses It
`notion-search`	Read	Finds databases by name, locates existing position pages for upsert logic
`notion-fetch`	Read	Reads Config DB to get thresholds, reads Watchlist, polls Risk Dashboard for human approvals
`notion-create-pages`	Write	Creates risk events, position entries, alert log items, and daily digest pages
`notion-update-page`	Write	Updates position data on re-scan, resolves escalated events after human approval
`notion-create-database`	Write	Scaffolds the entire 6-database Notion workspace using SQL DDL syntax
`notion-create-comment`	Write	Agent leaves acknowledgment comments on escalated items after human approves
`notion-get-comments`	Read	Reads discussion threads on escalated risk events

The Bidirectional Loop

Most MCP integrations I've seen are one-directional — an agent writes to Notion. VaultRoom goes both ways:

Human → Agent (Notion → MCP → VaultRoom):

The agent reads the Config and Watchlist databases every monitoring cycle. If a human changes a health factor threshold from 1.2 to 1.5 in the Notion UI, the agent picks it up on the next cycle and adjusts its detection sensitivity. No redeployment, no config files — the human edits Notion, the agent adapts.

Agent → Human (VaultRoom → MCP → Notion):

Risk events, positions, and alerts are written to structured databases. But the key showcase is the escalation flow:

Risk engine detects a critical signal (e.g., health factor drops below 1.0)
Agent creates a Risk Dashboard entry via notion-create-pages with status set to "Escalated"
Agent polls the dashboard via notion-search every cycle, waiting for status change
Human reviews the AI analysis in Notion and changes status to "Approved"
Agent detects the change, updates status to "Resolved" via notion-update-page
Agent leaves a comment on the page via notion-create-comment: "✅ VaultRoom agent acknowledged approval. Escalation resolved."

That last step — the agent commenting on a Notion page — is what makes this feel like a real conversation between the AI and the human, all inside Notion.

Notion-Flavored Markdown for Rich Pages

The daily digest is the visual payoff. Instead of constructing JSON block arrays, the agent writes Notion-flavored Markdown through MCP. The server converts it to rich Notion blocks automatically:

> blockquotes become callouts with portfolio snapshots
Standard Markdown tables become Notion tables with position data
<details> tags become toggles containing the full Gemini analysis
Numbered lists become recommendation items

One MCP call, one Markdown string, and Notion renders a professional portfolio briefing with callouts, tables, and expandable sections. This is significantly cleaner than building block trees through the REST API.

Monitor Cycle Data Flow

Each monitoring cycle follows a four-phase pattern — reading config from Notion, fetching on-chain data, detecting risks, and writing results back:

Architecture

Notion Database Schema

VaultRoom manages 6 interconnected databases, all created programmatically via MCP using SQL DDL syntax:

Why DeFi + Notion MCP?

I build DeFi products professionally — lending protocols and yield platforms on Cardano. The risk scenarios in VaultRoom aren't hypothetical. Health factor monitoring, whale movement detection, and liquidation risk assessment are problems I deal with daily.

No other submission in this challenge touches DeFi. VaultRoom brings genuine domain expertise to a real problem, and Notion MCP turns out to be a surprisingly natural fit for operational risk management: structured databases for tracking, rich pages for reporting, comments for human-agent communication, and the whole thing accessible from a phone without building a custom UI.

Lessons Learned: Hosted MCP Quirks

Building a custom MCP client against the hosted Notion MCP server taught me things the docs don't mention:

The hosted MCP has its own OAuth — You can't use a Notion REST API token. The MCP server uses PKCE with dynamic client registration. I implemented the full RFC 9470 → RFC 8414 discovery flow.
SQL DDL for database creation — CREATE TABLE syntax with custom types: TITLE, RICH_TEXT, SELECT('opt1', 'opt2'), MULTI_SELECT, CHECKBOX. Not the JSON schema from the REST API.
Property values are SQLite-flavored — Checkboxes are "__YES__" / "__NO__" (not booleans). Dates need expanded keys like "date:Field Name:start". Multi-selects are JSON array strings.
Pages in databases use data_source_id — When creating rows, you reference the collection:// ID, not the database page ID.
Notion-flavored Markdown just works — Blockquotes → callouts, tables → rich Notion tables, <details> → toggles. One string, one MCP call, beautiful output.

Tech Stack

Layer	Technology
Runtime	Node.js 20 + TypeScript (strict)
MCP Client	`@modelcontextprotocol/sdk` (Streamable HTTP)
Notion MCP	Remote hosted `mcp.notion.com` (OAuth 2.0 PKCE)
Cardano	`@blockfrost/blockfrost-js` (Preprod testnet)
Ethereum	`ethers` v6 (Sepolia testnet)
AI	`@google/generative-ai` (Gemini 2.5 Pro)
Scheduler	`node-cron`
Validation	`zod`
Logging	`winston`

Built solo for the Notion MCP Challenge · March 2026

Building Verifai: How We Used 3 Gemini Models to Create an AI QA Agent That Finds Real Bugs

Phạm Thanh Hằng — Wed, 11 Mar 2026 16:31:52 +0000

An inside look at building an autonomous QA testing agent with Gemini Computer Use, multi-model architecture, and Google Cloud — for the Gemini Live Agent Challenge.

QA engineers spend hours clicking through the same flows after every sprint. They write the same Jira tickets. They attach the same screenshots. They catch the obvious bugs, but the subtle ones — the ones that only show up with specific user accounts or edge-case data — slip through.

We built Verifai to change that. It's an AI agent that reads your Jira tickets, opens a real browser, tests your application the way a human QA engineer would, and files Jira tickets for the bugs it finds — complete with screenshots and reproduction steps.

This post walks through exactly how we built it using Google's AI models and cloud services, the architectural decisions that made it work, and the mistakes we made along the way.

The Core Idea: An Agent That Sees Before It Acts

Most browser automation tools follow a script. Playwright runs a sequence of commands. Selenium clicks selectors. If the page layout changes or an element moves, the test breaks.

Verifai doesn't run scripts. For every single action it takes, it follows this loop:

Screenshot the current browser state
Send the screenshot to Gemini with the Computer Use tool
Gemini analyzes what's on screen and decides what to do next
Execute that one action in the browser
Screenshot again and verify whether the expected outcome happened
Repeat until the test step passes, fails, or can't be completed

The AI sees the live page before every decision. If a button moves, the login form looks different, or an unexpected popup appears, Gemini adapts. This is fundamentally different from running a pre-written test script — it's a Computer Use agent that happens to do QA.

Three Gemini Models, Three Distinct Jobs

One of our most important design decisions was splitting work across three specialized Gemini models instead of using one model for everything. Each model was chosen for its specific capability:

Gemini 3 Flash — The Agent's Eyes and Hands

This model powers the core agentic loop. Using the native Computer Use tool, Gemini 3 Flash looks at a screenshot and returns a structured action: "click at pixel coordinates (640, 350)" or "type 'standard_user' into the field at (400, 280)."

The Computer Use tool is critical because it returns coordinate-based actions through a proper tool-calling protocol. The model isn't generating JSON text that we parse and hope is valid — it's using a structured tool interface that returns typed actions. This matters enormously for reliability.

Here's what a single action decision looks like in the code:

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: [{
    role: "user",
    parts: [
      { inlineData: { mimeType: "image/jpeg", data: screenshotBase64 } },
      { text: `You are a QA browser agent looking at a live screenshot.
               Task: ${step.text}
               Expected: ${step.expectedBehavior}
               Decide the NEXT single action.` }
    ],
  }],
  config: {
    tools: [{
      computerUse: {
        environment: Environment.ENVIRONMENT_BROWSER,
      },
    }],
  },
});

The model sees the actual rendered page — not the DOM, not the HTML source — the pixels on screen. It decides where to click based on what it sees, just like a human tester would.

Gemini 2.5 Flash Lite — The Agent's Brain

Every task that doesn't require Computer Use goes to Flash Lite. This includes:

Spec parsing: When you give Verifai a Jira ticket or Confluence page, Flash Lite reads the text and generates a sequential test plan — 5 to 8 atomic browser actions with expected outcomes for each.

Step verification: After each action executes, Flash Lite gets a fresh screenshot and checks: "Did the expected behavior happen?" It returns a structured verdict with a finding description and severity rating.

Bug description enrichment: When a step fails, Flash Lite writes a detailed bug title and description from the screenshot — suitable for a Jira ticket that a developer can actually act on.

Real-time narration: During execution, Flash Lite generates one-sentence narration lines for the live transcript panel: "Clicking the login button — expecting redirect to inventory page."

By routing all of these tasks to Flash Lite, we keep Computer Use calls reserved exclusively for browser action decisions.

Gemini 2.5 Flash TTS — The Agent's Voice

The most memorable demo feature: the agent speaks aloud during test execution. At key moments — session start, each step beginning, bug discovery, session end — we send a text narration to Gemini TTS and stream the audio to the frontend.

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-tts-preview",
  contents: [{
    role: "user",
    parts: [{
      text: `You are Verifai, a professional QA agent. 
             Narrate: "Bug found — cart badge not updating after add to cart"`
    }],
  }],
  config: {
    responseModalities: ["AUDIO"],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: "Kore" },
      },
    },
  },
});

Voice is entirely fire-and-forget — it never blocks step execution or slows down the session. If TTS fails or is rate-limited, text narration continues normally. But when it works, the effect is striking: the AI narrates its own testing in real time.

Google Cloud: The Infrastructure Layer

Verifai uses four Google Cloud services, each solving a specific problem:

Cloud Run — Hosting the Agent

The agent server runs on Cloud Run with specific configuration for our use case. Playwright needs memory (Chromium is hungry), WebSocket connections need session affinity, and test sessions can take several minutes. Our Cloud Run config:

2Gi memory / 2 CPU — headroom for Chromium + screenshot processing
10 minute timeout — sessions with many steps need room
Session affinity — WebSocket connections must stick to one instance
Low concurrency (5 per instance) — each session runs its own browser

Firestore — Report Persistence

Every test session generates a report with the tri-state results (passed, failed, incomplete), bug details, step outcomes, and metadata. These are saved to Firestore so users can browse test history, re-open past reports, and track trends over time.

The report structure is denormalized — each document contains the full step list, all bugs, and computed metrics. This keeps reads simple (one document fetch per report) at the cost of larger documents, which is the right tradeoff for a QA reporting tool where writes happen once and reads happen many times.

Cloud Storage — Bug Screenshots

When Verifai finds a bug, the screenshot evidence needs a permanent home. We upload each bug screenshot to GCS and include the public URL in the Jira ticket. The file path includes the session ID and step ID for organization:

gs://verifai-screenshots/screenshots/{sessionId}/{stepId}-{timestamp}.jpg

These URLs go directly into Jira ticket descriptions, so developers can see exactly what the AI saw when it identified the bug.

Cloud Build — CI/CD

A cloudbuild.yaml handles the deployment pipeline: build the Docker image (including Playwright's Chromium and all its system dependencies), push to Artifact Registry, deploy to Cloud Run. The Dockerfile is carefully constructed — Chromium needs a specific set of system libraries that are easy to miss:

RUN apt-get update && apt-get install -y \
    libnss3 libnspr4 libdbus-1-3 libatk1.0-0 libatk-bridge2.0-0 \
    libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 \
    libxfixes3 libxrandr2 libgbm1 libpango-1.0-0 libcairo2 \
    libasound2 libatspi2.0-0 libwayland-client0 fonts-liberation

Miss one library and Chromium silently fails to launch. We learned this the hard way.

The Vision Loop in Detail

The vision loop is the heart of Verifai. Here's what actually happens for a single test step — say, "Enter username 'standard_user' in the login form":

Step 1: Observe. Playwright takes a JPEG screenshot (compressed to 1024px width for token efficiency) and captures the accessibility tree (AOM snapshot). Both are sent to Gemini 3 Flash.

Step 2: Decide. Gemini sees the screenshot, reads the accessibility tree, and uses the Computer Use tool to decide: "type 'standard_user' at coordinates (640, 280)." It returns a structured action, not free-form text.

Step 3: Highlight. Before executing, we inject a red circle overlay at the target coordinates and take another screenshot. This streams to the frontend so the operator can see exactly what the AI is targeting — a visual confirmation of intent.

Step 4: Execute. Playwright clicks at the coordinates, then types with a 30ms delay between keystrokes (simulating human typing speed).

Step 5: Wait. A brief pause for the page to settle — DOM updates, network requests, animations.

Step 6: Verify. A fresh screenshot goes to Gemini 2.5 Flash Lite with the question: "Expected behavior: 'Username field populated with standard_user.' Did this happen?" The model returns a structured pass/fail verdict.

If the action fails (wrong coordinates, element not clickable), the self-heal kicks in: Gemini sees the error context and the new screenshot, then tries a different approach — different coordinates, a different element, or a different action type entirely.

Tri-State Reporting: What Happened vs. What's a Bug

Early in development, we had binary pass/fail. Then reality hit: what happens when a step times out? Or the page takes too long to load? Or Chromium crashes? Those aren't bugs — they're infrastructure noise. But in a binary system, they look like failures.

Our solution: tri-state reporting.

Passed — the step executed and Gemini verified the expected outcome appeared on screen
Failed — Gemini verified that something is wrong (a real product bug, with evidence)
Incomplete — the step couldn't be assessed (timeout, crash, rate limit, user skip)

The overall report status follows clear rules: if any step failed, the report status is "Failed." If no steps failed but some are incomplete, it's "Incomplete." Only when everything passes is it "Passed."

This distinction matters because the report tells you "we found 2 real bugs and couldn't check 1 step due to a timeout" instead of "3 things failed." Developers trust the results because failures always mean verified bugs, never infrastructure noise.

Human-in-the-Loop: The AI Knows What It Doesn't Know

Full automation is impressive, but responsible AI requires admitting uncertainty. Verifai includes a Human-in-the-Loop system that activates when the AI encounters situations it can't handle confidently.

Every action decision includes a confidence score (0.0 to 1.0). When confidence drops below a configurable threshold, the agent pauses and asks the human operator for guidance. A modal appears with the current screenshot, the AI's question, and context-appropriate decision buttons.

The options change based on why the agent paused:

Low confidence action: "Does this look right? Proceed / Skip / Re-analyze"
Destructive action detected: "This might delete data. Allow / Skip / Abort"
Ambiguous verification: "I can't tell if this passed. Mark Passed / Mark Failed / Re-verify"
Authentication wall: "I've detected a login page. I've Logged In / Skip / Abort"

Every human intervention is logged with a timestamp, the question asked, the decision made, an optional human note, and how long the operator took to decide. This audit trail is included in the report — judges (and future auditors) can see that the AI operated with appropriate human oversight.

Integration: Jira and Confluence

Verifai plugs into the tools QA teams already use.

Input: Test specs come from three sources — Jira tickets (summary, description, and acceptance criteria are extracted via the REST API), Confluence pages (HTML storage format is converted to plain text, with optional child page inclusion), or free-form text pasted directly.

Output: When a bug is found, Verifai auto-creates a Jira ticket in the configured project. The ticket includes the bug title and description (enriched by Gemini), expected vs. actual behavior, severity-based priority mapping, a GCS screenshot link, and labels for traceability (verifai-auto, source-{ticket}, failure-{type}).

This closes the loop: spec in Jira → Verifai tests → bugs back in Jira. The QA engineer reviews the results instead of performing the testing.

What We'd Do Differently

Start with the vision loop. We wasted time on a "generate plan, execute blindly" architecture before realizing the agent must see the browser before every action. This should have been the starting assumption.

Model specialization from day one. Our first version used one model for everything. Splitting into three models (Computer Use, verification, voice) should have been the architecture from the start — each model has a fundamentally different job.

Build the reporting system early. Tri-state reporting required touching nearly every file in the codebase when we added it. If we'd designed the type system with three states from the beginning, it would have saved significant refactoring.

Try It

Verifai is open source:

GitHub: https://github.com/phamthanhhang208/verifai

Built with Gemini 3 Flash, Gemini 2.5 Flash Lite, Gemini 2.5 Flash TTS, Cloud Run, Firestore, Cloud Storage, and Cloud Build for the Gemini Live Agent Challenge.

Verifai was built for the Gemini Live Agent Challenge hackathon (UI Navigator category). The source code and all implementation prompts are available in the GitHub repository.