Muhammad Ibtisam Afzal

Posted on Mar 1 • Edited on Mar 10

How We Built Voyance: An AI Agent That Researches the Web by ‘Seeing’ It”

#agents #ai #gemini #showdev

Built for the Gemini Live Agent Challenge · **UI Navigator* track*

This article was created for the purposes of entering the Gemini Live Agent Challenge.

The pitch: Say "Compare pricing for the top 5 CRM tools" → get a spoken briefing and a comparison table in minutes. No selectors. No DOM scraping. Just natural language and pixels.

The idea
Our agent loop (Plan → Navigate → Extract → Verify)
Gemini vision: the "see the page" layer
ElevenLabs: Vera's voice
Firecrawl + Perplexity
Stack and deployment
What we'd do next
Try it

The idea

Voyance is an AI research agent that turns a sentence into competitive intelligence. Here’s what it does, end to end:

Step	What happens
1	You give a natural-language query (e.g. "Top 5 CRM tools").
2	The agent plans which sites to visit (Perplexity + Gemini).
3	Playwright opens real pages and captures screenshots — no DOM access.
4	Gemini vision "reads" those screenshots and extracts structured data.
5	Firecrawl is tried first for speed; Gemini is the fallback when needed.
6	Perplexity verifies key claims so we don’t ship hallucinations.
7	You get a sortable table, CSV/HTML export, and Vera (ElevenLabs) reading the briefing aloud.

No site-specific scrapers. No brittle selectors. It works on any site, through redesigns, forever — because it sees the page like a human does.

Hackathon alignment: We use the Google GenAI SDK (google-generativeai) for Gemini — planning, vision, and synthesis. The agent is a custom async loop (plan → navigate → extract → verify → report), not the Google ADK library. Backend is on Google Cloud Run. That satisfies the requirement: "Agents must be built using either Google GenAI SDK or ADK."

1. Our agent loop (Plan → Navigate → Extract → Verify)

We built a custom agent loop powered by the Google GenAI SDK for all Gemini calls:

Plan → Navigate → Extract → Verify → Report

Everything runs in one async loop and streams updates over WebSockets so the UI stays live.

┌─────────────┐
│  User query │
└──────┬──────┘
       │
       ▼  PLAN    Perplexity / Gemini → target URLs + data points
       ▼  NAV     Playwright visits each URL, captures screenshot
       ▼  EXTRACT Firecrawl (fast) or Gemini Vision (fallback)
       ▼  VERIFY  Perplexity: "Company X pricing is $49/mo" ✓/✗
       ▼  REPORT  Gemini narrative → ElevenLabs (Vera) speaks it
       │
       ▼
┌─────────────┐
│  Table +    │
│  Vera audio │
└─────────────┘

Planning

We first ask Perplexity for competitor URLs (e.g. "top 5 CRM tools") so we get real, up-to-date links. If that returns nothing, Gemini generates a research plan: intent, target_sites, data_points, search_queries. We also have keyword-based fallbacks for CRM, voice AI, project management, and more — so we never end up with zero sites.

Navigate

For each URL we use Playwright (headless Chromium) to load the page and capture a screenshot. No DOM, no selectors — just pixels. That screenshot is the only input we send to Gemini when we need the vision path.

Extract

We try Firecrawl first with a small schema (company, pricing tiers, features, segment). If it succeeds, we use it. If it fails (SPA, paywall, rate limit), we send the screenshot to Gemini 2.0 Flash and ask for the same fields in JSON. Fast path = Firecrawl; fallback = Gemini vision. Zero DOM dependency.

Verify

For each competitor with pricing, we call Perplexity with a claim like "Company X pricing is $49/month". A short fact-checker prompt + low temperature gives us a verified flag and confidence — we use that for verified / unconfirmed / low badges in the UI.

Report

We aggregate all records, ask Gemini to synthesize a short narrative, and send that text to ElevenLabs as the final Vera briefing.

Interrupts (e.g. "skip this site" or "focus on HubSpot") are supported: we store the instruction and, at the start of the next site iteration, re-run Gemini to re-plan and adjust the URL list.

Key features in the UI: natural-language input, 3–5 sites per task, sortable comparison table, CSV/HTML export, and Vera (ElevenLabs) reading the briefing aloud — plus Perplexity-backed fact verification for pricing and claims.

2. Gemini vision: the "see the page" layer

Gemini 2.0 Flash is the backbone of our "UI Navigator" story. We use it in three places:

① Research plan

From the user’s sentence we get a JSON plan: intent, confirmation message, target URLs, data points, exclusions, and Perplexity search queries. We strip markdown if needed and use it to drive the loop.

② Screenshot analysis (the core)

When Firecrawl doesn’t return usable data, we base64 the screenshot and send it to Gemini with a prompt that asks for:

company_name, page_type, pricing_tiers, key_features, target_segment, confidence

We tell the model to infer company from logo/domain, include "Contact sales" or "Free trial" as tiers when no numbers are visible, and never return "Unknown" unless the page is unreadable. We parse JSON robustly (strip markdown, extract {...}). If company is still missing, we derive it from the page URL (e.g. assetpanda.com → "Assetpanda"). That’s how we get real names and usable tiers even on enterprise "Contact us" pages.

③ Report synthesis

After we have all competitor objects, we ask Gemini to write a short, conversational summary for Vera to speak. We instruct it to be specific (mention actual prices and tiers) and avoid generic "review the table" phrasing.

All of this uses the Google GenAI SDK (google-generativeai) — the official SDK for Gemini. One model from env (GEMINI_MODEL, default gemini-2.0-flash). No DOM, no scraping, no ADK dependency — prompts and images only. That’s how we meet the hackathon’s “Google GenAI SDK or ADK” requirement with the SDK alone.

3. ElevenLabs: Vera's voice

We wanted the output to feel like a real briefing, not a robotic readout. ElevenLabs powers our persona Vera: one voice for confirmations, step narration, and the final summary.

How we call it:

Voice: Fixed ID (Rachel by default; overridable via env)
Model: Multilingual v2
Tuning: Stability, similarity boost, style, speaker boost

The backend returns base64 MP3; the frontend plays it in an <audio> element. We also generate short "narrate step" phrases (e.g. "Visiting HubSpot pricing page…") so Vera speaks during the run, not only at the end. That keeps the experience live and aligned with the hackathon’s "see, hear, speak" theme.

4. Firecrawl + Perplexity: speed and grounding

Firecrawl

First extraction attempt: POST to /v1/scrape with formats: ["markdown", "extract"] and a small JSON schema. When we get structured data (especially company_name and pricing), we use it and skip the vision call. Low latency on "easy" pages. When Firecrawl fails or isn’t configured, we fall back to Gemini on the screenshot — so the agent still works everywhere.

Perplexity

We use it in two ways:

URL discovery — "Top 5 CRM tools" → list of companies with URLs.
Claim verification — For each competitor we send a claim (e.g. "Company X pricing is $49/month") and ask Perplexity to verify. We use the sonar model, citations on, low temperature. The result drives our confidence levels and verified badges in the UI.

Bottom line: Firecrawl = speed where possible. Perplexity = grounding so we don’t present wrong pricing or features.

5. Stack and deployment

Layer	Tech
Backend	FastAPI, WebSockets, Python 3.11. Docker on Google Cloud Run (`infra/cloudbuild.yaml` + Terraform). 1 GiB memory, 1 CPU (Playwright/Chromium need headroom; 512 MiB default was too low).
Frontend	React, Vite, Tailwind. `VITE_API_URL` set at build time (e.g. Vercel env). No trailing slash to avoid WebSocket path issues.
Browser	Playwright (Chromium) in the same container; navigation + screenshots only.

The backend is on Cloud Run to satisfy "use at least one Google Cloud service" and "agents hosted on Google Cloud." All API keys (Gemini, ElevenLabs, Firecrawl, Perplexity) live in env vars (or Secret Manager in production).

Production lessons: WebSockets can drop after ~10s (load-balancer idle timeout), so we ping every 5s and, if the connection drops, the frontend polls for results. We normalize request paths so a double slash (//api/...) from a bad base URL doesn’t return 403. For extraction, we derive company from URL when the screenshot doesn’t show it, so you get real names and tiers like "Contact sales" instead of "Unknown" and "N/A."

What we'd do next

Gemini Live API — True end-to-end voice (user speaks → Gemini transcribes and interprets → agent starts). Right now we use Web Speech / optional Gemini transcribe for the initial query.
Stronger verification — e.g. structured Perplexity output or multiple claims per company.
Screenshot replay — We already store screenshots per session; we’d expose a "view sources" carousel so every row links back to the screenshot we used for extraction.

Try it


Live demo	voyance-beta.vercel.app
Source code	github.com/ibtisamafzal/voyance

Clone the repo, add API keys (Gemini, ElevenLabs, Firecrawl, Perplexity — see backend/.env.example), start the backend and frontend, and try something like "Compare pricing for the top 5 CRM tools." The README includes a Features list, Hero + Output screenshots, and clear hackathon alignment (GenAI SDK + custom loop, Cloud Run).

Team: GenAI-Innovators · Connect: Ibtisam Afzal (LinkedIn)

Thanks to the Gemini Live Agent Challenge for the push to build something that really "sees" the web.

#GeminiLiveAgentChallenge