DEV Community

Cover image for Building Parallax: The Vision-Powered UI Navigator Agent
Vani Chitkara
Vani Chitkara

Posted on

Building Parallax: The Vision-Powered UI Navigator Agent

This piece of content was created specifically for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

Traditional automated testing is broken. It relies on "cheating" by looking at the underlying HTML code (the DOM) to find buttons and links. But humans don’t browse the web by reading code; we browse by seeing pixels.

When we (my teammate and I) set out to build Parallax, we wanted to create a truly human-centric testing agent. We didn't want a scraper; we wanted an agent with "eyes." To achieve this, we turned to the cutting-edge capabilities of Google Gemini 2.5 Flash and the Google Cloud ecosystem.

🧠 The Core: A Vision-to-Action Brain

At the heart of Parallax is the Gemini 2.5 Flash model. We chose this model specifically for its industry-leading multimodal performance and low latency.

In Parallax, we don't send a single line of HTML to the AI. Instead, our agent loop performs the following:

  • Capture: Using Playwright, we snap a high-resolution screenshot of the browser viewport.
  • Analyze: We send that raw image to gemini-2.5-flash with a specific user persona context (e.g., "You are Martha, a 72-year-old with low tech literacy").
  • Act: The model "sees" the UI elements and returns raw pixel coordinates for the next action—be it a click, a scroll, or a type. By using gemini-2.5-flash, the agent can identify UX friction that code-based tests ignore, such as poor color contrast, overlapping elements, or confusing visual hierarchies.

🛠️ Multi-Agent Orchestration with Google ADK

Parallax doesn't just run one test; it runs a "swarm" of diverse perspectives. We used the Google Agent Development Kit (ADK) to orchestrate these independent persona agents. The ADK allowed us to create distinct cognitive models for each persona, ensuring that "Martha" (our 72-year-old dear grandmother), "Raj" (our 28-year-old power user), and our 5 other agents with diverse personas can navigate the same site simultaneously, each reporting unique findings based on their specific technological background.

📈 Scaling on Google Cloud

To handle the intensive compute requirements of headless browsers and high-frequency AI calls, we built a serverless architecture on Google Cloud:

  • Google Cloud Run: Our FastAPI backend is fully containerized and deployed on Cloud Run. This allows us to scale horizontally as more agents are spawned, ensuring that the "Vision Loop" remains snappy and responsive.
  • Google Cloud Firestore: We use Firestore for real-time state management. As agents find issues, they are instantly streamed to a live dashboard, allowing developers to watch the "thinking process" of the AI in real-time.
  • Google Cloud Storage (GCS): Every multimodal artifact—every screenshot the agent "saw" is persisted in GCS. This creates a visual audit trail that is invaluable for debugging UX failures.

💡 Conclusion

Parallax represents a shift from "testing code" to "testing experiences." By combining the multimodal power of Gemini 2.5 Flash with the reliability of Google Cloud, we’ve built a tool that helps developers see their apps through the eyes of their most diverse users.

Check out the live project here: https://bit.ly/parallax-agent

Top comments (0)