How We Built an AI-Powered E2E Testing Tool (Part 2)

#testing #ai #automaton #webdev

We’re still on a mission to build a true QA autopilot - one that needs only a URL to get going. In Part 1, we covered our shift from vision-based RL to self-hosted LLMs and all the tweaks that made it reliable. Now in Part 2, we’re scaling up: testing OpenAI and Gemini APIs, plugging in RAG to keep tests focused, and evolving our multi-agent flow. Let’s jump in!

1. API Era: OpenAI & Gemini

By early 2025, self-hosting small LLMs couldn’t meet scale or latency needs. We prototyped:

OpenAI GPT-3.5/4 APIs: Good quality, big context windows, ~0.5–1 s per call.
Google Gemini API:
- Latency: ~0.8 s per 1k tokens.
- Quality: Superior in complex navigation and multi-step wizards.
- Cost: Comparable to GPT-3.5, with more generous quotas.

We decided to continue experimenting with Gemini.

Why Gemini?

Bypassed local VRAM constraints.
Handled thousands of tokens (full-page JSON) without major trimming.
Scaled horizontally: we spun up multiple API workers in Docker Swarm.

That enabled us to test:

1. Multi-step Forms. “Part 1 on Step 1 → Click Next → Part 2 on Step 2 → Submit” flows worked reliably, since Gemini tracked state across contexts.
2. CRM CRUD Workflows. Complex user flows - like “Create record → Edit → Delete” - across nested tables and modals can now be executed in a single prompt. This is a crucial step forward: for example, to properly test the delete functionality for a "Lead" in a CRM, you first need to create a Lead, then delete it, ensuring you're not interfering with any existing data. Treegress now makes this kind of isolated, end-to-end scenario possible with one command.

2. Introducing RAG for Larger Pages

We understand that generating a large number of test cases is pointless if they aren’t clear and meaningful. That’s why we rely on our team’s deep experience in software engineering and subject matter expertise in QA to ensure every test adds real value. So we built a basic RAG pipeline:

QA-Curated Test Templates. QA engineers wrote generic test-case templates for common page types (Login, Contact Us, CRUD lists).
Retrieval at Runtime. When our system knows it’s on a Contact Us page, it pulls that template from our document store.
LLM calls → Tests. Because we lock onto QA templates, hallucinations drop, and coverage improves.

Result:

Predictability.
No fluff—just real, meaningful test cases.
Easier to cover edge cases without relying solely on LLM's creative power.
Templates + full-page JSON give LLM enough context to customize tests per page.

Multistep support, work with intentions

3. What Failed & Why

Vision+Text RL Models: Massive and slow.
Flat List JSON: Lost parent/child relations—poor element detection.
- Fix: Switch to nested JSON reflecting the true DOM tree.
Local VLLM + Large Models: OOMs, crashes, high latency.
- Lesson: If you need low latency and high concurrency, self-hosting big models can be a dead end—APIs often win.

Conclusion

We set out to build an AI agent that “sees” a page, “understands” its purpose, and “generates” or “validates” test cases with minimal manual work. Our path led us from massive vision+RL efforts to lean JSON+LLM setups, and ultimately to multi-agent architecture, a hybrid of self-hosted LLMs, external APIs, and RAG. Each dead end taught us something crucial: raw vision is heavy (but still can be helpful), flat JSON loses context, and local LLM hosting runs into VRAM walls.

Today, our production flow looks like this:

Website Crawler & Analyzer – Parses and serializes each page into nested JSON. Thanks to our unique DOM handling, it includes a built-in self-healing engine (a topic we’ll cover in a separate article).
AI Agent for Test Case Generation – Uses a RAG layer to fetch QA templates from a file store based on the page type. These are combined into a prompt and sent to the Gemini API, which returns structured test cases or direct verdicts.
AI Agent for Test Case Verification – Reviews the generated test cases for accuracy and relevance.
Backend Services with Playwright – Executes the tests against the live environment.

It’s a long journey - not just to generate test cases, but to build a platform that truly understands your website, generates meaningful, context-aware test cases, and ensures reliability with predictable outcomes. But let us be clear: we didn’t set out to build a Copilot for QA automation — we’re building an Autopilot for testing.

If you’d like to be part of that journey, we’d love for you to try Treegress and share your thoughts.