The VLA Testing Pipeline in Mano-AFK: When AI Agents QA Their Own Work

#ai #agents #opensource #testing

AI coding tools have gotten remarkably good at generating code. You describe what you want, and within minutes you have functions, components, even entire applications scaffolded out. But there's a question that rarely gets asked in the excitement: who tests it?

Writing code accounts for maybe 30% of shipping software. The remaining 70% — defining requirements, deploying, testing, finding bugs, fixing them, and verifying the fixes — is where most projects quietly stall. Every AI coding assistant today stops at some variation of "here's the code, good luck." The developer is still left to deploy it, test it manually, discover the bugs, explain the bugs back to the AI, wait for fixes, and re-test.

That workflow isn't autonomous development. It's autocomplete with extra steps.

The Testing Gap Nobody Talks About

Most engineering teams rely on a layered testing strategy: linting catches syntax errors, unit tests verify individual functions, and API tests confirm that endpoints return the right data. These layers are well-understood, well-automated, and widely adopted.

But here's the uncomfortable reality: all three can pass while the application is completely broken for end users.

A button's onClick handler might correctly call an API endpoint that returns valid JSON — and the unit test, API test, and linter will all report green. Meanwhile, the button itself is hidden behind a CSS overflow, or renders off-screen on mobile, or navigates to a blank page because the frontend routing is misconfigured. The backend works. The tests pass. The user sees nothing.

This is the E2E testing gap. It's the difference between "the code compiles" and "the software ships." And it's the hardest layer to automate, because it requires something most test frameworks don't have: the ability to actually look at the application and interact with it the way a human would.

Why Traditional E2E Testing Falls Short

Tools like Selenium and Playwright have been the go-to for browser-based E2E testing for years. They work by programmatically controlling a browser through DOM selectors — clicking elements by their CSS class, filling inputs by their HTML id, asserting text content by XPath.

The problem is fragility. DOM-based selectors break whenever the UI changes. A designer renames a class, a framework update restructures the component tree, a developer switches from a <div> to a <button> — and the entire test suite fails, not because the application is broken, but because the selectors are stale.

This creates a maintenance burden that scales linearly with application complexity. Large teams often dedicate entire QA engineers just to keep Selenium tests from becoming red noise. Smaller teams simply skip E2E testing altogether.

There's a more fundamental issue, too. DOM-based testing can only verify what's programmatically accessible. It can check that a text node contains "Success" but it can't tell you that the success message is rendered in white text on a white background. It can verify that an image element exists but not that the image actually loaded. It operates on structure, not on what the user actually sees.

VLA: Giving Agents Eyes

Vision-Language-Action (VLA) models change this equation. A VLA model takes a screenshot of the application, understands what it sees through visual reasoning, and generates concrete actions — click coordinates, text input, scroll directions — based on that understanding.

The key difference from DOM-based automation: VLA operates on pixels, not selectors. It doesn't need to know that the "Submit" button is a <button class="btn-primary">. It sees a button labeled "Submit" and clicks it, exactly as a human tester would. If the button moves to a different position on the page, the VLA model still finds it. If the framework changes from React to Vue, the visual interface stays the same and the tests still work.

This makes VLA-based testing inherently more robust than selector-based approaches. But it also enables something selector-based tools fundamentally cannot do: visual validation. A VLA model can verify that a chart actually renders with the correct data, that a color-coded status indicator is the right color, that a modal overlay is visible and properly positioned. It tests what the user experiences, not what the DOM describes.

Mano-P's benchmark performance across multiple evaluation dimensions, including GUI grounding and visual understanding tasks.

The Full Pipeline: Build → Test → Fix → Repeat

Individual testing capability is useful. But the real value emerges when visual testing becomes part of a fully autonomous development pipeline — where an AI agent doesn't just write code, but also deploys it, tests it with real browser interactions, and fixes whatever breaks.

Here's what that pipeline looks like in practice:

Step 1: Requirements first. Before a single line of code is written, a structured PRD (Product Requirements Document) is generated with acceptance criteria. Every test case traces back to a specific requirement. Every bug fix maps to an AC number. This eliminates the most common failure mode of AI-generated code: "it works, but it doesn't match the intent."

Step 2: Build and deploy. Code is generated, dependencies are installed, and the application is deployed to a local development server — all without human intervention.

Step 3: Layered testing. The pipeline runs lint checks first (fast, catches syntax issues), then API tests (verifies backend logic), then E2E tests using a VLA model to open the app in a browser, navigate through user flows, and verify that the interface matches the acceptance criteria.

Step 4: Fix loop. When tests fail, the agent reads the failure report, inspects the relevant code, makes targeted fixes, re-deploys, and re-tests. This loop can run for multiple iterations — catching not just the initial bug but also regressions introduced by the fix itself.

The entire cycle — from "build me a budget tracker" to "here's your running app with a test report" — runs without human involvement.

Adversary Review: Why the Builder Shouldn't Test Itself

There's a well-known principle in software engineering: the person who writes the code shouldn't be the only one testing it. Developers have blind spots about their own work. They unconsciously avoid testing the edge cases they didn't think of during implementation.

The same principle applies to AI agents. When a single agent builds and tests, it tends to generate tests that validate its own assumptions rather than challenging them. The tests pass not because the code is correct, but because the tests are aligned with the same reasoning that produced the code.

A more robust approach uses separation of concerns:

A Build Agent writes the code, handles deployment, and fixes bugs
An Adversary Agent independently reviews the PRD and source code to find problems the builder missed
A Main Agent triages each finding through code inspection, API tests, or E2E verification

The adversary operates without knowledge of the builder's implementation decisions. It reads the requirements, reads the code, and asks: "What could go wrong that the builder didn't consider?" This catches usability gaps, data integrity issues, inconsistent behavior across features, and missing edge cases that automated tests alone would miss.

Self-Evolution: Getting Smarter Over Projects

Most AI coding tools treat every project as a fresh start. The context window resets, lessons from previous sessions are lost, and the same mistakes get repeated.

A self-evolving pipeline maintains persistent knowledge across projects through two mechanisms:

Build rules — When a bug takes multiple fix iterations to resolve, the lesson is extracted and applied to all future projects. "Always add loading states to async data fetches" isn't a generic best practice; it's a specific rule learned from a specific failure.
Preference accumulation — Layout patterns, color schemes, component choices, and architectural preferences converge over time. The tenth project reflects accumulated understanding of what the developer actually wants, not just what they described in a single prompt.

This is a meaningful shift from stateless code generation to something that develops institutional memory.

Mano-AFK: An Open-Source Implementation

At Mininglamp, we built Mano-AFK as an open-source implementation of this full pipeline. It takes a natural language description, generates a PRD with acceptance criteria, builds the application, deploys it locally, runs layered testing (lint → API → E2E → adversary review), and iterates through fix loops — up to 10 rounds — until all tests pass or a detailed report is generated.

The E2E testing layer is powered by Mano-P, Mininglamp's on-device VLA model. Mano-P runs entirely on local hardware — the 4B quantized model achieves 76 tokens/s decode speed on an M4 Pro with just 4.3 GB peak memory. No screenshots leave the device, no API keys are required, and there's zero per-test cost. It uses pure vision to understand GUI interfaces without relying on DOM parsing or accessibility trees, which means it works across web apps, desktop software, and any application with a visual interface.

Mano-P's GUI grounding benchmark results — the ability to accurately locate and interact with UI elements is foundational to reliable visual testing.

For teams that prefer cloud-based testing, Mano-AFK also supports Claude CUA as an alternative backend. The local mode with Mano-P is recommended for development workflows where privacy, latency, and cost matter.

What This Means for Development Workflows

The combination of VLA-based visual testing, adversary review, and self-evolving build rules points toward a future where "AI-assisted development" means more than code generation. It means AI agents that can participate in the full software lifecycle — including the 70% that happens after the code is written.

We're still early. VLA models aren't perfect at visual understanding, adversary review can produce false positives, and self-evolution needs many project cycles to show meaningful improvement. But the direction is clear: autonomous development pipelines that close the loop between writing code and shipping software.

Both Mano-AFK and Mano-P are open source and available on GitHub. If this approach to autonomous testing resonates with your workflow, we'd welcome you to try them out and share your experience. ⭐