How I built an open source visual QA tool after every AI agent I tried failed

#ollama #qualityassurance #playwright #opensource

I had a test spec to run against a web app. A couple of dozen test cases covering login, navigation, data persistence, drag-and-drop interactions, and settings. The kind of document that sits in a folder and gets run manually before each release.

Running it manually takes most of a morning. So I decided to automate it properly.

The first attempt

I built a browser agent using a local Ollama model and pointed it at the app. Gave it the test cases as a prompt and let it run.

The first run, it navigated to the login page, identified the email field, typed the credentials correctly, and then spent ten minutes navigating in circles before ending up somewhere completely unrelated, having lost track of the task entirely.

The second attempt made it past login before forgetting what it was doing and trying to log in again. The third produced a markdown table with verdicts that had no relationship to what was actually on screen.

The problem was not the model. It was the architecture.

What was actually going wrong

I was treating the AI like a QA engineer, handing it a list of tasks and expecting methodical progress. Language models don't work that way in a browser. Every page change brings a new wall of DOM noise. React apps are particularly bad for this. By step five or six, the model has lost track of where it is. It navigates to verify things it already checked two steps ago.

I tried local models first and assumed the problem was capability, so I switched to the Gemini API. Same result. The cloud model was faster and more articulate about what it was doing, but it still lost the thread after a handful of steps. It would confidently describe elements that were not on screen and navigate to pages that had nothing to do with the test it was supposed to be running. The issue was not model quality. A smarter model in a bad architecture just fails more eloquently.

Browser agents work well for one-shot tasks. They fall apart on multi-step regression testing where you need to tick off a couple of dozen cases in sequence without losing the thread.

The approach that worked

I stopped trying to make the AI act and started using it only to observe.

Deterministic code handles everything requiring reliability: logging in, navigating to each URL, clicking buttons, typing text. All scripted, no model in the loop.

For each test: take a screenshot, send it to a local vision model, ask one specific yes or no question. "Is the login form visible with no blank white screen?" "Is the user profile page showing the correct name and email fields?"

The model never navigates. Never clicks. It looks at a static screenshot and answers one question. No agent loop, no state to lose.

A couple of dozen tests, a few minutes end to end, all green.

Things that were harder than expected

The login flow was two steps: email first, then a Continue button, then the password field appears. This is the kind of thing a browser agent handles inconsistently. A small block of deterministic code handles it every time.

The app uses React drag-and-drop which does not respond to standard drag commands. You have to dispatch raw mouse events with a hold delay for React to register the interaction. Scripted code handles it. An agent trying to drag things fails most of the time.

How you phrase the question to the vision model matters more than which model you use. "Is the submit button visible?" and "Is there a green button labeled Submit at the bottom of the form, below the terms and conditions checkbox?" produce very different results. Being specific is the work.

One thing that matters in practice: the apps worth testing usually sit behind SSO with 2FA. Automating that login is not realistic. So lookout has a lookout auth command that opens a headed browser, lets you log in by hand once, and saves the session. Every subsequent run reuses it. Not glamorous, but it is the difference between a demo and a tool.

What came out of it

I packaged this into a CLI tool called lookout. You write a YAML spec defining your tests, point it at your app, and run it. In headed mode you can watch it click through the browser in real time while a GPU stats panel shows your vision model working through each screenshot. When it finishes, it generates an HTML report with pass/fail results and an embedded screenshot for every test, and opens it in your browser automatically. The whole run from launch to reading the report takes a few minutes.

Single Go binary, local Ollama model by default with cloud vision models as an opt-in for teams that want higher accuracy. It also ships with a prompt to convert an existing PDF test spec into YAML, useful if you inherit a document rather than write one from scratch. CI-ready exit codes and JUnit XML output so it slots into a pipeline rather than being something you run manually.

It is open source at github.com/alexmchughdev/lookout. It ships with a demo spec you can run out of the box, but getting it working reliably against your own app will take some time dialling in the selectors and question phrasing for your specific UI. That tuning is where the real work is, and it is different for every app. If you try it out or want to contribute, feel free to open an issue or a PR.

The actual takeaway

The instinct when you hear "AI QA automation" is to picture an agent that reads your test spec and handles everything. That capability is closer than it was a year ago, but for complex multi-step regression testing across a React app it still breaks down in practice.

Treating the AI as a fast and consistent observer, and using deterministic code for everything else, produces something genuinely reliable. The model does not need to be smart. It needs to be consistent. Those are different requirements.