DEV Community

anjo zulaybar
anjo zulaybar

Posted on

We stopped writing Playwright selectors and let AI figure it out

The problem with selector-based testing

If you've maintained a Playwright or Cypress test suite for more than a few months, you know the drill. A designer renames a class, a developer restructures a form, and suddenly 30 tests are broken — not because the feature broke, but because .submit-btn became [data-action="submit"].

You end up in a loop: fix selectors, ship, selectors break, fix selectors. The tests stop being useful because nobody trusts them.

What we built

We built Confidence Gate — an AI-powered test execution engine where you describe test steps in plain English and the system figures out the rest.

Instead of:

await page.locator('[data-testid="email-input"]').fill('user@example.com');

await page.locator('button[type="submit"]').click();

await expect(page).toHaveURL('/dashboard');

You write:

{ "action": "enter the email from the test data in the email field",

"expected": "the email field contains the entered address" }
"expected": "the dashboard is displayed and the login form is gone" }

The engine translates each step into a typed intent, resolves the target element from the accessibility tree, executes it in a real Playwright browser, takes a screenshot, and verifies the outcome visually.

How the execution engine works

Each step goes through four stages:

1. Intent generation — The natural language action is converted to a structured JSON ({ action: "click", target: { label: "Sign In", role: "button" }, value: null }). This separates intent from implementation.

2. Element resolution — A multi-tier resolver finds the element: accessibility tree first (fast, reliable), CSS heuristics second, AI-assisted fallback third.

3. Execution + behavior detection — Playwright executes the action. A mutation observer watches for DOM changes, URL changes, and value changes to confirm something actually happened.

4. Verification — A vision model looks at the post-action screenshot and checks it against the expected result. If behavior was detected but verification fails, the engine assumes it hit the wrong element and
retries with a blacklisted selector.

Self-healing selectors

When a selector stops working between deploys, the repair loop kicks in. It re-queries the accessibility tree, scores candidate elements against the original target description, and picks the best match. The new selector is cached so the next run is fast.

The confidence score

After a run, every step result feeds into a score (0–100) built from:

  • Pass / fail / inconclusive ratio
  • Flakiness history (tests that flip between runs)
  • Selector stability (how often repair had to run)
  • AI risk analysis against a PRD (optional)

The score maps to a gate decision: ship, caution, or block. You can call the API from CI and fail a deployment if the score drops below your threshold.

Stack and setup

  • Backend: Python 3.11 · FastAPI · Celery · MongoDB · Redis · MinIO · Playwright
  • Frontend: Next.js 15 · React 19 · TypeScript · Tailwind CSS v4
  • AI: pluggable — OpenAI, Anthropic Claude, or Ollama for local models
  • Auth: Firebase or local JWT (no Firebase account needed for dev)

git clone https://github.com/OaktreeInnovations/confidence-gate.git
cd confidence-gate

cp .env.example .env

make up

Open http://localhost:3001 and you're running.

What's next

We're working on four things in order:

  1. Stabilising execution (fewer inconclusive steps, better handling of edge cases)
  2. Better PRD coverage analysis (requirement-level traceability, not just a score)
  3. Browser recording (record your actions → auto-generate test cases)
  4. Test generation from PRD (upload a spec → get a full test suite)

The repo is MIT licensed and open to contributions. If any of this is interesting to you — especially the browser recording or the AI execution engine — come say hi on GitHub.

https://github.com/OaktreeInnovations/confidence-gate

Top comments (0)