Gaddour

Posted on Nov 26

Our AI Automation Agent Got Accepted at RoboCon 2026 Conference

#testing #automation #ai #opensource

Most UI tests today still look like this:

you code the steps,
you hard-code selectors (IDs, XPath, CSS),
you pray they don’t break on the next release.

This works… until:

the accessibility tree is a mess,
the app runs inside a WebView,
the UI is legacy or hybrid,
or there is no reliable locator at all.

At that point, traditional automation just gives up.

I’ve been working on a different approach:

Instead of hard-coding selectors and steps,

let an AI agent build the locator and the action at runtime.

This is what our project, AI Agent, is about.

🚀 AI Agent is open source & early-stage.

If this resonates with you, please ⭐ star the repo:

👉 https://github.com/aidriventesting/Agent

It helps a lot with visibility and future sponsorship.

What is AI Agent?

AI Agent is an open-source project that plugs into your existing tests and tools, and moves the “intelligence” to runtime:

You give an instruction (in natural language or structured form).
The agent analyzes the current UI.
It decides what element to interact with and what action to perform.
Then it calls Appium / Playwright RF keywords behind the scenes.

For accessible apps, the agent can still use locators — but it builds them on the fly, instead of you hard-coding them.

For non-accessible apps (no IDs, no labels, weird trees), it can switch to vision-based mode:

work directly from the screenshot.

Why vision-based?

Some apps are just not testable with classic locators:

custom rendering,
games,
kiosk / embedded UIs,
“designer” apps with no semantic structure,
broken or incomplete accessibility.

In those cases, “find element by ID” is not an option.

That’s where a vision agent comes in:

it receives a screenshot,
it detects interactive regions (buttons, inputs, icons…),
it understands text and layout,
it chooses where to click / type based on the screen, not the DOM.

Right now, AI Agent integrates OmniParser for this, and the plan is to support more models and eventually a dedicated model tuned for interactive zones in mobile & web UIs.

Two ways to use AI Agent

AI Agent is not “all or nothing”.

You can use it in two complementary ways.

1. Agent mode: `Agent.Do` and `Agent.Check`

This is the “agentic” interface.

You give it a goal at step-level, and it decides what to do on the current screen:

Agent.Do → perform an action based on the instruction and UI
Agent.Check → verify something visually or semantically on the current screen

Example (simplified Robot Framework style):

*** Settings ***
Library    AIAgentLibrary

*** Test Cases ***
Login With Runtime Agent
    Open Application    my_app
    Agent.Do    Tap the login button
    Agent.Do    Type "user@example.com" into the email field
    Agent.Do    Type "Secret123" into the password field
    Agent.Do    Submit the login form
    Agent.Check    Verify that the home screen is visible and shows the username

No hard-coded XPath.

The agent looks at the UI / accessibility / screenshot and makes a decision in the moment.

Today, this is step-by-step.

The roadmap includes Agent.Autonomous for multi-step flows in one shot.

2. AI-in-the-loop tools for any test

You don’t have to rewrite your whole suite to use AI.

AI Agent also provides small, focused keywords/tools that you can drop into any existing test. For example:

Locate GUI element visually → get bounding box / description of an element on the screen.
Explain what is on this screen → useful for debugging and test failure analysis.
Report a bug with visual context → capture screenshot + regions + description.
Suggest a locator → propose a more robust selector based on the UI.

This “AI in the loop” mode is meant to augment your traditional tests, not replace them.

You keep your framework, your asserts, your structure — and use AI only where it actually helps.

How it works (high level)

Under the hood, AI Agent has three main parts:

1. UI understanding from structure

uses whatever is available: accessibility tree, DOM, widget hierarchy, etc.
can build locators dynamically and choose good candidates.

2. UI understanding from vision

uses models like OmniParser (for now) to parse screenshots into blocks, text, regions.
future: dedicated model for “interactive zones” (tappable, typable, etc.).

3. Decision layer

takes the current instruction + perceived UI,
picks a target element and an action,
dispatches to Appium / WebDriver / Robot Framework.

The focus right now is on reliable per-step decisions, clear logs, and reproducible behavior — not on creating a mysterious black-box “magic agent”.

📢 AI Agent at RoboCon 2026

AI Agent will be presented at RoboCon 2026 in Helsinki, the main Robot Framework community conference:

👉 https://www.robocon.io/agenda/helsinki#what-if-robot-framework-have-a-brain

The talk will explore:

why runtime locator generation matters,
how vision-based perception fits into real-world testing,
how Agent.Do / Agent.Check and future Agent.Autonomous can live together with classic Robot Framework suites.

If you’re attending RoboCon 2026, come say hi and bring your weirdest UI problems. 😄

Roadmap

Short-term

Improve runtime locator generation from accessibility / DOM.
Strengthen the OmniParser integration and add alternative vision backends.
Provide robust Agent.Do / Agent.Check implementations with good logging.
Expose useful “AI-in-the-loop” keywords for common use cases:
- visual location,
- smart attachments for bug reports,
- visual checks.

Mid-term

Agent.Autonomous for multi-step flows.
A custom model for interactive UI zones.
Benchmarks for agent-based vs selector-based testing.
Better support for non-accessible and legacy apps.

⭐ How you can support the project right now

If you want this direction to exist for real and stay open source:

⭐ Star the GitHub repo (this is the most important signal)

👉 https://github.com/aidriventesting/Agent
Share the project with:
- your QA / automation team,
- anyone fighting with fragile locators,
- people working on vision/agentic testing.
Open issues with:
- your use cases,
- screenshots of hard-to-test UIs,
- ideas for AI-in-the-loop keywords.
Support the project financially via Open Collective (infra, models, device farms):

👉 https://opencollective.com/ai-testing-agent

Get involved

I’m especially interested in feedback from:

mobile & web test engineers,
people dealing with inaccessible / legacy UIs,
researchers working on UI understanding or agents,
teams that want to bring “just enough AI” into existing test suites.

Comments, critiques, weird edge cases… all welcome.

Let’s see how far we can push runtime UI automation with an AI agent in the loop.