Christian Potvin

Posted on Mar 3

Persona-based testing with AI agents: find the UX gaps your E2E tests can't see

#ai #webdev #ux #testing

TL;DR: Give an AI agent a user persona and no map of your product. Make it state what it expects before every click. Compare expectation to result. Anything that doesn't match is a UX gap your functional tests will never find. At the end of the session, ask it for a score. The scores are humbling.

The blind spot you can't fix by writing more tests

Your E2E test suite passes. Every assertion is green. The sign-up flow works, the confirmation email arrives, the dashboard loads.

And then a real user lands on your product and gets confused at the third step.

Not because it's broken. Because it wasn't legible.

This is the gap that functional testing can't close. A test asserts that clicking a button triggers an event. It doesn't assert that a new user understood what the button was for before clicking it. That gap lives entirely outside the test runner — and it only surfaces in production, when someone who doesn't think like you tries to use something you built while thinking like yourself.

Persona-based AI testing is an attempt to close that gap before launch.

What persona-based testing actually is

It's not a new testing framework. It's not an AI that automatically generates test cases. It's closer to a think-aloud usability session — the kind a UX researcher would run with a real participant — except the participant is an AI loaded with a specific user identity and given access to your product cold.

The core idea:

Define a persona — who this person is, what they want, how patient they are, how technical they are
Give the agent that persona and a starting URL, nothing else
The agent navigates your product as that person would — reading, clicking, filling forms
Before every action, the agent states what it expects
After every action, it records what actually happened
Any mismatch between expectation and result gets flagged as an unclear behavior
At the end, the agent steps out of role and gives you a score + a list of what would have changed their experience

The output isn't a pass/fail report. It's a legibility audit from a specific type of user.

The personas that reveal the most

Not all personas are equally useful. The ones that find the most interesting issues are the ones furthest from your own mental model.

The power user of a similar tool — someone who has used a competitor and arrives with assumptions baked in. They'll try keyboard shortcuts that don't exist, expect features in locations where you didn't put them, and flag every terminology difference between your product and what they're used to.

The non-technical professional with a legitimate use case — a lawyer, an accountant, a project manager. They have real reasons to use your product but zero tolerance for developer vocabulary. They will tell you exactly which words stopped them and which interactions felt unsafe.

The impatient first-time visitor — five seconds to understand what this is. Ten seconds to decide whether to continue. Anything that requires reading a paragraph to understand is already a failure for this persona.

The compliance-minded user — this one surprised me the most when I ran it. The persona asks: can I export my data? Can I delete my account? Where is the privacy policy? What data do you store? These aren't power-user concerns. They're table stakes for anyone in a regulated environment — and they're the exact questions most developer-built products can't answer cleanly.

The expectation-before-action loop

This is the mechanism that makes the approach work.

Standard automated tests look like this:

do thing → assert result

Persona testing looks like this:

state expectation → do thing → compare result to expectation → flag gap if mismatch

The addition of the expectation step forces the agent to engage with the UI as a reader before engaging with it as an actor. That's what real users do — they read, form a mental model, then act based on that model. When the mental model is wrong, the UX failed.

The prompt instruction I've settled on:

Before every click, fill, navigation, or form submission, write one sentence 
starting with "I expect...". 

After every action, write one sentence starting with "The result was...". 

If the result contradicts the expectation in any way, write:
[UNCLEAR BEHAVIOR]: describe what you expected, what happened, and why the gap matters 
for someone with this persona.

An example output from a developer persona testing a product with a bulk-action feature absent:

I expect to be able to select multiple items and delete them in one action,
since I have accumulated a large number of records during testing.

[action: look for a select-all checkbox or bulk action menu]

The result was: no bulk selection exists. Each item has an individual delete button.

[UNCLEAR BEHAVIOR]: For a user managing high volumes — which is the core use case for 
a developer in a testing workflow — deleting items one by one is not a viable 
interaction. This will feel broken even though it technically works. A checkbox column 
and a "Delete selected" action would resolve this entirely.

That's not something a functional test would ever flag. The deletion works. The problem is the interaction cost at volume.

Handling registration flows: the email problem

Most products have a sign-up step that requires a real working email address. You want to test the full flow — the confirmation email arrives, the link works, the user lands on the right page. You can't mock this for a persona session; the whole point is to test the real path.

The cleanest solution I've found is to give the agent access to a disposable email MCP server alongside Playwright MCP. The agent creates a fresh inbox as part of the session, uses that address to register, reads the confirmation email natively, and continues — all in the same tool-calling context, no tab-switching, no clipboard gymnastics.

I used this approach when running persona sessions on my own platform, MinuteMail — which, conveniently, is a disposable email service itself. The agent used MinuteMail to generate the throwaway address it needed to register for a MinuteMail account. It's a bit recursive, but it worked cleanly.

The MCP server for MinuteMail is open source at chrptvn/mcp-servers if you want to look at how it's structured. The pattern generalises to any service with an API — the agent tool call looks roughly like:

[TOOL: minutemail_create_mailbox]
params: { expiresIn: 15 }
→ { id: "mbx_abc123", address: "qk7r2@minutemail.cc" }

I expect this address to function as a valid email recipient for the sign-up form.

[TOOL: playwright_fill] → input[name="email"] → "qk7r2@minutemail.cc"
[TOOL: playwright_click] → button[type="submit"]

I expect a confirmation email to arrive within 60 seconds.

[TOOL: minutemail_list_messages] → polling mailboxId "mbx_abc123"
→ { items: [{ subject: "Confirm your account", ... }] }

The result was: email arrived in approximately 9 seconds. Subject matched expectation.

[TOOL: minutemail_get_message] → extract confirmation URL from body
[TOOL: playwright_navigate] → confirmation URL

The session reads like a coherent narrative. Browser state and inbox state interleave naturally because both are just tool calls to the agent.

Real examples: what the personas actually found

I ran three persona sessions on MinuteMail during early development. Here are two findings that led to real product changes.

The developer persona and bulk delete

The developer persona was evaluating MinuteMail as a testing tool for a CI pipeline. The use case: spin up many mailboxes during a test run, clean up afterward.

After creating a dozen test mailboxes, the persona went to clean up:

I expect to be able to select all mailboxes and delete them at once, since 
managing a large number of test inboxes is the primary developer use case.

[action: look for a select-all or bulk delete control]

The result was: no bulk selection exists. Each mailbox has an individual delete button 
in its row.

[UNCLEAR BEHAVIOR]: A developer managing test infrastructure will accumulate many 
mailboxes. Deleting them individually is not a workable workflow. This makes the 
cleanup step — which is table stakes for any testing tool — unnecessarily painful. 
Bulk delete should be a first-class action.

I hadn't thought about this because I'd been testing with one or two mailboxes at a time. The persona was right: bulk delete is normal when you look at it from the developer's workflow. It got built.

The lawyer persona and a cluster of compliance gaps

The compliance-minded persona — a lawyer evaluating the platform for professional use — was the most thorough session I ran. It didn't find one problem. It found a list.

The session produced multiple [UNCLEAR BEHAVIOR] flags across the product, nearly all of them in the same category: the platform didn't give the user enough control over their own data and identity. Account deletion was the most concrete example:

I expect to be able to delete my account from the settings page, as this is 
a standard requirement for any platform handling personal data.

[action: navigate to settings, look for account deletion]

The result was: no account deletion option exists anywhere in the UI.

[UNCLEAR BEHAVIOR]: The absence of account deletion is a significant concern for any 
user in a professional or regulated context. It implies either that data is retained 
indefinitely against the user's wishes, or that the platform has not considered its 
obligations to users. This alone would prevent adoption in any environment where 
data governance matters.

But it was one of several flags in the same vein. By the end of the session, the persona had built up a picture of a product that worked technically but hadn't been evaluated through a compliance or data-governance lens at all.

End-of-session UX score: 2/10. The note: "I cannot recommend this platform to clients."

Each issue in the list got addressed. When the same persona session was re-run after the fixes, it scored 9/10.

That delta — from 2 to 9 — came entirely from things that were invisible to functional tests. The tests were green the whole time. What changed was the product's legibility to a specific type of user with specific concerns.

This is also the most useful property of persona-based testing: the issues aren't random. They cluster by persona. The developer persona found workflow efficiency problems. The lawyer persona found data governance problems. Each session is a coherent audit from a specific angle, not a random list of bugs.

Setting up a session

You need:

A Copilot CLI session (or any agent runtime that supports MCP tool use)
Playwright MCP configured
A disposable email MCP if your product has a sign-up flow (MinuteMail's is at chrptvn/mcp-servers, or use any inbox API you prefer)
A persona prompt

The persona prompt template I use:

You are [name], a [role/background]. 

You have heard about [product] and want to [goal].
You have no prior knowledge of how the platform works.

Your characteristics:
- Technical level: [high / medium / low]
- Patience: [high / medium / low]  
- Main concern: [speed / reliability / compliance / cost / simplicity]

Rules for this session:
1. Before every click, fill, navigation, or form submission: write "I expect..."
2. After every action: write "The result was..."
3. If result contradicts expectation: write "[UNCLEAR BEHAVIOR]: " and explain why 
   it matters for someone with your background.
4. Do not look for workarounds. If something isn't obvious, flag it and stop.

Starting URL: [your product URL]

At the end of the session, provide:
- A UX score from 1–10
- The single biggest friction point
- One change that would have most improved your confidence in the product

Keep the persona description short. The more you tell the agent about your product, the less useful the session is — you want cold navigation, not an informed walkthrough.

What this catches that normal testing misses

To be direct about where this fits in a testing strategy:

What	Caught by	Missed by
Feature doesn't work	E2E tests	—
Feature works but is confusing	Persona testing	E2E tests
Terminology is opaque to non-technical users	Persona testing	E2E tests
Missing feature obvious to target user	Persona testing	E2E tests
Compliance gap (no delete, no export)	Compliance persona	E2E tests
Interaction cost at volume (bulk actions)	Power-user persona	E2E tests

Use both. E2E tests tell you the product works. Persona sessions tell you the product is usable. You need both signals before you ship to real users.

Limitations

A few things this approach won't give you:

It won't simulate real emotional friction. An AI persona can report "this is unclear" but can't feel the frustration of a real person who closes the tab and never comes back.

The prompt shapes the output. If your persona prompt doesn't include compliance as a concern, the agent won't surface compliance gaps. Think carefully about what each persona actually cares about.

It's not a substitute for talking to real users. Persona sessions are fast and cheap — run them before user interviews to filter out the obvious issues. That way your user research time goes toward the subtle stuff.

If you run this on your own product, the non-technical persona will find the most things. Every product I've seen tested this way has at least one piece of vocabulary that makes complete sense to the builder and zero sense to anyone outside their industry. Finding it before launch is worth the hour.

Summary

Persona-based AI testing finds legibility gaps that functional tests can't — because tests assert on behavior, not on communication clarity
The expectation-before-action loop is the key mechanism: it forces the agent to engage as a reader before acting, the way real users do
Different personas surface different categories of issues — developers find workflow friction, compliance-minded users find governance gaps
For sign-up flows, pair Playwright MCP with a disposable inbox MCP so the agent handles the full registration sequence without manual intervention
Use persona sessions as a fast first pass before user interviews — they filter the obvious, so your research time goes toward the subtle
The compliance persona is the most surprising: if your product isn't ready for a lawyer, it's probably not ready for anyone in a professional context

Tags: #testing #ai #webdev #ux

Created: 2026-03-01

DEV Community