TL;DR: Give an AI agent a user persona and no map of your product. Make it state what it expects before every click. Compare expectation to result. Anything that doesn't match is a UX gap your functional tests will never find. At the end of the session, ask it for a score. The scores are humbling.
The blind spot you can't fix by writing more tests
Your E2E test suite passes. Every assertion is green. The sign-up flow works, the confirmation email arrives, the dashboard loads.
And then a real user lands on your product and gets confused at the third step.
Not because it's broken. Because it wasn't legible.
This is the gap that functional testing can't close. A test asserts that clicking a button triggers an event. It doesn't assert that a new user understood what the button was for before clicking it. That gap lives entirely outside the test runner — and it only surfaces in production, when someone who doesn't think like you tries to use something you built while thinking like yourself.
Persona-based AI testing is an attempt to close that gap before launch.
What persona-based testing actually is
It's not a new testing framework. It's not an AI that automatically generates test cases. It's closer to a think-aloud usability session — the kind a UX researcher would run with a real participant — except the participant is an AI loaded with a specific user identity and given access to your product cold.
The core idea:
- Define a persona — who this person is, what they want, how patient they are, how technical they are
- Give the agent that persona and a starting URL, nothing else
- The agent navigates your product as that person would — reading, clicking, filling forms
- Before every action, the agent states what it expects
- After every action, it records what actually happened
- Any mismatch between expectation and result gets flagged as an unclear behavior
- At the end, the agent steps out of role and gives you a score + a list of what would have changed their experience
The output isn't a pass/fail report. It's a legibility audit from a specific type of user.
The personas that reveal the most
Not all personas are equally useful. The ones that find the most interesting issues are the ones furthest from your own mental model.
The power user of a similar tool — someone who has used a competitor and arrives with assumptions baked in. They'll try keyboard shortcuts that don't exist, expect features in locations where you didn't put them, and flag every terminology difference between your product and what they're used to.
The non-technical professional with a legitimate use case — a lawyer, an accountant, a project manager. They have real reasons to use your product but zero tolerance for developer vocabulary. They will tell you exactly which words stopped them and which interactions felt unsafe.
The impatient first-time visitor — five seconds to understand what this is. Ten seconds to decide whether to continue. Anything that requires reading a paragraph to understand is already a failure for this persona.
The compliance-minded user — this one surprised me the most when I ran it. The persona asks: can I export my data? Can I delete my account? Where is the privacy policy? What data do you store? These aren't power-user concerns. They're table stakes for anyone in a regulated environment — and they're the exact questions most developer-built products can't answer cleanly.
The expectation-before-action loop
This is the mechanism that makes the approach work.
Standard automated tests look like this:
do thing → assert result
Persona testing looks like this:
state expectation → do thing → compare result to expectation → flag gap if mismatch
The addition of the expectation step forces the agent to engage with the UI as a reader before engaging with it as an actor. That's what real users do — they read, form a mental model, then act based on that model. When the mental model is wrong, the UX failed.
The prompt instruction I've settled on:
Before every click, fill, navigation, or form submission, write one sentence
starting with "I expect...".
After every action, write one sentence starting with "The result was...".
If the result contradicts the expectation in any way, write:
[UNCLEAR BEHAVIOR]: describe what you expected, what happened, and why the gap matters
for someone with this persona.
An example output from a developer persona testing a product with a bulk-action feature absent:
I expect to be able to select multiple items and delete them in one action,
since I have accumulated a large number of records during testing.
[action: look for a select-all checkbox or bulk action menu]
The result was: no bulk selection exists. Each item has an individual delete button.
[UNCLEAR BEHAVIOR]: For a user managing high volumes — which is the core use case for
a developer in a testing workflow — deleting items one by one is not a viable
interaction. This will feel broken even though it technically works. A checkbox column
and a "Delete selected" action would resolve this entirely.
That's not something a functional test would ever flag. The deletion works. The problem is the interaction cost at volume.
Handling registration flows: the email problem
Most products have a sign-up step that requires a real working email address. You want to test the full flow — the confirmation email arrives, the link works, the user lands on the right page. You can't mock this for a persona session; the whole point is to test the real path.
The cleanest solution I've found is to give the agent access to a disposable email MCP server alongside Playwright MCP. The agent creates a fresh inbox as part of the session, uses that address to register, reads the confirmation email natively, and continues — all in the same tool-calling context, no tab-switching, no clipboard gymnastics.
I used this approach when running persona sessions on my own platform, MinuteMail — which, conveniently, is a disposable email service itself. The agent used MinuteMail to generate the throwaway address it needed to register for a MinuteMail account. It's a bit recursive, but it worked cleanly.
The MCP server for MinuteMail is open source at chrptvn/mcp-servers if you want to look at how it's structured. The pattern generalises to any service with an API — the agent tool call looks roughly like:
[TOOL: minutemail_create_mailbox]
params: { expiresIn: 15 }
→ { id: "mbx_abc123", address: "qk7r2@minutemail.cc" }
I expect this address to function as a valid email recipient for the sign-up form.
[TOOL: playwright_fill] → input[name="email"] → "qk7r2@minutemail.cc"
[TOOL: playwright_click] → button[type="submit"]
I expect a confirmation email to arrive within 60 seconds.
[TOOL: minutemail_list_messages] → polling mailboxId "mbx_abc123"
→ { items: [{ subject: "Confirm your account", ... }] }
The result was: email arrived in approximately 9 seconds. Subject matched expectation.
[TOOL: minutemail_get_message] → extract confirmation URL from body
[TOOL: playwright_navigate] → confirmation URL
The session reads like a coherent narrative. Browser state and inbox state interleave naturally because both are just tool calls to the agent.
Real examples: what the personas actually found
I ran three persona sessions on MinuteMail during early development. Here are two findings that led to real product changes.
The developer persona and bulk delete
The developer persona was evaluating MinuteMail as a testing tool for a CI pipeline. The use case: spin up many mailboxes during a test run, clean up afterward.
After creating a dozen test mailboxes, the persona went to clean up:
I expect to be able to select all mailboxes and delete them at once, since
managing a large number of test inboxes is the primary developer use case.
[action: look for a select-all or bulk delete control]
The result was: no bulk selection exists. Each mailbox has an individual delete button
in its row.
[UNCLEAR BEHAVIOR]: A developer managing test infrastructure will accumulate many
mailboxes. Deleting them individually is not a workable workflow. This makes the
cleanup step — which is table stakes for any testing tool — unnecessarily painful.
Bulk delete should be a first-class action.
I hadn't thought about this because I'd been testing with one or two mailboxes at a time. The persona was right: bulk delete is normal when you look at it from the developer's workflow. It got built.
The lawyer persona and a cluster of compliance gaps
The compliance-minded persona — a lawyer evaluating the platform for professional use — was the most thorough session I ran. It didn't find one problem. It found a list.
The session produced multiple [UNCLEAR BEHAVIOR] flags across the product, nearly all of them in the same category: the platform didn't give the user enough control over their own data and identity. Account deletion was the most concrete example:
I expect to be able to delete my account from the settings page, as this is
a standard requirement for any platform handling personal data.
[action: navigate to settings, look for account deletion]
The result was: no account deletion option exists anywhere in the UI.
[UNCLEAR BEHAVIOR]: The absence of account deletion is a significant concern for any
user in a professional or regulated context. It implies either that data is retained
indefinitely against the user's wishes, or that the platform has not considered its
obligations to users. This alone would prevent adoption in any environment where
data governance matters.
But it was one of several flags in the same vein. By the end of the session, the persona had built up a picture of a product that worked technically but hadn't been evaluated through a compliance or data-governance lens at all.
End-of-session UX score: 2/10. The note: "I cannot recommend this platform to clients."
Each issue in the list got addressed. When the same persona session was re-run after the fixes, it scored 9/10.
That delta — from 2 to 9 — came entirely from things that were invisible to functional tests. The tests were green the whole time. What changed was the product's legibility to a specific type of user with specific concerns.
This is also the most useful property of persona-based testing: the issues aren't random. They cluster by persona. The developer persona found workflow efficiency problems. The lawyer persona found data governance problems. Each session is a coherent audit from a specific angle, not a random list of bugs.
Setting up a session
You need:
- A Copilot CLI session (or any agent runtime that supports MCP tool use)
- Playwright MCP configured
- A disposable email MCP if your product has a sign-up flow (MinuteMail's is at
chrptvn/mcp-servers, or use any inbox API you prefer) - A persona prompt
The persona prompt template I use:
You are [name], a [role/background].
You have heard about [product] and want to [goal].
You have no prior knowledge of how the platform works.
Your characteristics:
- Technical level: [high / medium / low]
- Patience: [high / medium / low]
- Main concern: [speed / reliability / compliance / cost / simplicity]
Rules for this session:
1. Before every click, fill, navigation, or form submission: write "I expect..."
2. After every action: write "The result was..."
3. If result contradicts expectation: write "[UNCLEAR BEHAVIOR]: " and explain why
it matters for someone with your background.
4. Do not look for workarounds. If something isn't obvious, flag it and stop.
Starting URL: [your product URL]
At the end of the session, provide:
- A UX score from 1–10
- The single biggest friction point
- One change that would have most improved your confidence in the product
Keep the persona description short. The more you tell the agent about your product, the less useful the session is — you want cold navigation, not an informed walkthrough.
What this catches that normal testing misses
To be direct about where this fits in a testing strategy:
| What | Caught by | Missed by |
|---|---|---|
| Feature doesn't work | E2E tests | — |
| Feature works but is confusing | Persona testing | E2E tests |
| Terminology is opaque to non-technical users | Persona testing | E2E tests |
| Missing feature obvious to target user | Persona testing | E2E tests |
| Compliance gap (no delete, no export) | Compliance persona | E2E tests |
| Interaction cost at volume (bulk actions) | Power-user persona | E2E tests |
Use both. E2E tests tell you the product works. Persona sessions tell you the product is usable. You need both signals before you ship to real users.
Limitations
A few things this approach won't give you:
It won't simulate real emotional friction. An AI persona can report "this is unclear" but can't feel the frustration of a real person who closes the tab and never comes back.
The prompt shapes the output. If your persona prompt doesn't include compliance as a concern, the agent won't surface compliance gaps. Think carefully about what each persona actually cares about.
It's not a substitute for talking to real users. Persona sessions are fast and cheap — run them before user interviews to filter out the obvious issues. That way your user research time goes toward the subtle stuff.
If you run this on your own product, the non-technical persona will find the most things. Every product I've seen tested this way has at least one piece of vocabulary that makes complete sense to the builder and zero sense to anyone outside their industry. Finding it before launch is worth the hour.
Summary
- Persona-based AI testing finds legibility gaps that functional tests can't — because tests assert on behavior, not on communication clarity
- The expectation-before-action loop is the key mechanism: it forces the agent to engage as a reader before acting, the way real users do
- Different personas surface different categories of issues — developers find workflow friction, compliance-minded users find governance gaps
- For sign-up flows, pair Playwright MCP with a disposable inbox MCP so the agent handles the full registration sequence without manual intervention
- Use persona sessions as a fast first pass before user interviews — they filter the obvious, so your research time goes toward the subtle
- The compliance persona is the most surprising: if your product isn't ready for a lawyer, it's probably not ready for anyone in a professional context
Tags: #testing #ai #webdev #ux
Created: 2026-03-01
Top comments (0)