Mobile tests are where the bugs actually live. A signup flow that works on an iPhone 15 falls apart on a lower-end Android because the keyboard pushes a button off-screen. A push notification mid-flow leaves the app in a state nothing else reproduces. Memory pressure on a four-year-old Android does things you can't make a simulator do.
I wrote simulator-only tests anyway, for years. Real-device runs took ten to thirty minutes per cycle, the device farm queue was unpredictable, and CI on physical hardware was expensive enough that someone always wanted to talk about cost. So tests got written for the simulator, the device-specific bugs found their way to production, and we caught them in Sentry instead of in CI.
The thing that changed wasn't a better device farm or a faster CI. It was Claude Code writing the tests, BrowserStack MCP driving the devices, and the whole loop closing inside a single session. Claude writes a test, BrowserStack runs it on an actual iPhone, Claude reads the failure (screenshot, view tree, error trace) and fixes the test. Then it runs again. The cycle is under a minute. I never opened the test file.
That's the post. The tools matter, but what changed test authoring was the loop closing fast enough that I could stay in it.
The shape that changed
The mental model is simple. Old loop: write a test, push, wait for CI, read logs, guess at the failure, fix, push again. Best case ten minutes per iteration, often half an hour. Most of the wait is travel time: uploading the build, queuing for a device, tearing down. Almost none of it is the test itself.
New loop: Claude writes a test, the BrowserStack MCP runs it on a real device that's already provisioned, Claude reads the structured failure (screenshot, accessibility tree, console output, error stack), edits the test, runs it again. End-to-end under a minute on the happy path.
The speed isn't the value. The value is that the loop stays tight enough to stay in. Ten-minute iterations mean you context-switch out and come back cold. Sub-minute iterations mean the same problem is still in your head when the next result lands. You think about the test, not about what you were doing while you waited for CI.
The MCP is mostly invisible in this. It's just "Claude can drive real devices the way it drives anything else."
What the loop actually looks like
The clearest example was when I asked Claude to build an onboarding flow test from scratch: fresh install through to the home screen, with sign-up, email verification, and a multi-step intro along the way.
I described it in plain English. Claude wrote the first version: a script with the steps, selectors pulled from the screenshot it took on app launch, and an assertion at each milestone. It started a session on an iPhone 15, ran the test, and the test failed at step two.
The failure was a selector ambiguity. There were two elements matching [name="email"]: the visible field and a hidden form used for autofill detection. Claude saw both in the accessibility tree, picked the wrong one, and the typed text vanished into the hidden one. The test waited for the "verify your email" screen and timed out.
Claude fixed the selector to getByRole('textbox', { name: 'Email address' }) and re-ran. Past step two. Failed at step four. The third onboarding screen has an animated transition, and the tap on the next button fired before the animation finished, so the tap landed on a different element underneath. Claude added a wait on the transition end. Past step four. Eventually green.
Then I asked it to run the same test on a lower-end Android. Same code, different device. Failed on step three. The email verification deep link opens differently on Android. The Gmail app intercepts before the browser does, and the test was driving Chrome. Claude rewrote the verification step to use an alternative verification strategy for tests instead of clicking the link in the email. Re-ran on both devices. Green on both.
This whole sequence took maybe fifteen minutes. I didn't open a test file. I read every diff in the chat, said "yes" or "wait, why" three or four times, and the test ended up in the repo. The thing that made it work wasn't any single capability. It was that I could stay in the loop the whole time.
Where my role moved
I wasn't writing test code. I was reading it.
The most useful thing I did was catch the moments where Claude "fixed" a test by making it less strict. The pattern I learned to watch for: a test starts failing intermittently, Claude adds a wait longer than the original timeout, the test goes green. The wait isn't fixing flakiness. The wait is hiding it.
The cleanest example was a test that asserted the home screen rendered within 200ms after login. It started failing on the lower-end Android. Claude bumped the wait to ten seconds, the test passed, and we moved on. I caught it in review: the home screen was actually taking five seconds to render on that device because of an image-decode regression that had landed two days earlier. The test was supposed to catch exactly that. The "fix" deleted the catch.
This is the part the "LLM writes the tests" pitch always undersells. Claude is fast enough at authoring that it'll write a hundred tests in the time it would take me to write ten. Most of those tests are fine. Plenty pass because they assert close to nothing. If you don't read them, you ship a green suite that catches nothing, which is worse than no suite at all.
The bottleneck moved from authoring to judgment. Authoring is the part Claude is fast at. Judgment (does this test catch the right thing, is this assertion meaningful, is this wait masking a regression) is the part that's still mine. The work didn't disappear. It changed shape.
The MCP setup is where the time went
The loop only works when the MCP layer is reliable. When it isn't, you spend the day debugging the harness instead of writing tests.
The first day was almost entirely setup. BrowserStack credentials, MCP server config, tool permissions in Claude Code, the allowed device pool, network rules for the test backend. Each piece is documented somewhere. None of them are documented in the same place. I had four tabs open the whole time.
The tool surface is the bigger thing to plan for. The BrowserStack MCP exposes a finite set of actions: start session, run command, take screenshot, fetch logs. Most of what you want maps onto that cleanly. A few things don't. Driving a native permission dialog on iOS isn't a clean MCP action; you end up wrapping platform-specific helpers or carving a backdoor into the app for tests. Anything biometric is off the table without that backdoor.
Session lifetime caught me twice. BrowserStack sessions time out after a fixed window, devices disconnect under load, and the queue backs up during US business hours. The loop has to survive a session dying mid-test. I added a thin wrapper that re-acquires the device when the MCP returns a session-ended error and retries from the last clean assertion. That wrapper isn't optional. Without it, every long test run had a 30% chance of dying for no reason related to the code under test.
Permissions in Claude Code itself were the smaller surprise. By default, Claude asks before each tool call. The fiftieth time it asks to call the same BrowserStack action is the moment you regret it. I set a permission rule allowing that specific tool without prompts, and the loop got noticeably tighter. Skip this and you spend the day clicking "yes."
None of that is the loop changing test authoring. It's the boring tax for getting there. But it's a real day or two of setup, and skipping it in this post would be misleading. The pitch is "Claude writes the tests"; the reality is "Claude writes the tests after you spend a day making the harness work."
What you actually get
Coverage we wouldn't have written by hand. The cost of writing a test dropped enough that the question stopped being "is this worth a test" and started being "is this worth reading." We added tests for flows we'd been ignoring for years (the edge paths, the rare-but-real failure modes) because the marginal cost was small enough not to argue about.
Real-device coverage in CI, instead of simulator-only with prayers. The lower-end Android catches the things the iPhone 15 doesn't. The 200ms-versus-five-second image regression I mentioned earlier would have shipped if the test had stayed on a simulator.
The hidden cost is the one I keep coming back to: you have to read what Claude writes. If you don't, you accumulate tests that pass without testing anything, and the suite stops being a signal. The discipline shifts from authoring to reviewing. Most teams assume whoever writes the test is the one reviewing it. That works when writing is the slow step. When writing is fast and reviewing is the slow step, the assumption breaks. I haven't seen many teams notice yet.
It's not autonomous. The day it is, the discipline that survives won't be writing tests. It'll be deciding which ones still mean something.
Top comments (0)