DEV Community

Cover image for We Spent 3 Weeks Debugging Appium Flakiness on BrowserStack. The Fix Wasn't the Selectors.
Sriharsha Donthireddy
Sriharsha Donthireddy

Posted on • Originally published at Medium

We Spent 3 Weeks Debugging Appium Flakiness on BrowserStack. The Fix Wasn't the Selectors.

We were running mobile automation on BrowserStack. About 25% of our login tests failed randomly. After dropping global implicit waits, isolating test data, and modifying a critical timeout capability, failures plummeted to 5%.

For three straight weeks, our mobile login suite was gaslighting us.

Out of 160 tests, roughly 40 would fail on every single CI run. “Element not clickable.” “Timeout waiting for element.” The errors looked like classic locator issues, so we did what any desperate team does: we audited our XPaths, polished our selectors, and cranked up the automatic retries to mask the bleeding.

The build would occasionally go green on a third retry, and we’d tell ourselves it was just “cloud environment latency.”

We were just covering up a slow leak.

The Setup
Our configuration was pretty standard for a modern mobile QA pipeline:

Framework: Appium 2.x with WebDriverIO and Jest
Infrastructure: Real devices on BrowserStack (primarily iPhone 14 and Android 13)
CI/CD: Triggered via GitHub Actions on every pull request
The failures always clustered around the login page. An iPhone 14 would time out waiting for the dashboard, or an Android device would sit on a blank screen after authentication.

The most infuriating part? The issue only appeared on BrowserStack real devices and never reproduced consistently on local simulators or emulators.

That made every debugging session feel like chasing a ghost.

What We Tried First (The Wrong Path)
We wasted the first two weeks treating symptoms instead of the disease.

First, we let global timeout creep win. We increased our global implicit waits from 10 seconds to 30 seconds. It didn’t fix the flakiness — it just made the eventual failures take three times longer to run.

Then came the arbitrary browser.pause() band-aids. We started scattering manual sleep timers before critical clicks. Hardcoded pauses are an automation sin — they assume mobile networks are predictable. They aren't.

When that failed, we set up blind retries. We configured GitHub Actions to re-run failed specs up to three times. This masked the problem enough for deployments to continue, but it completely destroyed our trust in the automation suite.

The Breakthrough
At this point, we stopped tweaking selectors and started looking at infrastructure, session behavior, and application state.

The breakthrough came when we compared BrowserStack session logs side-by-side against raw Appium server logs.

Several failures ended with a session termination before the next command was even issued. We repeatedly saw messages similar to:

[Appium] Closing session because no new commands were received within the timeout window
That was the first clue that the issue wasn’t our selectors at all. Something was happening before the next test command reached Appium.

The Real Culprits

  1. newCommandTimeout Gaps When running tests on remote cloud infrastructure, there can be brief and unpredictable communication gaps between the test runner and the remote Appium server during session setup or deep application transitions.

Appium’s newCommandTimeout determines how long the server waits before assuming the client has gone silent. We discovered that slow cloud-device initialization and occasional gaps between commands were sometimes exceeding our timeout window.

When that happened, Appium terminated the session — and the resulting errors looked almost identical to ordinary element failures.

What looked like a selector issue was often a dead session.

  1. Test Data Pollution We also discovered that our tests weren’t as isolated as we thought.

Some scenarios created users and modified application state without fully cleaning up afterward. If one test failed halfway through execution, another test could inherit that state. Sometimes the login screen was skipped entirely. Other times a session-expired modal appeared unexpectedly.

The failures looked random because the starting state wasn’t consistent.

  1. Implicit Waits Were Hiding Problems Our large global implicit waits were making failures harder to diagnose. Instead of failing at the exact point where synchronization broke down, the test would spend additional time waiting and eventually fail somewhere else.

The longer waits made debugging slower while providing almost no additional stability.

The Fix

  1. Updated the Appium Capabilities javascript
capabilities: [{
    platformName: 'iOS',
    'appium:deviceName': 'iPhone 14',
    'appium:platformVersion': '16',
    'appium:automationName': 'XCUITest',
    'appium:newCommandTimeout': 60,
    'appium:dontStopAppOnReset': true
Enter fullscreen mode Exit fullscreen mode

}]
This didn’t solve every issue by itself, but it eliminated a significant source of session instability.

  1. Replaced Implicit Waits with Explicit Conditions javascript
`// Before
await loginButton.click();
// After
await loginButton.waitForDisplayed();
await loginButton.waitForClickable();
await loginButton.click();
const dashboardHeader = await $('#dashboard-payload');
await dashboardHeader.waitForDisplayed({ timeout: 10000 });`
Failures became easier to understand because they occurred at the exact synchronization point that broke.
Enter fullscreen mode Exit fullscreen mode
  1. Enforced Test Data Isolation javascript
describe('Mobile Login Suite', () => {
    let uniqueUser;
    beforeEach(async () => {
        const timestamp = Date.now();
        uniqueUser = `test_user_${timestamp}@company.com`;
        await api.seedTestData(uniqueUser);
    });
    afterEach(async () => {
        await api.cleanupTestData(uniqueUser);
    });
});
Enter fullscreen mode Exit fullscreen mode

This removed cross-test contamination and made failures reproducible.

The Metrics
MetricBeforeAfterFlakiness / Failure Rate~25% (40 of 160 tests)~5%Average Build Duration18 minutes12 minutesTeam Response to Failures”Just rerun it”Investigate immediately

More importantly, developers started trusting the automation results again. That was the real win.

The Bottom Line
Flaky tests are a cultural tax on engineering teams.

The moment developers stop trusting the automation suite, they stop reading the logs, stop investigating failures, and start treating CI results as background noise.

If your mobile automation is failing randomly on cloud devices, don’t assume the problem is your XPath. Look at your session configuration. Look at your synchronization strategy. Look at your test data isolation.

The root cause might not be in the test itself. It might be in the infrastructure around it.

What’s the strangest root cause you’ve ever found behind a flaky automation test? Drop it in the comments — I’d love to hear the stories.

Top comments (0)