DEV Community: daniel-octomind

Organize automated tests without getting eaten by your devs

daniel-octomind — Mon, 13 Oct 2025 13:10:14 +0000

There’s not much a developer hates more than a blocked pipeline by a flaky test. Well, maybe having to refactor someone’s legacy code, but pipeline delays are right up there.

They’re on their own deadlines, and every minute counts. If the blocker is a real bug, no one argues - better to stop a bad merge than fire-drill it in production. But if the blocker turns out to be a flaky test? That’s when things get… heated. Faster than you can say “sorry,” your end-to-end tests get yanked from the pipeline and you’re back to hoping the right bugs get caught before prod. Not a great look.

And yet, ignoring tests isn’t an option either. If you’ve ever had something critical fail in production because “oh yeah, that test was turned off,” you know the pain. Tests need to run, and they need to run as soon in the pipeline as possible. Anything else is just a slow slide into “we’ll fix it later” - a.k.a. never.

How do you keep your pipeline trustworthy without triggering a developer revolt?

Here’s the strategy we use at Octomind that’s been keeping our devs (mostly) happy, our pipeline (mostly) green, and our releases (mostly) bug-free.
‍

The e2e testing setup: An hourglass, not a pyramid

Forget the textbook “testing pyramid.” Our shape looks more like an hourglass.

Thousands of unit tests: These are quick to write, quick to run, and cover the bulk of our low-level logic. Modern AI helps here - generating solid starting points for many cases - but we still make sure they test useful things, not just padding our coverage stats. Unit tests are our wide base, and they don’t slow anyone down.
A pinch of integration tests: Just a handful. They’re trickier to write, more prone to timing issues, and not something we want to balloon in number. They cover only the most critical cross-component behaviors.
A generous layer of e2e tests: This is where things get interesting. We use our own tooling to create, run, and maintain them. Contributions come from all over - developers, QA, even business folks. The key here isn’t the quantity, it’s the stability.

The OctoQA rotation

We don’t have a dedicated QA, but we consider software quality to be an essential part of every Octoneer’s work. To keep stability front and center, we introduced a rotating role: OctoQA.

Each week, a different person wears the OctoQA hat. Their mission: monitor and manage our “non-pipeline” e2e tests.

Why non-pipeline? Because fresh e2e tests are often flaky at first. Not because the code is bad, but because test setup is hard - isolation issues, data dependencies, timing quirks. Even seasoned test writers don’t always nail it on the first try.

The test quarantine process

Here’s how it works:

New e2e test is written → It does not go straight into the blocking pipeline.
Instead, it is run in staging and as part of nightly scheduled runs.
Each morning, OctoQA reviews the results:

Fails? Sent back to the author for fixes.
Passes 10 consecutive runs? Promoted to the next environment, eventually graduating to the pipeline.

This “quarantine first, promote later” approach means our blocking pipeline stays green for the right reasons - not because we stripped all the tests out of it, but because only proven and stable tests make it in.

What happens if a pipeline test fails without cause?

It rarely happens, but even after promotion, a test can occasionally turn flaky. In that case, we don’t let it torture developers. It’s immediately pulled back into quarantine for investigation. Once fixed and stable again, it can rejoin the main pipeline.

Lessons learned

After running this system for a while, a few truths have become obvious:

Trust is everything: The fastest way to kill a pipeline’s credibility is to have it cry wolf all the time.
New tests need a proving ground: 'Write it → ship it' is fine for unit tests, but e2e tests need to earn their way into the pipeline.
Shared responsibility works: Rotating ownership means no one can shrug and say “not my problem.”
It’s easier to promote a good test than to repair a broken reputation of end-to-end testing: Once devs stop trusting your tests, getting them to take failures seriously again is a long road. ‍

So yes, the pipeline still blocks when it has to. But it blocks for real bugs, not false alarms. That’s how you avoid both broken prod releases and angry dev mobs.

Daniel Rödler
Chief Product Officer and Co-founder
‍

Why devs need AI-powered e2e test automation

daniel-octomind — Thu, 04 Sep 2025 09:50:15 +0000

QA suggested changes to development workflows often get a blowback from engineering teams. Introducing AI-powered end-to-end (E2E) testing is no different. For once - it’s tests - not all developers are thrilled about testing as a concept. Especially when it comes to end-to-end tests. And then the AI part? Difficult to digest.

Here are 6 reasons why we believe automated, AI-powered e2e tests belong to every development pipeline - for the benefit of both the code and the people who build it.

1. Tests that run automatically are helpful for EVERYONE (*as long as they run in reasonable times)

Manual testing is like manual linting or manual type-checking: it's prone to human error and, honestly, it just doesn't scale. Automatic execution of tests within your CI pipeline ensures reliability. Without automation, many devs often think, "my change isn't significant, so I'll skip testing" - cue broken code and unhappy engineering. Automation means your tests always run. Consistency is king, and automatic testing is how you get there.

Of course, the execution speed is of importance, developers move fast and can’t wait hours for the test suite run to finish. Well constructed test runners that parallelize without elaborate setup are your best friends.
‍

2. The true cost of bugs discovered later

It's no secret: bugs found later in the software development cycle are exponentially more expensive. Fixing a bug in staging can cost 5-6 times more than if caught in development, and in production, it's at least 10x - possibly even catastrophic if you lose users or revenue. Automatic E2E tests find issues early, save your team from costly context switching, extra ticket overhead, blocked releases, and late-night firefighting.

source: graph and NIST figures

3. “Blocking merges isn’t bad - it’s brilliant!” Daniel Draper, lead developer

Sure, blocking merges with automated tests might initially frustrate the dev who just wants to "get their code out there." But trust us: everyone appreciates not being woken up at 2 am because of a production outage. Automated tests act as gatekeepers, preserving everyone's sleep, weekend plans, and sanity. It might seem restrictive at first, but soon enough, engineering teams eventually realize good QA saves their future selves from unnecessary headaches.
‍

4. Optimizing your tests isn’t optional - it’s essential

Devs know it, QA engineers know it. Tests are production code. Treating tests with the same care and rigor as your application code ensures maintainability and scalability. Optimizing tests for speed and parallel execution might feel like extra work initially. But a fast, highly parallel test suite means quicker feedback, higher developer productivity, and more confident deployments. And AI-powered testing platforms make this optimization significantly easier and more effective.
‍

5. Testing isn’t just QA’s job - it’s everyone’s job

Quality assurance isn't solely about catching bugs. It's about building a mindset where everyone contributes to quality. Tests shouldn't be something developers dread - they should be part of an engineering culture that embraces high-quality standards. Encouraging devs to write and manage tests at appropriate levels (think: testing pyramid) helps them to produce better, more reliable code. Tests not only help identify issues but also foster a shared sense of responsibility for quality across the team.
‍

6. Not leveraging AI in testing while using AI everywhere else is a missed opportunity

Everyone who ever used an LLM to generate a more complicated end-to-end test knows how inconsistent the outcomes are. But “using AI for testing” doesn’t always mean to throw some code at an agent and let it generate a test for it. State-of-the-art testing platforms use AI in many different ways - taking advantage of its strengths while keeping critical functions deterministic.

If you don’t use AI in your testing workflow, you risk settling for slower delivery cycles, higher costs, more bugs slipping into production, and potentially burnt-out dev teams dealing with avoidable emergencies.

‍
Daniel Rödler
Chief Product Officer and Co-founder

Why we finally allowed arbitrary waits in our tests

daniel-octomind — Tue, 05 Aug 2025 13:54:40 +0000

*reposting a fellow octoneer’s article

For years we had a firm rule: no arbitrary sleeps in Octomind tests. Whenever someone asked for them, we pushed back. A hard-coded wait only papers over real bugs - and how do you even choose the “right” number? Too short and the test still flakes; too long and the whole suite drags while the bug stays hidden.

We felt pretty proud of that stance… until we broke it. So what changed?

The users who changed our minds

Two customers arrived with the same head-scratcher: bugs on the page under test that only broke test automation while rage-clicking users just brushed it off.

When the first user landed on the page under test for the first time, Playwright did the obvious thing: waited for DOMContentLoaded and dismissed the “Accept cookies” button. Closing that overlay coincided with the end of the page’s first hydration cycle, so everything was ready and the click succeeded.

The trouble appeared on the next navigation. Because the cookie banner is a one-shot component, it never shows up again. The framework takes care of hydration. During that window, the DOM looked complete in the markup but the JavaScript listeners that make it interactive weren’t attached yet. Any click events were fired into limbo.

Humans barely notice: they click a button once, if nothing happens, they click again - and by then hydration has finished and the second click works. Automation, however, isn’t so forgiving - Playwright clicks once and expects success. From the test runner’s point of view the button is “unclickable,” and the entire suite becomes flaky.

Digging in, we found the culprit: the site uses Nuxt with the nuxt-delay-hydration plug-in. To win Lighthouse points, the plug-in deliberately delays hydration, leaving a half-alive DOM that ignores clicks. Great for scores, terrible for test runners.

In other words, the real bug isn’t the flaky test; it’s that the page lets the user interact before it’s actually ready. The app should either finish hydration faster or block pointer events until it’s done. But when the dev team has “bigger fish to fry” and testers still need reliable automation - is where a well-placed, deterministic wait comes in.
‍

Waiting - with intent

Here’s the surprise twist: the plug-in also exposes a lifesaver - window._$delayHydration, a promise that resolves when hydration finishes. Instead of guessing a timeout, we could:

await page.waitForFunction(() => window._$delayHydration);

No arbitrary sleep, no hidden bugs - just a deterministic gate that says: “OK, the page is ready; click away.” We wrapped that in a wait for step and shipped it.

Now frameworks that surface a hydration promise get first-class support, our users get reliable tests, and we still sleep well at night.

“Fine - let’s hide that bug for you”

Our second case came from a team testing a classic magic-link login:

Enter email → press send code
Open the “Here’s your one-time code” email
Copy the code, paste it back in the page
Celebrate as the user is logged in

Humans breeze through this. Our automation… not so much. Roughly 70 % of runs failed with invalid one-time code. We replayed the run in our UI: right email, right timestamp, code typed perfectly. Nothing obvious.

It was time to check Playwright Trace Viewer for more details. There, hidden in the network tab, we saw it: the POST request sent oneTimeCode: null even though the input showed the correct value. After a few experiments we found that if we waited ~3 seconds before filling the email field on step 1, the bug never appeared.

Classic timing issue. The fix belongs in the app, but the responsible dev team couldn’t reproduce it outside automation and, frankly, had other stuff to do. Meanwhile, QA needed a working login test today.

So we asked ourselves:

Is three seconds reliable? (Yes.)
Does it unblock the customer? (Absolutely.)
Does it risk future flakiness? (Maybe, but we’ll monitor.)

Result: we added a wait for fixed time to the wait for step that pauses exactly as long as the user specifies. It’s a band-aid, sure, but it keeps their CI green while the real bug sits in the backlog - and that, for now, is the difference between testing and not testing at all.
‍

What we learned

Inside your own repo, you can stay pure: spot the timing issue, fix the code, push to prod, done. But once you’re shipping a testing platform for others, the equation changes. Testers aren’t always sitting next to the developers who own the bug - and even if they are, business priorities win over elegance.

So we’ve learned to balance principle with pragmatism:

Idealism: Root‐cause every failure and fix it at the source.
Reality: Octomind users sometimes lack code access, dev bandwidth, or both
Middle ground: Offer a surgical wait that unblocks the workflow while the real fix makes its way through the backlog. ‍

It’s not the romantic story we once told ourselves, but it keeps releases moving. And if that wait turns out to be unnecessary tomorrow - great, we delete it. Until then, it’s the tiny compromise that saves the day.

Kosta Welke
code monkey at Octomind

A programmer yelling at the clouds about vibe coding

daniel-octomind — Wed, 25 Jun 2025 13:18:08 +0000

Vibe coding, noun
Writing computer code in a somewhat careless fashion, with AI assistance
~ Merriam Webster

Vibe coding is all the rage these days, but is that really the way we want to go as an industry?

I’m neither the first nor the second guy to buck the trend but maybe my viewpoint from someone at a testing startup THAT USES AI is interesting to someone.
‍

State of the art LLM models

We all have heard the marketing spins of AI writing 25% of code at google, or replacing entire product departments, but in my personal experience these models are just not quite there yet to be let loose completely unchecked.
‍

I can (and regularly do) let AI write some tiny prototypes or auto-complete my mock-data in tests. But to me I regularly run into road blocks whenever I ask the AI to actually help me on a task I am struggling with myself: After all, the auto-completion of boilerplate is nice, but this is not what is costing me most of my time in my day-to-day development work.
‍

Just last week I was struggling with the old ‘ESM hell’ problem of an intricate “playwright being executed in a node subprocess” problem - no matter which LLM I tried to prompt, the answers weren’t better than whatever I found on StackOverflow. And after all, that is completely understandable, the LLM is only working through its training corpus…
‍

And of course I admit, this is a problem only a handful of people on the planet probably have (maybe even no one else, I suspect it ALSO had to do with our way of linking the pnpm node_modules), but that’s my point: The amount of time that is spent on the long tail of edge cases is absolutely not comparable with the “easy” cases of boilerplate: one costs me days, and one costs me a few seconds -> I would LOVE to use AI for the hard problems, but I just can’t currently.

‍

Reviewing AI code isn’t fun

Not only that AI doesn’t solve my problems, but having models run loose on your codebase was the cause of this hilarious reddit post, where people pointed out some of the funny, but also sad, PRs that Github Copilot has opened on the .NET runtime repo, when let loose.

source: Reddit

The comment encapsulates a lot of my feelings as well - if I need to actually review the AI code, I have to completely understand the problem, and due to the state of the art, much more than I would have to for my colleagues.

Check this this absolute insanity in a github comment chain:

source: GitHub

Testing your vibe coded mess

So how does it all relate to testing? Well I think that testing an AI models output with another AI without human interaction is a recipe for disaster. I think there’s an argument to be made that QA should potentially be the last thing that we replace with a fully autonomous agent.

If you think about it, the most important thing of an app is that it works “according to spec” of course. So if a fully AI-generated test is checking the AI-generated app, we end up in prime meme territory:

" width="800" height="735">

There MUST be a human involved in checking the tests. It can of course be ai-assisted, but the final approval must lay with a human. I think this argument is kind of intuitive by itself but if it isn’t the science itself agrees - model collapse is a real problem, that I think this also relates to: if AI is only trained / receives feedback from other AI, the aggregated error gets worse over time.

Kind of how if you ask ai to recreate the same image you get hilarious results. I personally would NOT want my app to be tested only by another AI, but what about you?
‍

So, what now?

In our team at Octomind we have some people that are more bullish on the whole AI hype train than me and some that are less so. But overall we see AI as a very valuable tool, but only that. Something that CAN help you if used correctly and in the right place, and can also cause issues when letting it run rampant.

You can be both excited about a technology and careful with putting it to use in a way that brings actual value. This is not a case of cognitive dissonance but a corrective to both extremes of the AI debate.

If you like my thought process feel free to check out our app, it’s free and you can see how we try to improve the e2e testing process with sprinkles of AI, good UI and visualizations and smart tricks like picking good / stable locators for you automatically.

And If you think my take isn’t nuanced enough, I’m missing something or just want to chat about tests, AI or general startup engineering, you can find me at:

Daniel Draper
Lead Octoneer at Octomind

AI testing: IDEs vs. testing platforms

daniel-octomind — Mon, 02 Jun 2025 12:25:46 +0000

When are you better off with an agentic IDE scripting e2e tests and when would you use an agentic testing platform?

We took our Octomind testing platform and ran some use experiments with devs writing & running end-to-end tests. We wanted to compare the 2 approaches to increase productivity in testing.

We used Cursor IDE as benchmark and had it generate Playwright tests, since Octomind uses Playwright test code under the hood.

Getting started: Tool setup simplicity vs. flexibility

The classic of the dev heavy vs. low-code dichotomy. Octomind works with an effortless, one-click setup and off-the-shelf support for test setup (environments, variables, etc.) Simply provide your web app's URL, and the platform is ready to roll. It's ideal for teams who value speed and ease, letting you jump straight into testing without worrying about detailed configuration.

In contrast, Cursor offers traditional software project flexibility - great if you’re an experienced developer comfortable with version control, node.js setup, and IDE configurations. This freedom is beneficial, yet it's important to note that setup requires more upfront effort and knowledge.
‍

Composing test cases: Intuition vs. precision

Crafting test cases in Octomind feels intuitive for users of different skill levels - describe your scenarios using natural language, and let the AI agent handle the rest. The test recorder is similarly frictionless. You don't need in-depth coding knowledge, it’s decently accessible for teams beyond software developers.

Cursor, on the other hand, uses AI to assist you in generating Playwright tests directly within your IDE. While powerful, Cursor’s effectiveness heavily depends on your domain knowledge and familiarity with coding tests. This offers precision, but less experienced users might face steep learning curves.
‍

Debugging: Visual insight vs. code analysis

Octomind provides an integrated recorder and visual step editor to fix tests. Debugging is a guided and mostly visual experience and - you quickly see and correct exactly where tests deviate.

Debugging in Cursor requires a deeper dive. With no immediate visual aids, you must run tests, monitor execution, identify issues, and manually adjust code. It's precise, but might be straining until you identify the problem depending on your coding skill.
‍

Flexibility and freedom: Structured ease vs. infinite possibilities

‍Octomind offers a comprehensive yet structured environment built atop Playwright. While it covers most testing scenarios -including complex ones like OTP and 2FA flows - it inherently restricts you to its features and integrations.

Cursor allows for the opposite - unlimited freedom. You're free to leverage any Playwright capability, ideal if your team thrives on customization and has expert knowledge. However, such liberty demands significant testing experience.

Test structure & management: Organized vs. manual management

‍Octomind provides descriptive prompts, structured steps, screenshots, and built-in management capabilities (folders, tags, AI-driven searches). It’s an out-of the box, yet prescriptive approach.

In Cursor, structure depends entirely on your team's organizational skill. While it can provide incredibly tailored code structures, poorly managed tests can quickly become overwhelming, especially as your suite scales.
‍

Execution: Built-in vs. DIY

‍Octomind comes with a built-in test runner. Execution is seamless - everything from environments to CI/CD integrations and scheduling is automated. It also handles complex issues like nuxt hydration, shared authentication, or geo-based proxies which can become painful when DIYed very quickly.

To be fair to Cursor - it is an IDE - not an end-to-end testing platform. It doesn’t inherently handle execution logistics - it’s not what it’s built for. The responsibility for configuring CI/CD, environment management, parallel test execution, and other advanced setups lies entirely with your team, demanding substantial testing expertise.

Maintenance: Guided assistance vs. code debugging

To keep the entire testing workload in check, Octomind offers features for easy maintenance. It identifies root causes of failures, visually compares them to successful runs, and offers auto-maintenance solutions. It can, because it has built and run the tests.

Maintenance with Cursor will lead to manually inspecting and debugging tests. Although the AI can assist since it has access to your codebase, once you get into resolving complex issues, manual effort and advanced debugging skills are needed.

Verdict: When to use which?

This comparison might seem like a stretch - Octomind and Cursor are 2 very different AI-assisted tools. However, their application area overlaps - they are both used in the app testing process. We discuss this often with our users who are also fervent users of AI IDEs. That’s why I decided to write this summary.

Choose Octomind / testing platform if your goal is a complete, streamlined testing solution that significantly reduces effort and allows broader team participation. Its built-in intelligence and intuitive maintenance workflows make it perfect for teams prioritizing productivity and ease of use.

Choose Cursor / AI-powered IDE if your priority is flexibility, customization, and absolute control over your testing processes. It’s perfect for seasoned developers comfortable navigating advanced test setups, deep debugging, and manual management. You would also need to invest a decent amount of time to set up the infrastructure for test execution and sort out advanced testing use cases like 2FA or email flows which come off the shelf in Octomind.

In the end, your decision depends on your preference - each tool caters distinctly to different team profiles and testing philosophies.

Daniel Roedler
Co-founder & CPO and the occasional guest contributor on the other Daniel's blog

Test isolation is (not) fun

daniel-octomind — Wed, 23 Apr 2025 10:43:40 +0000

Writing a single test is fun. Writing a test suite that scales is… sometimes less fun.

The more tests you write the less it becomes a matter of individual tests, and more a matter of designing the whole testing system. There are countless decisions to be made.

The bad news is that even if you are a well seasoned tester, practices from one company may not be directly applicable to another company. Every application is different and therefore each requires a different strategy. There really is no best practice.

source: X

The good news is, that there are at least some areas, with challenges that many have been through before and ones we can learn from. In this blogpost, I’d like to discuss one particular category of these challenges - test isolation.
‍

Browser isolation

There are many test automation tools that offer automatic script generation. Playwright has codegen features, Cypress has its Studio, Selenium offers a full IDE. There are even whole products based on the idea of recording and replaying tests.

While these tools are fun to use and can be great for learning the basics, they quickly fall short, when it comes to the repeated use of code that they generated.

Even just re-running a newly recorded test can be a problem, because the data created in the first recording may get in the way of the second run.

But it’s not just about data. Modern websites these days use cookies, local storage, indexed databases and other forms of local in-browser storage that provides important context to the application’s frontend code, or servers.

The simplest example of this are personalized cookie settings. These settings are usually saved in (you guessed it) cookies, but once a user opens a new browser or enters incognito mode, these settings are gone.

This same problem then translates to test automation. Usually, testing frameworks clean up the state of the browser to avoid polluting the context of individual tests. But as the mentioned example shows, there are many situations where this might actually become a problem.
‍

describe('Cookie Consent Tests', () => {
    test('new user sees cookie banner', async () => {
        await page.goto('https://yourpage.com');
        // A new user should see the cookie consent banner
        const banner = await page.locator('.cookie-banner');
        expect(await banner.isVisible()).toBe(true);
    });

    test('returning user does not see banner', async () => {
        // First visit to set the cookie
        await page.goto('https://yourpage.com/')
        await page.click('.cookie-banner-accept');

        // Simulate returning to the site
        await page.reload();
        const banner = await page.locator('.cookie-banner');
        expect(await banner.isVisible()).toBe(false);
    });
});

In practice this means that there are two different perspectives or user flows that need to be tested in order to get a good coverage:

brand new user experience
the experience of returning users

There’s a tendency to think in user scenarios when creating test automation, but even the same scenario can have different outcomes given a difference in context.

‍

Test isolation

When it comes to test isolation, I feel like it’s one of those principles that every test automation beginner learns about. A test should not interfere with any other tests. Simple as that.

But then when it comes to reality, it’s much, much harder to stick to this. Let’s take a simple example where we want to test a to-do app. We want to:

create an item
delete an item

In principle, these should of course be two separate tests. But you can see how tempting it is to merge them together to speed up the execution. Because after all, if we really want to isolate the second test, we’ll need to create an item anyway.

‍

// tests depend on each other
describe('Todo List', () => {
    test('create todo item', async () => {
        await createTodo('Buy milk');
    });
    test('delete todo item', async () => {
        await deleteTodo('Buy milk');
    });
});

// each test is independent
describe('Todo List', () => {
    test('can create new todo item', async () => {
        const todoText = 'Buy milk';
        await createTodo(todoText);
        const newItem = await page.locator('.todo-item', { hasText: todoText });
        expect(await newItem.isVisible()).toBe(true);
    });

    test('can delete existing todo item', async () => {
        // Setup: Create item specifically for this test
        const todoText = 'Delete me';
        await createTodo(todoText);

        // Actual deletion test
        await deleteTodo(todoText);
        const deletedItem = await page.locator('.todo-item', { hasText: todoText });
        expect(await deletedItem.isVisible()).toBe(false);
    });
});

In real-life scenarios, we of course deal with much more complex scenarios, but the decisions that need to be made are similar in principle. When deciding on whether to merge tests together, create dependencies or to fully isolate them, I personally always side with isolation.

This is mostly because I want to be able to run tests in parallel. While full test isolation might create a slight increase in execution time, parallelization will decrease it exponentially.

It is good to consider parallelization from day one of creating test automation. It’s much more complicated to achieve it if tests are not properly isolated. But what is proper isolation?

Consider testing an arbitrary SaaS platform - The most basic entity that decides how the page looks would usually be a single user account. So for a fully parallel test execution, each parallel process must run as a different user. In that case, when running 10 parallel processes, we must create 10 testing accounts.

In the case of an e-commerce application things might get a bit more complicated, because even if we create separate customer accounts, we still need to deal with available items in the store. In case they run out, the test automation would unexpectedly fail. This is a situation in which the basic entity would be the store itself. Before test execution, data of the whole store must be ready to “fulfil orders” of the whole test run. This of course might get pretty complicated, but it is a huge part of creating good test automation.
‍

Data isolation

When dealing with test automation, data is being thrown around everywhere. That’s what should happen. When your test automation resembles the way your app is going to be used, it will create, modify and delete data during the process. The main question is how to design a test suite in such a way that this data movement does not become a problem for both your system nor your test stability. These are some of the concerns, just to name a few:

creating user accounts with proper permissions
creating user data
preparing the environment for the system under tests
preparing assets, e.g. images or pdf documents to use during test execution
cleaning up data and data teardown
isolating data between individual tests
isolating data between sequential test runs of the same test
isolating data between parallel runs of the same test ‍

The data isolation problem ties all of the previous problems together. While it might be tempting to reuse existing data across tests, this approach can become a hellish nightmare once tests run in parallel. Whenever possible, I’d advise for as much data isolation as possible.

One of the common approaches is to create a data factory pattern that generates isolated data for each test. For example, when testing a user profile feature, rather than sharing a single test account, your data factory might look something like this:

const createTestUser = async (prefix: string = ''): Promise<User> => {
    const user = {
        username: `test_user_${prefix}_${Date.now()}`,
        email: `test_${prefix}_${Date.now()}@example.com`,
        preferences: {
            theme: 'light', 
            notifications: 'enabled'
        }
    };
    await backend.createUser(user);
    return user;
}

This approach ensures that each test has its own unique data set, eliminating potential conflicts between parallel test runs. However, when debugging, it’s beneficial to have a way to track data that was created. The example above randomizes the user name and email using Date.now(), but in case of debugging they might be a bit hard to find. An effective way of solving this is to produce an output of a test run that stores these names.

While creating isolated data is important, cleaning up that data is equally crucial. Test data can accumulate and cause various problems. If your tests run highly parallel and on every code change, databases quickly fill up, slowing down the system under test and potentially even leading to increasing costs.

There are often discussions on whether a data cleanup should happen before test execution starts, or after it finishes. I’d argue for doing it before the test execution, ideally completely isolated from the test script itself, as this means you can still access the test environment when debugging after a specific test run and ensure that a potentially failed teardown from before won't affect your next test run.
‍

All things considered, data isolation strategies should evolve with your test suite. It’s valuable to plan ahead, but it’s pretty much impossible to plan everything as the application under test evolves. Premature optimization can be a costly endeavor, with questionable results, so your test suite and isolation maturity should grow together with your product needs.

How do we isolate our tests?

At Octomind, we ensure test isolation by a variation of the test factory pattern from before. The core idea is that every test run generates new, unique data - each entity (such as documents, users, or transactions) is created with distinct identifiers and names. This means that even if multiple tests rely on an operation like "create a new document," each test instance will produce an independent document when executed, preventing unintended dependencies or data collisions.

This approach enables parallel execution since tests are not competing for the same resources, significantly reducing overall test runtime. Moreover, it makes running tests against multiple environments seamless, as each test case remains self-contained and does not introduce cross-environment conflicts.

Daniel Draper
Lead engineer at Octomind

AI doesn’t belong in test runtime

daniel-octomind — Mon, 10 Mar 2025 10:30:49 +0000

Adopting generative AI in end-to-end testing improves test coverage and reduces time spent on testing. At last, automating all those manual test cases seems within reach, right?

We have to talk about the stability and reliability of AI in this context, too. The concerns are real and I’d like to address a few here.

Testing tools don’t use AI in the same way

AI testing tools use AI differently to write, execute and maintain automated tests. The LLMs under the hood are not deployed in the same place in the same way by every testing technology. The AI can be deployed as:

codegen: You give the AI access to your code to generate test code in the desired automation framework. These are the Copilots and Cursors, or ChatGPT, when you prompt the LLM to create test code for you.
agents: AI is used to interact with an application like a human would. This usually happens in 2 ways:

AI used in runtime: LLM goes into the app and interacts with it every time a test or a test step is executed.
deterministic code used in runtime: LLMs are used to create interaction representations that translate into deterministic code used during test execution.
‍

What can go wrong?

1.
One significant problem is the brittleness of AI models. Small changes to input data - be it a prompt or an update to the web application - can have a disproportionate impact, leading to false positives or negatives in test results. Without a thorough review of an AI-generated test case, the brittleness could result in undesirable outcomes.

There are strategies to reinforce the good outcomes, deploy check loops, but eventually, you'll need a human in the loop. But requiring too much “human” in the equation eats up the benefits of AI - to save manual work. You know, the reason why you used AI in the first place.

2‍.
Another issue is the interpretability of the output of AI models. Understanding why a particular test failed and how to resolve the issue can be challenging, especially when the AI-generated code is complex or unfamiliar. This requires testers to deeply understand the AI’s output, which can be a tedious and frustrating task.

Adding the insult to the injury, not all AI testing tools disclose the generated code. Difficult to interpret anything.

3‍.
Embedding AI in test case execution directly. For example, using it for smart assertions - this can be both slow and costly, creating barriers to running test suites as frequently as desired. It introduces an extra layer of instability further complicating the process. The more parts call the LLM during a test run, the more complicated it gets.
‍

The good news is that all of the concerns can be mitigated by adopting the right architecture in the right place of testing cycle.

Use AI for test creation

AI is most valuable during the creation and maintenance phase of test cases. Let’s take a scripting example. You could begin with a prompt describing your desired test case and allow the AI to generate an initial version.

*Cursor screenshot - prompt + output

If you’re lucky, the AI may produce a valid and ready-to-use test case right away. How convenient!

If the AI struggles to interpret your application, you - the domain expert - can step in and guide it, ensuring that the resulting test case is accurate and robust. It’s a good practice to keep AI fallibility on top of your mind when you’re accessing its output. It’s an even better practice for tool developers to build the reminder into the process.

_*AI agent asking for help _

Do not use AI in test runtime

Ideally, AI should not be used during runtime. It’s slow. It’s brittle. It’s costly. A test case represents an expectation of how a system should work in a particular area. The agentic AI must try to fulfill this expectation. No workarounds. Only if the expectation is formulated precisely enough (code / steps) it can be validated against.

I suggest relying on established automation frameworks such as Playwright, Cypress, or Selenium for test execution. By using standard automation framework code, your test cases can remain deterministic, allowing you to execute them fast and reliably. Some providers even offer execution platforms to scale your standard framework test suites efficiently.

Use AI for test maintenance

The case for using AI in test auto-healing is quite strong. When given boundaries and the ‘good example’ of the original test case, the AI worst instincts can be mitigated. The idea is that AI generation works best when the problem space is limited.

A robust solution to auto-maintenance would address a huge pain point in end-to-end testing. Maintenance is more time consuming (and frustrating) than scripting and running tests combined. There are many tools building auto-maintenance features using AI right now. If good enough, they could considerably simplify the process of keeping your tests up-to-date and relevant.

Daniel Roedler
Co-founder & CPO

Testing is more about setup than scripts

daniel-octomind — Tue, 18 Feb 2025 14:22:33 +0000

Comparing testing frameworks is the type of news that gets a lot of eyeballs online. Playwright vs. Cypress vs. Selenium vs. Webdriver.io - everyone is interested in seeing which one is better, faster, more stable and easier to work with.

Speed and performance is extremely scrutinised. It seems like everyone is migrating from Cypress to Playwright, because it offers faster test execution. But is that really a goal worth pursuing?

It seems like a no-brainer to switch to the faster framework. If the test execution is faster, why wouldn’t you make the switch?

The problem becomes more apparent once you understand the whole system of test automation. Test automation is never just about script execution. Writing, maintaining, setting up, retrying and a number of other concerns are what makes up the daily life of a test automation engineer. A good test automation flow is the one where all of these parts of the process are fast, not just the test execution.
‍

If framework migration evangelists were honest, they would include time for setup into consideration. I know that edges on a long term experiment, so no blame here. Test scripting is not the biggest time investment. Set-up is.

I’ll pretend maintenance isn’t a thing for now. That deserves its own article.

Setting up test environments

When it comes to testing, a big part of what makes a project go slow (or fast) is the system under test. Testing tools are getting faster these days, and it seems that we are no longer limited by the speed of testing tools, but rather by applications under test.

Clicking, typing and interacting with the application under test is a task that most of the testing frameworks can handle just fine. Any seasoned test automation engineer will probably tell you that the real complexity is not in these interactions, but in all things around - abstractions, system design, data seeding, environment concerns, authentication setup just to name a few. All of these things are what makes testing possible and make up the real work of test automation.

So what are the main concerns when it comes to setting up a well performing test automation? There are a couple of them. You can think of your test automation script as of a real user. Simply starting a user journey might require answers to following questions:

What does the user need in order to interact with the system?
Are there any limitations in when/how can user interact, such as authorization, authentication, payment or other?
What kind of data is assumed when interacting with the system? (e.g. compare e-commerce vs. internet banking apps)
Are there any integrations to third party systems that the user needs to have first?
Does a user interact with other users when using the system?
How do you make sure you don’t pollute your production analytics with test data? ‍

There are, of course, many questions like these that will shape how the test script will look like, or what kind of setup will need to be done before it runs. In order to create a good testing system, these problems require well thought-through solutions. The biggest challenge of testing is that there’s no solution that would be 100% transferable to other projects.
‍
This brings me back to test framework comparisons. A very big part of test automation is actually not test automation, but preparation for it. A good test automation project is more than just a good script, it’s a good overall experience. After all, test automation serves the goal of shipping faster, with greater confidence.

Test automation in the AI era

The AI wave has influenced test automation as well. It seems that there are multiple companies that now tackle test automation in new, innovative ways. Autonomous testing is on the rise, and bold claims on how testing is going to be done purely by AI are made.

But many of these seem to take one part of test automation and execute it well, while forgetting others. Creating an automation script is a task that many autonomous companies jump on to, and have been able to achieve at varying levels of success. Demos of these tools can be quite impressive, but much like the reviews I mentioned earlier, they cover only part of the ground.

Test automation was never just the problem of creating scripts, but as we established earlier, it’s also a challenge of proper setup and proper context. Autonomous testing solutions need to look at the whole system of test automation related problems and be able to go beyond simple scripting.

Much like being able to prompt LLM through chatGPT or Claude, autonomous testing services need to have a way to provide proper context for the environment under test. Not only the URL of the application under test, but data seeds, environment variables and other settings are what tackles the test automation as a whole, rather than just part of it.

AI testing tool supporting setup

Setup is a critical challenge. At Octomind, we’ve built a set of features to help you put up testing and run it quickly. We will be expanding these features as we go.

Test portability

Environments are essential for running the same test suite across different stages of deployment. Staging, canary, production - you can create as many environments you need. Login credentials can be adjusted per environment and you can define different authentication methods for each stage.

You can easily define custom variables and incorporate them into your test cases. They are created in the default environment ensuring they appear consistently across all other environments. They allows you to assign different values to variables depending on the environment, making it easier to maintain test cases across multiple setups.

Test repeatability

To maintain a consistent test environment for each run, setup and teardown strategies are essential. They improve test repeatability and optimize execution time. For example, if a specific element - such as a support ticket in a ticketing system - can be assumed to exist, you can immediately interact with it instead of creating it from scratch for every test.

We are exploring several options to facilitate setup and teardown at the moment. We will keep you posted once this is shipped.

Daniel Draper
Lead Engineer at Octomind
‍

Our problem with backlogs

daniel-octomind — Mon, 04 Nov 2024 17:08:30 +0000

Over a year ago, we launched our dev tooling startup. One of the biggest pitfalls for new companies is focusing on the wrong priorities - over-engineering your scaling capabilities too early or neglecting customer value.

From the outset, we decided to create a culture where everyone truly owns their work. This approach not only ensures we're focusing on the right things. I have always enjoyed working most on the things I find important.

One of our early discussions was about adopting a 'no backlog' approach. Now, with a team of 8 engineers and over a year of experience, we revisited this decision to see if it still made sense.

It was a fun exercise. We got our backlog frustrations from previous workplaces out of the system, and discussed the advantages and disadvantages we have identified in the past year. I’ve summarized it here.

If you consider deleting your whole backlog or moving to a less backlog-centric approach of software development, I hope to give you some good insights.

What exactly is a backlog?

To start the discussion in the now larger team, I wanted to get a feeling for the backlog vibe. When I asked around, I got answers like:

An ordered list of tasks that we need to complete
A way to structure work
A way to remember past discussions and conversations

But also:

A dumping ground for unimportant tasks
A place I can put stuff in. I know we’ll never work on that. It gives me a good feeling of putting it there.
A to-do list that never gets done
A good tool to shut down discussions

For me personally, a backlog is a collection of tasks that aren’t important enough to do right now. If they were, we wouldn't be stashing them away. After years in software development I’ve come to see backlogs as more of a hindrance than a help.

When I refer to ‘backlog’ I do NOT mean a list of tasks that you need to do in the next short iteration (1-2 weeks), but anything longer than that (as in ‘product backlog’).

So, let’s dive into some of the problems I see with backlogs.

1. They are designed to be a dishonest conversation

Backlogs often lead to some pretty dishonest conversations. Picture this: A stakeholder urgently requests a new feature, as is often the case. The product owner, instead of outright saying, 'This isn’t a priority right now,' adds it to the backlog. The stakeholder walks away thinking their request will be addressed in x days, while the product owner knows it probably won’t be touched.

It’s an easier conversation to have than just saying we will never get to it because it's not important. This is what's happening every time you add something to a really long product backlog. This cycle keeps everyone happy on the surface but doesn’t solve the real issue.

‍

2. The never ending list

Backlogs have a large gap between expectation and reality even for the most experienced of teams. We all would like to see a nice, steady burn-down of tasks. In reality, backlogs rarely shrink. If it’s still growing after the 12th iteration, that just breeds frustration for everyone involved.

An empty backlog is a myth. We’ve never seen such a thing in any of the companies that anyone in our team has ever worked in. Unless we decided to delete the old and start with a new one out of sheer frustration.

source: https://www.allankelly.net/static/presentations/Shrunk.pdf

3. They are a COLOSSAL waste of time

Maintaining a backlog eats up a lot of time. Teams spend countless hours refining, grooming, and updating tasks, creating wireframes, and ensuring every detail is documented. Yet, by the time a task actually is being worked on, the original requirements or context may have changed substantially, making all that prep work useless.

Every moment spent on a task that isn’t immediately being picked up is time wasted. The gap between defining and implementing tasks often means the work becomes obsolete as requirements and reality change, as they always do in software.

source: https://lucasfcosta.com/2023/02/07/backlogs-are-useless.html

4. They shift bottlenecks (and blame) between different roles

Imagine a customer success manager who asks for a specific feature for a specific customer. The dev team has an open backlog policy, everyone is allowed to add a ticket. The engineering team has to take a look at the item, refine it, then potentially estimate the effort (#noEstimates, but we are trying to focus here). The refined task never makes it to an iteration. Different prios.

The blame for not completing the task on time lies with the engineering team. If they ONLY had worked faster.

source: https://lucasfcosta.com/2023/02/07/backlogs-are-useless.html

So you add a product owner to the front of the process. We don’t believe the role must be tied to a specific person, but the concept is solid. The prioritization is now done with the team at a much higher level of abstraction BEFORE stuff ends up in the backlog.

Ideally, you’ve matched the input rate and the output rate. Now, It's much faster and cheaper to throw stuff away.

source: https://lucasfcosta.com/2023/02/07/backlogs-are-useless.html

Part of our problem with dedicated product owners is that they take away too much ownership from the team. They often just take blame from both stakeholders AND dev teams, and can quickly become a single point of failure for the product's success.

Are they good at prioritizing and keeping stakeholders at bay?
Do they have enough authority to make decisions?
Are they blocking development by being too protective of the backlog?

Something that came up in the discussion summarized it perfectly:

“We used to have some fierce guardians of the backlog at my previous company. It was impossible to get anything in. Didn’t go well for either the product or the company.”

5. Backlog bias when discussing importance

Backlogs also introduce significant biases. Tasks added to the backlog represent decisions made without the necessary input from the current context. As time goes on and things change, these tasks might no longer be the best solution. Yet, teams continue with them simply because they’re in the backlog.

I’ve seen this happen over and over: a task sits in the backlog for months, gets discussed in every iteration as critical, only to become irrelevant when it’s finally addressed.

So, what did we do instead?

This quote of Allen Holub is about more than backlog size. It gives us a way forward without one.

source: https://x.com/allenholub

In product development, detailed long-term planning should be limited to high-level priorities — knowing what to build and why. The next steps become clear through building and immediate feedback. A backlog often gets too granular for planning beyond the next 1-2 weeks.

When you ask developers about the most important task, they're usually already aligned. While there may be minor differences in preferences, identifying the top priority right now isn't typically an issue.

1. Continuous Discovery

We adopted ‘Continuous Discovery’ practices, focusing on ongoing conversations with customers and stakeholders to gather insights and feedback. It’s about always reassessing priorities based on the latest information, ensuring that we are working on what matters most.

Rather than relying on a static backlog, our process is fluid, with priorities shifting as new information emerges.

However, we're not starting from scratch each week; the Opportunity system within Continuous Discovery helps us stay on course. Opportunities highlight customer needs and pain points without jumping prematurely into solutions on a low abstraction level until absolutely necessary. They ensure we tackle the most critical issues first.

For a deeper dive, check out Teresa Torres' explanation in Product Talk.

2. Aligning with goals

A key strategy is to focus on overarching business goals instead of specific tasks. This ensures our efforts align with the bigger picture and adapt to current priorities. This goal-oriented approach helps us identify critical problems without getting stuck on a fixed list.

This is, of course, the bread and butter of engineering and product leadership in all companies. It’s also really hard. We often struggle with making our goals measurable for small increments for example.

3. The case for not listing ideas

“But what about all my good ideas that I will lose if I don’t write them into a backlog?”

First, I really think that if the idea is good enough it will come up again, because you will spend more time thinking about the problem space. You might come up with an even better solution.

You should always rethink your ideas. You should be able to kill your darlings, if they don’t serve the purpose anymore.

More than a year into Octomind, we know what ideas were discussed and discarded, and why. We have the advantage of being a small team, so onboarding new colleagues to our mental model is still easy. I don’t think a backlog of outdated items helps the newbies anyways.

4. Cheating with mini-backlogs

We collect bug tickets, sort critical ones in real-time and run a bug triage every week to address the rest. We have a zero-bug-policy, which means they are just as important as any of the tasks that we planned for that iteration, and this ensures the bug backlog doesn’t grow indefinitely. Most of us get anxious if there’s more than one page of open GitHub issues. It seems to be working for us.

Is it cheating? You decide.

This might not work for you

Our business model relies on other vendors' LLMs, a landscape that shifts constantly. We must adapt quickly to avoid investing in solutions that could be outdated with the next GPT release.

I’m very well aware that not all software development is the same. Some ecosystems require long-term planning, extended development cycles, and careful rollout, especially in industries with strict regulations or safety-critical applications (looking at you CrowdStrike). Large engineering teams need robust processes so the house doesn’t fall apart.

We might eventually outgrow this set-up and introduce more backlog-like processes. Until then, we'll do our best to keep the backlog monster locked away.

Did you break your code or is the test flaky?

daniel-octomind — Mon, 27 May 2024 12:04:31 +0000

Flaky end-to-end tests are frustrating for quality assurance (QA) and development teams, causing constant disruptions and eroding trust in test outcomes due to their unreliability.

We'll go over all you need to know about flaky tests, how to spot a flaky test from a real problem, and how to handle, fix, and stop flaky tests from happening again.

Are flaky tests a real issue?

While often ignored, flaky tests are problematic for QA and Development teams for several reasons:

1. Lack of trust in test results
When tests are unreliable, developers and QA teams may doubt the validity of the results, not only for those specific tests but also for the entire set

2. Wasted time and resources
The time and resources wasted diagnosing flaky tests could've been spent adding value to the business.

3. Obstructed CI/CD pipelines
Constant test failures that are unreliable often result in the need to run tests again to ensure success, causing avoidable delays for downstream CI/CD tasks like producing artifacts and initiating deployments.

4. Masks real issues
Repeated flaky failures may lead QA and Developers to ignore test failures, increasing the risk that genuine defects sneak through and are deployed to production.
‍

What causes flaky tests?

Flaky tests are usually the result of code that does not take enough care to determine whether the application is ready for the next action.

Take this flaky test Playwright test written in Python:

    page.click('#search-button')
    time.sleep(3)
    result = page.query_selector('#results')
    assert 'Search Results' in result.inner_text()

Not only is this bad because it will fail if results take more than three seconds —it’s also wasting time if the results return in less than three seconds.

This is a solved problem in Playwright using the wait_for_selector method:

    page.click('#search-button')
    result = page.wait_for_selector('#results:has-text("Search Results")')
    assert 'Search Results' in result.inner_text()

Selenium solves this using the WebDriverWait class:

    driver.find_element(By.ID, 'search-button').click()
    result = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, 'search-results'))
    )
    assert 'Search Results' in result.text

Flaky tests can also be caused by environmental factors such as:

Unexpected application state from testing in parallel with the same user account
Concurrency and race conditions from async operations
Increased latency or unreliable service from external APIs
Application configuration drift between environments
Infrastructure inconsistencies between environments
Data persistence layers not being appropriately reset between test runs ‍

Writing non-flaky tests requires a defensive approach that carefully considers the various environmental conditions under which a test might fail.

What is the difference between brittle and flaky tests?

A brittle test, while also highly prone to failure, differs from a flaky test as it consistently fails under specific conditions, e.g., if a button's position changes slightly in a screenshot diff test.

Brittle tests can be problematic yet predictable, whereas flaky tests are unreliable as the conditions under which they might pass or fail are variable and indeterminate.

Now that you understand the nature of flaky tests, let's examine a step-by-step process for diagnosing them.

Step 1. Gather data

Before jumping to conclusions as to the cause of the test failure, ensure you have all the data you need, such as:

Video recordings and screenshots
Test-runner logs, application and error logs, and other observability data
The environment under test and the release/artifact version
The test-run trigger, e.g. deployment, infrastructure change, code-merge, scheduled run, or manual

You should also be asking yourself questions such as:

Has this test been identified as flaky in the past? If so, is the cause of the flakiness known?
Has any downtime or instability been reported for external services?
Have there been announcements from QA, DevOps, or Engineering about environment, tooling, application config, or infrastructure changes?

Having video recordings or frequently taken screenshots is crucial because it's the easiest-to-understand representation of the application state at the time of failure.

*step by step screenshots in Octomind test reports

Compile this information in a shared document or wiki page that other team members can access, updating it as you continue your investigations. This makes creating an issue or bug report easy, as most of the needed information is already documented.

Now that you’ve got the data you need, let’s begin our initial analysis.

Step 2. Analyze logs and diagnostic output

Effectively utilizing log and reporting data from test runs is essential for determining the cause of a test failure quickly and correctly. Of course, this relies on having the data you need in the first place.

For example, if you're using Playwright, save the tracing output as an artifact when a test fails. This way, you can use the Playwright Trace Viewer to debug your tests.

‍

*source code in Playwright Trace Viewer tab

To begin your analysis, first identify the errors to determine if the issue stems from one or a combination of the following:

Test code failing unexpectedly (e.g. broken selector/locator)
Test code failing as expected (e.g. failed assertion)
Application error causing incorrect state or behavior (e.g. JavaScript exception, rejected promise, or unexpected backend API error).

The best indication of a flaky test is when the application seems correct, yet a failure has occurred for no apparent reason. If this type of failure has been observed before, but the cause was never resolved, the test is probably flaky.

Things become more complicated to diagnose when the application functions correctly, yet the state is clearly incorrect. You’ll then need to determine if the test relies on database updates or responses from external services to confirm if infrastructure or data persistence layers could be the root cause. Hopefully, your application-level logs, errors, or exceptions will provide some clues.

Debugging test failures is easier when all possible data is available, which is why video recordings, screenshots, and tools such as Playwright’s Trace Viewer are so valuable. They help you observe the system at each stage of the test run, giving you valuable context as to the application's state leading up to the failure. So, if you’re finding it challenging to diagnose flaky tests, it could be because you don’t have access to the right data.

If the test has been confirmed as flaky, document how you came to this conclusion, what the cause is, and, if known, how it could be fixed. Then, share your findings with your team.

Because you’ve confirmed the test is flaky, re-run your tests, and with any luck, they’ll pass the second or third time around. But if they continue to fail, or you’re still unsure why the test is flaky, more investigation is required.

Step 3. Review recent code changes

If the test run was triggered by an application or test code changes, review the commits to look for updates that may be causing the tests to fail. For example, newly added tests that aren’t cleaning up their state executing before the now-failing test.

Also, check for changes to application dependencies and configuration.

Step 4. Verify the environment

If your flaky tests still fail after multiple re-runs, it’s likely that application config, infrastructure changes, or third-party services are responsible. Test failures caused by environmental inconsistencies and application drift can be tricky to diagnose, so check with teammates to see if this kind of failure has been seen before under specific conditions, e.g. database server upgrade.

Running tests locally to step-debug failures or in an on-demand isolated environment is the easiest way to isolate which part of the system may be causing the failure. We have open sourced a tool to do exactly that - Debugtopus. Check it out in our docs or go directly to the Debugtopus repo.

Also, check for updates to testing infrastructure, such as changes to system dependencies, testing framework version, and browser configurations.

Reducing flaky tests

While this deserves a blog in its own right, a good start to preventing flaky tests is to:

Ensure each test runs independently and does not depend implicitly on the state left by the previous tests.
Use different user accounts if running tests in parallel.
Avoid hardcoded timeouts by using waiting mechanisms that can assert the required state exists before proceeding.
Ensure infrastructure and application configuration across local development, testing, and production remains consistent.
Prioritize the fixing of newly identified flaky tests as fast as possible.
Identify technical test code debt and pay it down regularly.
Promote best practices for test code development.
Add code checks and linters to catch common problems.
Require code reviews and approvals before merging test code.

Conclusion

The effort required to diagnose flaky tests properly and document their root cause is time well spent. I hope you’re now better equipped to diagnose flaky tests and have some new tricks for preventing them from happening in the future.

We constantly fortify Octomind tests with interventions to prevent flakiness. We deploy active interaction timing to handle varying response times and follow best practices for test generation already today. We are also looking into using AI to fight flaky tests. AI based analysis of unexpected circumstances could help handling temporary pop-ups, toasts and similar stuff that often break a test.

Maximilian Link
Senior Engineer at Octomind

The Full-stack Testing Mindset

daniel-octomind — Mon, 08 Apr 2024 13:07:43 +0000

“I love contributing to and maintaining our end-to-end test suite” - Nobody, Ever

End-to-end tests have a poor reputation for being brittle, slow, and frustrating for both QA and engineering teams to manage.

TL;DR

If you’re pushed for time, here are the main takeaways:

Adopt a Full-Stack Testing Mindset
Unless you’re an API-only company, backend systems don’t operate in isolation. Pay special attention to the integration points between different backend services, especially when async operations and task queues are involved.
The Importance of Simulating Real-World Conditions
Include tests that simulate unpredictable real-world conditions, such as variable network latency, high traffic loads, and interactions with external services. These conditions can expose backend issues that are not apparent in a controlled test environment.
Developers: Collaborate with and Educate QA Teams
Work closely with QA automation engineers to understand the limitations and deficiencies in the current end-to-end testing codebase and guide them in upskilling and improving their ability to write comprehensive and robust tests.
QA: Proactively Learn from Developers
QA should take every opportunity to learn how data flows from the front-end to the backend and other architectural and infrastructure concerns so you can start to troubleshoot and test with the mindset of a developer.
Consider Playwright for Boosting Speed and Productivity
If you haven’t already, look into switching to Playwright, as that will likely offer the biggest bang for your buck for simplifying the test code, shrinking test times in CI, and reducing the flakiness of tests overall.

‍My goal with this article is to surface the often overlooked role end-to-end testing can play in uncovering types of backend bugs that are difficult, if not impossible, to find with integration or unit tests alone and why the solution lies in both software engineers and QA adopting a full-stack testing mindset.

And don’t worry - this isn’t a thinly veiled attempt to position our product as the answer to all your end-to-end testing woes. I simply want to inspire you to write better code that provides a more robust customer experience by embracing the value E2E testing can offer.

Why developers should care about e2e tests

Most developers consider their job done once unit (and hopefully, integration tests) have been written. End-to-end tests are usually an afterthought for developers, if they’re thought of at all.

Customer success is not whether your backend tests pass and the APIs under test fulfil their contract - it’s whether an entire customer flow works as expected in a production setting.
‍
There are many real world examples to illustrate the importance of end-to-end tests to identify backend issues. Just recently, we would have shipped a broken settings feature without E2E tests.

The backend test suite passed, but the Auth API call was incorrectly restricted to admin users only, and because every developer has admin permissions, the issue was invisible to us.

Luckily, we had an end-to-end test fail as it used a real customer account with standard permissions, revealing the underlying issue. Without this test, the user flow would have been broken on release for our users to discover.

The organizational divide between engineering and QA

This is typically how organizations think of end-to-end testing:

Dividing test design and responsibility solely according to an org chart makes logical sense but doesn’t work well in practice.

While a poor-quality or brittle end-to-end test suite with constant failures frustrates everyone, it’s ultimately developers who pay the price as they will most likely be responsible for investigating each failure and determining the solution. It’s therefore in their best interests to ensure E2E tests provide maximum value with minimal false positives.

While I’m not suggesting developers save the day and take ownership of end-to-end testing, they need to play a leadership role to drive improvements in test quality by:
‍

Understanding the limitations of unit and integration tests for simulating real-world scenarios, mainly where external services are involved.
‍
Working closely with QA automation engineers to educate them about flows where async task processing or data synchronization operations may create a more nuanced happy path.
‍
Reviewing test cases to ensure best practices are being followed, ensuring test code is easily maintainable, readable and debuggable.
‍
Encouraging and helping QA learn more about the system architecture, infrastructure, and application code so they may learn how to think like a developer and write more sophisticated and comprehensive test cases.
‍

The goal is for QA and development teams to act as collaborative partners and take greater ownership of the quality and reliability of their end-to-end tests together. This helps foster a better team-oriented environment and puts the customer experience and code quality as the shared objective.

Ultimately, the best developers are the ones that don’t just focus on addressing the strict requirements of their user story or task, but care deeply about the quality and delivered customer value and don’t let organizational boundaries limit the scope of their testing.

The e2e testing landscape is transforming

There’s never been a better time to be optimistic about end-to-end testing improving, thanks to Microsoft’s open-source testing framework, Playwright, and automated testing fixes powered by AI.

Since Playwright’s release in 2020, it’s racked up more than 60k stars on GitHub and over 4.5 million installs per week from npm — and with good reason, as it’s the most modern and forward-thinking test framework. We’re betting our entire company on it.

It’s blazingly fast, tests multiple browsers in parallel, runs headed and headless, has flexible layout-based selectors such as near, below, and above, flakiness prevention features such as auto-wait, integrated VS Code debugging, and much more.

This means writing high quality and reliable E2E tests doesn’t require years of experience and weird timing hacks. And the fact it’s used and owned by Microsoft means that you can be sure it’s here to stay and will continue to be invested in for the long term.

Where AI fills the gaps

But Playwright alone doesn’t take all the pain away. You still have to identify, write, run, and maintain your Playwright E2E tests. ‍

At Octomind, we think that AI has the capacity to make that part much easier. Our LLM based agent is trained to have a semantic understanding of code at the UI layer, enabling them to infer relationships between related UI components and make intelligent suggested changes to fix failures caused by issues such as broken Playwright locators.

We want to remove the needless interruptions caused by failing UI tests that aren’t genuine regressions by automatically fixing them, e.g. when link text is updated or a button position changes. It’s a task AI is well suited to.

Summary

End-to-end testing isn’t just for testing the front-end and catching simple UI regressions and the testing pyramid is changing to reflect the value they can offer.

End-to-end tests are unique in that they can simulate real-world conditions such as network latency, high traffic loads, and interactions with external services. These conditions can expose backend issues unlikely to surface in controlled environments such as those where integration and unit tests are run.

By adopting a collaborative and comprehensive approach to end-to-end testing where developers support QA in developing a deeper understanding of the entire application stack, software teams can significantly improve the quality of their code and provide a more robust and reliable customer experience.

With Playwright and AI revolutionizing end-to-end testing, there’s never been a better time to adopt a ‘Full-Stack Testing Mindset’ to get the most out of your E2E test suite.

Daniel Draper
Lead Engineer at Octomind

Keep your Copilot and your code quality

daniel-octomind — Thu, 29 Feb 2024 14:33:51 +0000

AI generated issues with code quality

Studies on GitHub Copilot's impact on coding trends reveal a paradox: while it boosts coding speed it also introduces problems with code quality and maintainability. The latest study ‘Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality’ by GitClear confirms the pattern. According to GitHub’s own study: "developers write code "55% faster” when using Copilot, but there appears to be a decline in code quality and maintainability of the code generated by AI.

Analyzing over 153 million changed lines of code, the GitClear study predicts doubled code churn in 2024 compared to 2021 and a dip in effective code reuse. For more insights resulting from the study, check out this summary written by David Ramel or watch Theo Browne’s reaction to it.

These findings raise important questions about the long-term impacts of AI tools in software development. Poor code leads to app failures, revenue loss, and skyrocketing cost to fix the mess.

AI testing and refactoring to course correct AI generated code

How do we get quality back on track while keeping the productivity benefits of tools like Copilot? After all, degrading quality will undermine delivery speed and neutralize some of the gains, if not most.

AI might be the answer here, too.

First off, we need to push AI models to churn out code that meets quality standards. This relies heavily on Copilot-type tool providers stepping up their game. It’s reasonable to assume they’ll get better at solving it. However, it’s not a silver bullet.

Second, new tools will make quality control, debugging and refactoring as easy as AI code creation. Startups tackling non-trivial developer problems around code quality mushroomed in the past 2 years. Exponentially so since the launch of GPT4. With decreasing code quality, they will become a necessity. But they will take time to learn and use which again, offsets the time you saved using AI generated code.

A third addition will be the generation of test code at all levels. Copilot itself seems to be decent in suggesting helpful unit tests already. But it won't be enough.
‍

*a Redditor commenting on GitHub Copilot's code & test writing ability

The case for AI in UI testing

Having AI taking care of end-to-end tests is a more complex task to solve. It is also essential not only to produce code fast but to also get it deployed fast. The speed gains from code generation won’t materialize otherwise.

Direct test code generation is still not there yet, as many evaluations seem to point out, like this analysis of gpt 3.5 and gpt4 capability of handling e2e (Cypress & Playwright) tests.

But what if we approached it differently and used AI for its commonsense knowledge? Think UI tests as functional e2e tests for front-end. They test for functionality based on user interactions. AI can have a look at an application and interact with it as prompted. This way AI doesn’t mimic the coder but the user. LLMs combined with vision models show a promising route to handle the problem.

AI testing and trust

The biggest challenge with any test suite and especially with end-to-end tests, is trust. If they don’t provide enough coverage, they are not trustworthy. The same applies if they are flaky. Moreover, they need to run reasonably fast because otherwise, it is a pain to use them. Adding AI to the equation introduces even more reliability concerns.

The AI needs to provide high enough coverage by producing test cases of high quality and low flakiness which are running fast enough for a short feedback cycle. It has to be able to auto-fix broken tests to achieve meaningful coverage and keep it long term.

To counteract the diminishing trust into AI generated code, an efficient and trustworthy AI test strategy can go a long way.

That's what we are working on.

We are out with Octomind’s beta version but we need your feedback to nail an AI testing tool you will trust and love to use.

Daniel Rödler
Co-founder and CTO / CPO at Octomind