DEV Community

Cover image for Unleashing AI to Heal Flaky Tests in CodeceptJS with Playwright

Posted on

Unleashing AI to Heal Flaky Tests in CodeceptJS with Playwright

Dealing with flaky tests is one of the biggest pain points for anyone doing browser and end-to-end test automation. You know the drill - tests that started to fail without any reason. Intermittent failures that sap confidence in the test suite and prevent stable deployments. Maintaining tests can take up to 50% of the time for QA test automation engineer.

Usually, the cause of failure is quite simple: the element name has changed, or a new modal window appeared. What if we could use AI to identify the root cause and try to fix the test on the fly, instead of distracting engineer from their actual work?

That's what I did.

Rather than just re-running flaky tests hoping they'll pass, or manually debugging every failure, I tried several popular AI models with new AI healing feature of CodeceptJS to fix failing tests.

The results were quite promising...


But you still need to read further, to learn which models I used and how to configure them.

Failing Test

I started with a failing test which opens GitHub login page but uses incorrect locators.

Scenario('Github login', ({ I }) => {
  I.amOnPage('')'Sign in');
  I.fillField('Username', 'davert');
  I.fillField('Password', '123345');'Login');
  I.see('Incorrect username or password');

Enter fullscreen mode Exit fullscreen mode

This test is executed by CodeceptJS using Playwright as an engine. As an alternative, you can use webdriverio as the engine of choice, but I prefer Playwright as it is faster.

This test fails because there is no "Login" button on the page and no "Username" field. So test won't get to the assertion step and fail on interaction.

The failing test runs ~4 secs. We need to keep that information to see how using AI impacts the speed of a test.

Fixing tests on the fly is possible with heal plugin of CodeceptJS. It should be enabled in codecept.conf.js as well as AI provider to make things work.


This was my first stop, as OpenAI's ChatGPT has been a game-changer in making AI accessible. My model of choice was gpt-3.5-turbo-0125. To start using it I had to fill my credits balance in OpenAI Console. ChatGPT is still free but if you use it via API, you need to have pre-paid credits. However, it's relatively cheap. I spent only a few cents playing with it.

open ai console billing

I updated my codecept.conf.js file to use it as AI provider:

ai: {
  request: async (messages) => {
    const completion = await{
      model: 'gpt-3.5-turbo-0125',
    return completion?.choices[0]?.message?.content;
Enter fullscreen mode Exit fullscreen mode

The complete configuration is here

I executed the test few times and with --ai mode and test has passed. Yey!


I received very sane suggestions to fix tests:

1. To fix Github login
  Replace the failed code: (suggested by ai)
- I.fillField("Username", "davert")
+ I.fillField("input#login_field", "davert")
at Test.<anonymous> (./tests/ai_test.js:12:5)

2. To fix Github login
  Replace the failed code: (suggested by ai)
+"input[value='Sign in']")
at Test.<anonymous> (./tests/ai_test.js:14:5)
Enter fullscreen mode Exit fullscreen mode

Both are correct. However, test duration has risen from 4s to 12s which is not great not terrible. I can assume that every call to OpenAI can take an additional ~4s (or more).

Claude (Anthropic)

I then tried the impressive Claude model from Anthropic. Claude seemed to have stronger reasoning capabilities and was able to provide more nuanced suggestions. I liked Claude more than ChatGPT 3.5 when I had to solve my daily programming tasks. How good Claude is when it comes to fixing tests on the fly?

Claude provides free 5$ credits to start. Nice!

claude console

I updated my config to use claude-2.1 model via Anthropic SDK:

ai: {
    request: async(messages) => {
      const resp = await anthropic.messages.create({
        model: 'claude-2.1',
        max_tokens: 1024,
      return => c.text).join('\n\n');
Enter fullscreen mode Exit fullscreen mode

This model also provided valid results:


1. To fix Github login
  Replace the failed code: (suggested by ai)
- I.fillField("Username", "davert")
+ I.fillField('#login_field', 'davert');
at Test.<anonymous> (./tests/ai_test.js:12:5)

2. To fix Github login
  Replace the failed code: (suggested by ai)
at Test.<anonymous> (./tests/ai_test.js:14:5)
Enter fullscreen mode Exit fullscreen mode

But took almost 25 secs! That's too long!

However, I used the legacy Claude 2.1 model, Anthropic recommends Haiku model as a lightweight replacement for it.

So I updated my config:

    request: async(messages) => {
      const resp = await anthropic.messages.create({
        model: 'claude-3-haiku-20240307',
        max_tokens: 1024,
      return => c.text).join('\n\n');
Enter fullscreen mode Exit fullscreen mode

However, it was not faster than claude 2.1. It took 25s.

AI assistant took 25s and used ~3K input tokens. Tokens limit: 1000K

Self-Healing Report:
2 steps were healed

Suggested changes:

1. To fix Github login
  Replace the failed code: (suggested by ai)
- I.fillField("Username", "davert")
+ I.fillField('#login_field', 'davert');
at Test.<anonymous> (./tests/ai_test.js:12:5)

2. To fix Github login
  Replace the failed code: (suggested by ai)
+'Sign in');
at Test.<anonymous> (./tests/ai_test.js:14:5)
Enter fullscreen mode Exit fullscreen mode

I think Claude's general purpose model is not the best choice for healing tests.


Mixtral was something new to me. I never tried to use mostly because it doesn't have common chat interface as Claude or ChatGPT. But recently I discovered service Groq that provides the same experience as ChatGPT and Claude but for opensource models like Mixtral.

Btw, this model is opensource so you can run it on Claudeflare, Google, Amazon, or locally. However, I will use groq due to its simplicity.

And at this moment it's free to use!

groq service

I updated my config to use mixtral via groq:

    request: async (messages) => {
      const chatCompletion = await{
          model: "mixtral-8x7b-32768"  // llama2-70b-4096 || gemma-7b-it || llama3-70b-8192 || mixtral-8x7b-32768
      return chatCompletion.choices[0]?.message?.content || "";
Enter fullscreen mode Exit fullscreen mode

Well, this was faaast!

Image description

The suggestions were good, and quite surprisingly it took 7s to fix 2 steps!


Another opensource model available from Groq is LLAMA from Meta. Let's try llama2:

    request: async (messages) => {
      const chatCompletion = await{
          model: "llama2-70b-4096"
      return chatCompletion.choices[0]?.message?.content || "";

Enter fullscreen mode Exit fullscreen mode

Llama2 failed to provide a fix.


According to the logs it suggested code fixes for Codeception PHP testing framework, which didn't work in JS:

Here's an example code snippet that demonstrates how to adjust the `click` method to fix the test:

Enter fullscreen mode Exit fullscreen mode

Very bad. Maybe llama3 is better? I also tried llama3-70b-8192 model:

It failed
Image description

but after a few tries it resulted in fixing one step


Also, I don't like the suggested locator

1. To fix Github login
  Replace the failed code: (suggested by ai)
Enter fullscreen mode Exit fullscreen mode

Llama seems like very fast (7s) but very unstable for tasks like this. Maybe another day?


Groq provides access to Gemma model from Google. Why not adding model gemma-7b-it to competition?

It worked and worked fast!

Image description

But after a few tries, I didn't manage to reproduce the result. Sometimes its response is not valid JS code so it can't be taken into action:


The adjusted locator `'.auth-form-body form button[type="submit"]:contains("Sign in")'` addresses the issue by:
Enter fullscreen mode Exit fullscreen mode

Unfortunately, CodeceptJS can't run the code from its words.


AI can improve the way we keep our end-2-end tests up to date. If you run a lot of browser tests on CI you should already try to use AI assistant to reduce flakiness of your tests. Just why not?

If it fixes a test, it will use much less time then you would do reproducing and re-running this test.

If it fails, you lose nothing. AI is quite cheap, esp comparing to QA engineers rates.

As of April 2024 the best model you can use to heal flaky tests is Mixtral mixtral-8x7b-32768. True winner! It took about 1s for request. You can enable it today with Groq in your CodeceptJS setup and have your tests fixed up on the fly!

ChatGPT-Turbo gpt-3.5-turbo-0125 also showed a good result. Last year on same task it took ~20-30s per request. Serious improvement, as the request time was reduced to 4s.

P.S. Sure, general LLM are not stable and can't guarantee the same results. But even if they can fix 10% of failures for you, that can make huge impact!

Source code:

Top comments (0)