Zachary Hamm

Posted on Oct 29

Optimizing a load balancing algorithm to minimize the runtime of a static process

#algorithms #testing #node #showdev

This is a followup to the following article: Load balancing Cypress tests without Cypress Cloud.

Months ago, I built a simple load balancer for the Cypress testing framework. While others exist, like the really well-designed cypress-split and an existing one in paid Cypress Cloud, I set out to see how to improve and combine all the necessary commands into one package. Originally, tests were balancing using a "round-robin" algorithm, but after time, I found that the extremes between the highs and lows from the process execution times to be inefficient. Thus, I set out to improve upon this with a new approach.

Everything below defines specifically version 0.2.9 of the “cypress-load-balancer” NPM package; please note it may change over time from how it is described here.

I feel confident that the new algorithm has reduced the timings significantly while ensuring all runners are evenly distributed, and the results at the end will demonstrate as such. Here's how I used some system design techniques to accomplish and implement this enhanced change.

The issue with using a "round-robin" approach in this case

A "round-robin" algorithm is a common, simple approach mainly used for handling dynamic traffic. VMWare does a good explanation of it here:

In a nutshell, round robin network load balancing rotates connection requests among web servers in the order that requests are received. For a simplified example, assume that an enterprise has a cluster of three servers: Server A, Server B, and Server C.

The first request is sent to Server A.

The second request is sent to Server B.

The third request is sent to Server C.

The load balancer continues passing requests to servers based on this order. This ensures that the server load is distributed evenly to handle high traffic.

There is a stark difference between an open, dynamic process and the cypress run command, however: The cypress run command is a static process that has a defined end point, whereas balancing traffic usually means that the line of communication is left open intentionally. For network requests, this means the most optimization is needed in the present load of traffic, but it does not necessarily care about the total time the whole process took if it is finally shut down. For running tests, however, we want to get tests to run as fast as possible: while improving each individual test file is an option, a second option exists to distribute them across parallel processes to minimize the amount of time it takes to fully complete testing all files.

With my initial round-robin algorithm, I sorted the test files by their expected run time, and then assigned them out to each runner, or worker process. Like the description above, if I had three runners and nine files, then the first runner gets the first file, the second gets the second file, third gets the third, and repeat. However, what this means is that since they are sorted by runtime, the first runner should always be the longest to run, and the last runner should be the shortest. The differences between these runtimes is very unbalanced: if we divide the nine files into three sets of three files each, then the first runner is always going to get the highest running time from each of those sets:

File timings (in milliseconds) [900, 800, 700, 600, 500, 400, 300, 200, 100]

Runners after load balancing:
[ [900, 600, 300], [800, 500, 200], [700, 400, 100] ]

Total run times:
[ 1800, 1500, 1200 ]

Take note of the difference between the longest runner and shortest runner is 600ms. This can definitely be improved.

Defining a new solution

Now, after noticing the extremes above, I wanted to restart and determine what I actually needed out of load balancing. Originally, it was just to evenly distribute X files between Y runners. This does not really optimize for time, however. Instead, I came up with a new goal statement:

The load balancing algorithm should keep the total runtime of the Cypress test execution process as low as possible, while evenly balancing test files amongst all available runner processes allowed.

With that in mind, I now could design a real working solution. Before even defining the boundaries of the algorithm, let's take a look at what we know about the affected process.

Initial observations of the Cypress `run` command

I collected some thoughts around Cypress and the cypress run process. Here's my list I uses to define the requirements later on:

Cypress has both an open and a run process and they are distinct from each other. The open process launches an interactive testrunner UI and can remain available after a test completes. The run process executes a set of files provided to it and ends when it is completes test execution and any post-test events.
The cypress run command is static: it has a defined start point and end point, with required inputs and various outputs.
The inputs to cypress run include a set of file names or file patterns to determine the testfiles to use; without a list of files (that it can find within the suite), it will exit after starting up without executing any tests.
Excluding setup and teardown of the process, the total test set execution time is the accumulation of the runtimes from all test files being executed.
- Furthermore, if every test could exist in a separate process run in parallel, then the minimum test set execution time of the entire process is equal to the longest-running test file. To get the lowest possible test set execution time, we must wait for the longest-running file to complete.
When parallelized runner processes are used that can have multiple tests run, then the total execution time of that process should be kept lower than the process containing the longest-running test file.
- For example, if a runner has only one file that is the longest-running file, then the accumulation of test timings across all other runners should be kept lower than the longest-running process, if possible.
To know how long a file runs, it must be executed at least once and have its timing recorded in a persistent location, like a statistics file that keeps a history of test file timings.

That's a lot of observations, but they helped immensely to shape a working solution.

Defining requirements of the new algorithm

Here's where it got fun: how do I communicate to myself what the algorithm must do? Well, based on the earlier observations, I can pick out a lot of things that are already well-defined.

Requirements for the load balancing function

Load balancing only affects the execution of the test set and does not consider any setup or teardown events, since it only records timings for test files.
The algorithm must know the following inputs:
- the list of files to balance
- a count of runners (worker processes)
- a way to read a statistics file containing the timings of the files.
- In this case, it records the durations, average, and the median execution time of a file: my algorithm uses the median execution time when balancing.
- If the timing of a file is not known, for instance, it is a new file, assume an initial timing as 0.
A plugin must be registered in the Cypress nodeSetupEvents hook to record the duration of each test after it has been executed, update the file statistics, and save them to a persistent statistical file. In this case, it goes to a directory of /.cypress_load_balancer with a file named spec-map.json. This file will be reused by the balancing function to know the timing of each file.
Each runner process will only contain the name of each file it is executing.

Requirements of the algorithm

It accepts a list of filenames, the runner count, and can access the statistical file, spec-map.json.
To start, get the statistics of each file, and then sort the file by longest to shortest runtime as sortedFiles.
Initialize X runners as empty arrays []. For each runner, remove the longest running file from the sortedFiles array and add it to the runner array. For example, for 3 runners, runner 1 gets the longest running file, runner 2 gets the second longest, and so on.
Calculate the total timing of each runner, then save the total time of the largest runner to a variable; this variable may be updated later. This variable will be named highestTotalRunnerTime.
Finally, sort the runners from shortest to longest runtime runtime; this is the inverse of how the files are sorted on purpose.
Next, it will iterate over every runner except for the largest runner. For each:
- If the runner has a larger run time than the highestTotalRunnerTime, skip adding any files to it in this iteration.
- If not, then remove the next highest file from sortedFiles and add it to the runner.
- Repeat until:
- no more files remain
- all runners have a total time equal to or greater than the original highestTotalRunnerTime.
If there are more files to add and all runners now have a time equal to or higher than highestTotalRunnerTime, then this process will repeat as such:
- If all runners have the exact same execution time, then:
- iterate over each runner and add the next highest file time; this becomes the new basis of the minimum execution time.
- Sort the runners from shortest to longest runtime again.
- Get the largest runner and set its total execution time as the new value of highestTotalRunnerTime.
- (This is to prevent a deadlock state when more files remain but no runners are equal to the highestTotalRunnerTime.)
Continue until all files have been placed in a runner.
Return the array of runners.

The "weighted-largest" algorithm in depth

Here is the algorithm function in TypeScript with some additional comments to explain how it works:

interface LoadBalancingMap {
  e2e: {
    [relativeFileName: string]: {
      stats: {
        durations: number[];
        average: number;
        median: number;
      };
    };
  };
  component: {
    [relativeFileName: string]: {
      stats: {
        durations: number[];
        average: number;
        median: number;
      };
    };
  };
}


function balanceByWeightedLargestRunner(
  loadBalancingMap: LoadBalancingMap,
  testingType: TestingType,
  runnerCount: number,
  filePaths: FilePath[]
): Runners {
  if (runnerCount === 1) return [filePaths];

  //Helper methods
  const getFile = (fp: FilePath) => loadBalancingMap[testingType][fp];
  const getTotalTime = (fps: FilePath[]) => sum(fps.map((f) => getFile(f).stats.median));
  const sortByLargestMedianTime = (fps: FilePath[]) =>
    fps.sort((a, b) => getTotalTime([a]) - getTotalTime([b])).reverse();

  //Sort **files** by highest to lowest "expected run time" (median runtime)
  const sortedFilePaths = [...sortByLargestMedianTime(filePaths)];
  const addHighestFileToRunner = (runner: FilePath[]) => {
    const file = sortedFilePaths.shift();
    if (file == null) {
      return;
    }
    runner.push(file as string);
  };

  //Initialize each runner empty
  let runners: Runners = Array.from({ length: runnerCount }, () => []) as Runners;

  //This could be done more efficiently by using array indices alongside an array of every runners' total time,
  // instead of resorting each iteration.
  sortRunners: do {
    if (sortedFilePaths.length === 0) break;

    const areAllRunnersEqualInRunTime = runners.every((r) => getTotalTime(r) === getTotalTime(runners[0]));
    if (areAllRunnersEqualInRunTime) {
      //When all runners are equal in time, pop out the file with the next highest runtime for each runner
      //This will prevent a deadlock state while also keeping files evenly spread amongst runners while still balanced
      runners.map(addHighestFileToRunner);
    }

    //Sort **runners** by lowest to highest runtime
    runners = runners.sort((a, b) => getTotalTime(a) - getTotalTime(b));

    //Get the highest runner runtime of this iteration to compare against the other smaller runners
    const highestRunTime = getTotalTime(runners[runners.length - 1]);

    //"runners.length - 2" means it will not add files to the largest runner at "runners.length - 1", since we are using it as a basis for all other runners.
    for (let i = 0; i <= runners.length - 2; i++) {
      if (sortedFilePaths.length === 0) break sortRunners;
      const currentRunner = runners[i];
      const currentRunnerRunTime = getTotalTime(currentRunner);

      if (currentRunnerRunTime >= highestRunTime) continue;
      addHighestFileToRunner(currentRunner);
    }
  } while (sortedFilePaths.length > 0);

  //Remove empty values just in case
  return runners.map((r) => filterOutEmpties(r)) as Runners;
}

Keep in mind that this algorithm is not very optimized and can be made much better, but in this case, memory usage and O(n) is not going to be a concern, even with large file sets.

If we take the "weighted-largest" algorithm and compare it against the "round-robin" with the same 9 files and 3 runners, we get this output:

File timings: [900, 800, 700, 600, 500, 400, 300, 200, 100]

Runners after load balancing:
[ [800, 500, 100], [700, 600, 200], [900, 400, 300] ]

Total run times:
[ 1400, 1500, 1600 ]

The highest time is only 200 ms less than the results of the “round-robin” algorithm, but the difference between the largest and smallest runner is now only 200ms as well. While this may seem like only a marginal improvement, the examples below demonstrate a more stunning improvement.

Examples: Comparisons of different file timing distributions

To get an accurate idea of how these will handle different file timings, I created 8 different variations of file time distributions and compared results of the "round-robin" and "weighted-largest" algorithms.

Example: Every file has the exact same timing

Here there were 9 files, each having a timing of 100. Both algorithms ended up with the same result, which should be expected.

Both had an equal distribution amongst all runners. This seems reasonable.

Example: Bell curve distribution

Values: 
[
1, 
10, 10, 
20, 20, 20, 
30, 30, 30, 30, 
40, 40, 40, 40, 40, 
50, 50, 50, 50, 50, 50, 
60, 60, 60, 60, 60, 
70, 70, 70, 70, 
80, 80, 80, 
90, 90, 
100
]

Here, we can already start to see improvement. The "weighted-largest" algorithm appropriately balanced each runner, with a spread of 1, whereas the "round-robin" approach has a spread of 80. That's pretty good so far!

Example: Extreme high-end distribution

Values:
[
100, 100, 
500, 500, 500, 500, 500, 500, 500, 500
]

I used only two runners for this trial. The results of both algorithms were equal in balancing abilities. This looks reasonable.

Example: Extreme low-end distribution, version 1: total sum of low values equals single high value

Values:
[
100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 
1000
]

Wow! We can totally see improvements with the "weighted-largest" algorithm. Both of its runners have equal time, where as the "round-robin" approach is incredibly unbalanced, with a difference of 1000 total time between its runners.

Example: Extreme low-end distribution, version 2: total sum of low values is greater than single high value

Values:
[
100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
1000
]

There is not much difference from the first low-end distribution trial. The "weighted-largest" algorithm produces better results.

Example: Extreme center distribution

Values:
[
10, 
20, 
30, 
40, 
50, 50, 50, 50, 50, 50, 
60, 
70, 
80, 
90, 
100
]

What is most interesting here is that it is much easier to visualize how the "round-robin" algorithm handles each value, where each runner's runtime is shown in a descending order. Still, the "weighted-largest" algorithm produces very even results.

Example: Extreme "ends" distribution

Values:
[
10, 10, 10, 10,
20,
30, 
40, 
50, 
60,
70, 
80, 
90,
100, 100, 100, 100
]

"/>

Again, the "round-robin" approach is easy to visualize in a descending order. The "weighted-largest" algorithm has much more even results.

Example: Uniform value distribution

Values:
[
100, 100, 100, 100, 100, 100, 
200, 200, 200, 200, 200, 200
]

Like the very first trial with all timings equal to one another, both algorithms will balance exactly the same in this case.

Conclusion from the initial trials

These results definitely show the "weighted-largest" algorithm to be much more optimized in terms of balancing files appropriately. Not only are each runner more evenly distributed in timing, but the longest runtime is kept as low as possible as well. However, one thing to consider is that a lot of these values do not have much variance from one another; most are in steps of 10 or 100 away from each other. So, I decided to try this with some real-world results.

Examples: Real world trials

130 component tests

Skipping showing the values as there are too many, but the longest running file takes 3 minutes, and most take under 10 seconds. There is a greater variance of timings across all files in this example.

For this trial, there were 130 tests across 6 runners. There is a stark difference between both algorithms, with the "weighted-largest" being the more effective solution here. All of its jobs take under 4 minutes, as compared to the longest job of the other algorithm being over 9 minutes!

45 end-to-end tests

Values (in minutes):
[
  0.01, 0.01, 0.01, 0.01, 0.01, 0.06,
  0.14, 1.22, 1.31, 1.76, 1.81, 2.01,
  2.22, 2.31, 2.32, 2.32, 2.42, 3.15,
  3.19, 3.28, 3.32, 3.41, 3.68, 4.33,
  4.86, 5.09, 5.1, 7.05, 8.37, 9.01,
  9.99, 10.06, 11.95, 12.75, 13.18, 13.39,
  15.69, 17.21, 19.19, 20.03, 20.62, 20.75,
  22.45, 45.08, 46.21
]

The end-to-end tests have a lot more variance between file timings. The longest running test has a median runtime of over 45 minutes!!

In this trial, I used 16 runners as this is more appropriate in an actual organization. The great thing is that the "weighted-largest" was able to keep that single file in its own runner, and all other runners were kept lower than it. I believe this to be a huge success.

Additional considerations and final conclusions

This exercise was really refreshing to me -- it was great to actually see improvement over a common solution by readdressing the actual goal. For static processes, it makes much more sense to balance all runners to stay below the limit of the largest runner, since ultimately the total execution time is dependent on the total runtime of any one of its parallel processes. The "weighted-largest" algorithm feels much more fitting as a solution here.

As a final note, I wanted to compare the results versus the "cypress-split" package as well, so I took the component tests and split them amongst 6 runners there as well. The results are nearly identical across both packages!

If you'd like to add this package to your Cypress suite, see it here!

https://github.com/hammzj/cypress-load-balancer

DEV Community

Optimizing a load balancing algorithm to minimize the runtime of a static process

The issue with using a "round-robin" approach in this case

Defining a new solution

Initial observations of the Cypress `run` command

Defining requirements of the new algorithm

Requirements for the load balancing function

Requirements of the algorithm

The "weighted-largest" algorithm in depth

Examples: Comparisons of different file timing distributions

Example: Every file has the exact same timing

Example: Bell curve distribution

Example: Extreme high-end distribution

Example: Extreme low-end distribution, version 1: total sum of low values equals single high value

Example: Extreme low-end distribution, version 2: total sum of low values is greater than single high value

Example: Extreme center distribution

Example: Extreme "ends" distribution

Example: Uniform value distribution

Conclusion from the initial trials

Examples: Real world trials

130 component tests

45 end-to-end tests

Additional considerations and final conclusions

Top comments (0)

The issue with using a "round-robin" approach in this case

Defining a new solution

Initial observations of the Cypress run command

Defining requirements of the new algorithm

Requirements for the load balancing function

Requirements of the algorithm

The "weighted-largest" algorithm in depth

Examples: Comparisons of different file timing distributions

Example: Every file has the exact same timing

Example: Bell curve distribution

Example: Extreme high-end distribution

Example: Extreme low-end distribution, version 1: total sum of low values equals single high value

Example: Extreme low-end distribution, version 2: total sum of low values is greater than single high value

Example: Extreme center distribution

Example: Extreme "ends" distribution

Example: Uniform value distribution

Conclusion from the initial trials

Examples: Real world trials

130 component tests

45 end-to-end tests

Additional considerations and final conclusions

Initial observations of the Cypress `run` command