The Problem
Here at CompanyCam, we rely on automated testing to ensure that our code is working as expected. Our backend is written in Ruby on Rails, and we leverage automated testing (via rspec) to help ensure that new code doesn’t break existing functionality. We have many thousands of these automated tests—over 3500 at the time of writing—and we’re always writing new tests alongside our new features, to be sure those don’t break in the future, either.
Because these tests are so important to our development process, and because we have many developers working on our codebase simultaneously, we execute our full test suite dozens of times each day.
For new features and maintenance work, we use a typical pull-request workflow, running tests at each step along the way. A normal day for a developer might look like this:
- Write code.
- Test that code locally.
- Push this new code to our central repository (Github).
- Run tests to ensure that our new code is working properly.
- If changes are needed, go to step 1, otherwise proceed.
- Merge in changes that other developers have written in the meantime.
- Run tests again, ensuring everything still works.
- If changes are needed, go to step 1, otherwise proceed.
- Merge code into the main branch.
Because the run tests step is done so frequently, it’s important that it happens as quickly as possible.
This is where we had a problem.
It was taking 13 or more minutes to run our full test suite from start to finish. During that time there wasn’t much for a developer to do—if their tests failed, they were going to have to fix or redo some work. If they passed, they could carry on—but it wasn’t safe to carry on before knowing everything was working properly. With 13 minutes to kill, it was easy to get caught up in another conversation, go make another cup of coffee, play some Wordle, or get lost in any other distraction. Suddenly that 13 minutes was 20 minutes, or a half hour. Multiply this delay by all the times the tests had to be run each day (many dozens at least), and you can see that tests were a major slowdown for our team.
Setting a Goal
We knew we wanted the tests to run faster, but how much faster was enough? We set a goal to get the tests under 5 minutes. We had an intuitive sense this was possible, and 5 minutes seemed like a short enough time frame that it wouldn’t turn into a full on-distraction. I had an internal stretch goal of getting our tests down to 3 minutes, but at the time I had no solid reason to believe we could get there.
Step 1: Parallel Tests
Once we had a goal, we needed to figure out how to get there. The first thing we knew we had to do was get our tests to run in parallel. We figured that we could always throw more computing power at the tests, but unless we could run many tests at the same time, side by side, we’d never be able to make full use of any available computing power we might have.
Several members of our team worked together to get the parallel_tests gem working for our codebase. Most of our tests were fine running in parallel, but we found enough that made assumptions about the order they would be run in that we had to do some fixing and rewriting. We kept this work in a feature branch, and kept chipping away at it until all the tests would pass.
Once we had all the tests running in parallel, we pushed the branch up to our CI provider, and our tests were passing in about 8 minutes. That’s about a 40% time savings! A good start, but we knew we could do better.
Step 2: Evaluating Our CI Provider
The process of running tests automatically with each new bit of code is called Continuous Integration (CI). We use a CI service in the cloud to run these tests for us.
As we dug into the timeline reports for our running tests, we discovered that our tests were actually running pretty quickly—about 3 minutes—but the environment setup step was taking a full 5 minutes before the tests could even get started!
We spent some time reading through the documentation from our CI provider at the time, and made as many optimizations as we could. Still, we just couldn’t get the setup time down. It seemed that we were maxing out the capacity of their servers during our setup process.
We decided it was time to look around at other CI providers to see if we could find one that better fit our needs.
Step 3: Testing Other CI Providers
We set up trial accounts with a few providers and spent the time to get our testing environment working on each of them. During this process, we kept our existing CI provider integration working; it was imperative that we didn’t disrupt the regular workflow of our engineers while we endeavored to speed things up.
Right away we discovered that other CI providers had more capable platforms, and we were able to get our tests down in the 6 minute range—tantalizingly close to our 5 minute goal!
One provider stood out from the others: CircleCI. After testing we decided to put our optimization efforts into CircleCI and see what kind of results we could get.
We were not disappointed by this decision.
Step 4: Optimizing CircleCI
CircleCI offered more choices for machine size and parallelism than other providers we tested, and by using a few “large” instances we were able to see our tests run blazingly fast—under two minutes! However, our setup time —the time it takes before the tests can even begin, during which our test environment is initialized—was still in the 90-120 seconds range. After some investigation, we were pretty confident we could make that step much faster as well.
Caching Assets
In order to make that setup time faster, the first thing we needed to do was cache our gems; gem install was taking 30-45 seconds, and our gems rarely change. By leveraging CircleCI’s asset caching feature, we are able to store a copy of the installed gems, which we restore every run instead of re-installing. By using this cached copy we were able to cut the assets step of setup down to a 2 second check, saving half a minute per run. (When the gems do change, the first run takes a little longer. Subsequent runs use the new cached gems, and are fast again).
The Big Win: Using CircleCI’s Parallelism Feature
CircleCI has a really cool feature for running tests in parallel. The parallel feature is kind of like the parallel_test gem, but on steroids—it splits up the individual test files for you, to as many virtual test servers as you want (up to about 64).
Let’s say you have 16 tests that need to be run. By running your tests with the parallelism value set to 2, two test servers will be set up. The first will run 1-8, the second 9-16. If you set parallelism to 4, then you’d have four servers, running 1-4, 5-8, 9-12, and 13-16, respectively.
But the magic doesn’t stop there! After every run, CircleCI stores the amount of time each test took to run, and the next time the tests are run, it divides them up by estimated duration!
Here’s an example. Let’s say you have four tests, and the time to run each of them varies:
Test 1: 30 seconds
Test 2: 27 seconds
Test 3: 17 seconds
Test 4: 9 seconds
If you split up the tests in order (1&2, 3&4), then your total run time will be the time of the longest-running server. So, in this example, that would be 30+27, for a total of 57 seconds. The other server will have finished in 26 seconds. CircleCI will notice how long each test takes, then in the next run break them up to be the closest total per group possible. So the next run would be 1&4 and 2&3. Because the longest run in this case is just 44 second, you’d cut nearly 25% (13 seconds) off your total run time!
Because we have many thousands of tests, we’re able to split them up very evenly.
By leveraging CircleCI’s parallelism we were able to run our tests on as many as 32 instances at a time, cutting our test time down to less than 30 seconds! Still, because of setup, total run time was in the 2:30 range.
Aside: As another bonus, running the tests this way meant we didn’t have to use the parallel_tests gem. We didn’t have any problems with the parallel_test gem (it still makes sense to use it in local development, for example, where we aren’t using CircleCI’s magic) but not having to use it in CI made our test stack simpler and easier to maintain.
The Last Step: Docker Image Hosting and Caching
Now that the testing process itself was running crazy fast, we needed to optimize the last few pieces of the setup process.
The first step was to move our Docker images to Amazon Elastic Container Registry (ECR). ECR infrastructure is close (in the network sense) to CircleCI, and downloads have been very fast for us.
The second step was to ensure that we were using the CircleCI Convenience Images whenever possible. These Docker containers are heavily cached; in most cases they are instantly available for you.
The last step was a passive one: as more and more of our tests were being run on CircleCI’s environment, it meant we were more likely to get cache hits on our own Docker container images, cutting out the time it takes to download those images from ECR.
Altogether, these optimizations brought our total run time down into the low two minutes (2:15 average)! Our original goal had been under 5 minutes—we were under half that time!
Price/Performance Balance
While we were able to run our tests blazingly fast with a parallelism of 32, we were paying for 32x the setup costs of running in a single box. Over time, this can get expensive.
We ran a series of tests, changing the parallelism parameter in our CircleCI config, and testing the resulting total run time.
In the end, we settled on a parallelism of 12. That’s about 1/3 the total cost of 32, and our test suite completes in 2:45-3:00, which we find is an acceptable time range for our staff.
What’s Next
With testing, the work is never truly done. We still have some lingering flaky tests (tests that sometimes pass, sometimes don’t) that we’re working on cleaning out. We’re also in the middle of the process of identifying slowly running individual tests and optimizing them to speed up our suite a bit.
And then, of course, as we keep writing tests, we’re going to need to maintain our CI suite so that we don’t slow down over time. We regularly check our average test run times, and will adjust our parallelism factor as needed (and optimize individual slow tests) to keep things moving quickly.
Speeding up our test suite has been a huge win for our engineering team. It’s improved morale, removed a bunch of frustration, and opened up new avenues of possibility. Mostly, it’s saving us a ton of time every day. Time that we can spend Showing Up, Growing Up, and Doing Good.
Top comments (0)