At Place Exchange, we use Buildkite as the continuous integration and continuous deployment platform to deploy our Django application to AWS. We use Buildkite pipelines to deploy applications on all available environments including Dev, QA and Production. Before making any changes, the mean build and deployment time across all these environments was about an hour. Using a host of tactical fixes, we brought that down to about 15 minutes, resulting in a decrease of almost 75% in our build times. This is the story of how we got there.
Why did we start looking into this problem?
This all started when we decided to add a new QA environment, primarily for load testing purposes, as a part of the buildkite pipeline. Adding a QA environment suddenly increased time in rolling out features to the Production environment, primarily due to the fact that there were extra build steps which consumed additional build time. This resulted in complaints from developers, who said that the rollout process took too much time and that it was delaying the deployment of hotfixes and new features. While we wanted the deployment to the QA environment to be part of the same pipeline, we also wanted to minimize rollout time to the Production environment.
Our initial observations
Sequential pipeline deployment
Before parallelizing application tests and Dev/QA deployment, we used to run the different steps in our deploy pipeline sequentially.
As a result, time required to run application pipeline was
- In Best Case (All steps are uses docker layer caching to build docker Image) : ~50 mins
- In Worst Case (No steps uses docker layer caching to build docker Image) : ~1 hr
Docker layers cause increased build times
When building an image, Docker steps through the instructions in your Dockerfile, executing each instruction in the order specified. As each instruction is examined, Docker looks for an existing image in its cache that it can reuse, rather than creating a new (duplicate) image. To know about this in more detail, go here
It was observed, when docker agents got used for the first time (when they start up), it downloaded images from the network, since no cached layer was present on the agent. This resulted in additional time to execute the build steps. Furthermore, combinations of first-time-use of build agents across multiple steps, led to variations in overall build time.
Fixing the problem and how we made our builds faster
Running a build's jobs in parallel is a simple technique to decrease your build’s total running time and Buildkite provides many options to do so. There are multiple steps involved in the build and deploy process. While we can run some of them in parallel, they need to be decided based on several criteria. We used some of these options below to optimize total build time across different environments.
Parallel test execution
Backend and Front-end tests could be run separately since they are independent of each other. However, the results of these two sets of tests get uploaded on Code Climate to generate a holistic testing report. Luckily, Buildkite Agent has artifact support to store and retrieve files between different steps in a build pipeline. By leveraging this feature, where we can pass information between two parallel Buildkite steps, we imported the results of tests (maintained as build step artifacts), combined them to one and then uploaded it to Code Climate.
After this change, build time reduced from ~1 hour to ~35 mins, resulting in a 42% reduction.
Parallel deployment to independent environments
Depending on the usage of environments in your SDLC setup, environments can be used for different purposes with varying degrees of priority. While we use the Dev environment in the classic sense of using it as a sandbox environment, we use the QA environment purely for load testing purposes.
In Buildkite, multiple agents can be used to run different independent steps in parallel. Using this approach, we have updated our application’s pipeline to deploy to Dev and QA environments in parallel, which reduces build time significantly.
Ensure execution of only branch-specific steps in deployment pipeline
Steps, such as running local tests and sandbox database migrations, that are supposed to be run on feature branches for local tests and migration do need not to be part of the master pipeline.
Applying branches filter, and executing necessary specific steps, helped reduce build and deployment time to master.
After this change, build time reduced from ~35 mins to ~16 mins, resulting in a 54% reduction.
Measuring success
After applying all the optimizations and changes mentioned above, the total time required to run the application’s pipeline is
In Best Case (All steps are use docker layer caching to build docker Image) : ~16 mins
In Worst Case (No steps use docker layer caching to build docker Image) : ~25 mins
Diving into some more details, specific to the master & feature branch pipelines -
On the master branch
In the above graph, X-axis denotes time range of master branch builds and Y-axis denotes time taken by master branch builds.
As you can see, prior to 13th March, builds used to take ~50 min to 1 hr to complete. After 13th March, where we pushed build optimization changes, you will notice builds time reduced in the ~17 min to ~25 min range.
Highlighting some exceptions in graph above -
- From 21st Feb to 25th Feb, we used a separate pipeline because we were in the process of migrating our application from Swarm to Kubernetes. During that process we had a lightweight pipeline for the Kube deployment, which caused lower build time.
- Between 23rd March and 26th March, two builds took ~1 hr to complete, after checking it was noticed that they ran without docker layer caching. This was what took us down the path of docker layer caching, and in addition we found that there was a slow network on the hosts that ran our buildkite agents. The slow network was a consequence of using t3.medium instance type for the BK hosts, which promised "Up to 5gbps" network performance, and such performance degradation happens very infrequently.
On feature branches
In the above graph, X-axis denotes time range over builds happened and Y-axis denotes time taken by build. Before 11th March 2020, builds on non-master branches used to take ~15 min to ~20 min to complete.
First changes to build optimization i.e. running tests in parallel, were released on 6th march and after 11th march 2020, almost all branches were updated with that change. After 11th March 2020, build time reduced in ~10 min to ~15 min range, resulting in ~33% improvements in build times
Highlighting some exceptions in graph above -
- Some builds after changes took ~15 min to complete, that is because they ran without docker layer cache.
- On 17th March 2020, one build took ~20 min to complete, since that branch was not updated with changes made for build optimization.
Next steps
Using the approaches outlined above, we have optimized build and deployment time of our application to a certain extent but optimization is a never ending process. We are still in the process of making our build times better. Some ideas that we are contemplating exploring include
- running the same build step in parallel over multiple agents
- reducing docker image build time by inspecting Dockerfile and potentially removing ansible-related packages from Dockerfile. Today, these are running infrastructure-based steps, that are not strictly application-related.
Thanks for reading!
Top comments (0)