Thai Pangsakulyanont for Taskworld

Posted on May 1, 2020

Save the precious build minutes! Reusing build outputs with Git Tree Hash 🌳

#github #circleci #performance #optimization

Abstract: Do you spend a lot of time (and perhaps money) waiting for builds on the master branch, even though its contents are identical to the commit before you clicked that Merge Pull Request button? You may save time by caching your build output, using a Git Tree Hash (not to be confused with Commit Hash) as a cache key.

This is the second installment of a series about build performance optimization.

For this one, I discovered the technique while optimizing Taskworld’s frontend build pipeline.

Context: The Netlify Build Workflow

At Taskworld, we’ve been using a build workflow that has now been popularized as “Netlify Build.” It is a Git workflow for developing, testing, and delivering JAMStack sites to production continuously:

While you work on feature branches, an automated system builds and deploys your code to a deployment preview environment, so people can review and test the app without having to do do a git checkout && yarn && yarn dev by themselves.
When your change is merged to master, an automated system builds and deploys your code to a production environment.

I think this is very similar to the GitHub flow… but with the GitHub flow, they deploy stuff to production (in order to verify the changes on production) before merging to master. Netlify’s default build workflow, in contrast, deploys to production after merging to master (it can be customized though).

😩 Why build twice?

If your master branch is protected, and you also require feature branches to be up-to-date with master before they can be merged, you’ll find that the commit generated when merging to master will have the exact same contents as the commit just before the merge.

But still, most CI pipelines, by default, will build your code from scratch in such situation.

Now this is fine for a small project. But as our app grows, what used to take seconds now takes minutes. We start to see some redundancy in building the exact same code over and over again.

🌳 Introducing the Git Tree Hash

When you run git rev-parse HEAD, you get the hash of the HEAD commit:

$ git rev-parse HEAD
10684e38090ed90d2d58d3ff3c81ace99ce658fe

When you run git rev-parse HEAD: (note the colon at the end), you get the hash of the contents that the HEAD commit is representing. This is a tree hash:

$ git rev-parse HEAD:
29aa6872b8bfa8a911995c6a6b206fdd158339e3

🤔 Now, what’s the difference?

A commit hash is calculated based on its contents and metadata. Metadata includes the commit message, committer and author’s information (like name, email, and date), as well as the parent commits’ hashes. That’s why each time you create a commit you always get a completely different hash.
A tree hash, on the other hand, captures only the state of the files within. It’s as if you take a snapshot of the repository at that commit and hash its contents. If the contents are the same, the tree hash will be the same, regardless of the commit history.

You can also get a tree hash of a subdirectory. As you might have guessed, the subdirectory’s tree hash will stay the same if you don’t modify the contents of that subdirectory:

$ git rev-parse HEAD:docs
1c771ff483992f38b268f08e9c015b613aa51e0a

If you are interested in learning more about this concept, I really recommend the article “The anatomy of a Git commit” by Christoph Burgdorf where Git blobs, trees, and commits are visualized using an easy-to-understand diagram.

💡 Reusing build outputs with Git Tree Hash

Now, how can we put this to use?

We can use Git Tree Hash as a cache key for our build outputs.

Instead of running your build script unconditionally, you can make your build process try to restore the build output from the cache first, and then run the build script only when the cache doesn't exist.

Here’s an example CircleCI configuration. The same concept can also be applied to other CI systems.

+      - run:
+          name: obtain tree hash
+          command: |
+            git rev-parse HEAD: | tee /tmp/tree.hash
+      - restore_cache:
+          keys:
+            - v1-tree-{{ checksum "/tmp/tree.hash" }}
       - run:
           name: build
           command: |
-            yarn build
+            test -f build/index.html || yarn build
+      - save_cache:
+          key: v1-tree-{{ checksum "/tmp/tree.hash" }}
+          paths:
+            - build

On CI systems that doesn’t provide a cache storage, you can also use a cloud storage service, like Amazon S3 or Google Cloud Storage.

⚠️ Some safety caveats:

Ensure that the build directory does not exist before restoring the build output from the cache. Build systems that gives you a pristine build environment (e.g. GitHub Actions and CircleCI) will not have this problem. However if you use build systems where the workspace directory can be reused (e.g. Jenkins), please be aware of this caveat.
Make sure to verify the validity of the restored cache. A corrupted cache can lead to a corrupted build. CircleCI already verifies the integrity of the restored cache, so with CircleCI we get this for free.
Make sure to never save the build output to the cache if the build failed. It may lead to a corrupted cache.

🤩 Use a more fine-grained tree hash

This requires you to list out everything that may potentially affect the build output, but doing this will increase the chance of a cache hit.

       - run:
           name: obtain tree hash
           command: |
-            git rev-parse HEAD: | tee /tmp/tree.hash
+            git rev-parse HEAD:src | tee -a /tmp/tree.hash
+            git rev-parse HEAD:public | tee -a /tmp/tree.hash
+            git rev-parse HEAD:.babelrc | tee -a /tmp/tree.hash
+            git rev-parse HEAD:postcss.config.js | tee -a /tmp/tree.hash
+            git rev-parse HEAD:tsconfig.json | tee -a /tmp/tree.hash
+            git rev-parse HEAD:yarn.lock | tee -a /tmp/tree.hash

⚠️ Some safety caveats:

If you forget to include any dependency here, there is a risk of a developer changing the configuration just to find out that the build output remains unchanged. Tracking this down can be a painful experience. So, when caching strategies like this is involved, try to make it clear to the developers what goes into determining whether to reuse the build output, and also provide instructions on how to force-invalidate the cache.

🏘 Use with subprojects

Sometimes a project may contain both the main application and a documentation site in the same repository. This can take time to build.

Since you can obtain a tree hash of a subdirectory, you can, for example, cache the documentation site, and only rebuild it when the documentation contents have changed.

⚠️ Some safety caveats:

You may use this technique with a monorepo project as well, but the contents of the same-repository dependencies must also go into the cache key of the dependent sub-project. I would recommend using tools designed for monorepos such as Nx or Bazel instead of a makeshift solution like this.

✅ Use with tests

If your tests are taking a long time to run, you can also cache the junit.xml file or something similar when tests are passing, so you don't have to re-run your tests if you already know that they are passing.

ℹ️ Including the commit hash in the build output

Including a commit hash in the built application can make it easier to track down bugs, e.g. in Sentry. Usually this is accomplished by providing a build-time environment variable. For example:

env REACT_APP_GIT_SHA=`git rev-parse --short HEAD` yarn build

This might not work well with build output caching. When a build output from a previous commit is reused, the embedded commit ID may not match the commit that is being deployed.

We can resolve that by not using the commit hash during build time, but inject the commit hash into the application package at deployment time (e.g. injecting a <script>APP_VERSION='…'</script> or a manifest JSON file instead), and have the built application read it at runtime instead.

Conclusions

By learning how Git works under the hood, and exploring how data is stored inside a Git repository, we can come up with ways to improve our build performance.

At Taskworld, we saw 3 minutes of savings in build time when we can reuse a build output from one of the previous commits.

As a result, the build pipeline becomes faster in merging situations. We also get feedback from the CI system especially faster when we make changes to parts of the repository other than the main app (such as end-to-end tests and build scripts).

Hope you’ll find this useful, and thanks for reading!

Top comments (1)

William Fortin • Apr 27 '21

I'm curious if you got it to work on GitHub Actions since the GitHub cache can only be shared between "parent" branches, so when I hit master I get a cash miss for that tree hash.
Did you end up building your own cache on S3?