DEV Community

Cover image for Open source parallel processing for Gatsby
Matt Biilmann
Matt Biilmann

Posted on

Open source parallel processing for Gatsby

To help the greater Gatsby ecosystem shorten the time it takes from commit to deploy, today I've just submitted a freshly baked Pull Request as my first larger contribution to the Gatsby open source project.

The new gatsby-parallel-runner plugin builds on some of the existing work in the Gatsby project to allow both plugins and core parts of Gatsby to parallelize certain tasks by delegating the work to a large pool of serverless functions.

Gatsby Cloud pioneered this approach, and with this plugin I hope that we can open an even more generalized approach to parallelization. While it's in early development, my goal is that this can one day be made available on any CI/CD environment, and empower individual plugin developers in the ecosystem to build in support for parallelization when a task is well suited for it.

Gatsby and "External Jobs"

A bit under a month ago Ward Peeters from Gatsby's team landed a Pull Request to "enable external jobs with ipc". The idea behind the pull request is that instead of running gatsby build directly, an orchestrating parent process can fork gatsby build and make sure an environment variable called ENABLE_GATSBY_EXTERNAL_JOBS is set. When that is done certain jobs will be sent via node's IPC protocol to the parent process as "external jobs" in order to allow the parent orchestrater to efficiently parallelize the execution of them.

In itself nodejs is single threaded, so out of the box any CPU intensive jobs in Gatsby will only ever take advantage of one CPU core, but this ipc based delegation opens up the possibility of taking advantage of external worker processes for parallelization.

The only plugin that currently hooks into this is gatsby-plugin-sharp which is used for image transformation. At Netlify we've often seen image transformations be a huge source of build slowdowns for Gatsby sites, since doing lots of image processing within a single threaded build is really inefficient.

The open source Gatsby process has until now not offered any implementation of an orchestrator for Gatsby that can help you take advantage of external jobs, and that's where our new gatsby-parallel-runner helps.

Gatsby Parallel Runner

I've been having fun building out this new Gatsby plugin that acts as an alternative build command. Once installed in a project, you'll run gatsby-parallel-runner instead of gatsby build.

Out of the box it comes with a parallelized implementation of the Sharp image plugin based on Google Cloud Functions. It includes an easy script to get your own functions and queues setup in a Google Cloud project with just one command. It's built to be extensible so the community can easily add alternative implementations of the actual execution layer. Obvious candidates would be AWS Lambda functions or a nodejs cluster implementation. It should also be easy to use with new plugins that want to add external jobs outside of just image processing.

My hope is that this can help pave the way for more innovations in the ecosystem around build parallelization – and of course we're seeing a lot of opportunity in adding a more generalized form of this to Netlify's own build layer.

Our philosophy has always been to keep the build layer fundamentally open - our core build image has always been Open Source as is our new Build Plugin layer, and we've always believed that a healthy Open Source ecosystem in the build tool space is vital to the growth of the whole JAMstack category. So we're happy to contribute this project back to the Open Source community.

I ran a few benchmarks with the official Gatsby image benchmark repository on Netlify's build environment both with and without the gatsby-parallel-runner and was thrilled to see the gatsby-parallel-runnner consistently outperform the normal gatsby build command:

Running 3 times from a clear cache with gatsby build:

Run 1:
11:10:27 PM: success Generating image thumbnails - 351.218s - 3234/3234 9.21/s

Run 2:
1:45:43 PM: success Generating image thumbnails - 384.171s - 3234/3234 8.42/s

Run 3:
11:18:22 PM: success Generating image thumbnails - 322.853s - 3234/3234 10.02/s

Avg time for image generation: 352.747s

Running 6 times from a clear cache with gatsby-parallel-runner:

Run 1:
10:51:31 PM: success Generating image thumbnails - 158.438s - 3234/3234 20.41/s

Run 2:
5:33:33 PM: success Generating image thumbnails - 68.016s - 3234/3234 47.55/s

Run 3:
3:03:48 AM: success Generating image thumbnails - 75.731s - 3234/3234 42.70/s

Run 4:
10:54:47 PM: success Generating image thumbnails - 64.478s - 3234/3234 50.16/s

Run 5:
10:58:31 PM: success Generating image thumbnails - 66.021s - 3234/3234 48.98/s

Run 6:
11:01:58 PM: success Generating image thumbnails - 71.416s - 3234/3234 45.28/s


Avg time for image generation: 84.017s

The first run after deploying the functions was a bit slower than the subsequent runs, as Google worked on scaling up the number of concurrent function executions, but even then it was still more than twice as fast in the worst case as the standard Gatsby build command.

And on average - even with the initial outlier included - the parallel runner gave more than a 4.2x speedup over the single threaded runtime.

For curiosity I repeated the same benchmark on Gatsby Cloud:

Run 1:
03:02:38 AM: success Generating image thumbnails - 98.472s - 3234/3234 32.84/s

Run 2:
07:34:53 AM: success Generating image thumbnails - 328.141s - 3234/3234 9.86/s

Run 3:
22:19:50 PM: success Generating image thumbnails - 85.101s - 3234/3234 38.00/s

Run 4:
22:34:22 PM: success Generating image thumbnails - 134.721s - 3234/3234 24.01/s

Run 5:
23:02:37 PM: success Generating image thumbnails - 82.822s - 3234/3234 39.05/s

Run 6:
23:07:31 PM: success Generating image thumbnails - 60.532s - 3234/3234 53.43/s


Avg time for image generation: 131.631s

These tests had a lot more variability in build times than my Netlify based tests, and while the average was more than twice as fast as the single threaded build performance, the open source parallel runner performed significantly better in the tests I ran. So hopefully the Gatsby Cloud team can also benefit from looking into the source code behind this implementation.

Setting Up

Install in your gatsby project:

npm i gatsby-parallel-runner

To use with Google Cloud, set relevant env variables in your shell:

export GOOGLE_APPLICATION_CREDENTIALS=~/path/to/your/google-credentials.json

export TOPIC=parallel-runner-topic

Deploy the cloud function:

npx gatsby-parallel-runner deploy

Then run your Gatsby build with the parallel runner instead of the default gatsby build command.

npx gatsby-parallel-runner

Discussion (3)

Collapse
hashemkhalifa profile image
Hashem Khalifa

Awesome, I will give it a try with AWS.

Thanks for the great effort!

Collapse
14850842 profile image
Sergio Pellegrini

Does this work during dev? Pulling WP images is always a time consumer :(