Richard Klose

Posted on Jun 17, 2019 • Originally published at blog.klose.dev on Jun 17, 2019

Mirroring Releases from GitHub

#github

This post was originally posted on my blog.

At any point in time, developers should be able to create an reproducible and identical output from their source code. If your code relies on third party packages, this can be tricky sometimes. One problem is that some packages download precompiled releases from the web, often from GitHub, to speed up and simplify package installation. However, if the repository (or any other source) is removed, builds are not reproducible anymore, they even can't be compiled anymore.

I'd like to share my approach on how to face this issue. But before I'd like to explain the problem a bit more in detail. If you are not interested in the details, you can safely skip the next paragraph. If you just came here for a working solution, check out my GitHub repository.

Contents:

The problem in detail: How packages download binaries from GitHub
Building the caching server: Download GitHub Releases to a local server
Using the caching server: Setup your environment
Optionally package everything into a docker container

In this post I will use electron and node-sass as examples, which can be installed with npm. They are quite popular and are the packages, that caused this issue for me at first.

The problem in detail: How packages download binaries from GitHub

Compiling node-sass from source takes some time. In most cases it takes even more time than compiling or packaging the project itself that is using it. Because node-sass must be compiled individually for each platform and operating system, its install script can compile it from scratch. Let's have a deeper look in how node-sass is installed.

When running npm install node-sass, the install script from package.json is executed. (See https://docs.npmjs.com/misc/scripts for more info on npm scripts.)

{
  ...
  "scripts": {
    "install": "node scripts/install.js",
  },
  ...
}

calling the install script from package.json of node-sass

if (process.env.SKIP_SASS_BINARY_DOWNLOAD_FOR_CI) {
  console.log('Skipping downloading binaries on CI builds');
  return;
}
...
if (sass.hasBinary(binaryPath)) {
  console.log('node-sass build', 'Binary found at', binaryPath);
  return;
}
...
if (cachedBinary) {
  console.log('Cached binary found at', cachedBinary);
  ...
  return;
}

download(sass.getBinaryUrl(), binaryPath, function(err) { ... }
...

the first lines from scripts/install.js that will be executed

As you can see from the first lines of the install script, three checks are performed, before the actual binary download is started.

Should the download be skipped? If the environment variable SKIP_SASS_BINARY_DOWNLOAD_FOR_CI is set, the script exits immediately. This will ensure, that nothing is downloaded and the binary is compiled from scratch. (Please keep in mind, that this might change in the future.)
Is there already a binary, at the compile destination? If the scripts finds a binary, it wouldn't need to recompile it.
Is there already a binary in some caches? I won't go into detail about this caches, but there might already be a binary, which the script could take instead of downloading or compiling it again.

If all three checks fail, download(sass.getBinaryUrl(), ...) will be called. This is where things get interesting.

function getBinaryUrl() {
  var site = getArgument('--sass-binary-site') ||
             process.env.SASS_BINARY_SITE ||
             process.env.npm_config_sass_binary_site ||
             (pkg.nodeSassConfig && pkg.nodeSassConfig.binarySite) ||
             'https://github.com/sass/node-sass/releases/download';

  return [site, 'v' + pkg.version, getBinaryName()].join('/');
}

getBinaryUrl is the function, that decides, from where the binaries will be downloaded

getBinaryUrl() has several sources from where it can take the download URL. They are used in the following order:

npm install --sass-binary-site=<url>
The environment variable SASS_BINARY_SITE
The variable sass_binary_site from .npmrc
pkg.nodeSassConfig.binarySite, where pkg is the content of the package.json of node-sass
The default URL: https://github.com/sass/node-sass/releases/download

As you can see, downloading from GitHub is the default way for getting the binary for node-sass, however, you can manually set another download source. This enables us to be no longer dependent on GitHub. We could use any other server instead by changing the download URL, using one of those settings. But before we can do that, we need such a server, that also has all the files, that are available at https://github.com/sass/node-sass/releases/download. Of course we could download them all manually, but it would be much nicer, if we had a server, that also updates itself automatically with all the files from GitHub.

Building the caching server: Download GitHub Releases to a local server

In order to be no longer dependent on GitHub for your own build steps, we can build a local custom server for that. You will need the following tools:

An operating system that can run bash scripts. In this post, I'm using Debian, but any other Linux distribution should also work as well as macOS.
cron, for scheduling the mirror updates.
curl, for fetching the metadate for releases from the GitHub API.
jq, for parsing those metadata.
wget, for downloading the files. (We could also do this with curl, but wget can automatically create subfolders for us)
nginx, for delivering the downloaded files via http(s). (Of course, any other webserver will also work)

For the following steps, I assume that the command are executed as root. If you can't or do not want to use root, you should use sudo, when necessary.

Step 1: Install packages

Starting with a minimal debian system, the required packages must be installed first:

apt-get install -y cron curl jq wget nginx

Step 2: Create a folder and the mirror script

At first we need a place, where the downloaded artifacts will be stored. I will be using /mirror here. Create the directory with:

mkdir /mirror

Th script will just be a simply bash script. So create a new file at /opt/mirror.sh and make it executable:

touch /opt/mirror.sh
chmod +x /opt/mirror.sh

Step 3: Setup the script

Before starting with the implementation of the actual script, a bit of setup should be made, so the script can be used easily:

#!/usr/bin/env bash

REPO=${1}
MIRROR_DIR="/mirror"
SRC_URL="https://api.github.com/repos/${REPO}/releases"
DEST="${MIRROR_DIR}/${REPO}"

Some basic variables, that make the script easier to maintaincc

With REPO=${1} the script can be called with /opt/mirror.sh <repo>, so we can use it with any GitHub repository, we want. (e.g. /opt/mirror.sh sass/node-sass or /opt/mirror.sh electron/electron)
If, for some reason, the mirror directory must be changed, using MIRROR_DIR="/mirror" helps us to do this in one place, so we don't have to change every usage of the directory in the script.
The GitHub API has an endpoint for the releases of every repository with the repository name in the URL, so we will just do a GET Request against the enpoint in the next step.
Of course, we want to be able to use this script with multiple repositories, so every repository gets its own subdirectory in /mirror.

Step 4: Query the GitHub API and parse the data

For querying the data from GitHub, we simply can use curl ${SRC_URL}. The GitHub API returns JSON Data per default, that looks like this:

GET https://api.github.com/repos/sass/node-sass/releases

Response:
[
  {
    url: "https://api.github.com/repos/sass/node-sass/releases/16997851",
    html_url: "https://github.com/sass/node-sass/releases/tag/v4.12.0",
    id: 16997851,
    tag_name: "v4.12.0",
    target_commitish: "master",
    name: "v4.12.0",
    draft: false,
    author: { ... },
    created_at: "2019-04-26T10:18:21Z",
    assets: [
      {
        url: "https://api.github.com/repos/sass/node-sass/releases/assets/12251871",
        id: 12251871,
        name: "win32-x64-11_binding.node",
        browser_download_url: "https://github.com/sass/node-sass/releases/download/v4.12.0/win32-x64-11_binding.node",
        ...
      },
      {
        url: "https://api.github.com/repos/sass/node-sass/releases/assets/12251872",
        id: 12251872,
        name: "win32-x64-11_binding.pdb",
        browser_download_url: "https://github.com/sass/node-sass/releases/download/v4.12.0/win32-x64-11_binding.pdb"
      }
    ],
    ...
  },
  ...
]

Stripped down response from the GitHub API

As you can see, GitHub already gives us a lot of information about the releases of node-sass in this example. Basically it's an array of the releases, where every release itself has an array of all the assets of that particular release and their browser_download_url. That are exactly the files, we want to download and keep in our release cache. Also, every release has a tag_name which is in most cases the version number of that release, coming from the git tag of that repository. We also should make sure, that we are only mirroring actual releases and no drafts, so we should have a look at the draft attribute. All the other information about the release date, the author, etc. might be interesting, but is not relevant for us now.

This means we could strip down, the JSON like this:

[
  {
    tag_name: "v4.12.0",
    draft: false,
    assets: [
      browser_download_url: "https://github.com/sass/node-sass/releases/download/v4.12.0/win32-x64-11_binding.node",
      browser_download_url: "https://github.com/sass/node-sass/releases/download/v4.12.0/win32-x64-11_binding.pdb"
    ],
    ...
  },
  ...
]

This can easily be done by passing the output to jq:

curl ${SRC_URL} | jq '[.[] | {tag_name: .tag_name, draft: .draft, assets: [.assets[].browser_download_url]}]'

Querying the data from GitHub and passing it to jq

Let me explain the jq filter in detail:

.[] | { ... }: This means, that alle objects from the array (.[]) are taken and passed (|) into a new template.

{tag_name: .tag_name, draft: .draft, assets: [.assets[].browser_download_url]}: The template has three attributes: tag_name, draft and assets. tag_name and draft are filled with the original values from the source data, while assets is defined as a new array, containing all values of browser_download_url from every object in assets of the source data.

For better handling, I'll save that in a new variable.

RELEASES=$(curl ${SRC_URL} | jq '[.[] | {tag_name: .tag_name, draft: .draft, assets: [.assets[].browser_download_url]}]')

Step 5: Preparing the download for each release

Now, that we have an JSON array, containing only the data we need, we can interate over the objects and prepare the download of the assets, using a for loop:

for RELEASE in $(echo ${RELEASES} | jq -r '.[] | @base64'); do
  DRAFT=$(echo ${RELEASE} | base64 --decode | jq -r '.draft')
  NAME=$(echo ${RELEASE} | base64 --decode | jq -r '.tag_name')
  URLS=$(echo ${RELEASE} | base64 --decode | jq -r '.assets | .[]')
  ...
done

In this loop, we create three variables out of the JSON data:

DRAFT: contains the value of draft from the JSON data, which will be true or false.
NAME: contains the value of tag_name from the JSON data. We will use this for creating a subfolder for this particular release later. In most cases, this is the version number of the release.
URLS: contains all asset URLs as a space separated string. Wget can handle this easily.

Ruben Koster has written an excellent blog post, explaining why we have to base64 encode and decode the JSON in bash loops.

Before we can start the download, we have to do two checks. At first, we should make sure, that we skip drafts:

if [["${DRAFT}" != "false"]]; then
  continue
fi

After that, we should check, if the assets already have been downloaded.

RELEASE_DEST=${DEST}/${NAME}
if [[-d ${RELEASE_DEST}]]; then
  continue
fi

Step 6: Download all files

Now we can easily download all assets from that particular release.

mkdir -p ${RELEASE_DEST}
wget -P ${RELEASE_DEST} --no-verbose ${URLS}

Before the download is started, we make sure that the destination directory exists. Also, wget is very chatty. By using --no-verbose we can supress its output so logs are not polluted with unneccessary download progress messages.

Script completed!

#!/usr/bin/env bash

REPO=${1}
MIRROR_DIR="/mirror"
SRC_URL="https://api.github.com/repos/${REPO}/releases"
DEST="${MIRROR_DIR}/${REPO}"

RELEASES=$(curl ${SRC_URL} | jq '[.[] | {tag_name: .tag_name, draft: .draft, assets: [.assets[].browser_download_url]}]')

for RELEASE in $(echo ${RELEASES} | jq -r '.[] | @base64'); do
  DRAFT=$(echo ${RELEASE} | base64 --decode | jq -r '.draft')
  NAME=$(echo ${RELEASE} | base64 --decode | jq -r '.tag_name')
  URLS=$(echo ${RELEASE} | base64 --decode | jq -r '.assets | .[]')

  if [["${DRAFT}" != "false"]]; then
    continue
  fi

  RELEASE_DEST=${DEST}/${NAME}
  if [[-d ${RELEASE_DEST}]]; then
    continue
  fi

  mkdir -p ${RELEASE_DEST}
  wget -P ${RELEASE_DEST} --no-verbose ${URLS}

done

The complete mirror script

Step 7: Automate the cache update

With cron, we can run the script regularly, e.g. daily, to update our cache. For every repository, that should be mirrored, a new line must be added to crontab (crontab -e).

0 4 * * * /opt/mirror.sh sass/node-sass
0 5 * * * /opt/mirror.sh electron/electron

Updating a node-sass and electron cache over night (e.g. at 2:00 AM and 3:00 AM)

You should run the script at least once by hand to initialize the cache and to verify that it works.

Step 8: A simple static webserver for http access to the cache

The last thing we need is a webserver, that delivers the downloaded files via HTTP. (Hint: Although I'm not covering HTTPS setup in this post, you should consider a HTTPS configuration for you webserver.)

With nginx, we can use a simple lightweight webserver, with the following configuration placed at /etc/nginx/sites-enabled/mirror.conf:

server {
    listen 80 default_server;
    listen [::]:80 default_server;
    root /mirror;

    location / {
        autoindex on;
    }
}

After creating the file, you should reload nginx. (e.g. with systemd in Debian: systemctl reload nginx)

You can use your favorite browser and browse to the servers IP address, to verify that the webserver works.

Using the caching server: Setup your environment

Now you can use that mirror in you projects. As you might remember from the beginning of this post, there are several ways e.g. for node-sass to define the download source. I personally prefer using .npmrc but you can also use the other ways.

If you do not have a .npmrc yet, create an empty file with that name and for node-sass and electron add:

sass_binary_site=http://<ip-of-your-server>/sass/node-sass
electron_mirror=http://<ip-of-your-server>/electron/electron/v

Both packages now look for folders at that location for the desired version. We have created this folders with our script before (the $RELEASE_DEST). Note that I've added a v for electron to the end. This is because the git tags, we used for this subfolder names always start with v (e.g. v2.0.0), but the electron install script uses the version number without the leading v when searching for the download.

Optionally package everything into a docker container

I've published a full working solution of the script in a docker container on GitHub and Docker Hub.

We are using a solution with docker very similar to this at Auerswald, but with a few company specific modifications. This way, we can easily maintain and update our mirrors.