Anton Yakutovich

Posted on Aug 4, 2021

GitLab CI: Cache and Artifacts explained by example

#devops #node #cicd #gitlab

Hi, DEV Community! I've been working in the software testing field for more than eight years. Apart from web services testing, I maintain CI/CD Pipelines in our team's GitLab.

Let's discuss the difference between GitLab cache and artifacts. I'll show how to configure the Pipeline for the Node.js app in a pragmatic way to achieve good performance and resource utilization.

There are three things you can watch forever: fire burning, water falling, and the build is passing after your next commit. Nobody wants to wait for the CI completion too much, it's better to set up all the tweaks to avoid long waiting between the commit the build status. Cache and artifacts to the rescue! They help reduce the time it takes to run a Pipeline drastically.

People are confused when they have to choose between cache and artifacts. GitLab has bright documentation, but the Node.js app with cache example and the Pipeline template for Node.js contradict each other.

Let's see what the Pipeline in GitLab terms means. The Pipeline is a set of stages and each stage can have one or more jobs. Jobs work on a distributed farm of runners. When we start a Pipeline, a random runner with free resources executes the needed job. The GitLab-runner is the agent that can run jobs. For simplicity, let's consider Docker as an executor for all runners.

Each job starts with a clean slate and doesn't know the results of the previous one. If you don't use cache and artifacts, the runner will have to go to the internet or local registry and download the necessary packages when installing project dependencies.

What is cache?

It's a set of files that a job can download before running and upload after execution. By default, the cache is stored in the same place where GitLab Runner is installed. If the distributed cache is configured, S3 works as storage.

Let's suppose you run a Pipeline for the first time with a local cache. The job will not find the cache but will upload one after the execution to runner01. The second job will execute on runner02, it won't find the cache on it either and will work without it. The result will be saved to runner02. Lint, the third job, will find the cache on runner01 and use it (pull). After execution, it will upload the cache back (push).

What are artifacts?

Artifacts are files stored on the GitLab server after a job is executed. Subsequent jobs will download the artifact before script execution.

Build job creates a DEF artifact and saves it on the server. The second job, Test, downloads the artifact from the server before running the commands. The third job, Lint, similarly downloads the artifact from the server.

To compare the artifact is created in the first job and is used in the following ones. The cache is created within each job.

Consider the CI template example for Node.js recommended by GitLab:



image: node:latest # (1)

# This folder is cached between builds
cache:
  paths:
    - node_modules/ # (2)

test_async:
  script:
    - npm install # (3)
    - node ./specs/start.js ./specs/async.spec.js

test_db:
  script:
    - npm install # (4)
    - node ./specs/start.js ./specs/db-postgres.spec.js

Line #1 specifies the docker image, which will be used in all jobs. The first problem is the latest tag. This tag ruins the reproducibility of the builds. It always points to the latest release of Node.js. If the GitLab runner caches docker images, the first run will download the image, and all subsequent runs will use the locally available image. So, even if a node is upgraded from version XX to YY, our Pipeline will know nothing about it. Therefore, I suggest specifying the version of the image. And not just the release branch (node:14), but the full version tag (node:14.2.5).

Line #2 is related to lines 3 and 4. The node_modules directory is specified for caching, the installation of packages (npm install) is performed for every job. The installation should be faster because packages are available inside node_modules. Since no key is specified for the cache, the word default will be used as a key. It means that the cache will be permanent, shared between all git branches.

Let me remind you, the main goal is to keep the pipeline reproducible. The Pipeline launched today should work the same way in a year.

NPM stores dependencies in two files — package.json and package-lock.json. If you use package.json, the build is not reproducible. When you run npm install the package manager puts the last minor release for not strict dependencies. To fix the dependency tree, we use the package-lock.json file. All versions of packages are strictly specified there.

But there is another problem, npm install rewrites package-lock.json, and this is not what we expect. Therefore, we use the special command npm ci which:

removes the node_modules directory;
installs packages from package-lock.json.

What shall we do if node_modules will be deleted every time? We can specify NPM cache using the environment variable npm_config_cache.

And the last thing, the config does not explicitly specify the stage where jobs are executed. By default, the job runs inside the test stage. It turns out that both jobs will run in parallel. Perfect! Let's add jobs stages and fix all the issues we found.

What we got after the first iteration:



image: node: 16.3.0 # (1)

stages:
  - test

variables:
  npm_config_cache: "$CI_PROJECT_DIR/.npm" (5)

# This folder is cached between builds
cache:
  key:
    files:
      - package-lock.json (6)
  paths:
    - .npm # (2)

test_async:
  stage: test
  script:
    - npm ci # (3)
    - node ./specs/start.js ./specs/async.spec.js

test_db:
  stage: test
  script:
    - npm ci # (4)
    - node ./specs/start.js ./specs/db-postgres.spec.js

We improved Pipeline and make it reproducible. There are two drawbacks left. First, the cache is shared. Every job will pull the cache and push the new version after executing the job. It's a good practice to update cache only once inside Pipeline. Second, every job installs the package dependencies and wastes time.

To fix the first problem we describe the cache management explicitly. Let's add a "hidden" job and enable only pull policy (download cache without updating):



# Define a hidden job to be used with extends
# Better than default to avoid activating cache for all jobs
.dependencies_cache:
  cache:
    key:
      files:
        - package-lock.json
    paths:
      - .npm
    policy: pull

To connect the cache you need to inherit the job via extends keyword.



...
extends: .dependencies_cache
...

To fix the second issue we use artifacts. Let's create the job that archives package dependencies and passes the artifact with node_modules further. Subsequent jobs will run tests from the spot.



setup:
  stage: setup
  script:
    - npm ci
  extends: .dependencies_cache
  cache:
    policy: pull-push
  artifacts:
    expire_in: 1h
    paths:
      - node_modules

We install the npm dependencies and use the cache described in the hidden dependencies_cache job. Then we specify how to update the cache via a pull-push policy. A short lifetime (1 hour) helps to save space for the artifacts. There is no need to keep node_modules artifact for a long time on the GitLab server.

The full config after the changes:



image: node: 16.3.0 # (1)

stages:
  - setup
  - test

variables:
  npm_config_cache: "$CI_PROJECT_DIR/.npm" (5)

# Define a hidden job to be used with extends
# Better than default to avoid activating cache for all jobs
.dependencies_cache:
  cache:
    key:
      files:
        - package-lock.json
    paths:
      - .npm
    policy: pull

setup:
  stage: setup
  script:
    - npm ci
  extends: .dependencies_cache
  cache:
    policy: pull-push
  artifacts:
    expire_in: 1h
    paths:
      - node_modules

test_async:
  stage: test
  script:
    - node ./specs/start.js ./specs/async.spec.js

test_db:
  stage: test
  script:
    - node ./specs/start.js ./specs/db-postgres.spec.js

We learned what's the difference between cache and artifacts. We built a reproducible Pipeline that works predictably and uses resources efficiently. This article shows some common mistakes and how to avoid them when you are setting up CI in GitLab.
I wish you green builds and fast pipelines. Would appreciate your feedback in the comments!

Top comments (15)

Michiel Hendriks • Aug 4 '21

Instead of caching node_modules, consider caching node's caching directory instead.
The difference is caching downloaded tar.gz files instead of thousands of small files. Despite gitlab's efforts, their caching mechanism sucks big time for for a large amount of small files.

Artiom Neganov • Sep 12 '23

Sorry, what do you mean under "node's caching directory"? Which one is that?
And what tar.gz files do you mean?

Anders Ramsay • Dec 8 '21

This is pure CI gold. Thank you!

Rick Stoopman • Jan 15 '22

Why do you create a hidden job while you only extend it in 1 job? This could all be included in the setup job right? And right now the cache.policy is always overwritten to pull-push. Or am I missing something?

Weam Adel • Jul 1 '23

Thank you so much for your effort, but I still didn't get why we need to add artifacts. You described the problem artifacts solves like so:

Second, every job installs the package dependencies and wastes time.

Isn't this why we use cache at the first place? to not install packages again? We already had added the cache at this point, so why do we need to add artifacts, too?

Benoit COUETIL 💫 • Aug 7 '21

node_modules can be huge in real world, and then unsuitable for artifacts which are limited in size. Worth knowing, it is also uploaded to central Gitlab, which can be a bottleneck for a large Gitlab instance with lots of runners uploading to it.

Other than that thank you, I learned that npm ci is slow due to node_modules deletion 🙏

Anton Yakutovich • Aug 7 '21

If you compare the time on the clean system, I bet npm ci would be faster than npm install. Cause it just downloads full tree of dependencies from package-lock.json. npm install will check which deps can be updated and build new dependency tree.

Agata Zurek • Feb 8 '22

Yes, this! My project's node_modules is 2GB and is too big for artifacts. What is the recommended solution to deal with that? I've had to include npm ci on every step to get my pipeline to work at all.

Benoit COUETIL 💫 • Feb 8 '22 • Edited

You should use cache. This is why cache exists, and can be shared even across pipelines.

But cache has to be configured on your runners, or you will experience missing cache each time your jobs switch runners (which should not be a problem, npm will handle it)

Lumin • Jan 20 '23

Should we add dependencies in other jobs too?

madhead • Aug 5 '21

We need an article about GitHub Actions!

dcg90 • Oct 19 '21

Thanks! Just a doubt, don't you need to specify the cache location to the npm ci command? Something like npm ci --cache ${npm_config_cache} --prefer-offline ?

Anton Yakutovich • Oct 25 '21

The variables section has npm_config_cache which will be used by npm automatically.

Minaro • Mar 11 '22

Please let us know what's the (5)

Josh Martens • Jun 2 '23

It looks like that is just for descriibing the "lines of the code" that are being talked about instead of using actual "line numbers" (since they aren't visible)

View full discussion (15 comments)

What is cache?

What are artifacts?

Links