Hi, DEV Community! I've been working in the software testing field for more than eight years. Apart from web services testing, I maintain CI/CD Pipelines in our team's GitLab.
Let's discuss the difference between GitLab cache and artifacts. I'll show how to configure the Pipeline for the Node.js app in a pragmatic way to achieve good performance and resource utilization.
There are three things you can watch forever: fire burning, water falling, and the build is passing after your next commit. Nobody wants to wait for the CI completion too much, it's better to set up all the tweaks to avoid long waiting between the commit the build status. Cache and artifacts to the rescue! They help reduce the time it takes to run a Pipeline drastically.
People are confused when they have to choose between cache and artifacts. GitLab has bright documentation, but the Node.js app with cache example and the Pipeline template for Node.js contradict each other.
Let's see what the Pipeline in GitLab terms means. The Pipeline is a set of stages and each stage can have one or more jobs. Jobs work on a distributed farm of runners. When we start a Pipeline, a random runner with free resources executes the needed job. The GitLab-runner is the agent that can run jobs. For simplicity, let's consider Docker as an executor for all runners.
Each job starts with a clean slate and doesn't know the results of the previous one. If you don't use cache and artifacts, the runner will have to go to the internet or local registry and download the necessary packages when installing project dependencies.
It's a set of files that a job can download before running and upload after execution. By default, the cache is stored in the same place where GitLab Runner is installed. If the distributed cache is configured, S3 works as storage.
Let's suppose you run a Pipeline for the first time with a local cache. The job will not find the cache but will upload one after the execution to runner01. The second job will execute on runner02, it won't find the cache on it either and will work without it. The result will be saved to runner02. Lint, the third job, will find the cache on runner01 and use it (pull). After execution, it will upload the cache back (push).
Artifacts are files stored on the GitLab server after a job is executed. Subsequent jobs will download the artifact before script execution.
Build job creates a DEF artifact and saves it on the server. The second job, Test, downloads the artifact from the server before running the commands. The third job, Lint, similarly downloads the artifact from the server.
To compare the artifact is created in the first job and is used in the following ones. The cache is created within each job.
Consider the CI template example for Node.js recommended by GitLab:
image: node:latest # (1) # This folder is cached between builds cache: paths: - node_modules/ # (2) test_async: script: - npm install # (3) - node ./specs/start.js ./specs/async.spec.js test_db: script: - npm install # (4) - node ./specs/start.js ./specs/db-postgres.spec.js
Line #1 specifies the docker image, which will be used in all jobs. The first problem is the
latest tag. This tag ruins the reproducibility of the builds. It always points to the latest release of Node.js. If the GitLab runner caches docker images, the first run will download the image, and all subsequent runs will use the locally available image. So, even if a node is upgraded from version XX to YY, our Pipeline will know nothing about it. Therefore, I suggest specifying the version of the image. And not just the release branch (
node:14), but the full version tag (
Line #2 is related to lines 3 and 4. The
node_modules directory is specified for caching, the installation of packages (npm install) is performed for every job. The installation should be faster because packages are available inside
node_modules. Since no key is specified for the cache, the word
default will be used as a key. It means that the cache will be permanent, shared between all git branches.
Let me remind you, the main goal is to keep the pipeline reproducible. The Pipeline launched today should work the same way in a year.
NPM stores dependencies in two files — package.json and package-lock.json. If you use package.json, the build is not reproducible. When you run
npm install the package manager puts the last minor release for not strict dependencies. To fix the dependency tree, we use the package-lock.json file. All versions of packages are strictly specified there.
But there is another problem,
npm install rewrites package-lock.json, and this is not what we expect. Therefore, we use the special command
npm ci which:
- removes the node_modules directory;
- installs packages from package-lock.json.
What shall we do if
node_modules will be deleted every time? We can specify NPM cache using the environment variable
And the last thing, the config does not explicitly specify the stage where jobs are executed. By default, the job runs inside the test stage. It turns out that both jobs will run in parallel. Perfect! Let's add jobs stages and fix all the issues we found.
What we got after the first iteration:
image: node: 16.3.0 # (1) stages: - test variables: npm_config_cache: "$CI_PROJECT_DIR/.npm" (5) # This folder is cached between builds cache: key: files: - package-lock.json (6) paths: - .npm # (2) test_async: stage: test script: - npm ci # (3) - node ./specs/start.js ./specs/async.spec.js test_db: stage: test script: - npm ci # (4) - node ./specs/start.js ./specs/db-postgres.spec.js
We improved Pipeline and make it reproducible. There are two drawbacks left. First, the cache is shared. Every job will pull the cache and push the new version after executing the job. It's a good practice to update cache only once inside Pipeline. Second, every job installs the package dependencies and wastes time.
To fix the first problem we describe the cache management explicitly. Let's add a "hidden" job and enable only pull policy (download cache without updating):
# Define a hidden job to be used with extends # Better than default to avoid activating cache for all jobs .dependencies_cache: cache: key: files: - package-lock.json paths: - .npm policy: pull
To connect the cache you need to inherit the job via
... extends: .dependencies_cache ...
To fix the second issue we use artifacts. Let's create the job that archives package dependencies and passes the artifact with
node_modules further. Subsequent jobs will run tests from the spot.
setup: stage: setup script: - npm ci extends: .dependencies_cache cache: policy: pull-push artifacts: expire_in: 1h paths: - node_modules
We install the npm dependencies and use the cache described in the hidden dependencies_cache job. Then we specify how to update the cache via a pull-push policy. A short lifetime (1 hour) helps to save space for the artifacts. There is no need to keep
node_modules artifact for a long time on the GitLab server.
The full config after the changes:
image: node: 16.3.0 # (1) stages: - setup - test variables: npm_config_cache: "$CI_PROJECT_DIR/.npm" (5) # Define a hidden job to be used with extends # Better than default to avoid activating cache for all jobs .dependencies_cache: cache: key: files: - package-lock.json paths: - .npm policy: pull setup: stage: setup script: - npm ci extends: .dependencies_cache cache: policy: pull-push artifacts: expire_in: 1h paths: - node_modules test_async: stage: test script: - node ./specs/start.js ./specs/async.spec.js test_db: stage: test script: - node ./specs/start.js ./specs/db-postgres.spec.js
We learned what's the difference between cache and artifacts. We built a reproducible Pipeline that works predictably and uses resources efficiently. This article shows some common mistakes and how to avoid them when you are setting up CI in GitLab.
I wish you green builds and fast pipelines. Would appreciate your feedback in the comments!