Benoit COUETIL 💫 for Zenika

Posted on Jun 25

🦊 GitLab CI Job Logs: The Art of Self-Documenting Pipelines

#gitlab #devops #cicd #tutorial

Initial thoughts
Prerequisite: clear stage and job names
1. Enable GitLab's built-in logging features
2. Choose the right CI log level — and stick to it
3. Bring color to your logs
4. Structure your output with collapsible sections
5. Synchronize and validate before moving on
6. Use warning jobs as early signals
7. Make warnings and errors actionable
8. Log the variable, path, and reason behind every decision
9. Know when NOT to log
Wrapping up
Further reading

The best CI/CD engineer is one who becomes unnecessary. If developers constantly ping us to understand why a pipeline failed, our pipelines are not doing their job. With the right logging strategy, every failure becomes a self-service debugging experience — and we get to work on actually interesting problems.

Initial thoughts

We have all been there: a developer opens a ticket saying "the pipeline is broken", and after 15 minutes of scrolling through a wall of undifferentiated text, we find that one missing failing test buried in line 847. 99% of the time, after pipelines have been stabilised, the culprit is the developer's code. Else is momentary and/or infrastructure problem.

The goal is simple: a developer should be able to diagnose and fix 90% of pipeline failures without ever pinging the CI/CD team. This is not about writing less code or cutting corners. It is about treating CI/CD logs as a user interface — one that developers interact with daily.

After years of maintaining pipelines across dozens of projects and as discussed in GitLab CI: 10+ Best Practices to Avoid Widespread Anti-Patterns, we have distilled a collection of practices that consistently make the difference between "I need help" and "I already fixed it".

Prerequisite: clear stage and job names

Self-documenting logs make even more sense on top of a pipeline that is already readable at a glance. Stage and job names must be clear, specific, and kept to a just-right count — enough to tell the story of what CI/CD is doing when we watch it run, not so many that the pipeline graph becomes noise.

On a typical three-tier app — frontend, BFF, and API — deployed to Kubernetes, an MR pipeline already reads like a table of contents:

Four stages, eleven jobs — a developer glancing at the pipeline UI knows exactly where they are: package, test, deploy to K8s, validate. The logging practices below assume this foundation is in place.

1. Enable GitLab's built-in logging features

Before writing a single echo, GitLab already offers several feature flags and variables that dramatically improve log readability. These quick wins cost nothing — just a few lines in our top-level variables: block:

variables:
  # prepend ISO 8601 timestamps to every line
  FF_TIMESTAMPS: "true"
  # auto-wrap each script command in a collapsible section
  FF_SCRIPT_SECTIONS: "true"
  # smoother real-time log streaming
  FF_USE_DYNAMIC_TRACE_FORCE_SEND_INTERVAL: "true"
  # convention honored by npm, jest, webpack, dotnet...
  FORCE_COLOR: 1

FF_TIMESTAMPS gives us millisecond-precision timing on every line — invaluable for spotting slow steps without manually adding timers. FF_SCRIPT_SECTIONS automatically wraps each script line in a collapsible section, giving structure for free. And FORCE_COLOR: 1 is a widely adopted convention that tells tools to output color even when they detect a non-TTY environment — without it, most CI output falls back to monochrome.

Timestamps, color, and structure — before we even start customizing. Welcome to the "free improvements" club.

2. Choose the right CI log level — and stick to it

The most common mistake is binary logging: either we print everything, or we print nothing. Both are equally useless.

A CI job should follow the same log level discipline as any application:

Level	When to use	Example
TRACE	debug data, usually hidden	`Variable expansion: $DEPLOY_ENV → staging`
INFO	Key steps the developer needs to follow	`Deploying service user-api to staging...`
WARNING	Something unexpected but non-blocking	`Cache miss for node_modules, full install triggered`
ERROR	Something failed that needs attention	`Deployment failed: health check timeout after 120s`

In a real deploy job, the four levels read like a 10-second story (more on how to do this later) :

[TRACE] Variable expansion: $DEPLOY_ENV → staging
[INFO] Deploying service user-api to staging...
[WARNING] Cache miss for node_modules, full install triggered
[ERROR] Deployment failed: health check timeout after 120s

The golden rule: a successful pipeline should produce a few lines of INFO, not 500. If we need more detail, we wrap verbose output in collapsible sections (see section 4).

The key is a centralized log function shared across all jobs. Write it once in a sourced helper script (source ci/utils.sh), and every job gets consistent formatting for free:

log() {
  local level="${1:-INFO}"; local message="$2"
  declare -A color_map
  color_map["TRACE"]="\033[38;5;250m"     # gray
  color_map["INFO"]="\033[1;94m"          # bright blue
  color_map["WARNING"]="\033[1;38;5;214m" # bold orange
  color_map["ERROR"]="\033[48;5;1m"       # background red
  printf "${color_map[$level]}[%s] %s\033[0m\n" "$level" "$message"
}

No need to repeat color codes in every .gitlab-ci.yml. A simple source ci/utils.sh in before_script and the entire team speaks the same logging language.

Too much logging is noise. Too little is silence. The sweet spot is a concise narrative that a developer can read in 10 seconds and understand exactly what happened — like a well-written commit message, but for runtime.

3. Bring color to your logs

Colors used by GitLab Runner — avoid these for custom messages, as they blend with the runner's own output:

Color	Shell code	Hex example	Description
Bold Green	`\033[1;32m`	`#5cf759`	Runner command echoes
Bold Cyan	`\033[1;36m`	`#00bdbd`	Runner generic messages
Bold White	`\033[1;37m`	`#ffffff`	Runner neutral outputs
Bold Red	`\033[1;31m`	`#ff6161`	Runner error messages

In a real job log, the runner's four colors look like this:

Using docker executor with image node:20-alpine
$ docker compose build --parallel
$ docker compose up -d
  #1 [web] building with "buildx"
  #2 [api] building with "buildx"
Cleaning up project directory and file based variables
ERROR: Job failed: exit status 1

Suggested colors for custom log messages:

Color	Shell code	Hex example	Description
Gray	`\033[38;5;250m`	`#bcbcbc`	TRACE — verbose/debug output, present but discreet
Bold Bright Blue	`\033[1;94m`	`#5797ff`	INFO — normal flow narrative and informational steps
Bold Orange	`\033[1;38;5;214m`	`#ffaf00`	WARNING — alternative for bolder warnings
Background Red	`\033[48;5;1m`	`#ff6161`	ERROR — stands out from GitLab's own bold red errors
Bold Bright Magenta	`\033[1;95m`	`#8e44ad`	Situational
Bold Bright Yellow	`\033[1;93m`	`#f4d03f`	Situational

TRACE — verbose/debug output, present but discreet
INFO — normal flow narrative and informational steps
WARNING — alternative for bolder warnings
ERROR — stands out from GitLab's own bold red errors
Situational
Situational

A monochrome wall of text is hostile to human eyes. Color creates a visual hierarchy that lets developers spot errors in seconds instead of reading every line.

GitLab job logs support ANSI escape codes natively. Combined with FORCE_COLOR: 1 from section 1, most tools will also output color automatically. For our custom messages, the centralized log function from section 2 already handles this.

Why background red instead of bold red for errors? Because when a job fails, GitLab's runner already prints its own error in plain bold red — and those messages are almost never helpful:

  #3 [api 3/4] RUN go build -o /app ./cmd/server
  #3 ERROR: process "go build" did not complete successfully
  > [api 3/4] RUN go build -o /app ./cmd/server:
  cmd/server/main.go:42:18: undefined: handlers.NewRouter
[10:15:22] Build failed for service 'api', please fix above error(s)
Cleaning up project directory and file based variables
ERROR: Job failed: exit status 1

By using background red for our own errors, the developer can instantly distinguish our actionable messages from the runner's generic noise. The white-on-red line is our message — it says what failed and what to fix. Everything else is the runner repeating "something threw an exception somewhere". Thanks, runner. Very helpful.

Consistency matters more than aesthetics. Once developers learn that background red means "read this first" and the runner's plain red means "ignore unless desperate", they navigate logs instinctively.

4. Structure your output with collapsible sections

GitLab supports native collapsible sections in job logs. This is our most powerful weapon against log noise. A deployment job with 800 lines of raw output becomes this scannable overview:

> Installing dependencies
▼ Database migration
  Applying migration '20240115_AddUserPreferences'...
  Applying migration '20240122_IndexOptimization'...
  [...]
  Done. 250 migrations applied.
▼ Deploying to staging
  Syncing build artifacts to server...
  [...]
  Restarting application pool...
[INFO] Health check passed

The > indicates a collapsed section — the developer can click to expand it if needed. The ▼ sections are expanded by default, showing the core output immediately. An 800-line log that reads like a 10-line summary? That is not logging — that is user experience design.

The implementation uses two helper functions:

start_section() {
  local id=$1; local title=$2; local collapsed=${3:-""}
  local opts=""
  [ "$collapsed" = "collapsed" ] && opts="[collapsed=true]"
  echo -e "\e[0Ksection_start:$(date +%s):${id}${opts}\r\e[0K\e[1;36m${title}\e[0m"
}

end_section() {
  local id=$1
  echo -e "\e[0Ksection_end:$(date +%s):${id}\r\e[0K"
}

Not all output deserves the same treatment. We use three categories:

No section needed — simple one-liner messages (log INFO ...) do not need to be wrapped in a section. They are already short and scannable.

Collapsed by default — verbose but predictable output (dependency install, Docker pull, cache restore). The developer only opens these when something goes wrong:

script:
  - start_section "npm_install" "Installing dependencies" collapsed
  - npm ci
  - end_section "npm_install"

Collapsible, open by default — the core action of the job. This is what the developer came to see: migration output, test results, deployment logs. It is wrapped in a section for structure, but expanded so the output is immediately visible:

script:
  - start_section "db_migrate" "Database migration"
  - npx prisma migrate deploy
  - end_section "db_migrate"

5. Synchronize and validate before moving on

A deployment is not done when the artifact lands on the server. It is done when we have proof it works. One of the trickiest CI/CD debugging situations is when an asynchronous operation fails after the pipeline has moved on — the error appears in the wrong context, or worse, the pipeline reports success while the actual deployment is still rolling out.

The rule is straightforward: if a step produces a result we depend on later, we wait for confirmation before continuing. This means waiting for rollouts to complete, then actively validating the result.

On success, the developer sees a clear narrative:

$ ./ci/deploy.sh staging
  Pushing image user-api:e4f5g6h to registry...
  Done.
[INFO] Deploying user-api to staging...
$ kubectl rollout status deployment/user-api -n staging --timeout=300s
  Waiting for deployment "user-api" rollout to finish: 0 of 3 updated replicas are available...
  Waiting for deployment "user-api" rollout to finish: 2 of 3 updated replicas are available...
  deployment "user-api" successfully rolled out
[INFO] Validating deployment...
[TRACE] Attempt 1/30: got version 'a1b2c3d', expected 'e4f5g6h'
[TRACE] Attempt 2/30: got version 'e4f5g6h', expected 'e4f5g6h'
[INFO] Deployment validated: version e4f5g6h is live

On failure, the error is impossible to miss — whether the rollout itself fails:

$ ./ci/deploy.sh staging
  Pushing image user-api:e4f5g6h to registry...
  Done.
[INFO] Deploying user-api to staging...
$ kubectl rollout status deployment/user-api -n staging --timeout=300s
  Waiting for deployment "user-api" rollout to finish: 0 of 3 updated replicas are available...
  Waiting for deployment "user-api" rollout to finish: 1 of 3 updated replicas are available...
  error: deployment "user-api" exceeded its progress deadline
[ERROR] Rollout failed or timed out for user-api in staging — run: kubectl
describe pod -l app=user-api -n staging
Cleaning up project directory and file based variables
ERROR: Job failed: exit status 1

Or the validation catches the wrong version:

$ ./ci/deploy.sh staging
[INFO] Validating deployment...
[TRACE] Attempt 29/30: got version 'a1b2c3d', expected 'e4f5g6h'
[TRACE] Attempt 30/30: got version 'a1b2c3d', expected 'e4f5g6h'
[ERROR] Deployment validation failed after 5 minutes — expected version
e4f5g6h, last received a1b2c3d, health endpoint
https://staging.example.com/api/health
Cleaning up project directory and file based variables
ERROR: Job failed: exit status 1

Good validation goes beyond "is it running?":

Version check: confirm the exact commit is deployed, not a cached old version.
Health endpoint: verify the service can reach its dependencies (database, cache, external APIs).
Smoke tests: run a minimal request that exercises the critical path.
Rollback trigger: if validation fails, automatically roll back and log the reason.

This applies to more than just Kubernetes deployments — database migrations, infrastructure provisioning, async API calls, DNS propagation. When a deployment validation fails, the developer knows immediately — not 30 minutes later when a user reports a bug. Schrödinger's deployment: simultaneously succeeded and failed until someone actually checks.

6. Use warning jobs as early signals

Not every issue deserves a pipeline failure. Some problems — like linter violations in legacy code, or a growing list of TODO comments — are real but not urgent. Blocking the pipeline on day one would just train developers to ignore the warnings (or worse, remove the job entirely).

Instead, we add the check as a non-blocking job using allow_failure: true. The job runs, reports its findings, and shows as an orange warning in the pipeline. It does not block the merge, but it stays visible — a gentle, persistent reminder that something needs attention:

eslint:
  stage: quality
  script:
    - npx eslint src/ --format stylish
  allow_failure: true

The orange icon acts as a soft nudge. Sooner or later, a developer picks up the task, fixes the violations, and the job turns green. At that point, we remove allow_failure: true and the check becomes mandatory — without any drama or big-bang migration.

This works for any gradual adoption pattern: stricter TypeScript settings, new security rules, accessibility checks, or dependency update policies. The pipeline documents the target standard before it enforces it.

Warning jobs also shine for detecting slow-moving disasters — things that are fine today but will explode next month. We can use allow_failure: exit_codes to turn specific exit codes into warnings:

for drive in C D; do
  usage=$(df -m /$drive | awk 'NR==2{printf "%.0f", $5}')
  if [ "$usage" -gt 90 ]; then
    log WARNING "Disk $drive: usage is ${usage}% (> 90%) — ask infra to clean up old releases"
    exit 99
  fi
  log TRACE "Disk $drive: usage is ${usage}%"
done

deploy:
  script: ...
  allow_failure:
    exit_codes: [99]  # disk space

The job shows as a warning (orange) in the pipeline instead of a hard failure. The developer sees the problem, but the pipeline is not blocked. This pattern works well for any ticking clock: disk space, certificate expiry, security scan thresholds, or GitLab Pages size limits approaching the quota.

7. Make warnings and errors actionable

The difference between a good log and a great log is the "what to do next" part. Every warning and error should include a remediation hint — a specific action the developer can take.

A bare error leaves the developer stranded:

[ERROR] Build artifacts not found at dist/api — sync cannot proceed
Cleaning up project directory and file based variables
ERROR: Job failed: exit status 1

The same error with a remediation hint becomes self-service:

[ERROR] Build artifacts not found at dist/api and fallback dist/api-main
[WARNING] add 'force-build-api' label on your MR and launch a new pipeline,
or run a manual pipeline on the target branch to create a fallback
Cleaning up project directory and file based variables
ERROR: Job failed: exit status 1

One line, all the context. Here are more patterns from real-world pipelines:

[ERROR] Pre/Production deployment requires Maintainer privileges (level >=
40), current user jdupont has level 30 — ask a project Owner to promote you
in Settings > Members

[ERROR] https://recette3.example.com/health failed with HTTP 503 —
localhost fallback returned "Cannot find module '@prisma/client'" — check
application logs on app-rec-03

[ERROR] Current branch is behind main or has conflicts — E2E tests would be
meaningless, please rebase or back-merge before retrying

[WARNING] E2E test results found in cache (18.3h old), skipping tests — to
force a full run, delete the cache or add the 'force-e2e' label

[WARNING] Kafka consumer service configured as Manual startup (GroupId is
empty) — set a GroupId in the environment config to enable automatic
startup

The pattern is always the same: what happened + why + what to do about it, all on a single line. Long lines scroll horizontally in GitLab's log viewer — and a developer can copy-paste one line into Slack and the recipient has the full picture without follow-up questions.

A developer reading this at 6 PM on a Friday should not need to ask anyone for help.

8. Log the variable, path, and reason behind every decision

Most CI logs describe what is happening. Great CI logs explain why a branch was taken — and with enough context to reconstruct the decision without opening the YAML: the variable or label that triggered it, the file or cache key that was found or missing, the path that was created or skipped. This is the single most effective way to reduce support tickets.

Caching is the #1 source of "why" questions. When a job is slow, the first thing a developer wonders is: "did it use the cache?" Without explicit logging, the only way to find out is to read the YAML and hope the runner logs are detailed enough. Instead, we log the decision:

[INFO] Cache found: node_modules checksum matches (a7f3b2c)
[INFO] Skipping npm ci

[WARNING] Cache miss: node_modules checksum changed (a7f3b2c → e91d4f8)
[INFO] Running full npm ci

[INFO] E2E test results found in cache (less than 24h old)
[INFO] Skipping E2E tests for this module

[WARNING] E2E test results expired (cache is 36h old, max: 24h)
[INFO] Running full E2E suite

Depending on the conditions, the developer sees exactly why tests were skipped — or why they were not:

[WARNING] Skipping tests (SKIP_TESTS=true)

[INFO] Running full test suite (no skip conditions met)

When a pipeline behaves unexpectedly, the "why" logs are the first thing developers look for. Without them, they are reverse-engineering our CI logic from YAML — a slow and error-prone process that almost certainly ends in a support ticket.

9. Know when NOT to log

The temptation after reading this article is to wrap every command in log(). Resist it. Over-logging is almost as bad as under-logging — it buries the signal in noise.

Here is what an over-logged npm ci looks like:

$ cd front && npm ci --prefer-offline
[INFO] Starting npm ci...
[INFO] Running: npm ci --prefer-offline
[INFO] Working directory: /builds/project/front
[INFO] Node version: 20.11.0
[INFO] npm version: 10.2.4
npm warn deprecated svgo@1.3.2: ...
added 1,847 packages in 42s
[INFO] npm ci completed successfully

And here is the same step, done right:

$ cd front && npm ci --prefer-offline
npm warn deprecated svgo@1.3.2: ...
added 1,847 packages in 42s

No custom logging at all. npm ci already tells us everything we need to know: what it installed, how long it took, and whether it succeeded (via its exit code). Adding log INFO around it is pure noise.

The rule of thumb: log when the pipeline makes a decision, not when it executes a well-known command. Specifically, we add logging for:

Ambiguous operations — cache hit or miss? skip or run? which environment was selected?
Operations that can fail silently — a sync that might succeed partially, a health check with a timeout.
Decisions with side effects — "deploying to production" is worth logging; "running npm ci" is not.

Standard tools like npm, go build, docker, kubectl produce their own output. Our job is to add context around them, not on top of them. If we find ourselves logging "Starting npm ci" right before npm ci, we have achieved the logging equivalent of a meeting that could have been an email.

Wrapping up

Self-documenting pipelines are an investment that pays off immediately. From GitLab's built-in features, through structured log levels, meaningful colors, and collapsible sections, to synchronization barriers, deployment validation, context-rich errors, actionable remediation, decision-context logging, and knowing when to stay silent — we transform CI/CD logs from a debugging nightmare into a developer self-service tool.

The ultimate test: when a pipeline fails at 2 AM, can the on-call developer fix it without escalating to the CI/CD team? If yes, we have done our job. The real metric is not code coverage or deployment frequency — it is the number of CI/CD support tickets trending toward zero.