DevOps RealWorld Series #2 Ephemeral Storage: The Hidden Cause of CI/CD Failures in Kubernetes

#devops #cicd #kubernetes #jenkins

We encountered a strange issue where build agents were failing randomly While running Jenkins pipelines on Kubernetes.
Agent pods would start normally and builds would run successfully for some time. Then the pods would terminate unexpectedly, causing pipeline failures.
Initially, we investigated:
• Jenkins logs
• Pipeline configuration
• Docker build stages
• SonarQube scans
Everything looked normal.

The real cause became clear after inspecting pod events:

kubectl describe pod jenkins-agent-pod

The Events section showed:
Evicted
The node was low on resource: ephemeral-storage
What Consumes Ephemeral Storage in CI Agents?
CI agent pods often consume more disk than expected due to:
• Docker image layers
• Dependency downloads
• Temporary build files
• Test artifacts
• Coverage reports
• SonarQube cache
• Package manager caches
Unlike CPU and memory, ephemeral storage is frequently ignored in resource configuration.

Why This Causes Pipeline Failures?
When ephemeral storage usage exceeds node capacity:

Kubernetes marks the node under disk pressure
Pods get evicted
Jenkins agents disappear
Pipelines fail unexpectedly Since Jenkins does not clearly indicate storage-related failures, this often looks like a Jenkins or pipeline problem.

The Fix
We resolved the issue by explicitly defining ephemeral storage resources:

resources:
  requests:
    cpu: 500m
    memory: 1Gi
    ephemeral-storage: 4Gi
  limits:
    cpu: 2
    memory: 4Gi
    ephemeral-storage: 10Gi

Additional improvements included:
• Cleaning workspace after builds
• Splitting heavy pipelines into separate agents
• Increasing node storage capacity
• Reducing artifact retention
Key Takeaway
If Jenkins agents fail randomly in Kubernetes, always check pod events:

kubectl describe pod jenkins-agent-pod

Ephemeral storage exhaustion is one of the most common but overlooked causes of CI/CD instability.

Top comments (1)

vandana.platform • Mar 10

Great breakdown of a problem that many teams overlook.

Ephemeral storage is one of those resources that often gets ignored because most people focus on CPU and memory when defining pod resources. CI workloads are particularly heavy on disk usage because of image layers, dependency downloads, and build artifacts.

In several environments I've seen, pipeline failures initially get blamed on Jenkins or pipeline configuration when the actual issue is node disk pressure leading to pod eviction.

Checking pod events with kubectl describe pod is definitely one of the fastest ways to identify this kind of issue.

Curious if you’ve also experimented with using dedicated node pools for CI workloads to isolate build-related disk usage from application workloads.