DEV Community

Tyson Lawrie
Tyson Lawrie

Posted on

CICD Evicted: exceeded ephemeral storage

I am a Software Engineer enjoying the journey to a Cloud Native landscape. You can find my previous posts on Medium


Pod ephemeral local storage usage exceeds the total limit of containers

Recently we hit a major issue with the builds for some of our AI based components in our DevOps pipeline. Due to either storage or memory limits. As with all the new modern DevOps tools, a task (essentially a Kubernetes Job) is spun up to complete the commands required to build and package.

In can be complex to pinpoint exactly what the issue is and what reaching these limits mean, at what point in the task are these limits reached.

ephemeral-storage = emptyDir && medium = memory

In kubernetes ephemeral-storage can be mapped to an emptyDir, essentially a scratch space for the duration of the pod which is then terminated. Unfortunately on our clusters, each Node only has a maximum disk space of 100Gb so it very quickly fills up.

Ok so let's use this cool medium feature and set the storage to use memory. Bonus the task sped up, we saw an increase in performance of about 20%. Unfortunately we still reached a maximum for the task with the memory resource limit.

Volumes:
  cicd-vol-data:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  Memory
Enter fullscreen mode Exit fullscreen mode

Visibility = New Relic infrastructure

We really needed to get a realistic view of what all these CICD tasks were doing, we could no longer just sail through with blind trust in tasks and pipelines. Time for a real deep analysis.

Memory

CPU

With new relic infrastructure agents on our kubernetes nodes we put together a dashboard that would show us a breakdown of the memory, storage, and CPU thats being consumed by our tasks.

In this dashboard you can see the spikes specifically for the AI related worker are outliers compared to all the other tasks. As you can see they reach the limits previously applied for disk and memory (16Gb Ephemeral + 16Gb of Memory as Disk) and just abruptly terminate. CPU is fine.

It seems we have just transferred the issue from storage to memory.

Unpacking the layers

In analyzing the memory usage, the task would consume a total of 21.6Gb of memory. Essentially the working scratch directory in memory was ~18Gb by completion. Well thats HUGE!

In breaking down the 18GB, we have;

  • After Clone = 1.7Gb
  • After Image Build & Export we have = ~16Gb.

Image build includes the build context, tempfs, and output archive. The build context
**Whenever there is a Dockerfile command (COPY . . or RUN xyz .) its copying the context into different layers. Every command in the container forms layers.

The key was to understand the various layers being built to see we were starting to copy the file contents around the place making multiples of 1.7Gb

The solution

To get this going for our team, we increased the memory limit to 24Gb, which was easier for us than disk (plus we wanted the speed increase). Hooray, it built!

And now we focus on the Dockerfile and code structure to try and make it more efficient and bring this back down to within normal range.


Thanks go to my whole team, in particular, key members of that team: Glen Hickman and Ionut Palada for the help in solving this issue.

Top comments (0)