If you are building a production system, the chances are that you will rely on Docker and Kubernetes for deployment.
Ever had a Docker image will blow up in size?. We faced one such situation recently, and in this write-up, I will cover how we brought the image back to a consumable size.
The System
We at Saama Technologies are building systems that will fast-track Clinical trials. Our data ingestion process is a data pipeline defined on Airflow and parallely executed using Kubernetes.
Each step on the data pipeline process will
- Spin up a new Kubernetes pod
- pulls the docker image
- Executes the task
- Kills the pod
The problem
One of the steps in the data ingestion process is an ML model making predictions over the ingested data. The model was initially consumed via a REST API. We decided to move the model as a Python module to avoid unnecessary latency and timeouts.
On incorporating the model code, the Docker image size suddenly blew up to 5 GB. Since Airflow uses this image to kickstart a task, each task(pod) took 10 minutes to start. If we have ten processes running in parallel, then we will lose 100 minutes in setting up the pod.
Debugging
At this point, It was easy to assume that the ML models are heavy, thereby contributing to the size of the docker image. Building upon our initial assumption, we were thinking of moving the model files to S3.
An important step in debugging is to validate our assumption before proposing any solution.
First, we need to find the size of the image.
docker images
The images below are a representation of command outputs and not the real problem for eg., the image size as you can see is just 184MB and not GBs
Each Docker image is made up of layers. We need to find which of those layers is the heaviest?
docker history <imgage_id>
This command will list down all the layers of a particular image, and it's size.
Surprise! Surprise!
The initial assumption that it was the model files was wrong; it was a step that installed the requirements
.
At this point, it is clear that one of the Python packages is heavy and contributes to almost 2 GB of the image size.
Next, we need to find out the heavy Python package. In python with pip list
and pip show
, we can construct a command to list requirements by size.
pip list --format freeze|awk -F = {'print $1'}| xargs pip3 show | grep -E 'Location:|Name:' | cut -d ' ' -f 2 | paste -d ' ' - - | awk '{print $2 "/" tolower($1)}' | xargs du -sh 2> /dev/null|sort
Eureka
We found that torch,
a deep-learning library that was a part of the research phase(not used anymore), was almost a GB.
There were also some ipython notebooks and CSV files. Using .dockerignore
we can remove unused files.
After all the above steps, we were able to bring the image size to 2.4 GB. The airflow tasks are happily pulling the image and kick-starting the ingestion process in a few seconds.
Top comments (0)