DEV Community

Bhavani Ravi
Bhavani Ravi

Posted on • Originally published at bhavaniravi.com on

How to Reduce Docker Image Size for your Python Project?

If you are building a production system, the chances are that you will rely on Docker and Kubernetes for deployment.

Ever had a Docker image will blow up in size?. We faced one such situation recently, and in this write-up, I will cover how we brought the image back to a consumable size.

The System

We at Saama Technologies are building systems that will fast-track Clinical trials. Our data ingestion process is a data pipeline defined on Airflow and parallely executed using Kubernetes.

Each step on the data pipeline process will

  1. Spin up a new Kubernetes pod
  2. pulls the docker image
  3. Executes the task
  4. Kills the pod

The problem

One of the steps in the data ingestion process is an ML model making predictions over the ingested data. The model was initially consumed via a REST API. We decided to move the model as a Python module to avoid unnecessary latency and timeouts.

On incorporating the model code, the Docker image size suddenly blew up to 5 GB. Since Airflow uses this image to kickstart a task, each task(pod) took 10 minutes to start. If we have ten processes running in parallel, then we will lose 100 minutes in setting up the pod.

Debugging

At this point, It was easy to assume that the ML models are heavy, thereby contributing to the size of the docker image. Building upon our initial assumption, we were thinking of moving the model files to S3.

An important step in debugging is to validate our assumption before proposing any solution.

First, we need to find the size of the image.


docker images

The images below are a representation of command outputs and not the real problem for eg., the image size as you can see is just 184MB and not GBs

list of docker images on the machine

Each Docker image is made up of layers. We need to find which of those layers is the heaviest?


docker history <imgage_id>

This command will list down all the layers of a particular image, and it's size.

Image of all layers of cache in a docker image

Surprise! Surprise!

The initial assumption that it was the model files was wrong; it was a step that installed the requirements.

At this point, it is clear that one of the Python packages is heavy and contributes to almost 2 GB of the image size.

Next, we need to find out the heavy Python package. In python with pip list and pip show, we can construct a command to list requirements by size.


pip list --format freeze|awk -F = {'print $1'}| xargs pip3 show | grep -E 'Location:|Name:' | cut -d ' ' -f 2 | paste -d ' ' - - | awk '{print $2 "/" tolower($1)}' | xargs du -sh 2> /dev/null|sort

output of all python packages in decreasing order with respect to size

Eureka

We found that torch, a deep-learning library that was a part of the research phase(not used anymore), was almost a GB.

There were also some ipython notebooks and CSV files. Using .dockerignore we can remove unused files.

After all the above steps, we were able to bring the image size to 2.4 GB. The airflow tasks are happily pulling the image and kick-starting the ingestion process in a few seconds.

Top comments (0)