Dockerizing a Simple Python Process
jess unrein Nov 16
This is part two in a series on taking a simple Python project from local script to production. In part one I talked about a gotcha I ran into when converting an old project from Python 2 to Python 3.
This part will go over how I put my Python process, its inputs, and its outputs into a Docker container and made an image publicly available on Dockerhub.
Requirements that I will not go over here. Go to Docker.com and follow the instructions there
- Download docker
- Create a docker id
- Log in with your docker id on Dockerhub
What is Docker?
Docker is a containerization platform. Containerization is a way to package units of code with their dependencies so that they have everything they need to run in isolation.
Using Docker can help fix the "it works on my machine" problem, and writing dockerized code is a great way to encourage thoughtful code practices. Docker containers should be simple, responsible for as little as possible, and dependent on as few externals as possible.
Docker image vs docker container
Throughout this post, and online, you'll see the terms
image is basically a snapshot of your dockerized code that is created when you use the
docker build command - more on that below. Docker images start a container when you use
docker run on that image. So a
container is a running instance of an
Anatomy of a Dockerfile
I decided to dockerize my csv writer from the previous post in this series so that I could move it between environments easily.
For this I needed a Dockerfile. A Dockerfile is a text file that does not have a file extension.
Here's what the dockerfile for my Python code looks like:
FROM python:3.7 ARG export_file=goodreads.csv COPY $export_file goodreads_export.csv COPY converter.py / CMD ["python", "./converter.py"]
The FROM keyword here indicates a dependency. Docker containers don't have languages automatically loaded. To access Python to run the code, we need to instruct the image to include
A note on Docker registries:
the default Docker registry is Dockerhub. If a docker image is available on Dockerhub, you don't need to specify a url when pulling or pushing from a docker repo. You just need the author's username and the repo name. For example, you can pull the docker image from this post with the command
docker pull thejessleigh/goodreads-libib-converter. If you're using a different registry you'll need to tell Docker where to go. For example, if you're using Quay you'd do
docker pull quay.io/example-username/test-docker-repo.
The python dependency in my Dockerfile doesn't have a username because it's an official repo hosted on Dockerhub.
ARG declares an argument. It is the only instruction in a Dockerfile that can precede
FROM, although I prefer to have
FROM come first for the sake of consistency.
In the above example, I declare an
export_file and give it a default. It expects a file called
goodreads.csv in the same directory as the Dockerfile. If I want to pass in something different, I instruct it to use a different filename with
--build-arg=export_file=my_goodreads_export.csv when building the image.
ADD duplicate the contents of a file into the docker image. This is where I'm importing the input file and also the actual Python code that the Docker image executes.
COPY takes two arguments:
- the location of the file you're putting into the image
- the location of the file inside the docker image
So whatever file I include as the CSV to convert will be referred to as
goodreads_export.csv inside the Docker container. This is nifty, because it means that no matter what I build the docker image with, the filename will always be consistent. I don't have to worry about making the Python code handle different filenames or paths. It can always look for
There are some subtle differences between
ADD that @ryanwhocodes has already written about, so I'll leave his post here.
RUN issues an instruction that is executed and committed as part of the image. If I were dockerizing a Python project that needed to install external packages, I could use
pip install those dependencies. However,
converter.py is a very simple process that doesn't need external packages, so I don't need to run anything as part of my build process.
There can only be one
CMD instruction per Dockerfile. If the Dockerfile contains multiple
CMDs, only the last one will execute.
CMD is the command you intend the image to do when you run an instance of it as a container. It is not executed as part of the build process for an image.
CMD is different from
RUN in this way.
Building a docker image
Now we have everything necessary to build a Docker image for our Python code from the Dockerfile.
As stated above, a Docker
image is an inert snapshot of an environment that is ready to execute a command or program, but has not yet executed that command.
To build using the above Dockerfile, we run
docker build --build-arg=export_file=goodreads_export.csv -t goodreads-libib-converter .
--build-arg tells Docker to build the image with a file called
goodreads_export.csv, overriding the default expectation of
-t goodreads-libib-converter "tags" the image as
goodreads-libib-converter. This is how you create your container with a human readable
. tells Docker to look for a Dockerfile to build in the current directory.
After I do this, I can see that the image was successfully created by checking my image list.
> docker image list REPOSITORY TAG IMAGE ID CREATED SIZE goodreads-libib-converter latest 1234567890 12 seconds ago 924MB
Running a Docker container
Now that I have an
image, I have a standalone environment capable of running my program, but it hasn't actually executed the core procedure specified with
CMD yet. Here's how I do that:
docker run goodreads-libib-container
I see the print debugging statements I have in my
converter.py file execute, so I know how many CSV rows are being converted. When I ran the program locally, it created an output file called
libib_export.csv. However, when I check the contents of my directory now, it's not there. How is that useful!?
Accessing Files Written Out
I'm no longer running the Python code in the directory I was before. I'm running it inside the Docker container. Therefore, any files that are written out will also be stored inside the Docker container. The output file doesn't do me much good in there!
I'm running the Docker container locally, so all I have to do is find the container and copy the output file from it's dockerized location to the place I actually want it.
docker cp container_id:/libib_export.csv ~/outputs/libib_export.csv
This extracts the resultant CSV output from
converter.py and puts it somewhere I can access it.
I can figure out the
container_id (or the human readable name) with
> docker ps -a CONTAINER ID IMAGE COMMAND CREATED NAMES e00000000000 goodreads-libib-export "python ./converter.…" 24 seconds ago naughty_mcclintock
Yes, naughty_mcclintock is actually the procedurally generated name for the container I've been working with locally.
Copying a file from a container to my desired location is fine for a local environment, but has limited uses if I ever want to take this project to production. There are other, better options for dealing with output files from Docker containers, but we'll get into that ✨ in another installment in this series ✨
Committing a docker image
After we've run the container to confirm that it works, we probably to create a new image based on the changes it made when it executed. We're preparing the image that we want to push up into an external Docker registry, like Dockerhub.
When committing a Docker image, we need to specify the registry (if it's something other than dockerhub), the author name, the repository name, and the tag name.
docker commit -m "Working Python 3 image" naughty_mcclintock thejessleigh/goodreads-libib-converter:python3
docker commit was successful, so I see a sha256 hash output in my terminal. Creating a commit message is, of course, optional. But I like to do it to keep organized.
A note on Docker image tags:
When you pull a Docker image and you don't specify a tag it will use the default tag (usually
latest). Tags are the way you can keep track of changes in your project without overwriting previous versions. For example, if you (for some reason) are still using Python 2, you can access the Python 2 image by running
docker pull thejessleigh/goodreads-libib-converter:python2. Right now the
latesttags on my rocker repo are the same, but you can pull either one.
Pushing a docker image to Dockerhub
Now that I have an image I want to put out into the world, I can push it up to Dockerhub.
First, I need to log into Dockerhub and create a repository. Repositories require a name, and should have a short description which details the purpose of the project, and a long description that explains dependencies, requirements, build arguments, etc. You can also make a Docker repository private.
Once I've done that, I run
docker push, which sends the latest commit of the project and tag I've specified up to the external registry. If you didn't specify a tag, this push will override the
latest tag in your repository.
docker push thejessleigh/goodreads-libib-converter:python3
If you go to my Dockerhub profile you can see the
goodreads-libib-converter project, and pull both the Python 2 and Python 3 incarnations.
Now that I have a working Docker image, I want to put it into production so that anyone can convert their Goodreads library CSV into a Libib library CSV. I'm going to go about this using AWS, which requires a bit of setup.
The next installment in this series will go over setting up an AWS IAM account, setting up
awscli and configuring your local profiles, and creating an s3 bucket that your IAM account can access.