Diego Crespo

Posted on Mar 21, 2024

Learning to use Docker

#tutorial #docker #python

Docker was released 13 years ago and has quickly become a mainstay in the programming community. At first I avoided it, considering it too complex, intimidating, and a waste of time for small projects. But Docker is not any of those things. Sure, wrapping a complex project with a Docker bowtie can be challenging, but that's because non trivial projects usually require non trivial solutions. The benefits you get once this is accomplished though, is immense. But before we get to that, let's start simple. What is Docker?

Docker is a technology that helps you package and run your software applications in a consistent and efficient way, regardless of the computing environment they're running on.

This package is typically called a Docker Image. The running instance of an Image is called a Docker Container. You can think of a Docker Image like a programming language and technology agnostic version of a C# or Java project, or a Python virtual environment. In those examples, the language settings and all of the packages that you install are only for that specific project/language you are working on.

If others want to collaborate on the project, they will need to have the same settings and packages as you. For example, if you have a Python virtual environment you can run pip freeze > requirements.txt to get the list of packages your project needs, then pass along the txt file to someone else who can then run pip install -r requirements.txt to download the same packages. But what if the person trying to install the dependencies for your project has a different version of Python than you? They may run into dependency issues, due to functions being deprecated, or behavior being changed as a library matures. Docker avoids this problem.

But there are more than just dependency conflicts that can doom a project. Sometimes projects require more than what the standard library, or 3rd party package ecosystem can provide.

Let’s say you want to do OCR on some pdfs. You might decide to use pytesseract. In the installation instructions it says that you need Google Tesseract OCR, as pytesseract is a wrapper over it.

Install Google Tesseract OCR (additional info how to install the engine on Linux, Mac OSX and Windows). You must be able to invoke the tesseract command as tesseract. If this isn’t the case, for example because tesseract isn’t in your PATH, you will have to change the “tesseract_cmd” variable pytesseract.pytesseract.tesseract_cmd. Under Debian/Ubuntu you can use the package tesseract-ocr. For Mac OS users. please install homebrew package tesseract.

Unfortunately Tesseract can't be pip installed, so getting it set up can be tricky. I can't count how many times I've joined a project, and my initial experience looks like this

Try to install additional dependencies
Run into an error preventing the install
Message Teams Chat, “Hey, has anyone run into this error when installing X?”
"Oh yea, the link they give in the documentation is to an old version of X, this Stack Overflow post has the correct link"
*Continues trying to install and runs into problem Y*
Ask in the chat again
"Oh it’s because your using version ZZZ for this, you have to downgrade it because Y is expecting a version QQQ instead."

Eventually you get it all working, and go on your merry way. 3 months later, it is summer and you have new interns. One of them has been assigned to this project to help. All of a sudden you see a Teams message in chat from the new intern.

“Hey has anyone run into this error when installing X?”
You: *Thinks very hard*. "Oh I remember this error! Hold on I have a link to a Stack Overflow post that has the fix"

Repeat steps 4-7 as your intern struggles with the same issues you had 3 months ago.

If only there was some way you could construct your software applications, that would allow it to behave in a consistent and efficient way, regardless of the computer environment it's running on. Wait! What's that I hear you say? There is? Why yes! Docker of course!

Docker In Practice

Let’s use our previous example for a simple Python app that performs ocr, and prints out the text to the screen.

The following code takes a trimmed pdf from the Autobiography of Benjamin Franklin pages 7-17, and prints it out to the console.



import pytesseract
from pdf2image import convert_from_path

pdf_file = "trimmed_autobiography.pdf"

# Convert PDF pages to images
pages = convert_from_path(pdf_file, 300)  # DPI set to 300

# OCR each page and extract text
text = ""
for page in pages:
text += pytesseract.image_to_string(page)

print(text)

Now let’s look at the Dockerfile we need to build this.



FROM python:3.12-slim-bookworm

# Set the working directory
WORKDIR /app

# Install Tesseract OCR, poppler-utils, and their dependencies
RUN apt-get update && \
apt-get install -y tesseract-ocr poppler-utils && \
apt-get clean

# Install python packages
RUN pip3 install --no-cache-dir pytesseract pdf2image

# Copy the main.py and PDF file into the Docker image
COPY main.py trimmed_autobiography.pdf /app/

# Command to run the Python script
CMD ["python3", "main.py"]

Here's a break down each line of this Dockerfile:

FROM python:3.12-slim-bookworm: This line specifies the base Image to use for building the Docker Image. In this case, we are using an Image named python with the tag 3.12-slim-bookworm, which means it's based on Python 3.12 and the Debian 'slim' variant. Slim is a smaller version of Debian without extra packages. We use the Python version of Debian instead of the normal Debian Image due to the fact that regular Debian disallows global pip installs (this is normally a good thing).
WORKDIR /app: This line sets the working directory inside the Docker container to /app. It's common practice to put the code for a project in /app but your working directory can be anywhere you want. The folder will be created if it isn’t there already. Subsequent commands will be executed from inside this directory
RUN apt-get update &&:
1. RUN Is a Docker instruction that executes commands in the container during the build process.
2. apt-get update Uses the Debian Advanced Package Manager (Apt) to update the package lists for packages that are available for installation.
3. && Allows for multiple commands to be run in a single RUN instruction and the \ allows the instruction to continue onto the next line.
4. apt-get install -y tesseract-ocr poppler-utils &&: This line installs the packages tesseract-ocr and poppler-utils along with their dependencies using the apt-get install command. The -y flag automatically confirms any prompts during installation.
5. apt-get clean: This command removes any unnecessary files and packages that were downloaded during the installation process. It helps to keep the Docker image size smaller.
RUN pip3 install --no-cache-dir pytesseract pdf2image: This command installs Python packages pytesseract and pdf2image using pip. The --no-cache-dir option disables caching of downloaded packages and metadata.
COPY main.py trimmed_autobiography.pdf /app/: This line copies files main.py and trimmed_autobiography.pdf from the host machine into the /app directory within the Docker container.
CMD ["python3", "main.py"]: This is the default command to run when the container starts. It specifies to run the main.py Python script using the python3 interpreter.

A Dockerfile is kind of like as a script you would run if you wanted to automate setting up a new laptop. It would specify all the tools and their dependencies you would need and then go and fetch and install them.

Building the Docker Image

Now that we have a Dockerfile we can use it to make a Docker Image. If you open your terminal and run docker build -t book-slim . with this Dockerfile in the current directory, Docker will build a Docker Image based on the instructions provided in the Dockerfile.
Once the image is created, you can run the Docker Image with the command docker run --name book-slim-container book-slim. This executes your docker Image named book-slim and gives the running container the name book-slim-container. This is helpful because all running Images need a name. If you don’t provide one than docker will make up a name.

Looking at my Docker Desktop app, I can see that when I executed my Docker Image the first couple of times without specifying a name, Docker gave it the name “trusting_nightingale”, and “nostalgic_shaw”. But when I specified a --name, that was the name it gave my container instead.

Getting back to our running program, after waiting patiently you should start seeing text printed out to the screen

My father had also, by the same wife, four
children born in America, and ten others by
a second wife, making in all seventeen. I
remember to have seen thirteen seated to-
gether at his table, who all arrived at years
of maturity, and were married. I wus the
last of the sons, and the youngest chiid, ex-
cepting two daughters. I was born at Bos-
ton, in New England. My mother, the sec-
ond wife, was Abiah Folger, daughter of
Peter Folger, one of the first colonists of
New England, of whom Cotton Mather makes .
honorable mention, in his Ecclesiastical His-

So, we built our Docker Image which specified all the dependencies we need to run the program, then we ran our Image, and it executed the main.py to print the pdf's text to the terminal like we asked it to. Run the command in the terminal again, but be warned it will fail.

docker run --name book-slim-container book-slim

This is the error message you will get



docker: Error response from daemon: Conflict. The container name "/book-slim-container" is already in use by container "5bf83a1f62e4e6191fd502b889282334a1e666b42f473ed43e3d61fc87ac013c". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.

The reason for this is that Docker requires each container to have a unique name, and we already created one named book-slim-container

To resolve this, we have a few options...

Remove the existing container: You can remove the existing container with the conflicting name using the docker rm command.



docker rm book-slim-container
or
docker rm 5bf83a1f62e4e6191fd502b889282334a1e666b42f473ed43e3d61fc87ac013c

Rename the existing container: If you want to keep the existing container but use a different name for the new one, you can rename the existing container using the docker rename command

docker rename book-slim-container new-name

Restart and re-execute the container: An existing container can be restarted, and then re-executed.

docker restart book-slim-container && docker exec book-slim-container python3 main.py

Avoid the issue altogether: Because an already created Docker Image is easy to execute as a container, it is common to pass the parameters --rm to delete the container after it is run, so subsequent runs can just recreate themselves using the same container name and Image

docker run --rm --name book-slim-contaner book-slim

Exploring containers

Let’s peel back the veil a bit on the container itself. Run the following command in the terminal.

docker run --name book-slim-container -it book-slim /bin/bash

Here’s what these new commands do

-i: Stands for "interactive mode". This means that the container's standard input will remain open even if not attached
-t: Stands for "tty" or "pseudo-terminal". It allocates a pseudo-terminal for the container
Together, -it ensures that the container session is interactive, allowing you to interact with the shell running inside the container */bin/bash: Overrides the default command specified in the Dockerfile (python3 main.py) and starts a bash shell (/bin/bash) inside the container.

If the interactive command ran successfully, you should be inside the container in the WORKDIR /app.

If you type ls you can see the two files that were copied into the container when the Docker Image was built. You can do python --version and see that the python version is 3.12 and run the main.py file manually. It all works like you would expect. When you are done exploring the container you can type exit in the terminal.

So the advantage of this approach is that you do all the hard work of figuring out all the dependencies for a project, and then encode that into a Dockerfile. Then your coworkers, friends, or collaborators just need to have Docker installed, and then they can run two simple commands to get up and running with your project



docker build -t  book-slim .
docker run --rm --name book-slim-contaner book-slim

Mounting volumes

Let’s make our python program more useful. Instead of printing out the text from the pdf, let’s save it to a file. Here is the new main.py



import pytesseract
from pdf2image import convert_from_path

pdf_file = "trimmed_autobiography.pdf"

# Convert PDF pages to images
pages = convert_from_path(pdf_file, 300)  # DPI set to 300

# OCR each page and extract text
text = ""
for page in pages:
text += pytesseract.image_to_string(page)

output_file = "extracted_text.txt"

with open(output_file, "w", encoding="utf-8") as file:
   file.write(text)

print("Text extracted and written to:", output_file)

If we were to run the Docker Image now with

docker run --rm --name book-slim-container book-slim

The extracted_text.txt would not be written in the same directory you ran the Docker Image. It would write inside the container, and then get cleaned up once the container finishes running. To make the written file accessible outside of the container, you want to mount a directory using the -v command.

Mounting a host directory into a Docker container allows you to share files and directories between the host machine (where Docker is running) and the container. When you mount a directory, you're essentially creating a link between a directory on your host system and a directory inside the container. This has 3 advantages

Access to Host Files: Any files or directories in the mounted directory on the host machine are accessible from within the container.
Persistence of Data: Changes made to files or directories in the mounted directory from within the container persist on the host machine, and vice versa. This means that if a file is created, modified, or deleted in the mounted directory from either the host or the container, the change will be reflected in both places.
Sharing Resources: Mounting a directory can be useful for sharing resources such as code, configuration files, or data between the host and the container. This is particularly handy during development or when you need to provide input data or retrieve output data from a container.

Here is the full command to replicate this behavior.

docker run --rm --name book-slim-container -v .:/app book-slim

To make sure the file is written locally, I pass .:/app to the -v command. This mounts the current directory ‘.’ on the host machine to the /app directory inside the container.

After a few moments, extracted_text.txt appears, confirming that I successfully wrote the file to disk

And that's the basics of Docker. There are still many other aspects of Docker worth getting into (Docker-Compose, managing multiple docker containers, exposing ports, etc) but this should at least get you started. In the mean time, I hope that this article made Docker more approachable, and I hope you use it in one of your next projects!

Call To Action 📣

Hi 👋 my name is Diego Crespo and I like to talk about technology, niche programming languages, and AI. I have a Twitter and a Mastodon, if you’d like to follow me on other social media platforms. If you liked the article, consider checking out my Substack. And if you haven’t why not check out another article of mine listed below! Thank you for reading and giving me a little of your valuable time. A.M.D.G