Chris Noring for Microsoft Azure

Posted on May 5, 2019 • Edited on May 12, 2019

125

Improve your Dockerfile, best practices

#docker #tutorial #showdev #beginners

Follow me on Twitter, happy to take your suggestions on topics or improvements /Chris

A Deeper look at the Dockerfile

Ok, so you know your way around Docker. You might have picked it up in my 5 part Docker series or somewhere else. Regardless you are at a point where you go from understanding the basics to doing it better. That's what this article is showing you, how you can improve existing fundamentals knowledge on Dockerfiles in particular

Resources

Best practices on Dockerfiles There is a long list of tips in here. Sooner or later you want to have a look here and improve your set up.
Push your Docker images to a container registry in the Cloud Your Docker images will need to be stored somewhere either at Docker HUb, a private registry that only you and your colleagues can access or why not a private registry in the Cloud.

What we know about Dockerfile

We know that the Dockerfile is like a recipe file where we can specify things like the OS image to base it on, what libraries should be installed, environment variables, commands we want to run and much more. Everything is there, specified in the file, it's super clear what you are getting. It's a really great advancement from the days where things just worked on our machine or when we spent hours or days installing things - It's progress.

Our Dockerfile sample

We've created a Dockerfile to give you an idea of what it can look like. Let's discuss the various parts of the file to better understand it. Here goes:

// Dockerfile
FROM node:latest

WORKDIR /app

COPY . .

RUN npm install

EXPOSE 3000

ENTRYPOINT ["node", "app.js"]

This is a pretty typical looking file. We select an OS image, set a working directory, copy the files we need, install some libraries, opens up a port and finally runs the applications. So what's wrong with that?

OS image size

At first glance, everything looks the way we expect but at a close look, we can see that we are using node:latest as an image. Let's try to build this into a Docker image with the command:

docker build -t optimize/node .

Ok, let's now run docker images to see our image and get some more stats on it:

It weighs in at 899 MB
Ok, we have nothing to compare with but let's change the image to one called node:alpine and rebuild our image:

77.7 MB, WOW!!! That's a huge difference, our Docker image is ten times smaller. Why is that?

This image is based on the Alpine Linux Project
in general the Alpine Linux images are much smaller than normal distributions. It comes with some limitations, have a read here. In general it's a safe choice though.

The cache

For every command you specify in the Dockerfile it creates another image layer. What Docker does, however, is to first check the cache to see whether an existing layer can be reused before trying to create one.

When we come to instructions like ADD and COPY we should know how they operate in the context of the cache. For both of these commands, Docker calculates a checksum for each file and stores that in the cache. Upon a new build of the Docker images, each checksum is compared and if it differs, due to a change in the file, it recalculates the checksum and carries out the command. At this point, it creates a new image layer.

Order matters

The way Docker operates is to try to reuse as much as possible. The best thing we can do is to place the instructions, in the Dockerfile, from the least likely to change to the most likely to change.

What does that mean?

Let's look at the top of our Dockerfile:

FROM node:alpine

WORKDIR /app

Here we can see that the FROM command happens first followed by WORKDIR. Both these commands are not likely to change os they are correctly placed at the top.

What is likely to change though?

Well, you are building an application so the source files of your app, or libraries you realize you might suddenly need, like a npm install, makes sense to place as further down in the file.

What do we gain by doing this?

Speed, we gain speed when we build our Docker image and we've placed the commands as efficiently as possible. So in summary ADD, COPY, RUn are commands that should happen later in the Dockerfile.

Minimize the layers

Every command you enter creates a new image layer. Ensure you keep the number of commands to a minimum. Group them if you can. Instead of writing:

RUN command
RUN command2

Organize them like so:

RUN command && \
    command2

Include only what you need

When you build an app. It easily consist of a ton of files but when it comes to what you actually need to create your Docker image it ends up being a smaller number of files. If you create a .dockerignore file you can define patterns that ensure that when we include files, we only get the ones we need, for our container.

Define a start script

Wether you use the command CMD or ENTRYPOINT, you should NOT call the application directly like so node app.js. Instead, try to define a starter script like this npm start.

Why you ask?

We want to make sure we are flexible and unlikely to change this instruction. We might actually end up changing how we start our app by us gradually adding flags to it like so node app.js --env=dev --seed=true. You get the idea, it's a moving target potentially. However by us relying on npm start, a startup script, we get something more flexible.

Use LABEL

Using the command LABEL is a great way to describe your Dockerfile better. You could use it to organize the files, help with automation and potential use cases, you know best what information makes sense to put there, but it exists to support you in bringing order to all your images so leverage it to your advantage. A labels value is a key-value pair like so LABEL [key]-[value]. Every label command can have multiple labels. In fact that it's considered to collect all your labels under one label command. You can do so by separating each key-value pair with a space character or like so:

LABEL key=value \
      key2=value2

Rely on default ports with EXPOSE

EXPOSE is what you use to open up ports on the container. To ensure we can talk to the container on that port we can use the -p command in conjunction with Docker run docker run -p [external]: [exposed docker port]. It's considered best practice to set the exposed port to the default ports used by what you are using like port 80 for an apache server and 27017 if you have a Mongo DB database etc.

Be explicit, use COPY over ADD

At first glance it looks like COPY and ADD does the same thing but there is a difference. ADD is able to extract TAR files as well, which COPY can't do. So be explicit and use COPY when you mean to copy files and ensure to only use ADD when you mean to use something feature specific like the mentioned TAR extraction.

Summary

There are many more best practices to follow when it comes to Dockerfile but the biggest gain I've mentioned throughout this post is the one on using the smallest image possible like alpine. It can make wonders for your image size, especially if the storage size is something you pay for.

Have a read in Dockerfile best practices docs for more great tips

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (8)

Harley • May 5 '19 • Edited

Minimize the layers

Cache misses are more likely to happen when grouping RUN commands. If it inflates the end image too much, multi-stage images can be used to have a more bare final image without all the build stages.

Edit: otherwise, I'd say these are all sensible guidelines.

derek • May 6 '19 • Edited

Specifically in the context of node images if you use distroless it will add a few ~5-10mbs in size compared to alpine but you get more security 🔒; Such as: no package manager tools let alone a shell, etc.

Tim • May 6 '19 • Edited

You should also remove package managers caches. Not sure about npm, but if you run yarn install && yarn cache clean you reduce that layer size by 50%.

Chris Noring • May 6 '19

Hi Tim. Appreciate you making the article better. I'll make sure to update it and give you credit :)

Ramon • May 19 '19

The main reason to use a .dockerignore file is because the docker command has to send your complete build-context to the docker daemon, which will then build the image.

If your context is the root of the repo for instance, it will send your whole project to the daemon before evaluating the COPY and ADD commands. This can become expensive when you have a lot of dependencies or store build artifacts in your project.

.dockerignore stops the docker CLI from sending specified files and directories to the daemon.