5 most common Dockerfile mistakes

Francesco Tonini — Thu, 25 Feb 2021 08:41:44 +0000

Docker is great. You cannot deny it. Popularity is still growing and the internet is full of examples for every possible programming language, framework, and environment. When it is time to deploy something the first thing I do is search on Google for an example of Dockerfile.

This is fine, right? Unfortunately, most of the examples available online are insecure by design. In my first post here on HashNode I am going to explore some common pitfalls and possible solutions.

Running as root

This is probably the most underrated issue. By default, containers run as root. Ipotethically, if one gains control of the container, it can cause harm to the host.

One easy and reliable fix is to create a user inside the container and set it as both the working directory and the running user.

FROM nginx:latest
RUN useradd --create-home dockeruser
WORKDIR /home/dockeruser
USER dockeruser

Using latest

Many examples' base image use the latest tag. While it is fine for tutorials, Dockerfiles in production must always pin an image tag that is supposed to not change and break your build.

The latest tag is updated every time a new version of the container is pushed. Your build can suddenly break.

Suppose you are deploying a container with python:latest as a base image. At that time latest refers to Python 3.6. Weeks later you have to rebuild the image but it fails, dependencies are not fulfilled. Why? I haven't touched them! By now python:latest refers to Python 3.9 which, incidentally, does not support some of your dependencies.

Minimize the number of layers

Docker creates a layer for each RUN, COPY, and ADD instruction. The more layers, the slower the container.
Whenever possible, wrap multiple commands into a single layer. Remember that you can use \ to trigger multi-line arguments

FROM nginx:latest
RUN apt update && apt install -y \
    git \
    rsync \
    && rm -rf /var/lib/apt/lists/*

Do not create huge containers

One container, one service. Docker containers are not virtual machines. If you have many services to deploy, just create many containers.

Use layer caching

When building images, Docker looks for images in the cache that can reuse. This way, no duplicate images are created and consecutive builds are faster.

But there is a catch. If you copy the source code before installing the dependencies, every time you update the code Docker will invalidate every successive instruction. In other words, you are going to install dependencies every time even though they are identical.
Fortunately, there is a quick fix. Simply make sure that layers that do not change frequently are before layers that do.
For instance, instead of copying the whole source code and then run npm install, just copy package.json, run npm install and then copy the rest of the code. This way every change to the source will not trigger npm install but rather the cache.

That's about it! There are many more tips and tricks to make Dockerfiles faster, maintainable, and secure. These are the 5 most common that everyone should fix ASAP.

If you like it, share and follow me for more! 😀

My solution to the Google HashCode 2020 online round

Francesco Tonini — Fri, 21 Feb 2020 21:59:14 +0000

Hi everyone! This post goes through the story behind the development of my solution for the Google HashCode 2020 online round. If you have never heard of Google HashCode, it is a team coding competition made by Google to solve engineering problems. There is no programming language constraints, just a problem to solve in a fixed amount of time. After the online round, the best teams will be invited by Google for the final round.

Yesterday's problem was to plan which books to scan from a set of libraries. Each book has its own score and the goal was to maximize the total score of scanned books. The full problem statement will be available on the Google HashCode website shortly.

Enough of that, let's make our hands dirty. This solution is available on GitHub and it is written in C# (don't be scared by that, if you known a bit of Java and lambda you'll be fine).

A naive approach

First of all, I had to find a metric to sort each library and pick the best ones. I decided to calculate the score of each library, then sort them in descending order, and output it.

Unsurprisingly, the total score was terrible. What I didn't take into consideration were two main facts: first, two or more libraries may have the same book, but I only need to scan it once; second, what should I do when two libraries have the same score? How should I prioritize one over the other?

Less duplicates, more score?

So I was back to the drawing board. This time, when I am facing two or more libraries with the same score, I pick the one with the least signup time, so that to allow more books to be scanned in parallel. Also, the score of a library takes into account only books that haven't been scanned before. That should fix it!

So, job done?

While I did improve the score on one dataset, the others were just like the first attempt. Also, the dataset "D" was really slow, but we'll get back to it later on.

So, as you may expect, I was back to the drawing board again; yet, something wasn't adding up. What if instead to sorting by score, then signup time, I do signup time first, then score?

Oh well, that was the kind of improvement I was looking for.

About dataset "D - tough choices"

While I was happy with the overall result, I had to find a solution for dataset "D", which on the above implementation was painfully slow. While looking at the data I realized that all books have a score of 65, meaning that I didn't need to calculate the score, I just had to multiply the number of books that can be scanned every day by the number of days and 65. This was fundamental to keep the execution time at a reasonable level (remember that we have a limited amount of time).

This has been a hell of a ride, but it was worth it. It's not the perfect score, but I am more than happy about the result.

If you would like to see my implementation, head over to GitHub. If you like this article, please consider sharing it with friends and colleagues. Also, if you have any suggestion, don't be shy. Ciao!

DEV Community: Francesco Tonini