Jibin Liu

Posted on Sep 15, 2018

How to persist data in docker container

#docker #container #datapersistence

TL;DR

Containers are supposed to be light-weighted. Adding unnecessary data will make it heavy to create and run. Docker provides several ways to mount storage from the host machine to containers. Volumes are the most commonly used one. It can be used to persist application data, and also share data between multiple containers as well. (local volumes cannot be shared between docker services though. You will need shared storage instead.)

Background

I've heard docker and container a while ago, however, I'm new to use them. Only recently I started exploring as it helps to build web services and easily deploy on multiple OS. (They are fantastic tools!)

For one of the web services, its job is to create/update/activate another virtual environment, and run a task using that environment. Different requests will sometimes need a different virtual environment. The requirements.txt file for each virtual environment is synced from time to time, then pip install is called to update the virtual environment. pip install can take time, and need to be called as fewer times as possible. That means the web service need to persist the virtual environments so that when the service restarts, it doesn't have to repeat the create/update environment jobs.

Here it raises the issue that, every time when a new image was built for the web service, obviously it doesn't have the virtual environments stored in the old container. This makes the service to be "very cold-start". To solve it, I first thought to commit the changes from the old container to the new image. However, this extremely increases the size of the image and container.

After a few hours of digging in the docker documentation, I realized that so far I've thought of containers to be "fully self-contained", while it has more power when working together with its host machine.

Solution

Docker provides three ways to mount data to the container: volumes, bind mounts, and tmpfs storage [1].

Volumes are part of the host filesystem, but managed by docker at the specific path and should not be modified by other applications
Bind mounts can be anywhere on the host, but can be modified by other applications
tmpfs are in the host's in-memory space, and never get written into the filesystem.

Generally speaking, volumes are the go-to solution to solve most of the data persistence issues in a container. Volumes can be either created by docker volume create command, or created when starting a container.

Examples as my solution

The docker documentation is here [2].

1. Create volume

First, let's create a volume named as virtualenv to serve as the path to store virtual environments.

➤ docker volume create virtualenv

We can check the volume by the following command

➤ docker volume inspect virtualenv
[
    {
        "CreatedAt": "2018-09-15T05:29:36Z",
        "Driver": "local",
        "Labels": {},
        "Mountpoint": "/var/lib/docker/volumes/virtualenv/_data",
        "Name": "virtualenv",
        "Options": {},
        "Scope": "local"
    }
]

2. Create container

The structure of the example app looks like this:

Dockerfile
main.py: the entrypoint
create_env.sh (used to create another virtual environment)

What main.py does is to check if the virtual environment "my_env" exists. If not, it will create it. We're going to mount the volume created above as ~/.virtualenv folder in the container.

I use the following Dockerfile to create the simplest python image:

FROM python:3.7
WORKDIR /app
ADD . /app
RUN pip install virtualenv
CMD ["python", "./main.py"]

main.py looks like this:

import os
import subprocess

def main():
    if os.path.exists('/root/.virtualenv/my_env'):
        print('my_env already exists')
    else:
        subprocess.run(['bash', 'create_env.sh'])
        print('my_env created')

if __name__ == '__main__':
    main()

And the one-line create_env.sh

cd ~/.virtualenv/ && virtualenv my_env

3. Start container with volume mounted

We first build the python image:

➤ docker build -t docker-data-persistence .

Then to mount the volume, we use --mount argument:

➤ docker run \
  --mount source=virtualenv,target=/root/.virtualenv \
  docker-data-persistence

Using base prefix '/usr/local'
New python executable in /root/.virtualenv/my_env/bin/python
Installing setuptools, pip, wheel...done.
my_env created

As we can see above, when we run the container for the first time, it will create the virtual environment "my_env" as it doesn't exist in the volume yet. If we run it the second time, it will say "my_env" already exists.

➤ docker run \
  --mount source=virtualenv,target=/root/.virtualenv \
  docker-data-persistence

my_env already exists

4. Inspect the volume

We can take a look into the files in the volume (in a hacky way [3]) to verify the contents:

➤ docker run -it \
  --mount source=virtualenv,target=/root/.virtualenv \
  docker-data-persistence \
  find /root/.virtualenv/my_env/bin

/root/.virtualenv/my_env/bin
/root/.virtualenv/my_env/bin/python3
/root/.virtualenv/my_env/bin/activate.csh
/root/.virtualenv/my_env/bin/easy_install-3.7
/root/.virtualenv/my_env/bin/python
/root/.virtualenv/my_env/bin/python-config
/root/.virtualenv/my_env/bin/easy_install
/root/.virtualenv/my_env/bin/python3.7
/root/.virtualenv/my_env/bin/activate
/root/.virtualenv/my_env/bin/pip
/root/.virtualenv/my_env/bin/activate.fish
/root/.virtualenv/my_env/bin/pip3
/root/.virtualenv/my_env/bin/wheel
/root/.virtualenv/my_env/bin/activate_this.py
/root/.virtualenv/my_env/bin/pip3.7

5. Delete the volume

To delete the volume, we can use docker volume rm <volume-name>. However, you can't delete a volume when there is a container that uses it, even if the container has exited.

➤ docker volume rm virtualenv
Error response from daemon: remove virtualenv: volume is in use - [dc4425b806a67a9002d68703cdd9854feba44e43d591278b4eb2869f43c0da6d]

References

Top comments (5)

Praveen Bisht • Sep 20 '19

Hi, thanks for your article

Can we store multiple containers data in one volume.

Container1
-- Folder
   -- File1.json
   -- File2.json

Container2
-- Folder
   -- File1.json
   -- File2.json

This is what we currently have, it's essentailly that files gets written in each containers by the end user of the app, and there is this one container with our backend that has to scan all the files and show them in file manager as nested structure. So we want to make these containers independent of user generated data and scanning the shared volume would be much more easier for us.

Can we do it like this

Shared Volume
-- Container1Data
   -- Folder
      -- File1.json
      -- File2.json
-- Container2Data
   -- Folder
      -- File1.json
      -- File2.json

kozelm007 • Jul 3 '19

Hi Liu. Thanks for your explanation, helped me lot as I'm pretty new in Docker. Just wondering if the volume is persistent in case of switching computer off/on?