Imagine that we have the following Postgresql database dump:
--
-- PostgreSQL database dump
--
-- Dumped from database version 11.5
-- Dumped by pg_dump version 11.3
SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;
--
-- Name: my_db; Type: DATABASE; Schema: -; Owner: postgres
--
CREATE DATABASE my_db WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE = 'en_US.utf8' LC_CTYPE = 'en_US.utf8';
ALTER DATABASE my_db OWNER TO postgres;
\connect my_db
SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;
SET default_tablespace = '';
SET default_with_oids = false;
--
-- Name: clients; Type: TABLE; Schema: public; Owner: postgres
--
CREATE TABLE public.clients (
id integer NOT NULL,
name character varying(150) NOT NULL
);
ALTER TABLE public.clients OWNER TO postgres;
--
-- Name: clients_id_seq; Type: SEQUENCE; Schema: public; Owner: postgres
--
CREATE SEQUENCE public.clients_id_seq
AS integer
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
ALTER TABLE public.clients_id_seq OWNER TO postgres;
--
-- Name: clients_id_seq; Type: SEQUENCE OWNED BY; Schema: public; Owner: postgres
--
ALTER SEQUENCE public.clients_id_seq OWNED BY public.clients.id;
--
-- Name: clients id; Type: DEFAULT; Schema: public; Owner: postgres
--
ALTER TABLE ONLY public.clients ALTER COLUMN id SET DEFAULT nextval('public.clients_id_seq'::regclass);
--
-- Data for Name: clients; Type: TABLE DATA; Schema: public; Owner: postgres
--
COPY public.clients (id, name) FROM stdin;
1 Client 1
2 Client 2
\.
--
-- Name: clients_id_seq; Type: SEQUENCE SET; Schema: public; Owner: postgres
--
SELECT pg_catalog.setval('public.clients_id_seq', 2, true);
--
-- Name: clients clients_pkey; Type: CONSTRAINT; Schema: public; Owner: postgres
--
ALTER TABLE ONLY public.clients
ADD CONSTRAINT clients_pkey PRIMARY KEY (id);
--
-- PostgreSQL database dump complete
--
It is a simple database with a Clients
table and 2 records.
If we want to start a Postgresql Docker container with this dump loaded to share with our team, we can add this SQL file into the /docker-entrypoint-initdb.d/ folder inside the container, like explained into the Postgresql Image docs from DockerHub.
Initialization scripts If you would like to do additional initialization in an image derived from this one, add one or more *.sql, *.sql.gz, or *.sh scripts under /docker-entrypoint-initdb.d (creating the directory if necessary). After the entrypoint calls initdb to create the default postgres user and database, it will run any *.sql files, run any executable *.sh scripts, and source any non-executable *.sh scripts found in that directory to do further initialization before starting the service.
The following Dockerfile uses postgres:11-alpine as base image and copies test_dump.sql file to the entrypoint folder.
FROM postgres:11-alpine
COPY test_dump.sql /docker-entrypoint-initdb.d/
If we build this image
$ docker image build . -t preloaded_db:latest
and start a container with the generated image
$ docker container run -d --rm -p 5432:5432 -e POSTGRES_PASSWORD=postgres -e POSTGRES_USER=postgres --name test_preloaded_db preloaded_db:latest
we can see in our database that the database was created. (password is postgres
)
$ psql -h localhost -U postgres
postgres=# \c my_db
psql (11.3, server 11.5)
You are now connected to database “my_db” as user “postgres”.
my_db=# SELECT * FROM clients;
id | name
— — + — — — — —
1 | Client 1
2 | Client 2
(2 rows)
Awesome. Now we have a docker image that has our database loaded. But if we check the log of this container
$ docker container logs test_preloaded_db
we can see CREATE DATABASE and CREATE TABLE commands.
/usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/test_dump.sql
SET
SET
SET
SET
SET
set_config
------------
(1 row)
SET
SET
SET
SET
CREATE DATABASE
ALTER DATABASE
...
This tell us that the dump is being processed every time we create the container. If we destroy this container and create a new one, the dump will be processed again. This works fine but if we have a big database with a big dump file, the startup process of the container will be slow because it can take some time to process the whole dump. We can fix it by keeping the database preloaded in the image.
Before we moving on, let’s destroy the container we created
$ docker container rm -f test_preloaded_db
Preloading the database in the image
To preload the database in the image, we need to tell our Dockerfile to execute the same entrypoint
of the original PostgreSQL image so it can execute the dump in the build step. Let’s use Multi-Stage build to divide our build in two steps. The first one will execute the entrypoint
with the dump file and the second one will copy the data folder to the resulting image.
# dump build stage
FROM postgres:11-alpine as dumper
COPY test_dump.sql /docker-entrypoint-initdb.d/
RUN ["sed", "-i", "s/exec \"$@\"/echo \"skipping...\"/", "/usr/local/bin/docker-entrypoint.sh"]
ENV POSTGRES_USER=postgres
ENV POSTGRES_PASSWORD=postgres
ENV PGDATA=/data
RUN ["/usr/local/bin/docker-entrypoint.sh", "postgres"]
# final build stage
FROM postgres:11-alpine
COPY --from=dumper /data $PGDATA
In the first step, we have the following instructions:
-
FROM postgres:11-alpine as dumper We define the base image our step will use.
postgres
with the11-alpine
tag in this case. -
COPY test_dump.sql /docker-entrypoint-initdb.d/ Copy the
test_dump.sql
file to the/docker-entrypoint-initdb.d/
folder. -
RUN [“sed”, “-i”, “s/exec "$@"/echo "skipping…"/”, “/usr/local/bin/docker-entrypoint.sh”] We need to execute this
sed
command in order to remove theexec "$@"
content that exists in thedocker-entrypoint.sh
file so it will not start the PostgreSQL daemon (we don’t need it on this step). -
ENV POSTGRES_USER=postgres; ENV POSTGRES_PASSWORD=postgres; ENV PGDATA=/data Sets environment variables to define
user
andpassword
and tell PostgreSQL to use/data
as data folder, so we can copy it in the next step -
RUN [“/usr/local/bin/docker-entrypoint.sh”, “postgres”] Execute the entrypoint itself. It will execute the dump and load the data into
/data
folder. Since we executed thesed
command to remove the$@
content it will not run the PostgreSQL daemon
The second step contains only this instruction:
-
COPY — from=dumper /data $PGDATA This will copy all files from
/data
folder from thedumper
step into the $PGDATA from this current step, making our data preloaded when we start the container (without needing to run the dump every time we create a new container).
If we build this Dockerfile
$ docker image build . -t preloaded_db:new
We can see in the output the dump being processed and after everything is finished, the image is built.
and we can start the container with this new image
$ docker container run -d --rm -p 5432:5432 -e POSTGRES_PASSWORD=postgres -e POSTGRES_USER=postgres --name test_preloaded_db preloaded_db:latest
and our database is loaded
$ psql -h localhost -U postgres
psql (11.3, server 11.5)
Type “help” for help.
postgres=# \c my_db
psql (11.3, server 11.5)
You are now connected to database “my_db” as user “postgres”.
my_db=# SELECT * FROM clients;
id | name
— — + — — — — —
1 | Client 1
2 | Client 2
(2 rows)
But if we check the logs now, the dump is not being processed every time we create the container
$ docker container logs test_preloaded_db
2019–09–16 01:42:22.458 UTC [1] LOG: listening on IPv4 address “0.0.0.0”, port 5432
2019–09–16 01:42:22.458 UTC [1] LOG: listening on IPv6 address “::”, port 5432
2019–09–16 01:42:22.460 UTC [1] LOG: listening on Unix socket “/var/run/postgresql/.s.PGSQL.5432”
2019–09–16 01:42:22.470 UTC [18] LOG: database system was shut down at 2019–09–16 01:41:02 UTC
2019–09–16 01:42:22.473 UTC [1] LOG: database system is ready to accept connections
We can see that only the PostgreSQL startup is being done. No dump is being executed because it was executed in the build
image step.
Creating a Makefile to make the process easier
I like to create a Makefile to make easier the process of making a database dump and creating an image. This Makefile will contain commands to create the dump the database, create an image and tag it by date allowing me to have daily dumps on my registry to download.
default: all
.PHONY: default all fetch_dump
date := `date '+%Y-%m-%d'`
TARGET_IMAGE ?= my_app
all: check_vars fetch_dump generate_image push_to_registry clean finished
check_vars:
@test -n "$(DB_ENDPOINT)" || (echo "You need to set DB_ENDPOINT environment variable" >&2 && exit 1)
@test -n "$(DB_NAME)" || (echo "You need to set DB_NAME environment variable" >&2 && exit 1)
@test -n "$(DESTINATION_REPOSITORY)" || (echo "You need to set DESTINATION_REPOSITORY environment variable" >&2 && exit 1)
fetch_dump: DB_USER ?= postgres
fetch_dump:
@echo ""
@echo "====== Fetching remote dump ======"
@PGPASSWORD="$(DB_PASSWORD)" pg_dump -h $(DB_ENDPOINT) -d $(DB_NAME) -U $(DB_USER) > dump.sql
generate_image:
generate_image:
@docker build . -t $(TARGET_IMAGE):latest -t $(DESTINATION_REPOSITORY)/$(TARGET_IMAGE):latest -t $(DESTINATION_REPOSITORY)/$(TARGET_IMAGE):$(date)
push_to_registry:
@echo ""
@echo "====== Pushing image to repository ======"
@docker push $(DESTINATION_REPOSITORY)/$(TARGET_IMAGE)
clean:
@echo ""
@echo "====== Cleaning used files ======"
@rm -f dump.sql
finished:
@echo ""
@echo "Finished with success. Pushed image to $(DESTINATION_REPOSITORY)/$(TARGET_IMAGE)"
And I can execute the following command to generate my image with a new dump
$ make DB_ENDPOINT=127.0.0.1 DB_USER=postgres DB_PASSWORD=postgres DB_NAME=my_db TARGET_IMAGE=myapp-data DESTINATION_REPOSITORY=gcr.io/my_project
This command usually is integrated in a Cron job in some server to be executed daily. With this I can have on my image registry dumps from each day.
Another interesting thing to do is to add some SQL script to obfuscate users data. This article can be helpful if you want to achive this
Thanks ☕️
Top comments (1)
Awesome. Keeping this in my bookmarks, because I'll need it for sure. And probably I will need to adapt it to other databases.