I'm always surprised by the number of people who never heard about CDC, seriously. Say, you need to capture every change in a specific table, like an Update, an Insert or a Delete, how do you do that? they say TRIGGERS!!!
Ok, let's be honest here, it is not completely wrong, it will do the job of " capturing a change in a table", but be aware that you're going to face some performance issues because using this method, because Triggers are Database Operations that will run before or after a Data Manipulation Language (DML) actions, here you can read more about Triggers.
The Solution - CDC
Alright, but we want to talk about Log Based CDC. Every DML action in a specific table will be saved in a Transactional log file, so we can take advantage of that. Here a very good article about Log Based CDC.
CDC and Postgres - Hands-on
Enough talking, I'm also going to show a quick Demo with Postgres. Get your Docker ready!!
Here is my docker-compose.yml file, that's all that you'll need for this tutorial:
version: "3" services: db: image: "postgres:11" container_name: "my_postgres" environment: - POSTGRES_USER=postgres - POSTGRES_PASSWORD=postgres - POSTGRES_DB=shop_db ports: - "5432:5432" command: - "postgres" - "-c" - "wal_level=logical" volumes: - my_dbdata:/var/lib/postgresql/data volumes: my_dbdata:
Copy this file and run docker-compose up -d
Then run docker ps, you should see something like that
0c28a37615f4 postgres:11 "docker-entrypoint.s…" 24 hours ago Up 24 hours 0.0.0.0:5432->5432/tcp my_postgres_
Great! our Container us up and running, now let's connect to our Postgres database running in our docker. Just run docker exec -it my_postgres psql -U postgres postgres
Now, run the following script
create table employees (id int primary key, name varchar, age int);
Make sure the wal_level is set to logical and the max_replication_slots is set to at least 1. To set these values, you will need to be a superuser. In our example, it's all good as wal_level was set in our docker-compose.yml, but just in case you're trying to do it in your own Postgres database.
If you want to check the parameters, just run:
show max_replication_slots; show wal_level;
Alright, We're nearly there. Now, lets create the Slot, running the following command:
SELECT * FROM pg_create_logical_replication_slot('slot_test', 'test_decoding');
Now, insert something in our employee table that we created before and run:
Insert into employees (id, name, age) values (1, 'Thiago', '99'); SELECT * FROM pg_logical_slot_peek_changes('slot', NULL, NULL); --and we should see something like that 0/16CEE88 | 582 | BEGIN 582 0/16CEE88 | 582 | table public.employees: INSERT: id[integer]:1 name[character varying]:'Thiago' age[integer]:99 0/16CEFA0 | 582 | COMMIT 582 --Now lets run it again SELECT * FROM pg_logical_slot_peek_changes('slot', NULL, NULL); --We should see the same result... duuhhh! 0/16CEE88 | 582 | BEGIN 582 0/16CEE88 | 582 | table public.employees: INSERT: id[integer]:1 name[character varying]:'Thiago' age[integer]:99 0/16CEFA0 | 582 | COMMIT 582
pg_logical_slot_peek_changes vs pg_logical_slot_get_changes
Not sure if you've noticed but we've run pg_logical_slot_get_changes twice, the reason for that is that you can retrieve the log data also using pg_logical_slot_get_changes. Lets take a look at the second one:
SELECT * FROM pg_logical_slot_get_changes('slot', NULL, NULL); -- all good, our information is still here 0/16CEE88 | 582 | BEGIN 582 0/16CEE88 | 582 | table public.employees: INSERT: id[integer]:1 name[character varying]:'Thiago' age[integer]:99 0/16CEFA0 | 582 | COMMIT 582 -- Alright, once again postgres=# SELECT * FROM pg_logical_slot_get_changes('slot', NULL, NULL); (0 rows) --Holy shhh#@&..
Yeah, as you can see, pg_logical_slot_get_changes consumes once the files and then it's gone. It's amazing if you want to keep retrieving the data from the last time since you've checked for changes. It is going to be very useful for the next few posts, not saying much now. 😃
Anyway, I hope it was helpful and thanks for your time! 👋
Top comments (4)
Great article. However, I personally dislike the term "CDC".
Context: I work for an ETL company, and to customers, "CDC" can mean anything from Postgres WAL tracking to replication that tracks changes based on a field value change. So it's always a first step to talk them down and ask what they actually mean by "CDC".
It's a marketing term, basically.
Additionally "CDC" is what MSSQL refers to as their form of what is generically a Transaction Log.
Hey, thanks you for your comment.
"It's a marketing term, basically." That's a very interesting point, thanks for sharing it.
Great post. Just a minor typo. You created a slot called: "slot_test" and use "slot" in the following methods.
Thanks Louis, I will fix that. :)