Bobur Umurzokov

Posted on Jun 22, 2024 • Originally published at learn.glassflow.dev

Data Synchronization in Microservices with PostgreSQL, Debezium, and NATS: A Practical Guide

#programming #tutorial #microservices #opensource

In modern software development, the microservices architecture has become popular for its scalability and flexibility. Despite its benefits, it brings significant challenges, particularly in data synchronization across various services. By leveraging PostgreSQL, Debezium, and NATS, we can establish an efficient and reliable method for synchronizing data across microservices.

This guide provides a step-by-step approach to building a data synchronization stack using popular technologies. You'll get an introduction to Change Data Capture (CDC), understand the challenges, and receive ready-to-use code snippets.

The Challenge of Data Synchronization in Microservices

Microservices are designed to be loosely coupled and independently deployable, with each having its own database. While this independence is advantageous, it presents a significant challenge in maintaining data consistency throughout the system. Traditional batch processing methods, such as ETL (extract, transform, load), can be cumbersome and often fail to provide real-time updates, which are essential for many modern applications.

Introduction to Change Data Capture (CDC)

Change Data Capture (CDC) is a process that tracks all data changes in a database and extracts them so they can be reflected in other systems, ensuring they have accurate and up-to-date copies.

For a more in-depth discussion on this topic, refer to our previous article, "Understanding Database Synchronization: An Overview of Change Data Capture."

The Pipeline Components

Although PostgreSQL is used as the source database in this example, the same principles can be applied to other databases like MySQL or MariaDB. Below are brief descriptions of the components used:

PostgreSQL: A Robust Database Solution

PostgreSQL is a powerful, open-source object-relational database system known for its advanced features and reliability. In a microservices architecture, each service uses its own PostgreSQL instance, ensuring data isolation and integrity.

Debezium: Change Data Capture

Debezium is an open-source platform for Change Data Capture (CDC). It monitors databases and captures row-level changes, emitting them as event streams. When integrated with PostgreSQL, Debezium captures every change made to the database in real time.

NATS: The Messaging Backbone

NATS is a central messaging system known for its lightweight design, high throughput, and low latency. It acts as the conduit for communicating data changes across different microservices.

Setting Up PostgreSQL

To enable CDC with PostgreSQL, several key concepts need to be understood:

Database Write-Ahead Log (WAL): WAL ensures data integrity by logging all changes before they are applied to the database files. In PostgreSQL, WAL records every change, maintaining the atomicity and durability of transactions.

Replication Slot: Replication slots are crucial for streaming replication. They ensure that the master server retains the necessary WAL logs for replicas, even if the replicas are temporarily disconnected. PostgreSQL supports two types of replication slots: physical and logical.

To configure PostgreSQL for CDC, the wal_level must be set to logical. Additionally, you may need to adjust the max_level_senders and max_replication_slots settings. Below is an example of a docker-compose file for setting up PostgreSQL:



version: '3.9'
services:
  postgres:
    image: postgres:latest
    command: "-c wal_level=logical -c max_wal_senders=5 -c max_replication_slots=5"
    environment:
      POSTGRES_DB: glassflowdb
      POSTGRES_USER: glassflowuser
      POSTGRES_PASSWORD: glassflow
    ports:
      - "5432:5432" 
    volumes:
      - ./data/postgres:/var/lib/postgresql/data

We can now start the database by running:



docker compose up

Let’s create a simple table that we will track later on. Here is an example of how to create a table in PostgreSQL:



$ psql -h 127.0.0.1 -U glassflowuser -d glassflowdb
Password for user glassflowuser:
psql (14.10, server 16.1 (Debian 16.1-1.pgdg120+1))
WARNING: psql major version 14, server major version 16.
         Some psql features might not work.
Type "help" for help.

glassflowdb=# CREATE TABLE accounts (
    user_id serial PRIMARY KEY,
    username VARCHAR ( 50 ) UNIQUE NOT NULL,
    password VARCHAR ( 50 ) NOT NULL,
    email VARCHAR ( 255 ) UNIQUE NOT NULL,
    created_on TIMESTAMP NOT NULL,
  last_login TIMESTAMP 
);

Setting up NATS

Update the docker compose yaml to include the NATs server configuration:



   nats:
    image: nats:latest
    ports:
      - "4222:4222"
    command:
      - "--debug"
      - "--http_port=8222"
      - "--js"

Setting Up Debezium

We are going to use a ready-to-use version of Debezium. Update the docker compose yaml to include the Debezium service configuration:



debezium:
    image: docker.io/debezium/server:latest
    volumes:
      - ./debezium/conf:/debezium/conf
    depends_on:
      - postgres
      - nats

To get it working, we need to define a configuration for Debezium. This configuration is specified in a file named application.properties.



debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector
debezium.source.offset.storage.file.filename=data/offsets.dat
debezium.source.offset.flush.interval.ms=0
debezium.source.database.hostname=postgres
debezium.source.database.port=5432
debezium.source.database.user=glassflowuser
debezium.source.database.password=glassflow
debezium.source.database.dbname=glassflowdb
debezium.source.topic.prefix=glassflowtopic
debezium.source.plugin.name=pgoutput

debezium.sink.type=nats-jetstream
debezium.sink.nats-jetstream.url=nats://nats:4222
debezium.sink.nats-jetstream.create-stream=true
debezium.sink.nats-jetstream.subjects=postgres.*.*

It is important to note that the source connector for PostgreSQL is typically set up using Debezium's default decoderbufs plugin. In this article, however, we will use pgoutput instead, so we need to set debezium.source.plugin.name=pgoutput.

How Debezium Achieves CDC with PostgreSQL

Before proceeding, let's discuss how Debezium implements Change Data Capture (CDC) with PostgreSQL.

Debezium connects to PostgreSQL as a replication client, which involves setting up a Debezium connector for PostgreSQL. This requires PostgreSQL to be configured with wal_level set to logical.
When set up, Debezium creates a logical replication slot in PostgreSQL. This slot ensures that relevant WAL entries are retained until Debezium processes them, preventing data loss even if the Debezium connector goes offline temporarily.
Debezium reads changes from the WAL through the replication slot. It decodes these changes from their binary format into a structured format (e.g., JSON) that represents the SQL operations.
Each decoded change is then emitted as a separate event. These events contain all necessary information about the database changes, such as the type of operation (INSERT, UPDATE, DELETE), the affected table, and the old and new values of the modified rows.
Debezium acts as a NATS producer, publishing each change event to a NATS topic (usually one topic per table).
Consumers can subscribe to these NATS topics to receive real-time updates about database changes. This enables applications and microservices to react to data changes as they happen.

Testing Our Setup

If everything is set up correctly, any changes made to the PostgreSQL database, such as updates to the accounts table, will be sent to NATS.



$ psql -h 127.0.0.1 -U glassflowuser -d glassflowdb

glassflowdb=# INSERT INTO "public"."accounts" ("username", "password", "email", "created_on")
               VALUES ('user2', 'beseeingya', 'user2@email.com', NOW());
glassflowdb=# DELETE FROM accounts WHERE username = 'user3';

When we create a consumer, we can observe all the events sent by Debezium.



$ nats consumer add DebeziumStream viewer --ephemeral --pull --defaults > /dev/null

$ nats consumer next --raw --count 100 DebeziumStream viewer | jq -r '.payload'

{

  "before": null,

  "after": {

    "user_id": 4,

    "username": "user2",

    "password": "beseeingya",

    "email": "user2@email.com",

    "created_on": 1700505308855573,

    "last_login": null

  },

  "source": {

    "version": "2.2.0.Alpha3",

    "connector": "postgresql",

    "name": "glassflowtopic",

    "ts_ms": 1700505308860,

    "snapshot": "false",

    "db": "glassflowdb",

    "sequence": "[\"26589096\",\"26597648\"]",

    "schema": "public",

    "table": "accounts",

    "txId": 742,

    "lsn": 26597648,

    "xmin": null

  },

  "op": "c",

  "ts_ms": 1700505309220,

  "transaction": null

}

{

  "before": {

    "user_id": 3,

    "username": "",

    "password": "",

    "email": "",

    "created_on": 0,

    "last_login": null

  },

  "after": null,

  "source": {

    "version": "2.2.0.Alpha3",

    "connector": "postgresql",

    "name": "glassflowtopic",

    "ts_ms": 1700505331733,

    "snapshot": "false",

    "db": "glassflowdb",

    "sequence": "[\"26598656\",\"26598712\"]",

    "schema": "public",

    "table": "accounts",

    "txId": 743,

    "lsn": 26598712,

    "xmin": null

  },

  "op": "d",

  "ts_ms": 1700505331751,

  "transaction": null

}

Why It Matters

Integrating Debezium with PostgreSQL and NATS for Change Data Capture (CDC) is essential for building advanced, real-time data pipelines. Once set up, this integration offers numerous possibilities for data utilization and integration across various systems and applications. For instance, change events captured from the database can be streamed to a data lake, allowing organizations to aggregate large amounts of data in a centralized repository for complex analysis and machine learning purposes. These data streams can also be fed directly into analytics dashboards, providing real-time insights and decision-making capabilities. This is particularly useful for monitoring key metrics, detecting anomalies, or understanding user behavior in near real-time.

Additionally, the system can trigger automated workflows in response to specific data changes, such as sending notifications or updating other systems. The flexibility and scalability of this setup make it an ideal foundation for building comprehensive and responsive data-driven ecosystems, catering to a wide range of use cases from business intelligence to operational monitoring.

Conclusions

Synchronizing data across microservices can be challenging, but using PostgreSQL, Debezium, and NATS offers a robust solution. This setup ensures real-time data consistency across services while adhering to the principles of microservices architecture. By leveraging these technologies, we can build scalable, resilient, and efficient systems that meet the demands of modern application development.

Remember, this guide is a starting point. Depending on your specific requirements, further customization and configuration may be necessary.

Discover various use cases of real-time data pipelines with code samples.