Franck Pachot for YugabyteDB

Posted on Jun 14 • Updated on Jun 17

Multi-Region Async Table Broadcast with YugabyteDB xCluster 1-to-N replication

#yugabytedb #distributed #postgres #database

Using PostgreSQL, you can add asynchronous read replicas to speed up reads from regions far from the primary one (the one handling reads and writes) and without affecting the primary's latency. This works when the reads can tolerate some staleness. If communication with the primary fails, you can choose to continue providing those reads even if the staleness increases.

YugabyteDB distributes reads and writes across the cluster by sharding tables and indexes into tablets, which are Raft groups, based on their respective primary key and indexed columns. You may decide to place the Raft leaders in a specific region to reduce the latency for the main user's region and have all Raft followers in other regions. When reading from one follower, it is not guaranteed to get the latest changes because it needs a quorum to get a consensus on the current state. However, when setting a read-time in the near past, typically 15 or 30 seconds, for a read-only transaction, one follower knows if it received all writes as of this point in time.

This means that we can read from a follower, which may be closer, with reduced latency, when accepting a defined staleness. However, having followers in remote regions may impact the write latency, as the leader waits for the acknowledgment of the majority of followers. When your goal is to allow reads with some staleness, asynchronous replication is sufficient and does not impact the primary. YugabyteDB offers three possibilities for read-only transactions that accept some staleness.

Raft followers in the primary cluster of the YugabyteDB Universe: They participate in the quorum. A network partition may affect the latency for the write workload and could impact availability in the event of additional failures.
Read Replica in additional nodes of the same Universe: The YugabyteDB cluster can be extended with extra Raft peers in a remote region. These peers receive writes asynchronously and do not participate in the quorum. They serve as a complete data copy to reduce read latency from a remote region. However, they need to be connected to the cluster and cannot be used for disaster recovery.
xCluster replica in another YugabyteDB universe: On a per-table basis, the write-ahead log (WAL) can be fetched by another cluster to maintain an asynchronous copy. This has no impact on the primary cluster and remains available in case of a network partition. It can be used for disaster recovery or to maintain availability for read workloads as long as the increased staleness is acceptable.

The xCluster replication is ideal for replicating a single table across multiple regions without impacting the primary database's availability. It also enhances read performance and availability.

Consider the following scenario: You have a basic table that stores messages, similar to a social media timeline. You require uninterrupted availability and fast performance for all users globally. It's understood that users may not see the most recent message immediately because they perceive it more as a chat system, with send/receive, rather than a traditional database.

Primary cluster

For this lab, I created a simple cluster using Docker.

# create a network
docker network create yb

# start the first node (RF=1) in lab.eu.zone1
docker run --name eu1 --network yb --hostname eu1 -d \
 yugabytedb/yugabyte \
 yugabyted start --cloud_location=lab.eu.zone1 \
 --fault_tolerance=zone --background=false
until docker exec eu1 postgres/bin/pg_isready -h eu1 ; do sleep 1 ; done | uniq

# start two more nodes (RF=3) in lab.eu.zone2 and lab.eu.zone3
for i in {2..3}
do
docker run --name eu$i --network yb --hostname eu$i -d \
 yugabytedb/yugabyte \
 yugabyted start --cloud_location=lab.eu.zone$i \
 --join eu1 --fault_tolerance=zone --background=false
until docker exec eu$i postgres/bin/pg_isready -h eu$i ; do sleep 1 ; done | uniq
done

I have designed a table to store messages along with their timestamps.

docker exec eu1 ysqlsh -h eu1 -c "
create table messages (
 primary key (tim, seq)
 , tim timestamptz default now()
 , seq bigint generated always as identity ( cache 1000 )
 , message text
)
"

Get information from the primary universe

To set up the replication, I will need the source cluster identifier, the table definition to create on the target, and the identifiers of the tables to replicate.

# UUID of the primary cluster
primary=$(
 docker exec eu1 yb-admin --master_addresses eu1:7100 get_universe_config | jq -r '.clusterUuid' | tee /dev/stderr
)

# DDL to create the table on the target
ddl=$(
 docker exec eu1 postgres/bin/ysql_dump -h eu1 -t messages | tee /dev/stderr
)

## Table IDs to replicate with xCluster
oids=$(
 docker exec eu1 yb-admin --master_addresses eu1:7100 \
  list_tables include_db_type include_table_id include_table_type |
 grep -E '^ysql.yugabyte.messages .* table$' |
 tee /dev/stderr |
 awk '{print $(NF-1)}' |
 tee /dev/stderr |
 paste -sd, ;
)

Asynchronous Replicas for xCluster

For each region (US, AU, JP), I need to start a new cluster with yugabyted create, get its cluster identifier with yb-admin get_universe_config, create the table with the DDL extracted from the primary, and then set up the replication using yb-admin setup_universe_replication.

I've configured the nodes to listen on all interfaces using the --advertise_address=0.0.0.0 option. This is easier for my tests: when I disconnect one container, it continues to run because the yb-master and yb-servers are still receiving their heartbeats through the localhost interface but they are isolated from other containers.

for replica in us au jp
do
# start the cluster
docker run --name $replica --network yb --hostname $replica -d \
 yugabytedb/yugabyte \
 yugabyted start --advertise_address=0.0.0.0 \
 --cloud_location=lab.$replica.zone1 \
 --fault_tolerance=zone --background=false
tgt=$(
 docker exec $replica yb-admin --master_addresses localhost:7100 \
  get_universe_config | jq -r '.clusterUuid' | tee /dev/stderr
)
until docker exec $replica postgres/bin/pg_isready -h localhost ; do sleep 1 ; done | uniq
# create the table
echo "$ddl" | docker exec $replica ysqlsh -h localhost -e
# setup the replication
docker exec $replica yb-admin --master_addresses localhost:7100 \
 setup_universe_replication $primary eu1 "$oids" 
# show replication status
sleep 5
docker exec $replica yb-admin --master_addresses localhost:7100 \
 get_replication_status
done

You may decide to add more nodes to make the replicas resilient to one failure. The replication target is a regular cluster. You can also replicate to the same cluster if this makes sense for you.

Testing the xCluster configuration

I can insert new messages on the primary:

# on the primary
docker exec eu1 ysqlsh -h eu1 -ec "insert into messages(message) values ('hello')"
docker exec eu2 ysqlsh -h eu2 -ec "select * from messages order by 1,2,3"

The messages are visible on each read replica

# on the replicas
docker exec us  ysqlsh -h localhost -ec "select * from messages order by 1,2,3"
docker exec au  ysqlsh -h localhost -ec "select * from messages order by 1,2,3"
docker exec jp ysqlsh -h localhost -ec "select * from messages order by 1,2,3"

If one region is isolated from the regions, the primary cluster is still available for reads and writes, synchronizing to the available regions. From the isolated region, reads are still available but stale.

docker network disconnect yb jp
docker exec eu1 ysqlsh -h eu1 -ec "insert into messages(message) values ('hello')"

docker exec au  ysqlsh -h localhost -ec "select * from messages order by 1,2,3"

docker exec jp ysqlsh -h localhost -ec "select * from messages order by 1,2,3"

When the network is back, the replication continues:

docker network connect yb jp
docker exec jp ysqlsh -h localhost -ec "select * from messages order by 1,2,3"

After reconnecting, the gap will be resolved after a few seconds:

Comparing with Read Replicas

To show the difference, I have set up read replicas in all regions.

for replica in us au jp
do
# start the read replica
docker run --name rr-$replica --network yb --hostname rr-$replica -d \
 yugabytedb/yugabyte \
 yugabyted start --read_replica --join eu1.yb \
 --cloud_location=lab.$replica.zone2 \
 --fault_tolerance=zone --background=false
done

# configure for read replica
docker exec eu1 \
 yugabyted configure_read_replica new

As it is still the same YugabyteDB universe, and replicating all tables, I can see the read replicas with the same number of tablet peers, but no leaders

The detail of tablets shows an additional READ_REPLICA role in the Raft groups.

This method is simpler. You can connect to the read replica nodes to perform both reads and writes. These actions will be directed to the primary nodes, unless you specify the transaction as read-only and permit follower reads, so that they can be served locally. However, even if it can read locally, this approach only functions when there is network connectivity to the primary nodes. If there is a network partition, the primary nodes are not affected, which is the main advantage compared to a stretched cluster where followers participate in the quorum, but the read replicas become inaccessible.

It is also interesting to look at the timestamps of the write operations. I've run the following to dump the WAL that contained my 'hello' message:

for wal in $(grep -r hello /root/var/data/yb-data/tserver)
do
 /home/yugabyte/bin/log-dump $wal
done 2>/dev/null | grep -b1 WRITE_OP

In eu1, eu2, eu3, rr-jp, rr-au, and rr-us I see all the same timestamps because it is the same universe with its Hybrid Logical Clock:

3017-  hybrid_time: HT{ days: 19891 time: 08:00:26.018685 }
3072:  op_type: WRITE_OP
3092-  size: 202
--
3801-  hybrid_time: HT{ days: 19891 time: 08:01:38.036026 }
3856:  op_type: WRITE_OP
3876-  size: 202

However, xCluster is more like logical replication and the timestamp is the target cluster's HLC when the replication occurred:

In jp:

4593-  hybrid_time: HT{ days: 19891 time: 08:00:26.034767 }
4648:  op_type: WRITE_OP
4668-  size: 161
--
5244-  hybrid_time: HT{ days: 19891 time: 08:01:48.792729 }
5299:  op_type: WRITE_OP
5319-  size: 161

In au:

4593-  hybrid_time: HT{ days: 19891 time: 08:00:26.098103 }
4648:  op_type: WRITE_OP
4668-  size: 161
--
5244-  hybrid_time: HT{ days: 19891 time: 08:01:38.086656 }
5299:  op_type: WRITE_OP
5319-  size: 161

In us:

4593-  hybrid_time: HT{ days: 19891 time: 08:00:26.094410 }
4648:  op_type: WRITE_OP
4668-  size: 161
--
5244-  hybrid_time: HT{ days: 19891 time: 08:01:38.081754 }
5299:  op_type: WRITE_OP
5319-  size: 161

The timestamps are all different and we can see that jp was disconnected about 10 seconds.

How xCluster replication works

When aiming for higher availability by accepting some staleness, similar to eventual consistency, xCluster replication is the appropriate method. However, it's important to understand how it functions. xCluster replication involves fetching changes from the primary database and applying them to the local table in a different database. Therefore, only read-only operations should be performed on it. Modifying the table can cause the replica to diverge from the primary unless two-way replication has been specified. For instance, deleting all rows in the replica will not synchronize them again, unless there are modifications in the primary that will replicate them once more.

It's important to note that although it looks like traditional logical replication, xCluster bypasses the SQL layer on the replica. This means that triggers and constraints are not raised and checked, as the data was already validated on the source. Additionally, the indexes are replicated rather than maintained by the replica. This is in contrast to Postgresql logical replication, which has scalability limitations and often requires dropping indexes and foreign keys to catch up after initialization on large tables.

Thanks to its two-layered architecture, YugabyteDB combines the benefits of physical replication (bypassing the SQL layer on the destination) and logical replication (the ability to replicate a single table, as shown in this example) and provides multiple modes: synchronous, to quorum, or asynchronous, by pushing the changes to read replicas, or pulling them from xCluster replicas.

DEV Community

Multi-Region Async Table Broadcast with YugabyteDB xCluster 1-to-N replication

Primary cluster

Get information from the primary universe

Asynchronous Replicas for xCluster

Testing the xCluster configuration

Comparing with Read Replicas

How xCluster replication works

Top comments (0)

Read next

NestJS and TypeORM — Efficient Schema-Level Multi-Tenancy with Auto Generated Migrations: A DX Approach

Identifiers 101: Understanding and Implementing UUIDs and ULIDs

Difference Between ORM and ODM

Understand inheritance types in Django models