CodeWithDhanian

Posted on Apr 1

Database Replication (Master-Slave, Multi-Master) in System Design

#systemdesign

Database replication forms a cornerstone of robust system design, enabling databases to achieve higher scalability, availability, and data durability in large-scale applications. By duplicating data across multiple nodes, replication ensures that read and write operations can be distributed intelligently while protecting against single points of failure. This section explores the two dominant replication architectures—Master-Slave Replication and Multi-Master Replication—with complete technical depth, configuration examples, and operational realities.

Introduction to Database Replication

Database replication is the process of copying and maintaining database objects and data changes across multiple database instances. In a distributed system, a primary database records all modifications in a transaction log, which is then transmitted to one or more secondary instances. These secondary instances replay the changes, keeping their data sets synchronized.

The core objectives of replication are threefold. First, it delivers horizontal scalability by allowing read-heavy workloads to be offloaded to additional nodes. Second, it provides high availability through automatic failover mechanisms when a node becomes unreachable. Third, it improves disaster recovery by maintaining geographically dispersed copies that survive hardware failures, network partitions, or data center outages.

Replication operates at different levels: statement-based, row-based, or mixed. Statement-based replication logs the exact SQL statements executed on the source, while row-based replication logs the actual before-and-after row images, offering greater precision during schema changes or non-deterministic operations. Modern database engines default to row-based for its reliability.

Master-Slave Replication

Master-Slave Replication, also known as primary-replica replication, designates one node as the master responsible for all write operations while one or more slave nodes handle read operations and serve as hot standbys. This architecture is the most widely adopted pattern for read scaling because it maintains strong write consistency on a single master.

Architecture of Master-Slave Replication

The master node accepts all INSERT, UPDATE, and DELETE statements. Every committed transaction is written to the binary log (in MySQL) or WAL (Write-Ahead Log in PostgreSQL). A dedicated replication thread on each slave connects to the master, reads the log events, and applies them locally through an SQL thread or apply process.

Slaves are typically configured in read-only mode to prevent accidental writes. Applications route write queries exclusively to the master and read queries to any available slave. This separation is commonly implemented via connection pooling libraries or ORM routing logic.

Replication Modes: Synchronous and Asynchronous

Asynchronous replication is the default mode. The master commits a transaction, writes it to its binary log, and immediately returns success to the client. The replication to slaves happens in the background. This delivers the lowest write latency but introduces replication lag—a period during which slaves have not yet applied the latest changes. Lag can be monitored using metrics such as Seconds_Behind_Master in MySQL.

Synchronous replication requires the master to wait for acknowledgment from at least one slave before committing. This guarantees zero lag for acknowledged writes but increases write latency and reduces overall throughput. Semi-synchronous replication strikes a balance: the master waits for one slave to receive the event (but not necessarily apply it), providing improved durability with acceptable performance.

Setting Up Master-Slave Replication in MySQL

Below are complete, production-ready configuration files and commands for establishing Master-Slave Replication using MySQL 8.0+.

Master Server Configuration (/etc/mysql/my.cnf):

[mysqld]
server-id               = 1
log-bin                 = mysql-bin
binlog-format           = ROW
expire-logs-days        = 7
innodb_flush_log_at_trx_commit = 1
sync_binlog             = 1
gtid_mode               = ON
enforce_gtid_consistency = ON

Explanation of key parameters:

server-id = 1: Uniquely identifies the master in the replication topology.
log-bin = mysql-bin: Enables binary logging, the foundation of replication.
binlog-format = ROW: Ensures precise, row-level changes are logged.
gtid_mode = ON: Activates Global Transaction Identifiers, simplifying failover and topology management.

Slave Server Configuration (/etc/mysql/my.cnf):

[mysqld]
server-id               = 2
relay-log               = relay-log
read_only               = 1
gtid_mode               = ON
enforce_gtid_consistency = ON

Slave setup commands (executed after starting the slave instance):

-- On Master first (to obtain log position)
SHOW MASTER STATUS\G

-- On Slave
CHANGE MASTER TO
    MASTER_HOST = '192.168.1.100',
    MASTER_USER = 'repl_user',
    MASTER_PASSWORD = 'strong_password',
    MASTER_AUTO_POSITION = 1;

START SLAVE;

-- Verify replication status
SHOW SLAVE STATUS\G

The MASTER_AUTO_POSITION = 1 leverages GTID for automatic positioning, eliminating manual log file and position tracking. The replication user must have the REPLICATION SLAVE privilege granted on the master.

Multi-Master Replication

Multi-Master Replication allows multiple nodes to accept write operations simultaneously. Every master acts as both a write and read endpoint, with changes propagated bidirectionally. This architecture is essential for globally distributed applications requiring low-latency writes from multiple regions.

Architecture of Multi-Master Replication

In a multi-master setup, each node maintains its own binary log and replication threads that push changes to every other node. Circular replication or full-mesh replication topologies are common. Global Transaction Identifiers become critical to detect and prevent infinite loops of replicated events.

Because multiple masters can modify the same row concurrently, conflict resolution is mandatory. Without it, data divergence occurs.

Conflict Resolution Strategies

Last Writer Wins (LWW) uses timestamps or logical clocks to decide which change prevails. Version vectors or CRDTs (Conflict-free Replicated Data Types) provide more sophisticated merging. Custom business logic can also be implemented via triggers or application-level reconciliation jobs.

MySQL Group Replication offers built-in conflict detection using a certification-based mechanism that aborts conflicting transactions.

Setting Up Multi-Master Replication in MySQL (Circular Example)

For a two-node active-active circular multi-master setup:

Node 1 Configuration (/etc/mysql/my.cnf):

[mysqld]
server-id               = 1
log-bin                 = mysql-bin
binlog-format           = ROW
gtid_mode               = ON
enforce_gtid_consistency = ON
auto-increment-increment = 2
auto-increment-offset    = 1

Node 2 Configuration (/etc/mysql/my.cnf):

[mysqld]
server-id               = 2
log-bin                 = mysql-bin
binlog-format           = ROW
gtid_mode               = ON
enforce_gtid_consistency = ON
auto-increment-increment = 2
auto-increment-offset    = 2

Establish bidirectional replication:

On Node 1:

CHANGE MASTER TO
    MASTER_HOST = 'node2-ip',
    MASTER_USER = 'repl_user',
    MASTER_PASSWORD = 'strong_password',
    MASTER_AUTO_POSITION = 1;

START SLAVE;

On Node 2 (pointing back to Node 1):

CHANGE MASTER TO
    MASTER_HOST = 'node1-ip',
    MASTER_USER = 'repl_user',
    MASTER_PASSWORD = 'strong_password',
    MASTER_AUTO_POSITION = 1;

START SLAVE;

The auto-increment settings prevent primary key collisions across masters. In production, MySQL Group Replication or Galera Cluster is preferred over manual circular setups because they provide multi-primary mode with automatic quorum and flow control.

Operational Considerations in System Design

Master-Slave Replication excels in read-heavy applications such as content delivery platforms and analytics dashboards. It simplifies consistency guarantees but creates a write bottleneck on the single master.

Multi-Master Replication supports geo-distributed systems and high write throughput but demands careful conflict handling and monitoring of replication lag across all nodes. Application code must be designed to tolerate brief inconsistencies or implement compensating transactions.

Both strategies integrate with load balancers, connection routers, and orchestration tools like ProxySQL or PgBouncer to direct traffic intelligently. Monitoring of replication lag, binary log size, and thread status is non-negotiable for production stability.

A complete system design decision between Master-Slave and Multi-Master ultimately depends on the workload profile, latency requirements, and tolerance for eventual consistency.

System Design Handbook: For comprehensive coverage of all system design concepts including advanced replication patterns, load balancing, caching, and real-world case studies, purchase the System Design Handbook at https://codewithdhanian.gumroad.com/l/ntmcf. This resource equips engineers and architects with the practical knowledge needed to design scalable, production-grade systems.

DEV Community