DEV Community

Cover image for How to set up MariaDB Galera cluster for high availability
Finny Collins
Finny Collins

Posted on

How to set up MariaDB Galera cluster for high availability

MariaDB Galera Cluster is a synchronous multi-master replication solution that lets every node in the cluster accept both reads and writes. If one node goes down, the others keep serving traffic without manual failover. This guide walks through the actual steps to get a three-node Galera cluster running, from package installation to production hardening.

MariaDB Galera

What Galera cluster gives you

Galera uses a certification-based replication model. Instead of shipping binary logs asynchronously like standard replication, it validates each transaction across all nodes before committing. This means every node has the same data at all times, and there is no replication lag to worry about.

A few things that make Galera different from traditional replication setups:

  • Every node is a master. You can write to any node without proxy tricks or failover scripts.
  • Transactions are either committed on all nodes or rolled back everywhere. No split-brain scenarios under normal operation.
  • Adding or removing nodes is relatively straightforward. The cluster handles state transfer automatically when a new node joins.

That said, Galera is not a magic bullet. Write conflicts between nodes still cause rollbacks, and network latency between nodes directly affects write performance. It works best when your nodes are in the same datacenter or at least on a low-latency network.

Prerequisites

Before you start, make sure your environment meets these requirements. All three servers should be able to communicate with each other on the necessary ports.

Requirement Details
Operating system Ubuntu 22.04 / Debian 12 / RHEL 8+ (this guide uses Ubuntu 22.04)
MariaDB version 10.6 or newer (10.11 recommended)
Servers 3 nodes minimum, each with at least 2 GB RAM
Network Low-latency connection between nodes, ports 3306, 4567, 4568 and 4444 open
Root access sudo or root on all three nodes

For this guide, we will use three nodes:

  • Node 1: 192.168.1.101
  • Node 2: 192.168.1.102
  • Node 3: 192.168.1.103

Replace these IPs with your actual server addresses.

Step 1: Install MariaDB on all nodes

Run these commands on each of the three servers. First, add the MariaDB repository to get the latest stable version.

sudo apt update
sudo apt install -y software-properties-common
sudo curl -LsS https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | sudo bash -s -- --mariadb-server-version=mariadb-10.11
sudo apt update
sudo apt install -y mariadb-server mariadb-client galera-4
Enter fullscreen mode Exit fullscreen mode

After installation, run the security script on each node:

sudo mariadb-secure-installation
Enter fullscreen mode Exit fullscreen mode

Set a root password, remove anonymous users, disable remote root login and drop the test database. Then stop MariaDB on all three nodes before configuring Galera:

sudo systemctl stop mariadb
Enter fullscreen mode Exit fullscreen mode

Do not start the service again until the configuration is done. Galera needs specific settings before the first bootstrap.

Step 2: Configure the Galera cluster

On each node, create a Galera configuration file. The settings are mostly the same across nodes, except for wsrep_node_address and wsrep_node_name.

Node 1 configuration

Create or edit /etc/mysql/mariadb.conf.d/60-galera.cnf:

[mysqld]
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0
innodb_buffer_pool_size=512M

# Galera settings
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name="mariadb_galera_cluster"
wsrep_cluster_address="gcomm://192.168.1.101,192.168.1.102,192.168.1.103"
wsrep_node_address="192.168.1.101"
wsrep_node_name="node1"
wsrep_sst_method=mariabackup
wsrep_sst_auth="sst_user:YourSSTPassword"
Enter fullscreen mode Exit fullscreen mode

Node 2 configuration

Same file, same content, but change the node-specific lines:

wsrep_node_address="192.168.1.102"
wsrep_node_name="node2"
Enter fullscreen mode Exit fullscreen mode

Node 3 configuration

wsrep_node_address="192.168.1.103"
wsrep_node_name="node3"
Enter fullscreen mode Exit fullscreen mode

Everything else stays identical. The wsrep_cluster_address lists all three nodes on every server so each node knows how to find the others.

Why these InnoDB settings matter

The innodb_autoinc_lock_mode=2 is required for Galera. It enables interleaved auto-increment generation, which prevents deadlocks during parallel writes across nodes. The binlog_format=ROW is also mandatory because Galera needs row-based events to replicate correctly.

Step 3: Create the SST user

State Snapshot Transfer (SST) is how Galera synchronizes a new or rejoining node with the rest of the cluster. The mariabackup method is the least disruptive option since it does not lock the donor node.

Bootstrap the first node temporarily to create the SST user:

sudo galera_new_cluster
Enter fullscreen mode Exit fullscreen mode

Then connect to MariaDB and create the user:

CREATE USER 'sst_user'@'localhost' IDENTIFIED BY 'YourSSTPassword';
GRANT RELOAD, PROCESS, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sst_user'@'localhost';
FLUSH PRIVILEGES;
Enter fullscreen mode Exit fullscreen mode

Keep this node running. We will start the other nodes next.

Step 4: Start the remaining nodes

With Node 1 running as the bootstrap node, start MariaDB on Node 2:

sudo systemctl start mariadb
Enter fullscreen mode Exit fullscreen mode

Wait for it to finish the state transfer before starting Node 3:

sudo systemctl start mariadb
Enter fullscreen mode Exit fullscreen mode

Each node will pull a full copy of the data from an existing cluster member during its first join. You can watch the progress in the MariaDB error log:

sudo tail -f /var/log/mysql/error.log
Enter fullscreen mode Exit fullscreen mode

Look for the line WSREP: Member 2.0 (node3) synced with group to confirm the node has joined successfully.

Step 5: Verify the cluster

Connect to any node and check the cluster status:

SHOW STATUS LIKE 'wsrep_cluster_size';
SHOW STATUS LIKE 'wsrep_cluster_status';
SHOW STATUS LIKE 'wsrep_connected';
SHOW STATUS LIKE 'wsrep_ready';
Enter fullscreen mode Exit fullscreen mode

You should see:

Variable Expected value
wsrep_cluster_size 3
wsrep_cluster_status Primary
wsrep_connected ON
wsrep_ready ON

If wsrep_cluster_size is less than 3, one of the nodes has not joined yet. Check the error logs on that node for connection issues or firewall problems.

To test replication, create a database on Node 1:

CREATE DATABASE galera_test;
Enter fullscreen mode Exit fullscreen mode

Then connect to Node 2 or Node 3 and verify it exists:

SHOW DATABASES;
Enter fullscreen mode Exit fullscreen mode

You should see galera_test on all three nodes.

Firewall configuration

Galera uses several ports for communication. If you are running a firewall (and you should be), open these ports between the cluster nodes:

sudo ufw allow from 192.168.1.0/24 to any port 3306
sudo ufw allow from 192.168.1.0/24 to any port 4567
sudo ufw allow from 192.168.1.0/24 to any port 4568
sudo ufw allow from 192.168.1.0/24 to any port 4444
Enter fullscreen mode Exit fullscreen mode

Port 3306 is for MySQL client connections. Port 4567 handles Galera cluster communication and replication. Port 4568 is used for Incremental State Transfer (IST). Port 4444 is for State Snapshot Transfer (SST) when a node needs a full data copy.

Only open these ports between cluster nodes. There is no reason to expose Galera ports to the public internet.

Production hardening tips

Running Galera in production requires a bit more attention than a dev setup. Here are the settings and practices that actually matter.

Tune gcache.size for IST

When a node briefly disconnects and reconnins, Galera can use IST instead of a full SST if the missing transactions are still in the gcache. Set this large enough to cover typical downtime:

wsrep_provider_options="gcache.size=1G"
Enter fullscreen mode Exit fullscreen mode

For most workloads, 512M to 2G is reasonable. If your write volume is high, increase it. A larger gcache means nodes can rejoin without expensive full state transfers.

Use a dedicated network

If possible, put Galera replication traffic on a separate network interface. This isolates cluster communication from application traffic and prevents network contention under load.

Monitor wsrep_local_recv_queue_avg

This metric tells you if a node is falling behind in applying transactions. If this value keeps growing, the node cannot keep up with the write load. Either the hardware is underpowered or you have too many large transactions.

Avoid large transactions

Galera certifies entire transactions atomically. A transaction that modifies millions of rows will block certification on all nodes until it completes. Break large operations into smaller batches — a few thousand rows per transaction is a good target.

Set up proper monitoring

At minimum, monitor these variables on each node: wsrep_cluster_size, wsrep_local_state_comment (should be "Synced"), wsrep_flow_control_paused (should be close to 0) and wsrep_local_recv_queue_avg. If flow control kicks in frequently, your slowest node is bottlenecking the cluster.

Handling node failures and recovery

One of the main reasons to run Galera is automatic failover. If a node crashes, the remaining nodes form a quorum and keep serving requests. No manual intervention needed.

Single node failure

Just restart the failed node. It will rejoin the cluster automatically, either via IST if the downtime was short enough, or via full SST if it was too long.

sudo systemctl start mariadb
Enter fullscreen mode Exit fullscreen mode

Full cluster shutdown

If all nodes go down (a datacenter reboot, for instance), you need to bootstrap again. Find the node with the most recent data:

sudo cat /var/lib/mysql/grastate.dat
Enter fullscreen mode Exit fullscreen mode

The node with safe_to_bootstrap: 1 or the highest seqno value should be your bootstrap node. If none is marked safe, edit grastate.dat on the most recent node and set safe_to_bootstrap: 1. Then bootstrap from that node:

sudo galera_new_cluster
Enter fullscreen mode Exit fullscreen mode

Start the other nodes normally after that. Getting this wrong can lead to data loss, so always check grastate.dat before bootstrapping.

Split-brain prevention

Galera requires a majority of nodes to form a quorum. With three nodes, you need at least two alive. If a network partition isolates one node from the other two, the isolated node stops accepting writes. The two-node partition keeps working normally.

This is why three nodes is the minimum for a production Galera cluster. Two nodes cannot form a majority when one fails.

Backing up a Galera cluster

Galera replication is not a backup strategy. Replication protects against hardware failure, but it does not protect against accidental deletes, schema mistakes or application bugs. A DROP TABLE on one node replicates instantly to all others.

You still need proper backups. One approach is to use mariadb-dump against one of the nodes:

mariadb-dump --all-databases --single-transaction --routines --triggers > backup.sql
Enter fullscreen mode Exit fullscreen mode

The --single-transaction flag is important here. It avoids locking tables and keeps the node available for reads during the backup. For larger databases, mariabackup gives you faster physical backups with less overhead.

For teams that want automated scheduled backups with compression, retention policies and offsite storage, Databasus is an industry standard tool for MariaDB backup. It is a self-hosted open-source solution that works with MariaDB, MySQL, PostgreSQL and MongoDB, and handles backup scheduling, multiple storage destinations (S3, Google Drive, SFTP and others) and notifications out of the box — suitable both for individual developers and enterprise teams.

Common issues and troubleshooting

A few things that trip people up when first running Galera in production.

Node cannot join the cluster. Check that all Galera ports are open between nodes. Verify that wsrep_cluster_address is identical on all nodes and uses the correct IPs. Check the error log for SSL or authentication errors.

Slow writes after adding a node. During SST, the donor node may slow down. With mariabackup as the SST method this impact is minimal, but on busy clusters you might notice a temporary performance dip. Wait for the state transfer to complete.

"Deadlock found" errors on multi-node writes. This happens when two nodes try to modify the same row simultaneously. Galera detects the conflict during certification and rolls back one transaction. Your application should handle this by retrying the transaction. This is normal Galera behavior, not a bug.

Cluster won't start after full shutdown. You need to bootstrap from the most advanced node. Check grastate.dat on all nodes and bootstrap from the one with the highest sequence number.

Wrapping up

Setting up a MariaDB Galera cluster is not that complicated once you understand the moving parts. Install MariaDB with Galera support, configure the cluster address and node-specific settings, bootstrap the first node and join the rest. The key things to remember: use at least three nodes for quorum, keep network latency low between them, avoid massive transactions and monitor your cluster metrics. And do not forget to set up actual backups separately from replication.

Top comments (0)