The split-brain scenario in a database environment is a situation whereby the communication link between two different sites is broken, as a result of this situation the production database and standby database become writeable at the same time.
The primary database in the production site remains active; while the secondary server (standby database) in disaster recovery, which was in read-only mode, becomes writeable and becomes active because it thinks the primary database is offline.
The main reason why both database servers become active in the two different sites, the quorum does not have visibility into the two different environments (production and disaster recovery) and assumes that the two environments are reachable and couldn't decide which of the database should vote.
In most cases, the application server will write into these two databases simultaneously (production and disaster recovery) because they are both active. if this issue was not identified on time, it will cause a serious problem, leading to database integrity issues. There will be inconsistency in viewing customer’s historical transactions because the transactions are in two different databases until the databases are reconciled.
Once a split-brain scenario has been identified, it is preferable to stop the SQLServer service in the standby database to prevent the database been active together with the primary database.
Nagios is a monitoring and alerting tool that can be used in monitoring database servers. There are Nagios plugins that can detect if there are issues with database replications, High Availability (HA) clustering, or any database-related issues.
Furthermore, immediately the communication link between the primary and disaster recovery site is restored and the primary database becomes writeable and the secondary database becomes read-only. The two databases will resume their initial mode before the split-brain scenario issues occurred. The primary database will also resume replication with the standby database but there will already be issues with (data and database schema) inconsistency and this can mislead the customers because only transactions that are routed to the primary database will be visible to users.
The next challenge will be to reconcile the data in the primary database with the secondary database. The SQL server service that was stopped in the secondary database will now be started. The following techniques was used in reconciling the data.
Remove the replication between the primary database and secondary database. This is very important because the primary database contains the most recent records.
Take a full backup of the primary and standby database in case we want to revisit the records.
Once the replication has been disabled or destroyed. The standby database now becomes writeable
Once the database has been properly synced and compared.
The next step will be to visualize the record from the application to confirm that the historical records are showing properly.
Managing crises during split-brain scenarios situations:
It is very important to configure file share witnesses. File Share Witness is a file share that is available to all nodes in a high availability (HA) cluster. The job of the Witness is to provide an additional quorum vote when necessary in order to ensure that a cluster continues to run in the event of a site outage.
The file-share witness should be hosted in any of the public cloud providers either AWS, Azure, or Google cloud which the quorum will use in deciding who should vote and becomes the primary database.
Configure a maintenance script in both primary and standby databases. This script will send a notification any time the communication link is broken. Once the notification is received, another script will stop the service of the standby database once it is active.
In conclusion, please do let me know if you find this article interesting and also share your experience with managing a split-brain scenario