DEV Community: Klim Markelov

What is ACID? Baby don't hurt me. No more.

Klim Markelov — Sun, 25 Apr 2021 15:31:02 +0000

Hello, ultra devs! 👋🏻⚡️
A lot of developers currently use relational databases such as MySQL, PostgreSQL, and so on. And they are probably familiar with transactions. But what the "transaction" is and what properties does it have?

Today I would like to talk about transaction properties that were combined into the cool and scary word ACID and reveal the truth about each letter in this word 👆🏻

What is ACID?

ACID is an acronym of the following words:

Atomicity;
Consistency;
Isolation;
Durability.

If we have a certain operation in a database that satisfies these properties, we can call this operation "transaction".

Ok, that's nice, but these words are just words without an explanation. Let's dig deeper into each of them! 🚜

Atomicity

Atomicity is a nice property that guarantees, that transaction is atomic and cannot be broken into smaller parts. If something happens inside that transaction, it will be reverted for all affected fields described inside this transaction.

Imagine we have the following query:

START TRANSACTION;
INSERT INTO posts VALUES ('title', 'body', 'draft');
SELECT @counter := COUNT(id) FROM posts WHERE status = 'draft';
UPDATE posts_statistics SET posts_amount = @counter + 1  WHERE status = 'draft';
COMMIT;

This transaction basically inserts the post and increments the post_statistics.posts_amount field for draft posts.
If an UPDATE of the posts_statistics fails, the whole transaction will be rolled back and the post won't be inserted in the posts table as well. So, these two operations (actually three, SELECT is also an operation) are highly coupled and can be considered atomic.

One more nice property of atomicity is that all clients won't see the change in tables until the transaction gets committed. So, if we execute the transaction described above and during the transaction execution, we execute the following query SELECT COUNT(id) FROM posts WHERE status = 'draft' from another client, we will see a different value than the @counter variable has.

Consistency

This is a very ambiguous and unclear letter. You might hear about it in CAP theorem (leave a comment if you wanna read about CAP theorem and types of consistencies there 😌) or in Consistent hashing. This Consistency is more about data consistency on the application level. Basically, it means, that before committing the transaction system needs to make sure to satisfy all invariants that were set up beforehand and were valid at the moment of starting the transaction.

Imagine the situation when you buy something on dev.to shop. You reached the point when you need to pay for a cool T-shirt. You filled out all the credentials and clicked Pay. In order to satisfy consistency property, the sum value of your amount of money and dev.to's amount of money should be the same at the moment of the beginning of the transaction and the moment of committing it.

This sum value can be considered an invariant for this transaction.

Isolation

This property is very useful for the database to prevent so-called race conditions. You may wonder what is "race condition".

Imagine, you have multiple clients trying to access the same data and change it at the same time. Let's say, they concurrently insert posts using transaction that is described in the Atomicity chapter:

START TRANSACTION;
INSERT INTO posts VALUES ('title', 'body', 'draft');
SELECT @counter := COUNT(id) FROM posts WHERE status = 'draft';
UPDATE posts_statistics SET posts_amount = @counter + 1  WHERE status = 'draft';
COMMIT;

Let's say initially we have 42 posts (just because 42). Since they do it at the same time, here is what might happen:

Basically, they added their posts, but the total number of posts should be equal to 44, but they ended up with 43.

So, in an ideal world isolation property ensures that concurrent transactions look like they get executed one after another or serially. Therefore it is also known as serializability. In practice, the implementation of this property rarely serializable and some databases use so-called Weak isolation (please, leave a comment if you wanna read about it).

Durability

The last, but the very important property is Durability. This property means that once a transaction got successfully committed, it will remain committed and won't be forgotten even if the data gets corrupted due to the power or the database crash.

We can distinguish durability based on the database architecture:

Single-node database;
Replication database (you can read more about replication here).

For single-node database, durability makes sure that the data won't be corrupted by writing on the disk and has a recovery mechanism in case of disk corruption.

For replication database, durability makes sure that the data got written in a certain amount of replicas, so in case of a crash, it will be propagated to other replicas.

Summary

Today we've learned about transaction properties that are aggregated into the ACID word.

ACID is an acronym of the following words:

Atomicity – ensures that all transactions are atomic and cannot be broken down into small parts;
Consistency – ensures that the transaction does not violate application invariants;
Isolation – ensures that race conditions do not happen;
Durability – ensures that once the transaction is committed, it is committed forever.

That's it! Thank you for your attention! I hope you liked this post 😌

Introduction to MySQL replication

Klim Markelov — Sun, 18 Apr 2021 19:12:19 +0000

Hello, ultra devs! 👋🏻⚡️
Today I would like to talk about Replication and how it works in MySQL.

Let's start from the beginning. If you know what replication is, feel free to skip the next chapter.

What is replication?

Basically, a replication mechanism means storing some copy of data on multiple machines. "How is it different from backups?" you may wonder. Replication is a bit more than that. When a backup is just a snapshot of the data in a certain time, replication helps you to not only keep the copy of the data in real-time ensuring availability but also facilitate the overload on the database providing both reading and writing to the client and therefore increase throughput. Also, replication helps you to distribute your data storage across the globe decreasing the response time for clients from different parts of the world.

In this article, we will be talking about leader-based replication, and to continue our journey into this world, it's required to introduce several terms:

Leader (master) – part of the replication system eligible to write and read from the database.
Follower (replica) – part of the replication system eligible only to read.

Basically, leader is responsible for all inserts, updates, deletes, and once these changes go through the leader, it transfers these changes to all its followers that are responsible for reads and never writes.

Here is a simple example of single-leader replication with two followers:

How it works in MySQL?

Ok, now we know what replication is, but how it actually works in MySQL? How data got transferred from leader to followers and how MySQL keeps the consistency between them?

Imagine the situation that dev.to is powered by MySQL. You just wrote the article and clicked on the Publish button. Here is what happens:

Data comes to the leader and get saved in the database;
The leader saves data changes in the special file called binary log;
Follower copies changes in binary log (binlog) to its own file called relay log;
Follower replays these changes from relay log to its own data.

As you can see, to synchronize relay log with leader's binary log, MySQL starts a worker thread that is called I/O follower thread. It's basically an ordinary client connection to the leader that starts reading its binary log.

Digging a bit into details, we can ask a quite reasonable question: in which format do binary log and relay log store the data?

Replication types

Currently, MySQL supports two types of replication:

Statement-based replication;
Row-based replication.

Statement-based replication

So, as it's clear from the name, statement-based replication records the whole query that changed the state of the data in binlog. So, when a follower decides to synchronize its data with the leader, it copies the query and replays it by executing this query and applying changes to its own data.

This kind of replication is very easy to implement and has multiple advantages:

It still works when the schema is different on the leader and the follower;
It's easy to audit and debug;
It requires not that much disk space.

Of course, with great advantages comes great disadvantages:

Non-deterministic functions. With non-deterministic functions, it can come up with different data on leader and follower. By non-deterministic functions I mean function like the following: CURRENT_USER(), RAND(), IS_FREE_LOCK() and so on. Executing them first on leader and then on follower can lead to inconsistent data;
Performance penalty. Imagine if you execute the following query:

INSERT INTO post_statistics VALUES (
  SELECT status AS statistics_type, COUNT(id) AS posts_amount FROM posts GROUP BY status
)

without having an index on the status field, and after pressing 'Enter' you just went for a tea (a hypothetical situation, I know, you probably drink coffee). The query got executed on leader, consuming all available CPU, and then follower picked up the baton, copied the query to its relay log, and cheerfully ate all CPU as well;

Triggers and stored routies. Triggers and stored routines, as well as Non-deterministic functions, can cause a lot of problems with different side-effects on leader and follower.

So, Statement-based replication has its own advantages, but big disadvantages. Therefore not every database supports this type of replication, but in the case of MySQL, up until MySQL 5.0, this type of replication was the only one supported.

Row-based replication

Compare to Statement-based replication, Row-based replication stores the actual data changes in binary log, but not the query. So, when a follower replicates the data, it doesn't execute the query, but applies the changes to each record it was applied to on leader.

Let's consider the advantages of this approach:

Less CPU intensive. If we execute the query, described in Statement-based replication chapter, follower does not replay this query, but copy the value and apply the change to its own data record. So, the query gets executed once and doesn't consume all available CPU;
Helps to find data inconsistency. Since Row-based replication stores the changes only, when follower replays these changes and tries to apply to the data that exists on leader but doesn't exist on follower, it throws the error. Meanwhile statement-based replication proceeds with what it has and keeps the inconsistency hidden, complicating the ability to find the point of failure and fix it;
No non-deterministic behavior. Compare to statement-based replication, if you execute the query that has Non-deterministic functions, it ends up with the same result for both leader and follower.

Looks nice, that's what we were expecting from replication, right? But along with the advantages come disadvantages:

High disk space consumption. Yeah, we just talked about Less CPU consuming for this replication, but the disk space is also very important. Imagine if you have the following statement:

UPDATE posts SET status = 'draft' WHERE status = 'published'

Considering the fact, that the posts table has about 1.000.000 of Published posts, this query becomes quite expensive, since it requires storing 1.000.000 of changes in binary/relay log files;

Does not allow different schemas. Sometimes it might be useful when you have different schemas on leader and follower (I don't know about these cases, but they definitely exist). As it was described above, row-based replication throws an error in case of data inconsistency, caused by different schemas;
Statement is not included in the binary log. It can be not a problem at all until you try to debug or audit what's going on and what query caused damage to your database. Row-based replication makes it hard to analyze.

Now we are fluent in replication process language. Everything is clear. Hold on, on the picture of replication example, we can see one leader and two followers. Can we do things differently?🕵️‍♀️

MySQL supported topologies

Single-leader replication

This type of replication is the most common one. It is useful when you have a lot of reads but not that many writes. You can distribute users reads among followers load-balancing them and therefore providing better response time. With this replication topology, you can easily add one more follower to it. Also, this topology prevents a lot of problems that multiple leaders topology have (will be described in Leader-leader replication), since it has only one leader.

Leader-leader replication

As it is visible from the picture, this topology involves two leaders.

This topology is useful when you have different data centers in different locations and you need to provide fast writes to both regions.

But with this advantage comes a great cost. Suppose, we have a table post_statistics and you just realized that the number of posts with Published status is triple more than it's actually written in table and you decide to fix the situation. So, you connect to MySQL and execute the following query:

UPDATE posts_statistics SET posts_amount=posts_amount * 3 WHERE status='published'

Meanwhile, somebody from a different part of the world just published his/her first post (just like me) and triggered the following query to be executed:

UPDATE posts_statistics SET posts_amount=posts_amount + 1

Suppose, the original number of posts was 10k. Due to replication lag, databases ended up with two different numbers: 30.001 and 30.003. And no errors were thrown.

This is a big disadvantage of this topology, and in practice, it brings more problems than advantages. But if you ended up with this topology, it's better to add few more replicas to it 😁

Active-passive leader-leader replication

In Active-passive leader-leader replication topology one server takes the role of the leader and another one takes the role of the follower. But in comparison to the ordinary leader-follower topology, it allows you to swap easily the leader responsibility from one server to another.

It's useful in many cases. For example, if you execute ALTER TABLE that locks the whole table for reads and writes, you can stop the replication process, easily swap leader responsibility, execute ALTER TABLE on the passive server, then swap the responsibility back, restore replication process, and execute ALTER TABLE on the remaining server. It can help you to keep your service alive while executing that expensive query.

Other topologies

There are many other topologies that are supported by MySQL:

Replication Ring topology
Tree of pyramid topology

And many others. You can choose the best topology that fits your purposes or create your own. This is the list 👆🏻 of the most common topologies used in MySQL.

Summary

Replication is a mechanism of having a consistent copy of the data storage. It provides:

Data distribution;
Load balancing;
Backups;
High availability and failover.

Leader-based replication consists of leader and follower. Both of them have their own journal of changes: binary log and relay log.

There are two types of replication:

Statement-based replication. It's represented in queries itself.
Row-based replication. It's represented in direct data changes.

There are multiple topologies for replication:

Leader-follower topology;
Leader-leader topology;
Leader-leader active-passive topology;
Ring topology;
Tree of Pyramids topology. And many specialized topologies together with custom ones.

That's it! Thank you for your attention! I hope you liked this post 😌