Scale up: a MySQL bug story, or why Aiven works

#aiven #mysql #sre #bug

One of the hardest questions we answer for our large enterprise customers is why they should choose Aiven instead of managing their own database and streaming services. It can seem counterintuitive that paying extra for a managed service can save you money. However, when we factor in economies of scale - particularly in regards to access to specialized knowledge and tooling - the case for managed services becomes clear. This was certainly the case for some of our MySQL clients earlier this year, where their investments in Aiven paid off in the form of a quietly managed bug fix.

What happened

In January 2023, our Site Reliability Engineering team began receiving reports that some MySQL databases were caught in restart loops. Aiven’s SRE team has playbooks for these instances - namely, we allow nodes to gracefully failover to a standby - but for some reason, this process wasn’t working: SREs had to manually restart nodes and restore them, and within days or weeks customers, sometimes the same ones, would find themselves facing another outage.

In addition, oftentimes when the SRE team did get those customer’s databases back online, it was difficult to restore from backup: the data had become corrupted.

Aiven’s SRE team felt the issue was “too intermittent to track” at first. Final statistics showed that less than <1% of Aiven’s tens of thousands of customers were affected, and the SRE team was only seeing incidents matching the pattern, at most, a few times a week.

However, over the course of a few weeks, the SRE team received enough reports from enough customers to realize that there was a broader incident at play, and they began their investigation work.

Doing the detective work: the power of economies of scale
Most of Aiven’s enterprise customers have their own Site Reliability Engineering teams. So why, you might ask, do they find value in a company like Aiven? The simple answer is that Aiven works for them because of economies of scale. It’s more efficient for Aiven to hire high quality database specialists and excellent SREs than it is for our customers to do so, and the kind of time investment and tooling Aiven can develop for the services we provide might not make sense even for a very large enterprise to do.

That’s what happened with this MySQL issue: upon further investigation by Aiven’s SRE and data services teams, we discovered that the core of the issue was related to a change of data types introduced in MySQL 8.0.30. Using SELECT on a problematic row caused the restart loop, and in addition corrupted the data in that row, making it more likely that the error would occur in future.

Aiven’s data services team discovered this by scanning thousands of deployments - the error occurred for a tiny amount of our customers.

Squashing 10,000 bugs with tooling

Aiven compared the source code of MySQL 8.0.30 and MySQL 8.0.31, and identified the block of code that changed and caused the issue. This sounds trivial, but in practice one of our engineers scanned through hundreds of commits in MySQL to identify the change.

It seems like the mitigation would be to upgrade MySQL to 8.0.31, but Aiven provides robust backups for our customers. The tooling Aiven uses to create and manage MySQL database backups lags behind the MySQL version releases: simply put, we could not upgrade MySQL and keep our backup promises at the same time.

An individual business might choose to upgrade anyway, but because Aiven operates at scale for thousands of customers, we had another option available: we decided to backport the fix to MySQL 8.0.30, and support a custom version of MySQL until our backup tooling released a version compatible with a more recent MySQL version. This let us continue to run robust, compliant backup tooling and mitigate the issue for our customers.

In addition, Aiven’s team developed a small script that rebuilt all MySQL databases upon startup - this way, we could ensure that the change was applied to all MySQL database customers, rather than just those experiencing the issue.

When rolling your own is risky

For a business managing their own database infrastructure, an error like this would be too obscure to find: across tens of thousands of Aiven for MySQL customers, Aiven only observed the issue just over 100 times, and only received a handful of reports. Of those, only a few customers experienced the issue more than once. Aiven runs more than 30,000 MySQL services so this is a genuinely tiny proportion.

Without sounding reductive of businesses that do manage their own infrastructure, the odds of experiencing the issue enough times that your database or site reliability engineering team could notice a pattern and have enough data to mitigate the issue are exceedingly low. However, for Aiven, which has higher investment in monitoring MySQL at scale for thousands of customers, it was easier to see the pattern and mitigate.This is the power of managed services, even for large businesses: it doesn’t make sense for your business to invest in this kind of tooling, but for a small fee, you have access to a company like Aiven, for whom it does.

So what would happen to you instead?

Your business would experience a show-stopping IT incident. First, your database would be caught in a restart loop, like our customers were. When you eventually resolved that, there’s a high chance your database would be corrupted. If you were smart, you’d restore that database from a backup, but setting up automated backups takes time and energy that small development teams often don’t have. You would have, in many cases, a very long and problematic production outage on your hands, in addition to potential data loss. And then, when you least expect it, the error would happen again, because you wouldn’t have the information to solve the problem at its source.

At Aiven, the longest production outage our customers experienced with this was approximately 2 hours, and the longest database rebuild time our customers experienced was about 6 hours. Out of all the failures Aiven saw in production, we only needed to manually restart a service once – automation took care of the rest. Our customers lost minimal data due to the robust backup tooling that Aiven uses. From our perspective, having a trusted partner full of MySQL and SRE experts to get your databases back online is a far better proposition than having to do it yourself with nowhere near the expertise.

Wrapping up

Aiven lets you manage your databases, streaming services and more directly in the Aiven Console or by using the Aiven API. For more information on Aiven for MySQL, check out the documentation or our tutorials on the Developer Center.

DEV Community

Scale up: a MySQL bug story, or why Aiven works

What happened

Squashing 10,000 bugs with tooling

When rolling your own is risky

So what would happen to you instead?

Wrapping up

Top comments (0)

Read next

Dive into Plasma Sphere Interaction!

A Developer's Take on Cerbos: The Smarter Way to Handle Authorization

Master React Profiler: Optimize Your App's Performance

Cypress run is a popular testing framework