Archiving large MySQL tables (part I - intro)

#mysql #database #rails

Background story: at NejŘemeslníci (a Czech web portal for craftsmen jobs), our data grows fast. Recently, our database reached 700GB of data, even though we used transparent compression for some of our largest tables. While this may not seem as that much of data, it gets in the way pretty badly when a need for ALTERing such tables emerges. A special software for online DDL migrations (gh-ost in our case) helps a lot but still requires a significant amount of free disk space to properly operate.

We noticed that many of our largest tables are actually used as ”append-only logs“, i.e. data is only appended to them and never UPDATEd again. Quite often, we SELECT from such tables but we almost never search for older data, we work with the recent stuff only. That's why we began to consider the option of archiving old data from those tables.

This article's aim is to provide a guide for archiving historical data from large tables using a MySQL feature called partitioning. Partitioning allows to split table's data into several physically separate files and to work with them independently thereafter. As a welcome bonus, the tables remaining in your live database will be much smaller and thus faster to work with and SELECT from.

This guide will describe two major steps in table data archiving:

initial manual splitting of the tables into partitions (see Part II)
adding shell scripts to fully automate the archival process in the future (see Part III).

But let us state a few prerequisites first…

Archiving vs. backing up

We need to be aware of the difference between backups and archives. By backing up, we mean periodic copying the data into a separate store with the option to fully and quickly restore them into the production database in case of some disaster.

On the contrary, by archiving we mean moving older parts of the data out of the database into a separate store without being able to fully (or conveniently) restore them back into the production database ever again. It's more like cleaning your house by putting the less frequently used stuff in boxes and moving them to the basement. They are still accessible, if needed (and if properly labeled :) ), but the access is more manual and time consuming and it's OK like that — the most important thing is that you've cleared your main living space!

Consider other options first

Before diving into this guide, give a few moments to consider other options that might be more relevant in your context:

If disk space is the main factor, consider using transparent compression in your tables. It is a one-time change and you can free a significant amount of space, especially if you repeatedly store similar long texts, such as emails, in your tables.
Consider a completely different storage for your data. Is the table a true log only? Perhaps it might reside in (compressed) files on your file system from the beginning and never touch the database? The files could be easily rotated using tools in your system and older data even removed if you actually don't ever need them again…

Conventions and limits of our archival process

Throughout this guide, we will use the following conventions:

The data will be archived on a temporal basis. While MySQL allows to partition your tables in many different ways, splitting the tables into time-based partitions seems to be the most convenient for us.
For this to work, you must have a column representing the time of the table records in all tables that you want to archive. We will partition each table using this temporal column. In a Rails project, for example, the best candidate for such will likely be the created_at column (but feel free to use any other, if needed).
Our archival process will handle partitioning a given table by years, months or weeks. This convention should be quite easily extendable to other time spans, such as days or even hours, if your data grows that fast.
For each to-be-archived table, we will be able to configure the period of archival as well as how much historical data we want to keep in the live database.
Older data will be exported out from the tables into gzipped files. You can do whatever you want with them, e.g. move them to a ”cold storage“ in your cloud.
The to-be-archived tables must obey the following rules:
- They must be InnoDB tables. Tables in other engines must be converted to InnoDB first.
- The column for the temporal partitioning must be part of the primary key (more on that later).
- Sometimes, thus, we will have to prepare the table by an initial migration before we will be able to partition it (more on that later, too).
- Be sure to read through the MySQL docs about partitioning.
We want the whole process to happen live, without any database downtime or substantial slowdown. The initial phase of manual tables partitioning will thoroughly use gh-ost, a very nice tool for online DDL migrations. Be sure to get very well acquainted with it or bad things can happen to your data! (More on this later, too.)
The initial manual phase of data partitioning requires a lot of free disk space, at least twice as much as the size of the largest to-be-archived table. However, after the initial run of the archival process is completed, the disk space will nicely free up again. Your life should be easier if you store your data in a cloud, as enlarging and shrinking virtual disks is much simpler than swapping hardware.
As our company is a Rails project, some parts of this guide will be slightly Rails-related. It should not be much of a problem to use it elsewhere, though.
This guide is written for the MySQL 8 server but a similar process will probably work in older MySQL versions or even in Maria DB or Percona server, though they may provide other relevant tools to achieve the same result more conveniently.

We guess this sums up the introductory stuff. In Part II of the series, we will finally roll up our sleeves and get to some db wizardry!

Top comments (4)

chuckvincent • Oct 20 '20

this is an excellent read. I will be attempting to use this process with master-slave replication. Do you know if the partitioning will be replicated or will i have to do it manually on the slave and then restart replication?

Matouš Borák NejŘemeslníci • Oct 23 '20

Hi chuckvincent, thanks! Luckily for us we don't yet have to use replication so I can't really serve here. Perhaps you've seen this page in the docs?

I'll be happy to hear about your findings and will gladly add some info / link to the articles once you find your way through if you agree!

Regardless of replication, I can tell you that the scripts tend to serve us really well so far, we run them weekly and didn't have to touch them yet, they seem to just work!

chuckvincent • Oct 23 '20

after spending several days with the gh-ost docs all my questions were answered. thus RTFM in full. If anyone else is trying this beware of extra bin logs created, I ran out of disk space and crashed a master mysql server because I did not account for extra bin log space, i had enough space to gh-ost my large table (27M rows @ 500GB), but the extra logs that were required to maintain the slave servers put be over the top.

Matouš Borák NejŘemeslníci • Oct 23 '20

Ah, yeah, I wrote about that too but it's very easy to miss a critical detail like this, among all the info. Hope your server recovered well!