Adheeban Manoharan

Posted on Sep 14, 2024

🗃️ Lessons Learned from Migrating Huge BigQuery Datasets Across Regions

#googlecloud #database #analytics #cloud

TL;DR We were tasked with migrating a large dataset from the US to the EU due to the client's EU-based operations and GDPR compliance requirements. In this post, I’ll walk you through the challenges I faced while transferring a BigQuery dataset across regions. Since Google's documentation on this subject is quite limited, I believe this guide will be invaluable for anyone attempting a similar migration.

Prelude

When tasked with migrating datasets across the globe (for example, from the US to the EU region), we had to address three key questions before starting the process:

What limitations does Google impose on data exports?

Understanding the constraints of Google's infrastructure was crucial to avoid unforeseen issues.

What volume of data are we dealing with?

The size and complexity of the dataset would significantly impact the migration strategy.

How can we ensure zero downtime during the migration?

Minimizing disruption to services while the data was being transferred was a top priority.

Mind you, our task is creating the same dataset with the same name and same number of tables in a different region. Now, Let's go over the questions one after the other.

Google's Limitations

Google's BigQuery by nature has some acceptable limitations (at the time of writing this blog) due to its architecture . Let me list them down. Some of them might sound pretty obvious but I'm still listing them down as we would have the full context when we go through how we solved this migration problem.

You cannot create a second dataset with the same name.
You cannot create a multi-regional dataset (This limitation applies only for the Asia region, US and EU multi regional buckets can be created)
You cannot export the contents of a BigQuery table that resides in the US region to a GCS bucket that resides in the EU region. (Cross regional export)
When you create a dataset, it usually takes about 20-120 seconds for it to be available for the APIs that consume this dataset, until then you'd get 404s.
Google doesn't give support for migrating these datasets, so we have to do it ourselves.

Data Size

Our application stores audit data in BigQuery, serving as our cold storage. Given the nature of audit logs, the datasets have grown to a colossal size. For example, some tables, even in their compressed form (Parquet format with Snappy compression), contained up to 4 terabytes of data. Since this is critical audit data, ensuring zero data loss was absolutely essential during the migration.

Zero Downtime

As mentioned earlier, our audit system writes data continuously to these BigQuery datasets. This means that migrating a customer’s data without downtime or prior notification could result in data loss during the migration window. In our case, it was crucial that the migration proceeded seamlessly, with zero downtime, so as not to disrupt the customer's operations.

How it actually went down

After getting all the information we needed, Below is the approach we finalised. In this illustration let's go with the assumption that we are migrating a US multi-regional dataset to a EU multi-regional location.

Yes, I hear you. I know how confusing this image is. I suck at these engineering illustrations. So let me explain.

1) Export Table Data as Parquet:

We initiated the migration by exporting the table data from the US dataset to a Google Cloud Storage (GCS) bucket in the US region. The data was saved in Parquet format, which is ideal for both storage and transfer due to its efficient compression.

2) Extract Table Schema and Partition Details:

Along with the data, we extracted the table schema, partition and clustering details, saving them as JSON files. This ensured that the tables would be accurately recreated during the migration.

3) Transfer Data to the EU bucket:

Using Google Cloud Storage Transfer service, we moved the exported data from the US bucket to a bucket in the EU region. This step circumvented BigQuery’s restriction on direct cross-regional exports. We chose storage transfer as this was the most efficient way google had for transferring data from one bucket to the other.

4) Create a Temporary Dataset:

A temporary dataset was created in the EU region to serve as an intermediary for data migration. This dataset mirrored the structure of the original tables from the US dataset. The application will write it's data during the migration window to this dataset.

5) Enable the Migration Flag:

At this stage, the migration flag was enabled in the application, ensuring that all new data writes occurred in the temporary EU dataset. This step allowed us to maintain zero downtime during the migration, as the application seamlessly redirected its writes without any interruption to the customer’s operations.

6) Delete the US Dataset:

Once we verified that the application was writing to the EU region without any issues, the old US dataset was deleted to comply with GDPR and avoid any confusion or conflict with the migrated data.

7) Create a Permanent EU Dataset:

A new permanent dataset was created in the EU region. This dataset would become the final destination for the migrated data using the schema that we saved in step 2, which ensured that the original dataset its exactly replicated. After creating this dataset, the data in the EU bucket is loaded into the newly created EU dataset's tables.

8) Turn off the Migration Flag:

Since a new EU dataset with the same name is created now, the migration flag that was enabled in step 5 can now be safely disabled, making the application redirect it's writes to the new EU dataset as it rightfully should.

9) Export Data from the Temporary Dataset:

Now during this migration window, the interim application data would've been written to the temp dataset. So, this data also has to be imported into the EU dataset. The data in the temporary dataset was exported to the EU Bucket.

10) Import Data into the Permanent EU Dataset:

The exported temporary data in the EU bucket is imported into the EU dataset in the same flow of operations. This marked the end of the migration process for the customer’s data.

11) Cleanup:

Now after ensuring that the migration is successful. The contents in the two buckets (US and EU) along with the temp dataset was deleted. Until then, this data served as a fallback.

Conclusion

Thanks for sticking with me till the very end! I know this might not be the most efficient way out there, but this ensured zero data-loss with zero-downtime during this migration activity for us. There are very few documentations on this and browsing endlessly for the documentation made me write this blog. I'd love hear your opinions on this.

Cheers, see you in the next one! 😉

DEV Community