ryo ariyama

Posted on Jan 1

Migrating Masking Database from Amazon Aurora to BigQuery: Performance Improvement and Security Implementation

#webdev #googlecloud #terraform #programming

Introduction

In a previous project, we used Amazon Aurora as our business database and synchronized this data to BigQuery for use as an analytics platform.
We had an opportunity to improve the performance of this synchronization process, and I'd like to share what we did.

Target Audience

This article is intended for:

Those who want to build an analytics platform with BigQuery
Those who want to learn about BigQuery security measures such as access restrictions
Those who want to know practical examples of data masking implementation

Background

In that project, we used Amazon Aurora as our application database. Since we stored sensitive information such as customer data, access to the database was restricted to specific operational terminals and limited operational members only.

On the other hand, the development team had the following needs, so we built a database with masked sensitive information daily and provided it to the development team:

Connect with backend applications for development
Query the database for troubleshooting and debugging purposes
Use as a data source for the data analytics platform

In the existing method, we created the masking database by cloning the production database and directly updating records with UPDATE statements.

We had been operating this way for some time, but the masking process performance had been degrading.
One reason was table lock contention occurring during data masking.
When directly updating records with UPDATE statements, tables need to be locked to synchronize with other tables.
During the masking process, we were executing UPDATE statements on all records of almost all tables in the database, which made table locks more likely to occur, resulting in massive write wait times.

The masking database was also used as a data source for the data analytics platform in addition to development and testing.
Using this analytics platform, we provided KPI dashboards to management by a specified time, but the masking process became a bottleneck, and there was concern that dashboard creation would regularly fail to meet the deadline in the near future.
Therefore, the challenge was to improve the masking process mechanism and reduce KPI dashboard creation time.

Improved Architecture

When considering improvement plans, we explored improvements from the following perspectives:

Offload to some storage other than Aurora
Perform masking processing by methods other than updating records

For the first perspective, since the analytics platform was using BigQuery as the data warehouse, we thought we could achieve this by transferring Aurora data to BigQuery.
For the second perspective, instead of updating database values, we thought we could achieve this by executing the masking logic that was previously done with UPDATE statements through SELECT statements when retrieving data.

Specifically, we adopted the following configuration:

We created a new project for the masking database and built a database equivalent to the masking database in BigQuery of this project with the following flow:

Export Aurora records to S3
Transfer data from S3 to BigQuery tables using BigQuery Data Transfer Service
Query Views with masking logic written based on the transferred data and link to analytics platform tables

The implemented View looks like this:

-- Conventional masking with UPDATE statement
UPDATE User
SET Name = "***"  -- Replace Name column in User table with *** uniformly as it contains sensitive information

-- Improved masking with SELECT statement
SELECT
  "***" AS Name
FROM
  User

By making these improvements, we achieved the following performance improvements and reduced processing time from about 3 hours to about 1 hour:

Improved overall query performance by offloading to BigQuery
Eliminated the need to synchronize with other tables by not writing values directly

While we could improve performance, this configuration required storing sensitive information in BigQuery.
In the next chapter, I'll introduce the security measures we implemented when storing sensitive information.

Implemented Security Measures

Applying Access Restrictions to BigQuery

To treat BigQuery equivalently to Aurora, we needed to apply the following access restrictions to datasets containing personal information:

Only infrastructure members and system accounts necessary for the service can access
Accessible only from specific operational terminals
BigQuery data export is restricted to the analytics platform project

The first requirement could be addressed with IAM restrictions, but we needed to consider another method for the second requirement.

VPC Service Controls is a service for managing access to GCP resources. By using this, we can place GCP resources in a private perimeter and restrict service access by specifying allowed IP addresses and IAM.
We used this service to allow operations on production resources by permitting operational member IDs and the global IP used by operational terminals as inbound rules.

Also, we use Terraform Cloud for deploying BigQuery resources, and we needed to make changes to the Terraform Cloud execution environment to accommodate these access restrictions.
When using Terraform Cloud, you can select the command execution environment, which is normally within the Terraform Cloud environment. In this case, the global IP of the execution environment also changes dynamically.
To fix the global IP permitted by inbound rules, we decided to use Terraform Cloud Agent. Using Agent allows you to specify the execution location of terraform commands to your own environment.
We built the Agent by constructing a GCE instance and starting a container within the instance.

While the Agent explanation and detailed configuration methods are described in the official documentation, by executing the following shell script as a GCE startup script based on the obtained Agent name and token, we could start the Agent when the instance runs:

#!/bin/bash

sudo apt-get update && sudo apt-get install -y \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update && sudo apt-get install -y \
  containerd.io \
  docker-ce \
  docker-ce-cli \
  docker-compose-plugin

TFC_AGENT_TOKEN=${tfc_agent_token} \
TFC_AGENT_NAME=${tfc_agent_name} \
docker run -d -e TFC_AGENT_TOKEN -e TFC_AGENT_NAME hashicorp/tfc-agent:latest

For the third requirement, since we were performing ETL across GCP using Cloud Workflows, we solved this by permitting the Cloud Workflows service account and analytics platform project as outbound rules.

Using Service Account Keys Without Issuing Credentials

This time, to access GCP resources from Terraform Cloud and AWS, we needed to use service accounts to operate resources from outside GCP.
When using service accounts from outside GCP such as other clouds, there's a method of issuing credentials, but we decided not to issue them for the following reasons:

If the key is leaked, anyone can execute the service account's permissions
Credentials have no expiration date, so rotation needs to be managed manually

Instead, we used Workload Identity to operate GCP resources from AWS and Terraform Cloud without issuing service account keys.
For example, when using Workload Identity with Terraform Cloud, the flow is as follows:

Following the official documentation, authentication is performed with the following steps:

Send an authentication token to GCP when executing Terraform commands
Verify the token's validity, and if no problems, issue a temporary token and send to Terraform Cloud
Set the temporary token in environment variables and execute commands
Discard the temporary token after command completion

The configuration method follows the GCP Configuration guide. The procedure roughly consists of:

Create Workload Identity Pool and Provider
- The attribute mapping values can be confusing, but the sample code is helpful
Create a service account and grant necessary permissions
Set the following values in Terraform Cloud environment variables:
- TFC_GCP_PROJECT_NUMBER: GCP project number
- TFC_GCP_PROVIDER_AUTH: true
- TFC_GCP_RUN_SERVICE_ACCOUNT_EMAIL: Email of the service account to use
- TFC_GCP_WORKLOAD_POOL_ID: Workload Pool ID
- TFC_GCP_WORKLOAD_PROVIDER_ID: Workload Provider ID

By using temporary authentication information instead of service account keys in this way, we avoided the risk of key leakage.

Conclusion

By migrating the masking database from Amazon Aurora to BigQuery, we were able to reduce processing time from 3 hours to 1 hour.
Additionally, by utilizing VPC Service Controls and Workload Identity, we were able to store sensitive information in BigQuery while ensuring security.

I hope this will be helpful for those who have similar challenges.

DEV Community