Introduction
In a previous project, we used Amazon Aurora as our business database and synchronized this data to BigQuery for use as an analytics platform.
We had an opportunity to improve the performance of this synchronization process, and I'd like to share what we did.
Target Audience
This article is intended for:
- Those who want to build an analytics platform with BigQuery
- Those who want to learn about BigQuery security measures such as access restrictions
- Those who want to know practical examples of data masking implementation
Background
In that project, we used Amazon Aurora as our application database. Since we stored sensitive information such as customer data, access to the database was restricted to specific operational terminals and limited operational members only.
On the other hand, the development team had the following needs, so we built a database with masked sensitive information daily and provided it to the development team:
- Connect with backend applications for development
- Query the database for troubleshooting and debugging purposes
- Use as a data source for the data analytics platform
In the existing method, we created the masking database by cloning the production database and directly updating records with UPDATE statements.

We had been operating this way for some time, but the masking process performance had been degrading.
One reason was table lock contention occurring during data masking.
When directly updating records with UPDATE statements, tables need to be locked to synchronize with other tables.
During the masking process, we were executing UPDATE statements on all records of almost all tables in the database, which made table locks more likely to occur, resulting in massive write wait times.
The masking database was also used as a data source for the data analytics platform in addition to development and testing.
Using this analytics platform, we provided KPI dashboards to management by a specified time, but the masking process became a bottleneck, and there was concern that dashboard creation would regularly fail to meet the deadline in the near future.
Therefore, the challenge was to improve the masking process mechanism and reduce KPI dashboard creation time.
Improved Architecture
When considering improvement plans, we explored improvements from the following perspectives:
- Offload to some storage other than Aurora
- Perform masking processing by methods other than updating records
For the first perspective, since the analytics platform was using BigQuery as the data warehouse, we thought we could achieve this by transferring Aurora data to BigQuery.
For the second perspective, instead of updating database values, we thought we could achieve this by executing the masking logic that was previously done with UPDATE statements through SELECT statements when retrieving data.
Specifically, we adopted the following configuration:

We created a new project for the masking database and built a database equivalent to the masking database in BigQuery of this project with the following flow:
- Export Aurora records to S3
- Transfer data from S3 to BigQuery tables using BigQuery Data Transfer Service
- Query Views with masking logic written based on the transferred data and link to analytics platform tables
The implemented View looks like this:
-- Conventional masking with UPDATE statement
UPDATE User
SET Name = "***" -- Replace Name column in User table with *** uniformly as it contains sensitive information
-- Improved masking with SELECT statement
SELECT
"***" AS Name
FROM
User
By making these improvements, we achieved the following performance improvements and reduced processing time from about 3 hours to about 1 hour:
- Improved overall query performance by offloading to BigQuery
- Eliminated the need to synchronize with other tables by not writing values directly
While we could improve performance, this configuration required storing sensitive information in BigQuery.
In the next chapter, I'll introduce the security measures we implemented when storing sensitive information.
Implemented Security Measures
Applying Access Restrictions to BigQuery
To treat BigQuery equivalently to Aurora, we needed to apply the following access restrictions to datasets containing personal information:
- Only infrastructure members and system accounts necessary for the service can access
- Accessible only from specific operational terminals
- BigQuery data export is restricted to the analytics platform project
The first requirement could be addressed with IAM restrictions, but we needed to consider another method for the second requirement.
VPC Service Controls is a service for managing access to GCP resources. By using this, we can place GCP resources in a private perimeter and restrict service access by specifying allowed IP addresses and IAM.
We used this service to allow operations on production resources by permitting operational member IDs and the global IP used by operational terminals as inbound rules.

Also, we use Terraform Cloud for deploying BigQuery resources, and we needed to make changes to the Terraform Cloud execution environment to accommodate these access restrictions.
When using Terraform Cloud, you can select the command execution environment, which is normally within the Terraform Cloud environment. In this case, the global IP of the execution environment also changes dynamically.
To fix the global IP permitted by inbound rules, we decided to use Terraform Cloud Agent. Using Agent allows you to specify the execution location of terraform commands to your own environment.
We built the Agent by constructing a GCE instance and starting a container within the instance.
While the Agent explanation and detailed configuration methods are described in the official documentation, by executing the following shell script as a GCE startup script based on the obtained Agent name and token, we could start the Agent when the instance runs:
#!/bin/bash
sudo apt-get update && sudo apt-get install -y \
ca-certificates \
curl \
gnupg \
lsb-release
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update && sudo apt-get install -y \
containerd.io \
docker-ce \
docker-ce-cli \
docker-compose-plugin
TFC_AGENT_TOKEN=${tfc_agent_token} \
TFC_AGENT_NAME=${tfc_agent_name} \
docker run -d -e TFC_AGENT_TOKEN -e TFC_AGENT_NAME hashicorp/tfc-agent:latest
For the third requirement, since we were performing ETL across GCP using Cloud Workflows, we solved this by permitting the Cloud Workflows service account and analytics platform project as outbound rules.
Using Service Account Keys Without Issuing Credentials
This time, to access GCP resources from Terraform Cloud and AWS, we needed to use service accounts to operate resources from outside GCP.
When using service accounts from outside GCP such as other clouds, there's a method of issuing credentials, but we decided not to issue them for the following reasons:
- If the key is leaked, anyone can execute the service account's permissions
- Credentials have no expiration date, so rotation needs to be managed manually
Instead, we used Workload Identity to operate GCP resources from AWS and Terraform Cloud without issuing service account keys.
For example, when using Workload Identity with Terraform Cloud, the flow is as follows:

Following the official documentation, authentication is performed with the following steps:
- Send an authentication token to GCP when executing Terraform commands
- Verify the token's validity, and if no problems, issue a temporary token and send to Terraform Cloud
- Set the temporary token in environment variables and execute commands
- Discard the temporary token after command completion
The configuration method follows the GCP Configuration guide. The procedure roughly consists of:
- Create Workload Identity Pool and Provider
- The attribute mapping values can be confusing, but the sample code is helpful
- Create a service account and grant necessary permissions
- Set the following values in Terraform Cloud environment variables:
-
TFC_GCP_PROJECT_NUMBER: GCP project number -
TFC_GCP_PROVIDER_AUTH: true -
TFC_GCP_RUN_SERVICE_ACCOUNT_EMAIL: Email of the service account to use -
TFC_GCP_WORKLOAD_POOL_ID: Workload Pool ID -
TFC_GCP_WORKLOAD_PROVIDER_ID: Workload Provider ID
-
By using temporary authentication information instead of service account keys in this way, we avoided the risk of key leakage.
Conclusion
By migrating the masking database from Amazon Aurora to BigQuery, we were able to reduce processing time from 3 hours to 1 hour.
Additionally, by utilizing VPC Service Controls and Workload Identity, we were able to store sensitive information in BigQuery while ensuring security.
I hope this will be helpful for those who have similar challenges.
Top comments (0)