What I want to do
- I want to use BigQuery to analyze data from Google Analytics and services.
- To do this, I want to transfer data from Amazon Aurora to BigQuery.
- I want to transfer the data from Amazon Aurora to BigQuery less than once a day.
- I don't want to add too much load to Aurora.
- I want to write as little code as possible.
Policy
In order to reduce the load on Aurora, I thought that I could use the daily snapshots that Aurora automatically takes. Lucky for us, we have this snapshot. Lucky for us, these snapshots can be exported to S3 (which seems to have become possible around January 2020). This feature will export the snapshot in PARQUET format. This feature exports snapshots in PARQUET format, which is a very efficient format in terms of data volume.
https://docs.aws.amazon.com/ja_jp/AmazonRDS/latest/AuroraUserGuide/USER_ExportSnapshot.html
If you have a PARQUET file in S3, you can send it to BigQuery using BigQuery Data Transfer Service. BigQuery Data Transfer Service can read PARQUET files in S3 and send them to BigQuery. Moreover, it can be executed periodically if the interval is more than 24 hours, such as daily. The service itself is free of charge, although you will have to pay for the data transfer.
https://cloud.google.com/bigquery-transfer/docs/s3-transfer-intro
So, combining the two, the following is the process.
- launch Lambda, etc. at any time on a daily basis
- export the daily snapshot that Amazon Aurora is taking automatically to S3
- load the file with GCP's BigQuery Data Transfer Service
I recommend you to try this flow from the console before implementing it, so you can see how the data is exported to S3, what you need to export, and how the Transfer Service works. Basically, just follow the documentation, but there are a few things I got stuck on.
Try it from the console
I poked and prodded from the console, and it was easy to dump the Aurora snapshot to S3. But it took a lot of time, about 30 minutes. I wonder if it takes time to convert the file format. I don't know the details.
When I tried to read one of the PARQUET files with Transfer Service, I got an error like PERMISSION_DENIED: [AccessDenied: Access Denied]
and could not transfer it.
- It should have been enough because I have
S3ReadOnlyAccess
on it (it says so in the documentation). - AWS CloudTrail shows an error that
kms:Decrypt
is missing. - Add
kms:Decrypt
to the AWS user permissions for BigQuery Data Transfer Service- The target is the
KmsKeyId
specified in theStartExportTask
that you will see later (👇).
- The target is the
If you do this, you can transfer the data. By the way, you need to prepare the dataset and table in BigQuery beforehand. It seems that the data is always imported with WRITE_APPEND
, so it is better to set the partition at the import time in the table. First, I created an empty table with no schema.
When the data was transferred to BigQuery, the schema types were reasonably consistent and all were nullable.
The data transferred to BigQuery had a few schema types, but all of them were nullable. As far as I can see, the S3 URI setting for the transfer allows wildcards, but it ends up being shoved into a single table, so I decided that I need to set up one transfer per table in Aurora.
By the way, I tried to use wildcards to shove multiple RDB tables into one BigQuery table, but the file transfer from S3 to GCP worked, but the read did not (0 jobs succeeded).
Automate
Aurora Snapshot to S3
To automate the part of exporting Aurora snapshot to S3, we need to write some code, which looks like this when implemented in TypeScript.
If you implement it in TypeScript, it will look like this:
import { DescribeDBClusterSnapshotsCommand, RDSClient, StartExportTaskCommand } from '@aws-sdk/client-rds'
export const requestToExportSnapshotToS3 = async () => {
const client = new RDSClient({ region: 'Aurora's region' })
// retrieve the most recent Snapshot
const describeCommand = new DescribeDBClusterSnapshotsCommand({
DBClusterIdentifier: 'Aurora's DB cluster ID',
})
const describeResult = await this.client.send(describeCommand)
if (!describeResult.DBClusterSnapshots) return
const latestSnapshot = describeResult.DBClusterSnapshots.sort(
// We want to sort in descending order
(a, b) => b.SnapshotCreateTime?.getTime() ? 0 - (a.SnapshotCreateTime?.getTime() ? 0)
)[0].
const snapshotArn = latestSnapshot?.DBClusterSnapshotArn
if (!snapshotArn) return
// Request Export
// more details in the 👇 documentation
// https://docs.aws.amazon.com/AmazonRDS/latest/APIReference/API_StartExportTask.html
// taskID must be a date, because only date/time variables can be used in the Transfer Service.
Const taskID = `export-id-YYYYMMDD`.
const exportCommand = new StartExportTaskCommand({
ExportTaskIdentifier: taskID,
// You can create it when you do S3 export from the 👇 console, so it's easy to use it.
IamRoleArn: 'Role for S3Export',
// You can specify database, table, etc.
ExportOnly: ['database_name'],
S3BucketName: 'bucket to export to',
// I could have used any custom key, but I chose the key used for DB encryption
KmsKeyId: 'ID of the encryption key',
SourceArn: snapshotArn,
})
return this.client.send(exportCommand)
}
You can do this by using CloudWatch and Lambda. The important thing is the permissions of the role that executes the process.
- Add the role that executes the Export process to the KMS key user that is used for DB encryption (user/role that hits the SDK).
- To get KMSKeyNotAccessibleFault.
- Allow the following to the role that executes the export process 👆.
-
kms:DescribeKey
(the above key is the target resource) -
kms:CreateGrant
(the above key is the target resource) -
iam:PassRole
(required when running StartExportTask) -
rds:StartExportTask
. -
rds:DescribeDBClusterSnapshots
(to get the Arn of the target Snapshot)
-
If we give these, we could run.
Configure BigQuery Data Transfer Service
The only thing you need to think about is the S3 URI. The implementation in 👆 looks like this. The run_time
variable is in UTC, so you need to match it with the YYYYMMDD
of the taskID that you set in 👆 depending on the timing of the transfer service.
s3://your_bucket_name/export-id-{run_time|"%Y%m%d"}/database_name/schema_name.table_name/*.parquet
Now we can run it and the data will be in the BigQuery table we specified.
Conclusion
CloudTrail has helped me a lot. CloudTrail is a great service that can be used for debugging.
When I looked at the export from Aurora, I wondered if it was possible to do it in DynamoDB. It may be more complicated than this case because it may require a data format conversion layer.
{"Item":{"id":{"S": "83369704-ccef-4163-8c20-24d240c10806"}, "name":{"S": "Leopoldine Blümel-Mude"}}
{"Item":{"id":{"S": "1f187e30-b279-4bfa-85dc-e347092f364f"}, "name":{"S": "Theo Trommler"}}
{"Item":{"id":{"S": "88942fa5-e8be-4b29-8869-6abdc662eaea"}, "name":{"S": "Herr Franz-Peter Thanel B.A."}}
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_ExportTableToPointInTime.html
Top comments (1)
Data loading can be configured on Visual, Point and Click interfaces. great post , I won't need to spend extra time on intervals and useful information ...
Remove black magic toronto