Łukasz Budnik

Posted on May 4, 2021

Migrating to DynamoDB using Parquet and Glue

#database #aws #dynamodb #glue

I am preparing to migrate 1B records from AWS RDS PostgreSQL to AWS DynamoDB. Before moving 1B records I want to build some POCs using a smaller data set (101M records) to find the most optimal way of getting those records into AWS DynamoDB.

My first choice was obvious: AWS Data Migration Service. AWS DMS supports AWS RDS PostgreSQL as a source and AWS DynamoDB as a target. This can be setup very easily.

However, it took me a few iterations, trying out different instance types, different migration task parameters, to finally get to a stage where I was fully using 20000 WCU when writing data to DynamoDB.
I also noticed that AWS DMS was using PutItem operation rather than more efficient BatchWriteItem.

The fastest of my migrations tasks was:

I watched a few videos from the online AWS re:Invent about Glue and how super efficient it was in transforming and moving data around. I also knew that Glue added a support for writing to DynamoDB.

I wanted to try it out and see if it can be of any use for my migration project.

Export tables to AWS S3 in Parquet format

I used AWS Data Migration Service to export data to AWS S3 in Parquet format. Exporting table that has 101M records was quite fast and took 54m:

Glue & Glue Studio

I haven't used Glue before. In one of the AWS re:Invent videos I saw that AWS recently added Glue Studio where you can build jobs using visual editor and for more complex jobs you can add custom code.

I built a simple job in Glue Studio, but noticed that DynamoDB is not yet supported (in Studio). Doing a quick Google search I found a sample code right from the Glue documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb.

I pasted that code into my "Transformation - Custom code" block. I only had to change variables' names as Glue Studio generates code using camelCase and the documentation was using snake_case. Anyway it was just 1 operation: a call to glue_context.write_dynamic_frame_from_options() (I hope that in the near future it will be supported out of the box in Glue Studio).

And that was it. I ran the job and it took 3h 7m to complete:

In DynamoDB I saw all my 101M records:

Environment setup and summary

The environment looked like this:

VPC with VPC Endpoints to AWS S3 and AWS DynamoDB
AWS RDS PostgreSQL running in private subnet, db.m5.large
AWS DMS instance running in private subnet, dms.c5.2xlarge
DynamoDB table provisioned with 20000 WCU
AWS DMS migration task with ParallelLoadThreads set to 130 and ParallelLoadBufferSize set to 1000 was running at 74% CPU and was generating 20000 WCU
AWS Glue job had 10 workers, dynamodb.throughput.write.percent set to 1.0 was maxing 20000 WCU

The results were the following:

	DMS RDS to DynamoDB	DMS RDS to S3	Glue S3 to DynamoDB
RDS CPU	4%	11%	-
RDS Reads IOPS	100	350	-
RDS Connections	6	6	-
DMS CPU	74%	11%	-
DynamoDB WCU	20000	-	20000
Time	2h 51m	54m	3h 7m

Direct migration between AWS RDS to AWS DynamoDB using AWS DMS was 2h 51m.

Staged migration with RDS -> S3 Parquet -> DynamoDB was: 54m + 3h 7m = 5h 1m.

After all DMS was faster than the DMS + S3 + Glue trio.

As you can see in the table above, in both approaches the pressure on AWS RDS PostgreSQL instance was minimal.

There are still some optimisations I want to try, like speeding up the import to DynamoDB by setting WCU even higher. That will require some AWS DMS task parameters tuning and probably increasing the DMS instance size as well.

DMS

Pros of using DMS are:

it's fast, even with PutItem operations
one-stop shop for your migration tasks
can run additional validation logic for your migration

Cons of using DMS are:

need to rightsize the instance and tune task parameters to max DynamoDB writes

S3 Parquet and Glue

Pros of using S3 Parquet and Glue:

serverless solution - no need to manage the infrastructure
didn't have to do any job tuning, setting dynamodb.throughput.write.percent to 1.0 was all I had to do to max DynamoDB writes
you get some bonuses, think - data lake!

Cons of using S3 Parquet and Glue:

slower than native DMS migration

Bonus #1

Since I had the data already in AWS S3 in Parquet format I wanted to play around with AWS Athena. I used Athena before and already knew it was super fast, but I never had a chance to play around with 20 columns wide 101M records data set. The experience was mind blowing. Complex queries running in just a few seconds. The same queries running on the original AWS RDS PostgreSQL instance were taking 20 minutes to complete!

Bonus #2

Since I already had the data in AWS Athena database I wanted to go one step further and easily visualise the data I had using AWS QuickSight. It took me a few minutes to set up some Pie Charts and Heat maps and publish my dashboard. All by drag and dropping columns in the QuickSight UI.

Top comments (2)

thedevkim • May 12 '21

Hi. Can you post here your full job script ? just erase or put dummy data on the details. Thank you!

Łukasz Budnik • May 20 '21

Hi @kimsean ,

Sorry was on annual leave and wasn't checking my emails and notifications. No problem here is my code, I have replaced any account-specific details with XXX. I also removed the automatically generated comments to keep it short.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrameCollection
from awsglue.dynamicframe import DynamicFrame

# this is the only part that I had to write
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
    glueContext.write_dynamic_frame_from_options(
        frame=dfc,
        connection_type="dynamodb",
        connection_options={
            "dynamodb.output.tableName": "XXX",
            "dynamodb.throughput.write.percent": "0.95"
        }
    )

# created automatically by AWS Glue Studio
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# DataSource - created automatically by Glue Studio
DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "s3", format = "parquet", connection_options = {"paths": ["s3://XXX/parquet/2021/04/29/public/test_02/"], "recurse":True}, transformation_ctx = "DataSource0")

# Mapping - created automatically by Glue Studio
# there are 20 columns in my data set, I removed 18 mappings to keep it short
# but this mapping was generated automatically by Glue Studio, didn't have to do anything here
Transform0 = ApplyMapping.apply(frame = DataSource0, mappings = [("id", "string", "id", "string"), ..., ("status", "string", "status", "string")], transformation_ctx = "Transform0")

# MyTransform - created by Glue Studio as a part of the custom transformation block
Transform1 = MyTransform(glueContext, DynamicFrameCollection({"Transform0": Transform0}, glueContext))

job.commit()