Migrating DynamoDB data using Lamba + Streams

#serverless #node #aws #dynamodb

The scenario

You've got an existing DynamoDB table and you'd like to migrate the data to another table. Or, you've got some data that pre-dates whenever you enabled streams and lined up that Lambda event listener. What's the move?

First, what are Streams?

When records are added or updated in your DynamoDB table change data is created and added to an event stream. This stream is super-easy to monitor & consume with a Lambda function. Basically, as records change data is added to a stream and you can capture that data with a Lambda function in near-realtime. Sweet.

One thing to note, event stream data is only stored for 24hrs, after which it's gone. Or is it? It is.

A common pattern, utilizing streams, is to write to a table, process the change data using Lambda and write to another location(i.e ElasticSearch, SQS). Maybe the data gets transformed a little bit along the way, too.

Let's say this is something you're doing -- you've got a nice pipeline running that sends data from dynamodb -> lambda -> elasticsearch but you've got some old data in the table that arrived before the stream was enabled. You can write a script that scans/queries the table and updates each entry with a flag(pre_existing_processed in our case, but change to whatever you like). By updating the existing record, it creates new change data and writes to the event stream. Pretty cool!

You could formulate a query that selects the records you'd like to get onto the event stream(date range, perhaps?) and update each record with a flag(something to indicate it's an old record).

The Code

I've created a small project that runs a paginated(DynamoDB will return up to 1MB worth of data per page) query and performs a bulk update(AWS allows a max. of 25 records per bulk update).

Clone the GitHub repo here.

Make sure you update ./aws_keys.json with AWS credentials that have access to DynamoDB before starting.

It's important to note that you'll likely need to increase your table's read/write capacity -- which comes at a cost.

Start by adding the requisite packages:

yarn

Run the script(you'll be prompted for your table name):

node migrate.js -t <YOUR_TABLE> -b <BATCH_SIZE>

There's also a batch limit parameter, in case you want to run a set number. Remember that depending on how much data you've got, it could take a long time to run. I recommend testing with a small batch size first to ensure everything is running how you expect it to.

This approach can be used to process millions of legacy/pre-existing records...but it'll take some time 😊

As always, be careful running this code and make sure you know the cost implications etc.

Hope this helps!