DEV Community

loading...

How to import a big delimited datatable into AWS DynamoDB without opening your piggy bank?

djviolin profile image István Lantos ・1 min read

I'm working on a serverless stack.

I have a delimited CSV file with 5-6 million lines. I want to import this into AWS DynamoDB.

Is it possible somehow to import this into AWS DynamoDB, without burning my monthly salary down the toilet? How to solve this on the cheap?

This data dump will be constantly updated and I want to import the new lines into DB. I think the most efficient way would be to make a data dump from my DynamoDB, find a key, compare it with the new third-party data dump's key (this key needs to be extracted from a column by regex), and only import the new lines into my DB. Do you have some recommendation (big data framework or some AWS service) which can help me with this? Or I'm open if there's a shell or a Go script to do the same thing.

Thank You!

Discussion (5)

pic
Editor guide
Collapse
rehanvdm profile image
Rehan van der Merwe

Okay so 5 to 6 Million, isn't that much.

Solution 1:

  • Split your CSV into 6 smaller once so that they are 1 Mil each.
  • Create an Athena table. In your code you can then query all the data.
  • Create a Lambda, that gets triggered when the CSV file is PUT into the bucket. Use a SQS to notify the Lambda, set batch size to 1 and concurrency to 1. If your using NodeJS, search for fast-csv package, you can then stream the CSV to read it in chunks.
  • Query Athena to find all the data, this is more efficient than using Dynamo to do a full scan. Stream these rows and create a hash key of the fields that you use to determine if the row is unique or not. Store this array in memory.
  • Then read the new CSV data also as stream, compute the unique hash same as before. If it is found, no need to update, if not insert the record.
  • Use DynamoDB's BatchWrite API to write in batches of 25. Do this concurrently (in parallel) within the same function. You will easily be able to hit 2k to 4k inserts if the table is set to OnDemand Pricing with a cold table. Then as mentioned before your file must have 1 Mil records max so that the insert to DynamoDB takes 1,000,000/4,000 = 250 seconds (max run time is 900) so your safe.
  • Upload the 6 raw CSV files to S3.

Solution 2:
Same principal as before, but instead of using Athena you could do a full table scan on your DynamoDB table. Bit more expensive and won't be as fast to read all the data for the Deduping part as Athena.

Something in the lines of that.

Collapse
donhill profile image
Don Hill

Do you have a working example of this ? I am currently trying the same thing.

Collapse
rehanvdm profile image
Rehan van der Merwe • Edited

I will DM you, just follow me so that we are connected. I don't have an example ready, but I quickly found this stackoverflow.com/a/33755463 with a quick search. Just change the async.series to async.parallelLimit, and make it say 25 as well so then it will do 25*25 = 625 concurrent writes.

Collapse
rehanvdm profile image
Rehan van der Merwe

I wrote a post on how to do this for you/others -> rehanvdm.com/serverless/dynamodb-i...

Collapse
iamatypeofwalrus profile image
Joe Feeney

This is a fun problem. Can I ask some questions? How often would you need to dump data into DDB and what is the SLA between a new revision showing up and when you need to get the data into DDB?

If updated even a few times a day you could set the WCU to 1000 units for an hour (~3.6 million writes) for $0.65. The naive solution seems worth implementing.