I'm working on a serverless stack.
I have a delimited CSV file with 5-6 million lines. I want to import this into AWS DynamoDB.
Is it possible somehow to import this into AWS DynamoDB, without burning my monthly salary down the toilet? How to solve this on the cheap?
This data dump will be constantly updated and I want to import the new lines into DB. I think the most efficient way would be to make a data dump from my DynamoDB, find a key, compare it with the new third-party data dump's key (this key needs to be extracted from a column by regex), and only import the new lines into my DB. Do you have some recommendation (big data framework or some AWS service) which can help me with this? Or I'm open if there's a shell or a Go script to do the same thing.
Thank You!
Top comments (5)
Okay so 5 to 6 Million, isn't that much.
Solution 1:
Solution 2:
Same principal as before, but instead of using Athena you could do a full table scan on your DynamoDB table. Bit more expensive and won't be as fast to read all the data for the Deduping part as Athena.
Something in the lines of that.
Do you have a working example of this ? I am currently trying the same thing.
I will DM you, just follow me so that we are connected. I don't have an example ready, but I quickly found this stackoverflow.com/a/33755463 with a quick search. Just change the async.series to async.parallelLimit, and make it say 25 as well so then it will do 25*25 = 625 concurrent writes.
I wrote a post on how to do this for you/others -> rehanvdm.com/serverless/dynamodb-i...
This is a fun problem. Can I ask some questions? How often would you need to dump data into DDB and what is the SLA between a new revision showing up and when you need to get the data into DDB?
If updated even a few times a day you could set the WCU to 1000 units for an hour (~3.6 million writes) for $0.65. The naive solution seems worth implementing.