Serverless Backends with AWS Cloud: Data Pipeline and S3

#datapipeline #s3

This is a part in a series on AWS serverless architecture. The original blog post series can be found both here and on my blog J-bytes.

Piping Hot Data

We're going to move data from Dynamo DB into our local environment. Later, we'll be decrypting it to do cool things with it. This section is super short if you've been following along.

If you used an RDS instead of DynamoDB, you can just work with the RDS as-is (slightly more expensive, choke points, etc). If you used DynamoDB, however, we really can't use lambdas to send out tweets to our users. They only work for a maximum of 5 minutes at a time and weren't really designed for scanning a database with 100K+ records at a time and sending messages. We could build tools to leave the data in the DB and just work with it there, but I think it's much simpler to download all the data, put it into a local MySQL instance, and then send out tweets or collect emails as you please.

Of course, you could always do this on an EC2 spot instance for a cheap, powerful way to mass-notify your followers as well.

Creating a Data Pipeline

This is the easiest part of the whole project. Go into S3 and create two buckets (or folders, choice is entirely yours):
<Project-name>-production-email
<Project-name>-production-twitter

You can also create two more if you need them for staging, but for the purpose of the demo I'm just going to use production from here on out. The steps are the same and very easy to copy either way.

After creating your buckets, open Data Pipeline in the AWS Console. Note: If your databases don't actually have things in them, your pipeline won't have anything to transport. Make sure your production email and twitter databases actually have at least one item each before continuing. You can do this via Postman.

Select a region if yours is not supported
Click Get Started Now
For Name, choose something fiendishly complex like "Production-email-pipeline"
Under Source, choose Export DynamoDB table to S3
For Source DynamoDB table name choose the exact table name - production_emails
Your output folder should be the correct folder you just made in S3.
DynamoDB read throughput ratio is something that depends on your project. For my project, the pipeline was supposed to run after the campaign was finished and there were no more incoming requests, so 1.0 was fine. While this is reading from your DB it will eat up read/write capacity, so be smart about how you set this.
Region of the DynamoDB table - set to the correct region that your table is in.
Schedule -> Run change this to "on pipeline activation"
Disable logging. Not important for us.
Default IAM Role.
Click Activate.
It will yell at you about validation warnings. Laugh, then click Activate again. There are no regrets here.
Wait like 10 minutes depending on factors you can't control. If your schedule state says "SCHEDULED On Demand", you probably have to keep waiting and manually refreshing. This page does not seem to auto-refresh
In the S3 serverless production folder/bucket, you should see a folder with today's date. Inside, you'll see three files. Ignore "_SUCCESS" and "manifest." We want the string of nonsense. Download it.
Repeat all of the above for the twitter table.