The Twitter API provides access to a high volume of Tweets that developers and researchers use to study the public conversation as it is happening in real time. In order to consume, process and store this high volume of data, developers build data ingestion and processing pipelines to streamline Twitter data consumption and storage. Due to the advantages of using a compute infrastructure, storage and application services in the cloud, developers use popular cloud services.
In this guide, we will take a look at an introduction to Twitter data processing and storage on Amazon Web Services (AWS). We will first discuss some popular services on AWS and how those can be used as components for building a data processing pipeline on AWS. We will then look at some example reference architectures that developers can refer to when designing and building their data pipelines for processing Twitter data on AWS. Code samples (with instructions on AWS configurations) for processing and storing Twitter data are provided throughout the blogs, linked to Github and use the new endpoints that are part of the new Twitter API v2.
Disclaimer: The architectures and design patterns listed below are to be used as a reference when building your data pipeline and are not meant to be a one-size-fits-all solution. Based on the volume of the data you are handling, your use case and storage requirements, you will need to customize them accordingly. Refer to the pricing calculator to get an estimate of how much a product will cost you. Twitter does not manage or operate AWS - please refer to their documentation for limitations and terms regarding use of these products.
Relevant Services and Terminology
Below is a list of AWS services that are relevant to building a data pipeline for processing and storing data from Twitter API.
|Name||Description||What can we use it for (context of the Twitter API)|
|EC2||Provisioning servers in the cloud||Running scripts to get data from the Twitter API|
|Lambda||Functions in the cloud. You can select memory or programming language and run code||Processing data once streamed, getting historical data from the Twitter API|
|Kinesis||Process streaming data in real time||Processing streaming data coming from PowerTrack|
|S3||File/Object storage in the cloud||Store Tweet data|
|DynamoDB||NoSQL database / key-value based document store||Store Tweet data|
|Relational Database Service (RDS)||Relational databases in the cloud||Store Tweet data|
|Secrets Manager||Service to store passwords & keys||Store API keys or credentials for the Twitter API|
|Simple Queue Service (SQS)||Queuing service to decouple application processing||To process records in batch, we can queue them to be picked up by various Lambda functions|
|Cloudwatch||Logging and events in the cloud||CloudWatch events to trigger lambda functions periodically|
In this section, we will discuss reference architectures for processing and storing Twitter data on AWS. For the purposes of this guide, we break down the components of a data processing pipeline into four components.
This component deals with the underlying compute infrastructure that you will use for connecting to and reading data from Twitter API. Generally, when you want to consume Twitter data from a streaming endpoint such as sampled stream, filtered stream, statuses/filter, or PowerTrack, you will need to run your code on a machine or server in order to connect to the stream and start consuming data. In order to do this, you can configure a server in the cloud using EC2 and run your code to connect to the streaming endpoint from this server.
This component deals with processing Twitter data once you have connected to the streaming endpoints and start receiving the data on our server. You can use queuing solution like SQS or streaming solution such as Kinesis to process this data and invoke Lambda functions to process, filter, and transform this data before storing it.
This component deals with storing the Twitter data. AWS offers RDS to write data to a variety of relational databases in the cloud. It also has a popular object store - S3 - as well as NoSQL databases like DynamoDB.
This final component deals with analyzing the data that we receive from the Twitter API after storing it. Services such as Comprehend or QuickSight make it easy to do so. (Note: We will not be covering this component in this guide, but we hope to discuss some of the services for this component in a future article)
Let’s take a look at some of the reference architectures below:
Processing data from streaming endpoints using AWS Kinesis Data Firehose
In this architecture, you can use one (or more) EC2 instances as a server. On this server, you can run your code that calls any of the streaming endpoints such as sampled stream, filtered stream, or PowerTrack to connect to the stream and ingest data. You can then set up a Kinesis datastream in the AWS console to deliver the Tweets to AWS S3. Kinesis allows you to replay the data if needed. It also allows for data to be consumed by multiple services. This is different from using an SQS queue because once the data from an SQS queue is acknowledged, there is no option of replay.
When setting up the delivery stream, you have to specify which S3 bucket you want to write the Twitter data to. You also specify the ‘Buffer Size’ or ‘Buffer Interval’ for delivery Tweets to S3. When this buffer criteria is met, Kinesis will deliver you Tweets to the S3 bucket. AWS also allows you to transform source records before delivery using a Lambda function. Note: The Kinesis Data Firehose also allows you to convert the data to Apache parquet or ORC format for optimization.
In order to process this data that is delivered to S3 using the Kinesis delivery stream, you can configure a Lambda function in such a way that when data is added to the S3 bucket, a Lambda function is triggered and you can have your processing logic (such as doing sentiment analysis) in this Lambda function. Depending on your use case, you can then store this processed data in a NoSQL solution like DynamoDb or choose from a relational database using RDS.
Check out our code sample (along with AWS configuration instruction) for connecting to the sampled stream endpoint and delivering Tweets to S3 using the Kinesis delivery stream.
Summary of steps for implementing this architecture on AWS:
Spin up an EC2 instance on which you will run your code to connect to the streaming endpoints.
Configure a Kinesis delivery stream:
- Specify the S3 bucket destination where you want Tweets delivered to
- Specify the buffer size and internal for delivering Tweets to S3 bucket
Configure a Lambda function to be triggered whenever data is added to your S3 bucket. This is the lambda function where you can do processing such as sentiment analysis.
If you wish to store your results from the step above in a datastore, make sure to configure those first before writing to those datastores in your Lambda function.
Getting historical data periodically on using Lambda and Cloudwatch events
If you wish to get historical data using endpoints such as recent search or search/tweets, you can have a script running on an EC2 instance periodically that calls these endpoints, get data from them and store them. However, an alternative, serverless approach is presented in this reference architecture. You can configure a Lambda function that will run the script to connect to the Twitter API to get historical data. You can set up a CloudWatch event to trigger the Lambda function periodically to get new/recent data. When setting the CloudWatch event, you can specify the frequency with which you wish to invoke the Lambda function (for example, a 15 minute time interval). After getting this data back, you can transform and/or store it in your data store.
Check out this blog post that shows how to trigger a Lambda function periodically and get Tweets using the recent search endpoint and then store processed Tweet IDs in DynamoDB.
Storing Twitter data on AWS
AWS provides various options for storing data that can be used when storing Twitter data. In this section we will take a look at three options: namely S3, DynamoDB and RDS.
Storing data in S3
A common way to store Tweet data is to store these in batches in S3 buckets. S3 is a popular solution for building a data lake with structured or unstructured data. If you wish to first collect all the data and use it later for analysis in tools such as Python or R, it makes sense to compress your data and then store it in S3. Compressing the file will reduce the size of each file.
For example, if you want to store the Tweets processed on an hourly basis, you can have a folder name per day as:
Folder Name: YYYY-MM-DD
Within this folder you can store the zipped files per hour as:
File Name: identifier-YYYY-MM-DD-HH.json or identifier-YYYY-MM-DD-HH.json.gz
You may even store it per 15 minute windows for a given day, in which case your file name will be:
File Name: identifier-YYYY-MM-DD-HH-mm.json or identifier-YYYY-MM-DD-HH-mm.json.gz
Then, in your Python / R code you can read this data from S3 and do your analysis.
Check out this code sample (along with AWS configuration instructions) that shows you how to store Tweets streamed from sampled stream endpoint in S3.
Storing data in DynamoDB
If you need to store individual Tweets per row in a database, such as tweet_id as the key and the value as the JSON Tweet object, then DynamoDB is a good fit. You can also store processed Tweets in this manner (after you have done some analysis per Tweet). Dynamo allows you to read by key (tweet_id in this case, which will return one Tweet object) or scan the table (which will return every Tweet in a Table).
Check out this blog post that shows how to store processed Tweet IDs in DynamoDB.
Storing data in a relational database (using RDS
If you do not wish to store the data in a NoSQL solution like DynamoDB, and require a schema on-write solution, then you can use Amazon Relational Database Service (RDS) to configure a relational database of your choice. RDS lets you set up MySQL, PostgreSQL, and Amazon Aurora databases (among others) in the AWS cloud.
Managing keys and secrets
In order to use the Twitter API, you need to use your API keys for authentication. You should not store these keys in your code. If you use AWS, one option is to use Secrets Manager to store your keys. By using the AWS SDK, you can load those credentials in your code. Check out this implementation in Java of retrieving keys from AWS Secrets Manager.
Feel free to reach out to me on @suhemparack with suggestions on what other AWS samples you would like to see for processing and storing Twitter data on AWS