Eugenia

Posted on May 11, 2019 • Originally published at towardsdatascience.com on Dec 27, 2018

How to build a PostgreSQL database to store tweets

#postgres #twitter #datascience #database

Learning how to stream from Twitter API

Twitter is used every day by people to express their feelings or thoughts, especially about something that it is happening at the moment or has just occurred; by companies to promote products or services; by journalists to comment on events or write news, and the list can go on. There is no doubt that analyzing Twitter and tweets is a powerful tool that can give us a sense on the public opinion about a topic that we are interested in.

But how do we perform this type of analysis? Fortunately, Twitter provides us with an API (that stands for ‘Application Programming Interface’) which we can interact with creating an app and in that way, access and filter public tweets.

The first step is to register our app; for that, we need to go to Twitter developer website, log-in with our twitter account and ask Twitter for approval as developers (As from July 2018, Twitter changed their policies and anyone that want to access Twitter API and create an app need to apply for a developer account, provide detailed information on how they intend to use it and wait for the application to be approved). After we received the approval, we can go on and create a new app filling out the details: Name (unique name that no one else has used as their Twitter app), Description , and Website (It should be the app’s home page but we can also put our personal website or a GitHub repository URL.

Creating an app in Twitter Developer account

After that, we need to create our access tokens. The access tokens will allow our Twitter app to read Twitter information, such as tweets, mentions, friends, and more:

For the whole analysis, we are going to work with JupyterLab and Python. Because anyone having this information can use it to authorize the app to connect to Twitter, we are going to create a python file (.py) where we can store the Consumer Key, Consumer Secret, OAuth Access Token, OAuth Access Token Secret and then call it in our main Jupyter Notebook file.

Now that we have registered our app, got our tokens/keys and stored them in a separated file, the second step is to decide where we are going to store our tweets once we got them. Should we store them in a file, in a NoSQL-type database or in a relational database? In order to answer this question, we need to understand how the information that we get from twitter app is given to us and what we want to do with it.

Twitter APIs always return tweets encoded using JavaScript Object Notation (JSON), an unstructured and flexible type which is based on key-value pairs with attributes and associated values that describe objects. Each tweet contains an author, message, unique ID, a timestamp and creation date when it was posted, among others; each user has a name, id, and the number of followers. Because of this, we will immediately think of storing tweets in a built database management systems (DBMS), like MongoDB, an open source database conceived as a native JSON database. The advantages of this type of database are that they are designed to be agile and scalable while using dynamic schemas without defining the structure first.

On the other hand, relational databases are more related to standards compliance and extensibility and consequently, do not give us freedom over how to store the data. They use dynamic and static schemas which help us to link data when they are relational. On the contrary, because of the unstructured approach, we can not perform this in MongoDB. Some other advantages of using a relational database are that we need only to change data in one of the tables and then it will update itself ( data integrity ), and they ensure that no attributes are repeated ( data redundancy ). As we said before, relational databases are structured in columns and rows in a way that we can link information from different tables through the use of keys that uniquely identified any piece of data within the table and are used by other tables to point to them.

Even though MongoDB has been designed to be fast and has great performance with unstructured data, relational databases such as PostgreSQL has a great performance handling JSON. This fact together with the possibilities that structuring the data give us lead us to use a relational database and specifically, PostgreSQL 10 (mention usually just as Postgres) because it is a free, reliable and efficient SQL database that most importantly, has a very powerful and useful python API, called psycopg2. In order to administer our database in a more friendly way, we will also use pgadmin that will allow us to create a database with a user and password to protect our information. Again, we are going to store these credentials in another python file so we keep them secret once we push the main file to our git repository.

The last components that we need to bring into our analysis are Tweepy, a python library that will become the main player in our code and will help us to access the Twitter API, handle the authorization request, capture and stream tweets among others, and json which will manage the JSON files obtained from the API.

So, the first thing that we need to do in our code is to import all the libraries and files that we’ve created:

After that, we will define a function that will authorize our app to connect to Twitter API. OAuth is an open standard for access mainly used by internet users to grant applications or websites access to their information on other websites without providing them the passwords and instead, allowing approved access tokens to access protected resources hosted by the resource server. In the case of Twitter, we already requested the tokens and keys so, we will use Tweepy that makes OAuth authorization easy for us handling it with the tweepy.OAuthHandler class. So, first, we pass to this function the consumer key and the consumer secret and then, we set the access tokens to be our access token and our access token secret that we got from Twitter.

As we have discussed before, relational databases store information in structured tables, so the next step is to decide the schema of our database. There is a very important concept that we need to consider here: Normalization. Normalizing a database requires a process that structures a relational database according to certain normal forms which follow the goal of reducing data redundancy and improving data integrity. A normalized database is one such as the relationship among the tables matches the relationship that is really there among the data. In simple terms, the rules state that unique keys and events in the rows should say something about it, facts that do not relate to the key belongs into different tables, and the tables should not imply relationships that do not exist.

In our case, we are going to create two tables: the first one will contain information about the Twitter users: user id, that will be our PRIMARY KEY (a key that is unique for each record), and user name. On the other hand, we will create a second table that will store information about the tweets: creation date, text (the tweet itself), user id, that will be our FOREIGN KEY (a key that uniquely identifies a primary key of another table) relating this table with our user table, and the retweet count.

We will define a function that once called will connect to our database with the credentials (using the command pyscopg2.connect) and create a table containing the name of the term we want to search for. For this last step, Postgres give us the possibility of setting up a cursor that encapsulates the query and reads its results a few rows at a time instead of executing a whole query at once. So, we will take advantage of that, create a cursor (using database.cursor()) and then execute our query to create the user table and then the tweets-containing table. We need to consider some points here: it is important to use the IF NOT EXISTS command when we perform the query to CREATE TABLE, otherwise, Postgres can rise the error that the table is already created and will stop the code execution; we need to clarify which type of variable each column contains (VARCHAR, TIMESTAMP, etc.), which column is the primary and foreign key, and in this last case, which column REFERENCES to; after we have executed the queries is important to commit this (database.commit()), otherwise, no changes will be persisted, and close the connection to the cursor and the database.

Afterward, we need to define a function that will help us store the tweets. This function will follow the same logic that we use to create the table(connect to the database, create a cursor, execute the query, commit query, close connection), but instead, we will use the INSERT INTO command. When creating the user table, we declared that user id will be our primary key. So, when we store the tweets we need to be careful how we insert this in the table. If the same user has two tweets, the second time the function is executed, it will raise an error because it detects that particular user id to already be in the table, as primary keys have to be unique. So, we can use here the ON CONFLICT command to tell postgres that if the user id is already in the table, it doesn’t have to insert it again. On the contrary, the tweet will be inserted into the tweets table and will be referenced to that user id in the user table.

There are two ways to capture tweets with Tweepy. The first one is using the REST search API, tweepy.API, which searches against a sampling of recent public tweets published in the past 7 days. The second one is streaming real-time tweets by using the Twitter streaming api that differs from the REST api in the way that this one pulls data from Twitter while the streaming api pushes messages to a persistent session.

In order to stream tweets in Tweepy, an instance of the class tweepy.Stream establishes a streaming session and sends messages to an instance of StreamListener class. Inside of this class, there are several methods that handle tweets. Depending on what type of information we want to obtain, we need to override the different methods: if we want only the status, we will then overload on_status method. Because we want detailed information about the tweet (creation date, user id, user name, retweet count), we will overload the on_data method that is in charge of receiving all messages and calling functions according to the type of the message.

Consequently, we will create the class MyStreamListener which will inherit from tweepy.StreamListener class and we will override on_data method. We will obtain the json file containing the tweet (json.load(raw_data)) and parse it to store the values in different variables (as an example: user_id = data[‘user’][‘id_str’]) to then pass them to the function to store tweets that we have created before.

It is important to be careful about error or exceptions that could occur at this point. For this, we will surround the code by a try/except block so in case an exception happens, it will be printed and we can be aware of what is happening. Also, there is a limit number of attempts to connect to the streaming API and this will show an error 420. We can handle this error by overloading the on_error method and disconnect the API in case this error shows up.

So, what is left? We need to create an api (tweepy.API()) and after that, create our stream object (tweepy.stream()) passing the authorization and the listener that we have created. We will use the filter function to stream all tweets containing a word of interested (track = [‘word’]) and being written in English (languages = [‘en])

Now, it’s time to start streaming the tweets!! I’m particularly interested in knowing what people are feeling about the Avengers. So I will use “avengers” as my term of interest and start capturing real-time tweets to create a nice database that will help my later sentiment analysis that you can read here as well as to visualize Twitter interactions with Networkx that you can find here. What are you interested in?