Some time ago I read a tweet of Kenneth Reitz, a very well known Python developer I follow on Twitter, asking:
Starting from this, I decided to analyze some tweets from pretty popular Python devs in order to understand a priori how they use Twitter, what they tweet about and what I can gather using data from Twitter APIs only.
Obviously you can apply the same analysis on a different list of Twitter accounts.
For my analysis I set up a Python 3.6 virtual environment with the following main libraries:
- Tweepy for interaction with Twitter APIs
- Pandas for data analysis and visualization
- Beautiful Soup 4, NLTK and Gensim for processing text data
Some extra libraries will be introduced later on, along with the explanation of the steps I did.
In order to access the Twitter APIs I registered my app and then I provided the tweepy library with the consumer_key, access_token, access_token_secret and consumer_secret.
We're now ready to get some real data from Twitter!
First of all, I chose a list of 8 popular Python devs, starting from the top 360 most-downloaded packages on PyPI and selecting some libraries I know or use daily.
Here's my final list, including links to the Twitter account and the libraries (from the above mentioned list) for which those guys are known for:
- @kennethreitz: requests
- @mitsuhiko: jinja2/flask
- @zzzeek: sqlalchemy
- @teoliphant: numpy/scipy/numba
- @benoitc: gunicorn
- @asksol: celery
- @wesmckinn: pandas
- @cournape: scikit-learn
I got all the data with two endpoints only:
- with a call to lookup users I could get all the information about the accounts (creation date, description, counts, location, etc.)
- with a call to user timeline I could get the tweets about a single user and all the information related to every single tweet. I configured the call to get also retweets and replies.
I saved the results from the two calls in two Pandas dataframes in order to ease the data processing and then into CSV files to be used as starting point for the next steps without re-calling each time the Twitter API.
The users dataframe contained all the information I needed; I just created three more columns:
- a followers/following ratio, a sort of "popularity" indicator
- a tweets per day ratio, dividing the total number of tweets by the number of days since the creation of the account
- the coordinates starting from the location, if available, using Geopy. @benoitc doesn't have a location, while @zzzeek has a generic "northeast", geolocated in Nebraska :-)
Here's the final users dataframe:
|kennethreitz||Kenneth Reitz||False||@DigitalOcean & @ThePSF. Creator of Requests: HTTP for Humans. I design art — with code, cameras, musical instruments, and the English language.||2009-06-24 23:28:06||Winchester, VA||[39.1852184, -78.1652404]||Eastern Time (US & Canada)||58195||26575||18213||480||822||37.94375||17.950339296730412|
|mitsuhiko||Armin Ronacher||True||Creator of Flask; Building stuff at @getsentry; prev @fireteamltd, @splashdamage — writing and talking about system architecture, API design and lots of Python||2008-02-01 23:12:59||Austria||[47.2000338, 13.199959]||Vienna||31774||2059||21801||593||941||36.76391231028668||8.470807784590775|
|zzzeek||mike bayer||False||2008-06-11 19:22:19||northeast||[41.7370229, -99.5873816]||Quito||14771||1106||3013||209||194||14.416267942583731||4.080386740331492|
|teoliphant||Travis Oliphant||False||Creator of SciPy, NumPy, and Numba; founder and Director of Anaconda, Inc. Founder of NumFOCUS. CEO of Quansight||2009-04-17 20:04:57||Austin, TX||[30.2711286, -97.7436995]||Central Time (US & Canada)||3875||1506||18052||483||746||37.374741200828154||1.1706948640483383|
|benoitc||benoît chesneau||False||web craftsman||2007-02-13 16:53:37||Paris||27172||548||1971||704||247||2.799715909090909||6.620857699805068|
|asksol||Ask Solem||False||sound, noise, stream processing, distributed systems, data, open source python, etc. Works at @robinhoodapp 🌳🌳🌴🌵||2007-11-11 18:56:33||San Francisco, CA||[45.4423543, -73.4373087]||Pacific Time (US & Canada)||3249||191||2509||513||126||4.89083820662768||0.8476389251239238|
|wesmckinn||Wes McKinney||True||Data science toolmaker at https://t.co/YVn0VFqgj0. Creator of pandas, @IbisData. @ApacheArrow @ApacheParquet PMC. Wrote Python for Data Analysis. Views my own||2010-02-18 21:01:15||New York, NY||[40.7306458, -73.9866136]||Eastern Time (US & Canada)||7749||3021||33130||784||1277||42.25765306122449||2.5804195804195804|
|cournape||David Cournapeau||False||Python and Stats geek. Former NumPy/Scipy core contributor. Lead the ML engineering team @cogentlabs Occasional musings on economics/music/Japanese culture||2010-06-14 03:17:21||Tokyo, Japan||[34.6968642, 139.4049033]||Amsterdam||15577||505||800||427||112||1.873536299765808||5.395566331832352|
The tweets dataframe on the contrary needed some extra preprocessing.
First of all, I discovered an annoying limitation about the Twitter user timeline API: there's a maximum number of tweets that can be returned (more or less 3200 including retweets and replies). Therefore I decided to group the tweets by username and to get the oldest tweet date for each user:
|asksol||2991||2009-07-19 19:24:33||2018-05-10 14:58:17|
|benoitc||3199||2017-02-21 14:55:37||2018-05-11 19:36:21|
|cournape||3179||2017-06-12 19:13:20||2018-05-11 16:55:39|
|kennethreitz||3200||2017-08-26 20:48:35||2018-05-11 21:07:46|
|mitsuhiko||3226||2017-06-14 13:23:57||2018-05-11 19:26:07|
|teoliphant||3201||2013-02-19 03:54:16||2018-05-11 16:48:39|
|wesmckinn||3205||2014-01-26 17:45:07||2018-05-03 14:51:29|
|zzzeek||3214||2015-05-05 13:35:38||2018-05-11 14:17:02|
Then I filtered out all the tweets before the maximum value between the first dates (2017-08-26 20:48:35).
Starting from these data, @kennethreitz is influencing the cut date because he's tweeting a lot more than some other users; but in this way we can at least get the same timeframe for all the users and compare tweets from the same period.
After this filter I got 25418-14470=10948 tweets, split in this way:
|asksol||52||2017-09-11 18:36:19||2018-05-10 14:58:17|
|benoitc||1849||2017-08-26 20:49:13||2018-05-11 19:36:21|
|cournape||1888||2017-08-27 01:02:11||2018-05-11 16:55:39|
|kennethreitz||3200||2017-08-26 20:48:35||2018-05-11 21:07:46|
|mitsuhiko||2328||2017-08-27 08:22:19||2018-05-11 19:26:07|
|teoliphant||443||2017-08-26 22:57:23||2018-05-11 16:48:39|
|wesmckinn||591||2017-08-28 18:07:08||2018-05-03 14:51:29|
|zzzeek||596||2017-08-26 22:14:11||2018-05-11 14:17:02|
Other preprocessing steps:
- I parsed the source information using Beautiful Soup because it contained HTML entities
- I removed smart quotes from text
- I converted the text to lower case
- I removed URLs and numbers
I filtered out all the tweets with empty text after these steps (i.e. containing only urls, etc) and I got 10948-125=10823 tweets.
I finally created a new column containing the "tweet type" (standard, reply or retweet) and another column with the tweet length.
Here are some columns from the final tweets dataframe (first 5 rows):
|kennethreitz||2018-05-11 21:07:46||RT @IAmAru: Trio is @kennethreitz-approved. #PyCon2018||rt trio is approved||en||0.0||1||Tweetbot for iΟS||retweet||54|
|kennethreitz||2018-05-10 21:55:35||If you want to say hi, swing by the DigitalOcean booth during the opening reception! #PyCon2018||if you want to say hi swing by the digitalocean booth during the opening reception||en||20.0||2||Tweetbot for iΟS||standard||95|
|kennethreitz||2018-05-10 20:34:20||@dbinoj 24x 1x dynos right now||x x dynos right now||en||0.0||0||Tweetbot for iΟS||reply||30|
|kennethreitz||2018-05-10 20:11:37||Swing by the @IndyPy booth for your chance to win a signed copy of The Hitchhiker's Guide to Python! ✨🍰✨ https://t.co/CZhd2If5s0 https://t.co/3kUaqu5TMX||swing by the booth for your chance to win a signed copy of the hitchhiker s guide to python||en||25.0||3||IFTTT||standard||152|
|kennethreitz||2018-05-10 13:53:31||Let's do this https://t.co/6xLCE4WCqA https://t.co/ERiMmffe8L||let s do this||en||22.0||1||IFTTT||standard||61|
The users dataframe itself already shows some insights:
- There are only two accounts with the verified flag: @mitsuhiko and @wesmckinn
- @wesmckinn, @kennethreitz, @teoliphant and @mitsuhiko are the most popular accounts in the list (according to my "popularity" indicator):
- @kennethreitz wrote since the creation of his account at least twice the number of tweets per day compared to the other devs in the list:
- Most of the accounts in the list live in the US; I used Folium to create a map showing the locations:
The tweets dataframe needs instead some manipulation before we can gather some good insights.
We can also group by username and tweet type, and show a chart with the mean tweet length. @kennethreitz for example writes replies shorter than standard tweets, while @teoliphant writes tweets longer than the other guys (exceeding the 140 chars limit):
Ok, now let's filter out the retweets and let's focus on the machine-detected language used in standard tweets and replies. The five most common languages are: English, German, French, undefined and a rather weird "tagalog" (ISO 639-1 code "tl", maybe an error in auto-detection?). Most of the tweets are in English; @mitsuhiko tweets a lot in German, while @benoitrc in French:
So, let's just select tweets in English or undefined: all the next charts are just considering tweets and replies in English (but obviously you can tune differently your analysis).
Let's group by username and get statistics about the number of favorites/retweets per user:
|username||favorite_count count||favorite_count max||favorite_count mean||favorite_count std||retweet_count max||retweet_count mean||retweet_count std|
From this table we can see that:
- @kennethreitz has the most retweeted and favorited tweet in the dataframe. Here's the tweet:
- @wesmckinn has the second most retweeted and favorited tweet in the dataframe. Here's the tweet:
- @wesmckinn has highest mean value for retweet count and favorite count
Since @wesmckinn has also the highest followers count, how these stats change if we normalize the dataframe using the followers count?
Obviously one tweet can get favorited/retweeted even from non-followers, but this normalization will probably produce more fair results because the higher the followers count, the most the tweet will probably be viewed.
|username||favorite_count perc count||favorite_count perc max||favorite_count perc mean||favorite_count perc std||retweet_count perc max||retweet_count perc mean||retweet_count perc std|
After the normalization we can see that @cournape and @teoliphant are getting higher mean values, in terms of retweets and favorites.
We can also see how the monthly number of tweets changes over time, per user. From the following chart we can see for example that @kennethreitz tweeted a lot in september 2017 (more than 800 tweets):
Or we can even see which tools are used the most to tweet, per user:
I grouped a lot of less used tools under "Other" (Tweetbot for iΟS, Twitter for iPad, OS X, Instagram, Foursquare, Facebook, LinkedIn, Squarespace, Medium, Buffer).
But what are the devs in the list talking about?
Let's start with a simple visualization, a word cloud.
After some basic preprocessing of the text from standard tweets only (tokenization, pos tagging, stopwords removal, bigrams, etc), I grouped the tweets by username and got the most common words for each one:
|@asksol||6||[('python', 3), ('enjoy', 1), ('seeing', 1), ('process', 1), ('handle', 1)]|
|@benoitc||488||[('like', 40), ('erlang', 33), ('use', 31), ('code', 30), ('people', 30)]|
|@cournape||43||[('japan', 8), ('japanese', 6), ('#pyconjp', 6), ('shibuya', 4), ('python', 4)]|
|@kennethreitz||1109||[('pipenv', 157), ('python', 84), ('new', 77), ('requests', 64), ('released', 53)]|
|@mitsuhiko||399||[('rust', 53), ('like', 36), ('people', 27), ('new', 25), ('way', 20)]|
|@teoliphant||113||[('#pydata', 39), ('#python', 36), ('@anacondainc', 18), ('great', 18), ('new', 15)]|
|@wesmckinn||129||[('@apachearrow', 32), ('data', 21), ('pandas', 16), ('python', 12), ('new', 10)]|
|@zzzeek||170||[('like', 14), ('years', 11), ('python', 10), ('time', 10), ('use', 9)]|
Then I created a word cloud for each username using word_cloud. All the guys are talking about Python or their libraries (like pipenv, pandas, sqlalchemy, etc); we can also spot some other programming languages like erlang and rust.
The next step is to identify real topics, using an LDAmodel from Gensim. I still used standard tweets from the two accounts with the higher number of tweets (@kennethreitz
and @mitsuhiko) and I performed the same preprocessing used for wordclouds generation.
I run the model using two dynamic values:
- the number of topics (ranging between 2 and 14)
- the alpha value (with possible values 0.2, 0.3, 0.4). Then I chose the best solution using the Gensim built-in Coherence Model, using c_v as a metric: the optimal model is the one with 9 topics and alpha=0.2
Here are the topics:
|topic number||top words|
|0||(0, '0.125*"way" + 0.094*"favorite" + 0.076*"feature" + 0.067*"oh" + 0.063*"think"')|
|1||(1, '0.140*"pipenv" + 0.124*"released" + 0.098*"pipenv_released" + 0.082*"want" + 0.073*"code"')|
|2||(2, '0.271*"python" + 0.132*"today" + 0.093*"people" + 0.039*"month" + 0.036*"kenneth"')|
|3||(3, '0.183*"requests" + 0.134*"love" + 0.081*"work" + 0.071*"html" + 0.057*"github"')|
|4||(4, '0.164*"like" + 0.100*"rust" + 0.098*"time" + 0.058*"day" + 0.047*"things"')|
|5||(5, '0.297*"pipenv" + 0.062*"support" + 0.058*"includes" + 0.045*"right" + 0.044*"making"')|
|6||(6, '0.271*"new" + 0.076*"getting" + 0.075*"better" + 0.058*"use" + 0.049*"photos"')|
|7||(7, '0.161*"good" + 0.097*"going" + 0.092*"got" + 0.067*"happy" + 0.058*"current"')|
|8||(8, '0.114*"great" + 0.091*"ipad" + 0.076*"finally" + 0.066*"heroku" + 0.057*"working"')|
We can check the intertopic distance map and the most relevant terms for each topic using pyLDAvis: you can explore the interactive data in the jupyter notebook in my github account.
In this post I showed how to get data from Twitter APIs and how to perform some simple analysis in order to know in advance some features about an account (e.g. tweet style, statistics about tweets, topics).
Your mileage may vary depending on the initial account list and the configuration of the algorithms (especially in topics detection).
I uploaded a jupyter notebook on my github, with some snippets I used in order to create this blog post.
- Improve preprocessing using lemmatization and stemming
- Try different algorithms for topics detection using Gensim (e.g. AuthorTopicModel or LDAMallet) or scikit-learn
- Add sentiment analysis