Eric Bonfadini

Posted on May 14, 2018

You can tell a man by his tweets

#dataviz #dataanalysis #python #api

Some time ago I read a tweet of Kenneth Reitz, a very well known Python developer I follow on Twitter, asking:

Starting from this, I decided to analyze some tweets from pretty popular Python devs in order to understand a priori how they use Twitter, what they tweet about and what I can gather using data from Twitter APIs only.
Obviously you can apply the same analysis on a different list of Twitter accounts.

Setting up the environment

For my analysis I set up a Python 3.6 virtual environment with the following main libraries:

Tweepy for interaction with Twitter APIs
Pandas for data analysis and visualization
Beautiful Soup 4, NLTK and Gensim for processing text data

Some extra libraries will be introduced later on, along with the explanation of the steps I did.

In order to access the Twitter APIs I registered my app and then I provided the tweepy library with the consumer_key, access_token, access_token_secret and consumer_secret.

We're now ready to get some real data from Twitter!

Choosing a list of Twitter accounts

First of all, I chose a list of 8 popular Python devs, starting from the top 360 most-downloaded packages on PyPI and selecting some libraries I know or use daily.

Here's my final list, including links to the Twitter account and the libraries (from the above mentioned list) for which those guys are known for:

@kennethreitz: requests
@mitsuhiko: jinja2/flask
@zzzeek: sqlalchemy
@teoliphant: numpy/scipy/numba
@benoitc: gunicorn
@asksol: celery
@wesmckinn: pandas
@cournape: scikit-learn

Getting data from Twitter

I got all the data with two endpoints only:

with a call to lookup users I could get all the information about the accounts (creation date, description, counts, location, etc.)
with a call to user timeline I could get the tweets about a single user and all the information related to every single tweet. I configured the call to get also retweets and replies.

I saved the results from the two calls in two Pandas dataframes in order to ease the data processing and then into CSV files to be used as starting point for the next steps without re-calling each time the Twitter API.

Preprocessing tweets

The users dataframe contained all the information I needed; I just created three more columns:

a followers/following ratio, a sort of "popularity" indicator
a tweets per day ratio, dividing the total number of tweets by the number of days since the creation of the account
the coordinates starting from the location, if available, using Geopy. @benoitc doesn't have a location, while @zzzeek has a generic "northeast", geolocated in Nebraska :-)

Here's the final users dataframe:

screen_name	name	verified	description	created_at	location	location_coo	time_zone	total_tweets	favourites_count	followers_count	following_count	listed_count	followers/following	tweets_per_day
kennethreitz	Kenneth Reitz	False	@DigitalOcean & @ThePSF. Creator of Requests: HTTP for Humans. I design art — with code, cameras, musical instruments, and the English language.	2009-06-24 23:28:06	Winchester, VA	[39.1852184, -78.1652404]	Eastern Time (US & Canada)	58195	26575	18213	480	822	37.94375	17.950339296730412
mitsuhiko	Armin Ronacher	True	Creator of Flask; Building stuff at @getsentry; prev @fireteamltd, @splashdamage — writing and talking about system architecture, API design and lots of Python	2008-02-01 23:12:59	Austria	[47.2000338, 13.199959]	Vienna	31774	2059	21801	593	941	36.76391231028668	8.470807784590775
zzzeek	mike bayer	False		2008-06-11 19:22:19	northeast	[41.7370229, -99.5873816]	Quito	14771	1106	3013	209	194	14.416267942583731	4.080386740331492
teoliphant	Travis Oliphant	False	Creator of SciPy, NumPy, and Numba; founder and Director of Anaconda, Inc. Founder of NumFOCUS. CEO of Quansight	2009-04-17 20:04:57	Austin, TX	[30.2711286, -97.7436995]	Central Time (US & Canada)	3875	1506	18052	483	746	37.374741200828154	1.1706948640483383
benoitc	benoît chesneau	False	web craftsman	2007-02-13 16:53:37			Paris	27172	548	1971	704	247	2.799715909090909	6.620857699805068
asksol	Ask Solem	False	sound, noise, stream processing, distributed systems, data, open source python, etc. Works at @robinhoodapp 🌳🌳🌴🌵	2007-11-11 18:56:33	San Francisco, CA	[45.4423543, -73.4373087]	Pacific Time (US & Canada)	3249	191	2509	513	126	4.89083820662768	0.8476389251239238
wesmckinn	Wes McKinney	True	Data science toolmaker at https://t.co/YVn0VFqgj0. Creator of pandas, @IbisData. @ApacheArrow @ApacheParquet PMC. Wrote Python for Data Analysis. Views my own	2010-02-18 21:01:15	New York, NY	[40.7306458, -73.9866136]	Eastern Time (US & Canada)	7749	3021	33130	784	1277	42.25765306122449	2.5804195804195804
cournape	David Cournapeau	False	Python and Stats geek. Former NumPy/Scipy core contributor. Lead the ML engineering team @cogentlabs Occasional musings on economics/music/Japanese culture	2010-06-14 03:17:21	Tokyo, Japan	[34.6968642, 139.4049033]	Amsterdam	15577	505	800	427	112	1.873536299765808	5.395566331832352

The tweets dataframe on the contrary needed some extra preprocessing.

First of all, I discovered an annoying limitation about the Twitter user timeline API: there's a maximum number of tweets that can be returned (more or less 3200 including retweets and replies). Therefore I decided to group the tweets by username and to get the oldest tweet date for each user:

username	count	min	max
asksol	2991	2009-07-19 19:24:33	2018-05-10 14:58:17
benoitc	3199	2017-02-21 14:55:37	2018-05-11 19:36:21
cournape	3179	2017-06-12 19:13:20	2018-05-11 16:55:39
kennethreitz	3200	2017-08-26 20:48:35	2018-05-11 21:07:46
mitsuhiko	3226	2017-06-14 13:23:57	2018-05-11 19:26:07
teoliphant	3201	2013-02-19 03:54:16	2018-05-11 16:48:39
wesmckinn	3205	2014-01-26 17:45:07	2018-05-03 14:51:29
zzzeek	3214	2015-05-05 13:35:38	2018-05-11 14:17:02

Then I filtered out all the tweets before the maximum value between the first dates (2017-08-26 20:48:35).
Starting from these data, @kennethreitz is influencing the cut date because he's tweeting a lot more than some other users; but in this way we can at least get the same timeframe for all the users and compare tweets from the same period.

After this filter I got 25418-14470=10948 tweets, split in this way:

username	count	min	max
asksol	52	2017-09-11 18:36:19	2018-05-10 14:58:17
benoitc	1849	2017-08-26 20:49:13	2018-05-11 19:36:21
cournape	1888	2017-08-27 01:02:11	2018-05-11 16:55:39
kennethreitz	3200	2017-08-26 20:48:35	2018-05-11 21:07:46
mitsuhiko	2328	2017-08-27 08:22:19	2018-05-11 19:26:07
teoliphant	443	2017-08-26 22:57:23	2018-05-11 16:48:39
wesmckinn	591	2017-08-28 18:07:08	2018-05-03 14:51:29
zzzeek	596	2017-08-26 22:14:11	2018-05-11 14:17:02

Other preprocessing steps:

I parsed the source information using Beautiful Soup because it contained HTML entities
I removed smart quotes from text
I converted the text to lower case
I removed URLs and numbers

I filtered out all the tweets with empty text after these steps (i.e. containing only urls, etc) and I got 10948-125=10823 tweets.

I finally created a new column containing the "tweet type" (standard, reply or retweet) and another column with the tweet length.

Here are some columns from the final tweets dataframe (first 5 rows):

username	created_at	full_text	text_clean	lang	favorite_count	retweet_count	source	tweet_type	tweet_len
kennethreitz	2018-05-11 21:07:46	RT @IAmAru: Trio is @kennethreitz-approved. #PyCon2018	rt trio is approved	en	0.0	1	Tweetbot for iΟS	retweet	54
kennethreitz	2018-05-10 21:55:35	If you want to say hi, swing by the DigitalOcean booth during the opening reception! #PyCon2018	if you want to say hi swing by the digitalocean booth during the opening reception	en	20.0	2	Tweetbot for iΟS	standard	95
kennethreitz	2018-05-10 20:34:20	@dbinoj 24x 1x dynos right now	x x dynos right now	en	0.0	0	Tweetbot for iΟS	reply	30
kennethreitz	2018-05-10 20:11:37	Swing by the @IndyPy booth for your chance to win a signed copy of The Hitchhiker's Guide to Python! ✨🍰✨ https://t.co/CZhd2If5s0 https://t.co/3kUaqu5TMX	swing by the booth for your chance to win a signed copy of the hitchhiker s guide to python	en	25.0	3	IFTTT	standard	152
kennethreitz	2018-05-10 13:53:31	Let's do this https://t.co/6xLCE4WCqA https://t.co/ERiMmffe8L	let s do this	en	22.0	1	IFTTT	standard	61

Explorative Data Analysis

The users dataframe itself already shows some insights:

There are only two accounts with the verified flag: @mitsuhiko and @wesmckinn
@wesmckinn, @kennethreitz, @teoliphant and @mitsuhiko are the most popular accounts in the list (according to my "popularity" indicator):
@kennethreitz wrote since the creation of his account at least twice the number of tweets per day compared to the other devs in the list:
Most of the accounts in the list live in the US; I used Folium to create a map showing the locations:

The tweets dataframe needs instead some manipulation before we can gather some good insights.

First of all let's check the tweet "style" of each account. From the following chart we can see for example that @cournape is retweeting a lot, while @mitsuhiko is replying a lot:

We can also group by username and tweet type, and show a chart with the mean tweet length. @kennethreitz for example writes replies shorter than standard tweets, while @teoliphant writes tweets longer than the other guys (exceeding the 140 chars limit):

Ok, now let's filter out the retweets and let's focus on the machine-detected language used in standard tweets and replies. The five most common languages are: English, German, French, undefined and a rather weird "tagalog" (ISO 639-1 code "tl", maybe an error in auto-detection?). Most of the tweets are in English; @mitsuhiko tweets a lot in German, while @benoitrc in French:

So, let's just select tweets in English or undefined: all the next charts are just considering tweets and replies in English (but obviously you can tune differently your analysis).
Let's group by username and get statistics about the number of favorites/retweets per user:

username	favorite_count count	favorite_count max	favorite_count mean	favorite_count std	retweet_count max	retweet_count mean	retweet_count std
asksol	46	41.0	1.608695652173913	6.111840097055933	3.0	0.10869565217391304	0.48204475908203187
benoitc	1009	30.0	0.6531219028741329	1.8313280878865186	17.0	0.13676907829534193	0.7644934696088941
cournape	214	60.0	1.2757009345794392	4.449367547428712	25.0	0.205607476635514	1.7481758044670788
kennethreitz	2637	3932.0	10.062571103526736	82.09998594317476	2573.0	2.620781190747061	50.79602602503255
mitsuhiko	1547	752.0	9.657401422107304	41.06463543974671	220.0	1.8526179702650292	9.932970595417615
teoliphant	186	808.0	26.080645161290324	69.54002504187612	134.0	7.806451612903226	17.085639972995896
wesmckinn	433	2081.0	45.750577367205544	142.2699008271913	695.0	12.270207852193995	48.083342617014644
zzzeek	439	85.0	2.173120728929385	6.417876507767421	28.0	0.44874715261959	1.896040581838119

From this table we can see that:

@kennethreitz has the most retweeted and favorited tweet in the dataframe. Here's the tweet:
@wesmckinn has the second most retweeted and favorited tweet in the dataframe. Here's the tweet:
@wesmckinn has highest mean value for retweet count and favorite count

Since @wesmckinn has also the highest followers count, how these stats change if we normalize the dataframe using the followers count?
Obviously one tweet can get favorited/retweeted even from non-followers, but this normalization will probably produce more fair results because the higher the followers count, the most the tweet will probably be viewed.

username	favorite_count perc count	favorite_count perc max	favorite_count perc mean	favorite_count perc std	retweet_count perc max	retweet_count perc mean	retweet_count perc std
asksol	46	1.634117178158629	0.06411700486942658	0.243596655920922	0.11956954962136308	0.0043322300587450395	0.019212624913592345
benoitc	1009	1.5220700152207	0.0331365754882869	0.09291365235345102	0.8625063419583967	0.0069390704360904	0.038787086230791176
cournape	214	7.5	0.1594626168224299	0.556170943428589	3.125	0.02570093457943925	0.21852197555838485
kennethreitz	2637	21.588974908032725	0.055249388368344046	0.4507768404061633	14.127271728984791	0.014389618353632417	0.27889982992935036
mitsuhiko	1547	3.4493830558231275	0.04429797450624903	0.18836124691411713	1.009128021650383	0.008497857760034094	0.04556199530029643
teoliphant	186	4.475958342565921	0.14447510060541938	0.38522061290647097	0.7423000221582097	0.043244247800261586	0.09464679798911974
wesmckinn	433	6.281316027769393	0.138094106149126	0.42942922072801465	2.0977965590099608	0.037036546490172004	0.1451353535074393
zzzeek	439	2.8211085297046132	0.07212481675836008	0.21300619010180605	0.9293063391968137	0.01489369905806803	0.06292866185987782

After the normalization we can see that @cournape and @teoliphant are getting higher mean values, in terms of retweets and favorites.

We can also see how the monthly number of tweets changes over time, per user. From the following chart we can see for example that @kennethreitz tweeted a lot in september 2017 (more than 800 tweets):

Or we can even see which tools are used the most to tweet, per user:
I grouped a lot of less used tools under "Other" (Tweetbot for iΟS, Twitter for iPad, OS X, Instagram, Foursquare, Facebook, LinkedIn, Squarespace, Medium, Buffer).

Finally, we can build a kind of punchcard chart for each user, showing an aggregation of tweets dates by day of the week and hours of the day:

Topics

But what are the devs in the list talking about?

Let's start with a simple visualization, a word cloud.
After some basic preprocessing of the text from standard tweets only (tokenization, pos tagging, stopwords removal, bigrams, etc), I grouped the tweets by username and got the most common words for each one:

username	tweet count	most_common
@asksol	6	[('python', 3), ('enjoy', 1), ('seeing', 1), ('process', 1), ('handle', 1)]
@benoitc	488	[('like', 40), ('erlang', 33), ('use', 31), ('code', 30), ('people', 30)]
@cournape	43	[('japan', 8), ('japanese', 6), ('#pyconjp', 6), ('shibuya', 4), ('python', 4)]
@kennethreitz	1109	[('pipenv', 157), ('python', 84), ('new', 77), ('requests', 64), ('released', 53)]
@mitsuhiko	399	[('rust', 53), ('like', 36), ('people', 27), ('new', 25), ('way', 20)]
@teoliphant	113	[('#pydata', 39), ('#python', 36), ('@anacondainc', 18), ('great', 18), ('new', 15)]
@wesmckinn	129	[('@apachearrow', 32), ('data', 21), ('pandas', 16), ('python', 12), ('new', 10)]
@zzzeek	170	[('like', 14), ('years', 11), ('python', 10), ('time', 10), ('use', 9)]

Then I created a word cloud for each username using word_cloud. All the guys are talking about Python or their libraries (like pipenv, pandas, sqlalchemy, etc); we can also spot some other programming languages like erlang and rust.

The next step is to identify real topics, using an LDAmodel from Gensim. I still used standard tweets from the two accounts with the higher number of tweets (@kennethreitz and @mitsuhiko) and I performed the same preprocessing used for wordclouds generation.
I run the model using two dynamic values:

the number of topics (ranging between 2 and 14)
the alpha value (with possible values 0.2, 0.3, 0.4). Then I chose the best solution using the Gensim built-in Coherence Model, using c_v as a metric: the optimal model is the one with 9 topics and alpha=0.2

Here are the topics:

topic number	top words
0	(0, '0.125"way" + 0.094"favorite" + 0.076"feature" + 0.067"oh" + 0.063*"think"')
1	(1, '0.140"pipenv" + 0.124"released" + 0.098"pipenv_released" + 0.082"want" + 0.073*"code"')
2	(2, '0.271"python" + 0.132"today" + 0.093"people" + 0.039"month" + 0.036*"kenneth"')
3	(3, '0.183"requests" + 0.134"love" + 0.081"work" + 0.071"html" + 0.057*"github"')
4	(4, '0.164"like" + 0.100"rust" + 0.098"time" + 0.058"day" + 0.047*"things"')
5	(5, '0.297"pipenv" + 0.062"support" + 0.058"includes" + 0.045"right" + 0.044*"making"')
6	(6, '0.271"new" + 0.076"getting" + 0.075"better" + 0.058"use" + 0.049*"photos"')
7	(7, '0.161"good" + 0.097"going" + 0.092"got" + 0.067"happy" + 0.058*"current"')
8	(8, '0.114"great" + 0.091"ipad" + 0.076"finally" + 0.066"heroku" + 0.057*"working"')

We can check the intertopic distance map and the most relevant terms for each topic using pyLDAvis: you can explore the interactive data in the jupyter notebook in my github account.

Conclusions and future steps

In this post I showed how to get data from Twitter APIs and how to perform some simple analysis in order to know in advance some features about an account (e.g. tweet style, statistics about tweets, topics).
Your mileage may vary depending on the initial account list and the configuration of the algorithms (especially in topics detection).

I uploaded a jupyter notebook on my github, with some snippets I used in order to create this blog post.

Next steps:

Improve preprocessing using lemmatization and stemming
Try different algorithms for topics detection using Gensim (e.g. AuthorTopicModel or LDAMallet) or scikit-learn
Add sentiment analysis

Top comments (3)

Andrea La Scola • May 15 '18

Stalking level: over 9000!! Awesome Job! 🚀

Ryan Palo • May 15 '18

This is awesome! Great analysis and write up, thanks!

Erin Moore • May 15 '18

Does it only work on men?

topic number	top words
0	(0, '0.125"way" + 0.094"favorite" + 0.076"feature" + 0.067"oh" + 0.063*"think"')
1	(1, '0.140"pipenv" + 0.124"released" + 0.098"pipenv_released" + 0.082"want" + 0.073*"code"')
2	(2, '0.271"python" + 0.132"today" + 0.093"people" + 0.039"month" + 0.036*"kenneth"')
3	(3, '0.183"requests" + 0.134"love" + 0.081"work" + 0.071"html" + 0.057*"github"')
4	(4, '0.164"like" + 0.100"rust" + 0.098"time" + 0.058"day" + 0.047*"things"')
5	(5, '0.297"pipenv" + 0.062"support" + 0.058"includes" + 0.045"right" + 0.044*"making"')
6	(6, '0.271"new" + 0.076"getting" + 0.075"better" + 0.058"use" + 0.049*"photos"')
7	(7, '0.161"good" + 0.097"going" + 0.092"got" + 0.067"happy" + 0.058*"current"')
8	(8, '0.114"great" + 0.091"ipad" + 0.076"finally" + 0.066"heroku" + 0.057*"working"')