DEV Community: Eric Bonfadini

You can tell a man by his tweets

Eric Bonfadini — Mon, 14 May 2018 22:51:55 +0000

Some time ago I read a tweet of Kenneth Reitz, a very well known Python developer I follow on Twitter, asking:
// Detect dark theme var iframe = document.getElementById('tweet-952553176925958145-292'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=952553176925958145&theme=dark" }
Starting from this, I decided to analyze some tweets from pretty popular Python devs in order to understand a priori how they use Twitter, what they tweet about and what I can gather using data from Twitter APIs only.
Obviously you can apply the same analysis on a different list of Twitter accounts.

Setting up the environment

For my analysis I set up a Python 3.6 virtual environment with the following main libraries:

Tweepy for interaction with Twitter APIs
Pandas for data analysis and visualization
Beautiful Soup 4, NLTK and Gensim for processing text data

Some extra libraries will be introduced later on, along with the explanation of the steps I did.

In order to access the Twitter APIs I registered my app and then I provided the tweepy library with the consumer_key, access_token, access_token_secret and consumer_secret.

We're now ready to get some real data from Twitter!

Choosing a list of Twitter accounts

First of all, I chose a list of 8 popular Python devs, starting from the top 360 most-downloaded packages on PyPI and selecting some libraries I know or use daily.

Here's my final list, including links to the Twitter account and the libraries (from the above mentioned list) for which those guys are known for:

@kennethreitz: requests
@mitsuhiko: jinja2/flask
@zzzeek: sqlalchemy
@teoliphant: numpy/scipy/numba
@benoitc: gunicorn
@asksol: celery
@wesmckinn: pandas
@cournape: scikit-learn

Getting data from Twitter

I got all the data with two endpoints only:

with a call to lookup users I could get all the information about the accounts (creation date, description, counts, location, etc.)
with a call to user timeline I could get the tweets about a single user and all the information related to every single tweet. I configured the call to get also retweets and replies.

I saved the results from the two calls in two Pandas dataframes in order to ease the data processing and then into CSV files to be used as starting point for the next steps without re-calling each time the Twitter API.

Preprocessing tweets

The users dataframe contained all the information I needed; I just created three more columns:

a followers/following ratio, a sort of "popularity" indicator
a tweets per day ratio, dividing the total number of tweets by the number of days since the creation of the account
the coordinates starting from the location, if available, using Geopy. @benoitc doesn't have a location, while @zzzeek has a generic "northeast", geolocated in Nebraska :-)

Here's the final users dataframe:

screen_name	name	verified	description	created_at	location	location_coo	time_zone	total_tweets	favourites_count	followers_count	following_count	listed_count	followers/following	tweets_per_day
kennethreitz	Kenneth Reitz	False	@DigitalOcean & @ThePSF. Creator of Requests: HTTP for Humans. I design art — with code, cameras, musical instruments, and the English language.	2009-06-24 23:28:06	Winchester, VA	[39.1852184, -78.1652404]	Eastern Time (US & Canada)	58195	26575	18213	480	822	37.94375	17.950339296730412
mitsuhiko	Armin Ronacher	True	Creator of Flask; Building stuff at @getsentry; prev @fireteamltd, @splashdamage — writing and talking about system architecture, API design and lots of Python	2008-02-01 23:12:59	Austria	[47.2000338, 13.199959]	Vienna	31774	2059	21801	593	941	36.76391231028668	8.470807784590775
zzzeek	mike bayer	False		2008-06-11 19:22:19	northeast	[41.7370229, -99.5873816]	Quito	14771	1106	3013	209	194	14.416267942583731	4.080386740331492
teoliphant	Travis Oliphant	False	Creator of SciPy, NumPy, and Numba; founder and Director of Anaconda, Inc. Founder of NumFOCUS. CEO of Quansight	2009-04-17 20:04:57	Austin, TX	[30.2711286, -97.7436995]	Central Time (US & Canada)	3875	1506	18052	483	746	37.374741200828154	1.1706948640483383
benoitc	benoît chesneau	False	web craftsman	2007-02-13 16:53:37			Paris	27172	548	1971	704	247	2.799715909090909	6.620857699805068
asksol	Ask Solem	False	sound, noise, stream processing, distributed systems, data, open source python, etc. Works at @robinhoodapp 🌳🌳🌴🌵	2007-11-11 18:56:33	San Francisco, CA	[45.4423543, -73.4373087]	Pacific Time (US & Canada)	3249	191	2509	513	126	4.89083820662768	0.8476389251239238
wesmckinn	Wes McKinney	True	Data science toolmaker at https://t.co/YVn0VFqgj0. Creator of pandas, @IbisData. @ApacheArrow @ApacheParquet PMC. Wrote Python for Data Analysis. Views my own	2010-02-18 21:01:15	New York, NY	[40.7306458, -73.9866136]	Eastern Time (US & Canada)	7749	3021	33130	784	1277	42.25765306122449	2.5804195804195804
cournape	David Cournapeau	False	Python and Stats geek. Former NumPy/Scipy core contributor. Lead the ML engineering team @cogentlabs Occasional musings on economics/music/Japanese culture	2010-06-14 03:17:21	Tokyo, Japan	[34.6968642, 139.4049033]	Amsterdam	15577	505	800	427	112	1.873536299765808	5.395566331832352

The tweets dataframe on the contrary needed some extra preprocessing.

First of all, I discovered an annoying limitation about the Twitter user timeline API: there's a maximum number of tweets that can be returned (more or less 3200 including retweets and replies). Therefore I decided to group the tweets by username and to get the oldest tweet date for each user:

username	count	min	max
asksol	2991	2009-07-19 19:24:33	2018-05-10 14:58:17
benoitc	3199	2017-02-21 14:55:37	2018-05-11 19:36:21
cournape	3179	2017-06-12 19:13:20	2018-05-11 16:55:39
kennethreitz	3200	2017-08-26 20:48:35	2018-05-11 21:07:46
mitsuhiko	3226	2017-06-14 13:23:57	2018-05-11 19:26:07
teoliphant	3201	2013-02-19 03:54:16	2018-05-11 16:48:39
wesmckinn	3205	2014-01-26 17:45:07	2018-05-03 14:51:29
zzzeek	3214	2015-05-05 13:35:38	2018-05-11 14:17:02

Then I filtered out all the tweets before the maximum value between the first dates (2017-08-26 20:48:35).
Starting from these data, @kennethreitz is influencing the cut date because he's tweeting a lot more than some other users; but in this way we can at least get the same timeframe for all the users and compare tweets from the same period.

After this filter I got 25418-14470=10948 tweets, split in this way:

username	count	min	max
asksol	52	2017-09-11 18:36:19	2018-05-10 14:58:17
benoitc	1849	2017-08-26 20:49:13	2018-05-11 19:36:21
cournape	1888	2017-08-27 01:02:11	2018-05-11 16:55:39
kennethreitz	3200	2017-08-26 20:48:35	2018-05-11 21:07:46
mitsuhiko	2328	2017-08-27 08:22:19	2018-05-11 19:26:07
teoliphant	443	2017-08-26 22:57:23	2018-05-11 16:48:39
wesmckinn	591	2017-08-28 18:07:08	2018-05-03 14:51:29
zzzeek	596	2017-08-26 22:14:11	2018-05-11 14:17:02

Other preprocessing steps:

I parsed the source information using Beautiful Soup because it contained HTML entities
I removed smart quotes from text
I converted the text to lower case
I removed URLs and numbers

I filtered out all the tweets with empty text after these steps (i.e. containing only urls, etc) and I got 10948-125=10823 tweets.

I finally created a new column containing the "tweet type" (standard, reply or retweet) and another column with the tweet length.

Here are some columns from the final tweets dataframe (first 5 rows):

username	created_at	full_text	text_clean	lang	favorite_count	retweet_count	source	tweet_type	tweet_len
kennethreitz	2018-05-11 21:07:46	RT @IAmAru: Trio is @kennethreitz-approved. #PyCon2018	rt trio is approved	en	0.0	1	Tweetbot for iΟS	retweet	54
kennethreitz	2018-05-10 21:55:35	If you want to say hi, swing by the DigitalOcean booth during the opening reception! #PyCon2018	if you want to say hi swing by the digitalocean booth during the opening reception	en	20.0	2	Tweetbot for iΟS	standard	95
kennethreitz	2018-05-10 20:34:20	@dbinoj 24x 1x dynos right now	x x dynos right now	en	0.0	0	Tweetbot for iΟS	reply	30
kennethreitz	2018-05-10 20:11:37	Swing by the @IndyPy booth for your chance to win a signed copy of The Hitchhiker's Guide to Python! ✨🍰✨ https://t.co/CZhd2If5s0 https://t.co/3kUaqu5TMX	swing by the booth for your chance to win a signed copy of the hitchhiker s guide to python	en	25.0	3	IFTTT	standard	152
kennethreitz	2018-05-10 13:53:31	Let's do this https://t.co/6xLCE4WCqA https://t.co/ERiMmffe8L	let s do this	en	22.0	1	IFTTT	standard	61

Explorative Data Analysis

The users dataframe itself already shows some insights:

There are only two accounts with the verified flag: @mitsuhiko and @wesmckinn
@wesmckinn, @kennethreitz, @teoliphant and @mitsuhiko are the most popular accounts in the list (according to my "popularity" indicator):
@kennethreitz wrote since the creation of his account at least twice the number of tweets per day compared to the other devs in the list:
Most of the accounts in the list live in the US; I used Folium to create a map showing the locations:

The tweets dataframe needs instead some manipulation before we can gather some good insights.

First of all let's check the tweet "style" of each account. From the following chart we can see for example that @cournape is retweeting a lot, while @mitsuhiko is replying a lot:

We can also group by username and tweet type, and show a chart with the mean tweet length. @kennethreitz for example writes replies shorter than standard tweets, while @teoliphant writes tweets longer than the other guys (exceeding the 140 chars limit):

Ok, now let's filter out the retweets and let's focus on the machine-detected language used in standard tweets and replies. The five most common languages are: English, German, French, undefined and a rather weird "tagalog" (ISO 639-1 code "tl", maybe an error in auto-detection?). Most of the tweets are in English; @mitsuhiko tweets a lot in German, while @benoitrc in French:

So, let's just select tweets in English or undefined: all the next charts are just considering tweets and replies in English (but obviously you can tune differently your analysis).
Let's group by username and get statistics about the number of favorites/retweets per user:

username	favorite_count count	favorite_count max	favorite_count mean	favorite_count std	retweet_count max	retweet_count mean	retweet_count std
asksol	46	41.0	1.608695652173913	6.111840097055933	3.0	0.10869565217391304	0.48204475908203187
benoitc	1009	30.0	0.6531219028741329	1.8313280878865186	17.0	0.13676907829534193	0.7644934696088941
cournape	214	60.0	1.2757009345794392	4.449367547428712	25.0	0.205607476635514	1.7481758044670788
kennethreitz	2637	3932.0	10.062571103526736	82.09998594317476	2573.0	2.620781190747061	50.79602602503255
mitsuhiko	1547	752.0	9.657401422107304	41.06463543974671	220.0	1.8526179702650292	9.932970595417615
teoliphant	186	808.0	26.080645161290324	69.54002504187612	134.0	7.806451612903226	17.085639972995896
wesmckinn	433	2081.0	45.750577367205544	142.2699008271913	695.0	12.270207852193995	48.083342617014644
zzzeek	439	85.0	2.173120728929385	6.417876507767421	28.0	0.44874715261959	1.896040581838119

From this table we can see that:

@kennethreitz has the most retweeted and favorited tweet in the dataframe. Here's the tweet: // Detect dark theme var iframe = document.getElementById('tweet-981547972239417345-95'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=981547972239417345&theme=dark" }
@wesmckinn has the second most retweeted and favorited tweet in the dataframe. Here's the tweet: // Detect dark theme var iframe = document.getElementById('tweet-986998077767716865-492'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=986998077767716865&theme=dark" }
@wesmckinn has highest mean value for retweet count and favorite count

Since @wesmckinn has also the highest followers count, how these stats change if we normalize the dataframe using the followers count?
Obviously one tweet can get favorited/retweeted even from non-followers, but this normalization will probably produce more fair results because the higher the followers count, the most the tweet will probably be viewed.

username	favorite_count perc count	favorite_count perc max	favorite_count perc mean	favorite_count perc std	retweet_count perc max	retweet_count perc mean	retweet_count perc std
asksol	46	1.634117178158629	0.06411700486942658	0.243596655920922	0.11956954962136308	0.0043322300587450395	0.019212624913592345
benoitc	1009	1.5220700152207	0.0331365754882869	0.09291365235345102	0.8625063419583967	0.0069390704360904	0.038787086230791176
cournape	214	7.5	0.1594626168224299	0.556170943428589	3.125	0.02570093457943925	0.21852197555838485
kennethreitz	2637	21.588974908032725	0.055249388368344046	0.4507768404061633	14.127271728984791	0.014389618353632417	0.27889982992935036
mitsuhiko	1547	3.4493830558231275	0.04429797450624903	0.18836124691411713	1.009128021650383	0.008497857760034094	0.04556199530029643
teoliphant	186	4.475958342565921	0.14447510060541938	0.38522061290647097	0.7423000221582097	0.043244247800261586	0.09464679798911974
wesmckinn	433	6.281316027769393	0.138094106149126	0.42942922072801465	2.0977965590099608	0.037036546490172004	0.1451353535074393
zzzeek	439	2.8211085297046132	0.07212481675836008	0.21300619010180605	0.9293063391968137	0.01489369905806803	0.06292866185987782

After the normalization we can see that @cournape and @teoliphant are getting higher mean values, in terms of retweets and favorites.

We can also see how the monthly number of tweets changes over time, per user. From the following chart we can see for example that @kennethreitz tweeted a lot in september 2017 (more than 800 tweets):

Or we can even see which tools are used the most to tweet, per user:
I grouped a lot of less used tools under "Other" (Tweetbot for iΟS, Twitter for iPad, OS X, Instagram, Foursquare, Facebook, LinkedIn, Squarespace, Medium, Buffer).

Finally, we can build a kind of punchcard chart for each user, showing an aggregation of tweets dates by day of the week and hours of the day:

Topics

But what are the devs in the list talking about?

Let's start with a simple visualization, a word cloud.
After some basic preprocessing of the text from standard tweets only (tokenization, pos tagging, stopwords removal, bigrams, etc), I grouped the tweets by username and got the most common words for each one:

username	tweet count	most_common
@asksol	6	[('python', 3), ('enjoy', 1), ('seeing', 1), ('process', 1), ('handle', 1)]
@benoitc	488	[('like', 40), ('erlang', 33), ('use', 31), ('code', 30), ('people', 30)]
@cournape	43	[('japan', 8), ('japanese', 6), ('#pyconjp', 6), ('shibuya', 4), ('python', 4)]
@kennethreitz	1109	[('pipenv', 157), ('python', 84), ('new', 77), ('requests', 64), ('released', 53)]
@mitsuhiko	399	[('rust', 53), ('like', 36), ('people', 27), ('new', 25), ('way', 20)]
@teoliphant	113	[('#pydata', 39), ('#python', 36), ('@anacondainc', 18), ('great', 18), ('new', 15)]
@wesmckinn	129	[('@apachearrow', 32), ('data', 21), ('pandas', 16), ('python', 12), ('new', 10)]
@zzzeek	170	[('like', 14), ('years', 11), ('python', 10), ('time', 10), ('use', 9)]

Then I created a word cloud for each username using word_cloud. All the guys are talking about Python or their libraries (like pipenv, pandas, sqlalchemy, etc); we can also spot some other programming languages like erlang and rust.

The next step is to identify real topics, using an LDAmodel from Gensim. I still used standard tweets from the two accounts with the higher number of tweets (@kennethreitz and @mitsuhiko) and I performed the same preprocessing used for wordclouds generation.
I run the model using two dynamic values:

the number of topics (ranging between 2 and 14)
the alpha value (with possible values 0.2, 0.3, 0.4). Then I chose the best solution using the Gensim built-in Coherence Model, using c_v as a metric: the optimal model is the one with 9 topics and alpha=0.2

Here are the topics:

topic number	top words
0	(0, '0.125"way" + 0.094"favorite" + 0.076"feature" + 0.067"oh" + 0.063*"think"')
1	(1, '0.140"pipenv" + 0.124"released" + 0.098"pipenv_released" + 0.082"want" + 0.073*"code"')
2	(2, '0.271"python" + 0.132"today" + 0.093"people" + 0.039"month" + 0.036*"kenneth"')
3	(3, '0.183"requests" + 0.134"love" + 0.081"work" + 0.071"html" + 0.057*"github"')
4	(4, '0.164"like" + 0.100"rust" + 0.098"time" + 0.058"day" + 0.047*"things"')
5	(5, '0.297"pipenv" + 0.062"support" + 0.058"includes" + 0.045"right" + 0.044*"making"')
6	(6, '0.271"new" + 0.076"getting" + 0.075"better" + 0.058"use" + 0.049*"photos"')
7	(7, '0.161"good" + 0.097"going" + 0.092"got" + 0.067"happy" + 0.058*"current"')
8	(8, '0.114"great" + 0.091"ipad" + 0.076"finally" + 0.066"heroku" + 0.057*"working"')

We can check the intertopic distance map and the most relevant terms for each topic using pyLDAvis: you can explore the interactive data in the jupyter notebook in my github account.

Conclusions and future steps

In this post I showed how to get data from Twitter APIs and how to perform some simple analysis in order to know in advance some features about an account (e.g. tweet style, statistics about tweets, topics).
Your mileage may vary depending on the initial account list and the configuration of the algorithms (especially in topics detection).

I uploaded a jupyter notebook on my github, with some snippets I used in order to create this blog post.

Next steps:

Improve preprocessing using lemmatization and stemming
Try different algorithms for topics detection using Gensim (e.g. AuthorTopicModel or LDAMallet) or scikit-learn
Add sentiment analysis

Finding my new favorite song on Spotify

Eric Bonfadini — Thu, 18 Jan 2018 22:14:11 +0000

While I'm developing (or just while I'm commuting to work) I usually love to hear some rock music.

I created some playlists on Spotify, but lately I'm stick to the same playlist, containing my favorite "Indie Rock" songs.
This playlist is made up of more or less 45 songs I discovered through the years in several ways.

Since I was starting to get bored about always listening to the same songs, last weekend I decided to analyze my playlist using Spotify APIs in order to discover insights and hopefully to find some new tunes I could add.

Here's what I did in more or less 300 lines of Python 3 code (boilerplate included).
You can find on github a Jupyter notebook with the code I used.

Setting up the environment

For my analysis I set up a Python 3 virtual environment with the following libraries:

Pandas for data analysis
Seaborn for data visualization
Spotipy for interaction with Spotify APIs

In order to access the Spotify APIs I registered my app and then I provided the spotipy library with the client_id, the client_secret and a redirect url.

Analyzing my playlist tracks

After a first API call to get my playlist id, I got all the tracks of my playlist with another API call along with some basic information like: song id, song name, artist id, artist name, album name, song popularity.

With another API call I got some extra information about the artists in my playlist, like genres and artist_popularity.

Finally with another API call I got some insightful information about my tracks:

duration_ms: the duration of the track in milliseconds;
acousticness: describes the acousticness of a song (1 => high confidence the track is acoustic). It ranges from 0 to 1;
danceability: describes the danceability of a song (1 => high confidence the track is danceable). It ranges from 0 to 1;
energy: it's a perceptual measure of intensity and activity (e.g. death metal has high energy while classical music has low energy). It ranges from 0 to 1;
instrumentalness: predicts whether a track contains no vocals (1 => high confidence the track has no vocals). It ranges from 0 to 1;
liveness: detects the presence of an audience in the recording (1 => high confidence the track is live). It ranges from 0 to 1;
loudness: detects the overall loudness of a track in decibels. It ranges from -60dB to 0dB;
valence: describes the musical positiveness conveyed by a track (1 => more positive, 0 => more negative). It ranges from 0 to 1;
speechiness: detects the presence of spoken words in a track (1 => speech, 0 => non speech, just music). It ranges from 0 to 1;
key: describes the pitch class notation of the song. It ranges from 0 to 11;
mode: the modality of a track (0 => minor, 1 => major);
tempo: the overall estimated tempo of a track in beats per minute (BPM);
time_signature: An estimated overall time signature of a track (how many beats are in each bar or measure).

The results of all these calls have been put inside Pandas dataframes in order to simplify the data analysis and then merged in a single dataframe using artist IDs and track IDs.
Some values (like song/artist popularity and tempo) have been normalized.

Explorative Data Analysis

After ensuring that the artists in my playlist all contain "Indie Rock" as genre, I saw (using shape, info and describe of the full dataframe) that my playlist consisted of 46 entries, containing all non-null values with these statistics:

Then I created some charts (distplot, countplot, boxplot) using Seaborn:

All these graphs show that I like songs with low acousticness/instrumentalness/speechiness, high energy/loudness/tempo, high artist popularity and duration of more or less 200 seconds.
Valence and song popularity span on a wide range, meaning that I have in my playlist both well-known and unknown songs, and both positive and negative ones.

But how my playlist compare against the Indie Rock genre?

Comparing my playlist with a sample of the genre

I used some calls to the search API with 'genre:"Indie Rock"' as a keyword and 'type=tracks' to get a sample of the Indie Rock genre (5000 songs in total).
This API offers also some nice keywords like 'tag:hipster' (to get only albums with the lowest 10% popularity) or 'year:1980-2020' (to get only tracks released in a specific year range).

Then I repeated the same analysis on both my current playlist and the Indie Rock sample and I got the following charts:

The graphs show that my playlist differs from the 5000 Indie Rock songs because:

I like shorter songs
I like songs with higher energy/loudness/tempo
I don't like songs with too negative mood (valence > 0.3)
I like songs mostly in key (0, 1, 6, 9)

The boxplot confirms the same insights and I agree with the outcome of this analysis.

Creating a new playlist with songs I potentially like

Using these insights, I applied some filters to the 5000 songs dataframe in order to keep only tracks I potentially like; for each step I logged the dropped songs to double check the filter behavior.

The first filter I applied was removing songs I already had in my original playlist, obviously.

The other filters were:

acousticness < 0.1
energy > 0.75
loudness > -7dB
valence between 0.3 and 0.9
tempo > 120
key in (0, 1, 6, 9)
duration between 10% quartile of original playlist duration and 90% quartile (178s and 280s)

In the end I got a dataframe with 220 tracks and I created a new playlist using an API call

Conclusions and future steps

After a few days listening to the new playlist, I'm quite happy with the results and I'm already promoting some tracks to my original playlist.

This "recommendation" method is really simple and probably works well only in this specific use case (i.e. it's a well definable subset of a specific genre).
The standard recommendation method of Spotify is obviously much better because, apart from audio analysis, it uses also a mix of Collaborative Filtering models (analyzing your behavior and others’ behavior) and Natural Language Processing (NLP) models (analyzing text of the songs).

Next steps:

Run the analysis again after a few months, in order to take into account new entries in my playlist and new songs in the 5000 sample
Enrich the information I already got with something new using the Spotify APIs (e.g the Audio Analysis endpoint or other services. It would be nice, as an example, to detect musical instruments in a track (guitars anyone??) or the presence of some features like distortion, riffs, etc.
Use the preview sample from the API to tag manually what I like/dislike on a subset of the 5000 songs and then run some ML algorithms in order to classify the music I like
Analyze deeper my playlist using some ML algorithms (e.g. cluster my tracks)

topic number	top words
0	(0, '0.125"way" + 0.094"favorite" + 0.076"feature" + 0.067"oh" + 0.063*"think"')
1	(1, '0.140"pipenv" + 0.124"released" + 0.098"pipenv_released" + 0.082"want" + 0.073*"code"')
2	(2, '0.271"python" + 0.132"today" + 0.093"people" + 0.039"month" + 0.036*"kenneth"')
3	(3, '0.183"requests" + 0.134"love" + 0.081"work" + 0.071"html" + 0.057*"github"')
4	(4, '0.164"like" + 0.100"rust" + 0.098"time" + 0.058"day" + 0.047*"things"')
5	(5, '0.297"pipenv" + 0.062"support" + 0.058"includes" + 0.045"right" + 0.044*"making"')
6	(6, '0.271"new" + 0.076"getting" + 0.075"better" + 0.058"use" + 0.049*"photos"')
7	(7, '0.161"good" + 0.097"going" + 0.092"got" + 0.067"happy" + 0.058*"current"')
8	(8, '0.114"great" + 0.091"ipad" + 0.076"finally" + 0.066"heroku" + 0.057*"working"')