<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Eric Bonfadini</title>
    <description>The latest articles on DEV Community by Eric Bonfadini (@ericbonfadini).</description>
    <link>https://dev.to/ericbonfadini</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F54228%2Fcbd95690-2e94-4104-9134-a5235d2c045e.png</url>
      <title>DEV Community: Eric Bonfadini</title>
      <link>https://dev.to/ericbonfadini</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ericbonfadini"/>
    <language>en</language>
    <item>
      <title>You can tell a man by his tweets</title>
      <dc:creator>Eric Bonfadini</dc:creator>
      <pubDate>Mon, 14 May 2018 22:51:55 +0000</pubDate>
      <link>https://dev.to/ericbonfadini/you-can-tell-a-man-by-his-tweets-2l9l</link>
      <guid>https://dev.to/ericbonfadini/you-can-tell-a-man-by-his-tweets-2l9l</guid>
      <description>&lt;p&gt;Some time ago I read a tweet of Kenneth Reitz, a very well known Python developer I follow on Twitter, asking:&lt;br&gt;
&lt;iframe class="tweet-embed" id="tweet-952553176925958145-292" src="https://platform.twitter.com/embed/Tweet.html?id=952553176925958145"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-952553176925958145-292');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=952553176925958145&amp;amp;theme=dark"
  }



&lt;br&gt;
Starting from this, I decided to analyze some tweets from pretty popular Python devs in order to understand a priori how they use Twitter, what they tweet about and what I can gather using data from Twitter APIs only.&lt;br&gt;
Obviously you can apply the same analysis on a different list of Twitter accounts.&lt;/p&gt;

&lt;h1&gt;
  
  
  Setting up the environment
&lt;/h1&gt;

&lt;p&gt;For my analysis I set up a Python 3.6 virtual environment with the following main libraries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/tweepy/tweepy" rel="noopener noreferrer"&gt;Tweepy&lt;/a&gt; for interaction with Twitter APIs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;Pandas&lt;/a&gt; for data analysis and visualization&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pypi.org/project/beautifulsoup4/" rel="noopener noreferrer"&gt;Beautiful Soup 4&lt;/a&gt;, &lt;a href="https://www.nltk.org/" rel="noopener noreferrer"&gt;NLTK&lt;/a&gt; and &lt;a href="https://radimrehurek.com/gensim/" rel="noopener noreferrer"&gt;Gensim&lt;/a&gt; for processing text data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some extra libraries will be introduced later on, along with the explanation of the steps I did.&lt;/p&gt;

&lt;p&gt;In order to access the Twitter APIs I &lt;a href="https://apps.twitter.com/" rel="noopener noreferrer"&gt;registered my app&lt;/a&gt; and then I provided the tweepy library with the consumer_key, access_token, access_token_secret and consumer_secret.&lt;/p&gt;

&lt;p&gt;We're now ready to get some real data from Twitter!&lt;/p&gt;

&lt;h1&gt;
  
  
  Choosing a list of Twitter accounts
&lt;/h1&gt;

&lt;p&gt;First of all, I chose a list of 8 popular Python devs, starting from &lt;a href="https://pythonwheels.com/" rel="noopener noreferrer"&gt;the top 360 most-downloaded packages on PyPI&lt;/a&gt; and selecting some libraries I know or use daily.&lt;/p&gt;

&lt;p&gt;Here's my final list, including links to the Twitter account and the libraries (from the above mentioned list) for which those guys are known for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/kennethreitz" rel="noopener noreferrer"&gt;@kennethreitz&lt;/a&gt;: requests&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/mitsuhiko" rel="noopener noreferrer"&gt;@mitsuhiko&lt;/a&gt;: jinja2/flask&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/zzzeek" rel="noopener noreferrer"&gt;@zzzeek&lt;/a&gt;: sqlalchemy&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/teoliphant" rel="noopener noreferrer"&gt;@teoliphant&lt;/a&gt;: numpy/scipy/numba&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/benoitc" rel="noopener noreferrer"&gt;@benoitc&lt;/a&gt;: gunicorn&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/asksol" rel="noopener noreferrer"&gt;@asksol&lt;/a&gt;: celery&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/wesmckinn" rel="noopener noreferrer"&gt;@wesmckinn&lt;/a&gt;: pandas&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://twitter.com/cournape" rel="noopener noreferrer"&gt;@cournape&lt;/a&gt;: scikit-learn&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Getting data from Twitter
&lt;/h1&gt;

&lt;p&gt;I got all the data with two endpoints only:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;with a call to &lt;a href="https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-lookup" rel="noopener noreferrer"&gt;lookup users&lt;/a&gt; I could get all the information about the accounts (creation date, description, counts, location, etc.)&lt;/li&gt;
&lt;li&gt;with a call to &lt;a href="https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html" rel="noopener noreferrer"&gt;user timeline&lt;/a&gt; I could get the tweets about a single user and all the information related to every single tweet. I configured the call to get also retweets and replies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I saved the results from the two calls in two Pandas dataframes in order to ease the data processing and then into CSV files to be used as starting point for the next steps without re-calling each time the Twitter API.&lt;/p&gt;

&lt;h1&gt;
  
  
  Preprocessing tweets
&lt;/h1&gt;

&lt;p&gt;The users dataframe contained all the information I needed; I just created three more columns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a followers/following ratio, a sort of "popularity" indicator&lt;/li&gt;
&lt;li&gt;a tweets per day ratio, dividing the total number of tweets by the number of days since the creation of the account&lt;/li&gt;
&lt;li&gt;the coordinates starting from the location, if available, using &lt;a href="https://pypi.org/project/geopy/" rel="noopener noreferrer"&gt;Geopy&lt;/a&gt;. &lt;a class="mentioned-user" href="https://dev.to/benoitc"&gt;@benoitc&lt;/a&gt; doesn't have a location, while @zzzeek has a generic "northeast", geolocated in Nebraska :-)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the final users dataframe:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;screen_name&lt;/th&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;verified&lt;/th&gt;
&lt;th&gt;description&lt;/th&gt;
&lt;th&gt;created_at&lt;/th&gt;
&lt;th&gt;location&lt;/th&gt;
&lt;th&gt;location_coo&lt;/th&gt;
&lt;th&gt;time_zone&lt;/th&gt;
&lt;th&gt;total_tweets&lt;/th&gt;
&lt;th&gt;favourites_count&lt;/th&gt;
&lt;th&gt;followers_count&lt;/th&gt;
&lt;th&gt;following_count&lt;/th&gt;
&lt;th&gt;listed_count&lt;/th&gt;
&lt;th&gt;followers/following&lt;/th&gt;
&lt;th&gt;tweets_per_day&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;Kenneth Reitz&lt;/td&gt;
&lt;td&gt;False&lt;/td&gt;
&lt;td&gt;@DigitalOcean &amp;amp; @ThePSF. Creator of Requests: HTTP for Humans. I design art — with code, cameras, musical instruments, and the English language.&lt;/td&gt;
&lt;td&gt;2009-06-24 23:28:06&lt;/td&gt;
&lt;td&gt;Winchester, VA&lt;/td&gt;
&lt;td&gt;[39.1852184, -78.1652404]&lt;/td&gt;
&lt;td&gt;Eastern Time (US &amp;amp; Canada)&lt;/td&gt;
&lt;td&gt;58195&lt;/td&gt;
&lt;td&gt;26575&lt;/td&gt;
&lt;td&gt;18213&lt;/td&gt;
&lt;td&gt;480&lt;/td&gt;
&lt;td&gt;822&lt;/td&gt;
&lt;td&gt;37.94375&lt;/td&gt;
&lt;td&gt;17.950339296730412&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mitsuhiko&lt;/td&gt;
&lt;td&gt;Armin Ronacher&lt;/td&gt;
&lt;td&gt;True&lt;/td&gt;
&lt;td&gt;Creator of Flask; Building stuff at @getsentry; prev @fireteamltd, @splashdamage — writing and talking about system architecture, API design and lots of Python&lt;/td&gt;
&lt;td&gt;2008-02-01 23:12:59&lt;/td&gt;
&lt;td&gt;Austria&lt;/td&gt;
&lt;td&gt;[47.2000338, 13.199959]&lt;/td&gt;
&lt;td&gt;Vienna&lt;/td&gt;
&lt;td&gt;31774&lt;/td&gt;
&lt;td&gt;2059&lt;/td&gt;
&lt;td&gt;21801&lt;/td&gt;
&lt;td&gt;593&lt;/td&gt;
&lt;td&gt;941&lt;/td&gt;
&lt;td&gt;36.76391231028668&lt;/td&gt;
&lt;td&gt;8.470807784590775&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zzzeek&lt;/td&gt;
&lt;td&gt;mike bayer&lt;/td&gt;
&lt;td&gt;False&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2008-06-11 19:22:19&lt;/td&gt;
&lt;td&gt;northeast&lt;/td&gt;
&lt;td&gt;[41.7370229, -99.5873816]&lt;/td&gt;
&lt;td&gt;Quito&lt;/td&gt;
&lt;td&gt;14771&lt;/td&gt;
&lt;td&gt;1106&lt;/td&gt;
&lt;td&gt;3013&lt;/td&gt;
&lt;td&gt;209&lt;/td&gt;
&lt;td&gt;194&lt;/td&gt;
&lt;td&gt;14.416267942583731&lt;/td&gt;
&lt;td&gt;4.080386740331492&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;teoliphant&lt;/td&gt;
&lt;td&gt;Travis Oliphant&lt;/td&gt;
&lt;td&gt;False&lt;/td&gt;
&lt;td&gt;Creator of SciPy, NumPy, and Numba; founder and Director of Anaconda, Inc. Founder of NumFOCUS. CEO of Quansight&lt;/td&gt;
&lt;td&gt;2009-04-17 20:04:57&lt;/td&gt;
&lt;td&gt;Austin, TX&lt;/td&gt;
&lt;td&gt;[30.2711286, -97.7436995]&lt;/td&gt;
&lt;td&gt;Central Time (US &amp;amp; Canada)&lt;/td&gt;
&lt;td&gt;3875&lt;/td&gt;
&lt;td&gt;1506&lt;/td&gt;
&lt;td&gt;18052&lt;/td&gt;
&lt;td&gt;483&lt;/td&gt;
&lt;td&gt;746&lt;/td&gt;
&lt;td&gt;37.374741200828154&lt;/td&gt;
&lt;td&gt;1.1706948640483383&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;benoitc&lt;/td&gt;
&lt;td&gt;benoît chesneau&lt;/td&gt;
&lt;td&gt;False&lt;/td&gt;
&lt;td&gt;web craftsman&lt;/td&gt;
&lt;td&gt;2007-02-13 16:53:37&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Paris&lt;/td&gt;
&lt;td&gt;27172&lt;/td&gt;
&lt;td&gt;548&lt;/td&gt;
&lt;td&gt;1971&lt;/td&gt;
&lt;td&gt;704&lt;/td&gt;
&lt;td&gt;247&lt;/td&gt;
&lt;td&gt;2.799715909090909&lt;/td&gt;
&lt;td&gt;6.620857699805068&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;asksol&lt;/td&gt;
&lt;td&gt;Ask Solem&lt;/td&gt;
&lt;td&gt;False&lt;/td&gt;
&lt;td&gt;sound, noise, stream processing, distributed systems, data, open source python, etc. Works at @robinhoodapp 🌳🌳🌴🌵&lt;/td&gt;
&lt;td&gt;2007-11-11 18:56:33&lt;/td&gt;
&lt;td&gt;San Francisco, CA&lt;/td&gt;
&lt;td&gt;[45.4423543, -73.4373087]&lt;/td&gt;
&lt;td&gt;Pacific Time (US &amp;amp; Canada)&lt;/td&gt;
&lt;td&gt;3249&lt;/td&gt;
&lt;td&gt;191&lt;/td&gt;
&lt;td&gt;2509&lt;/td&gt;
&lt;td&gt;513&lt;/td&gt;
&lt;td&gt;126&lt;/td&gt;
&lt;td&gt;4.89083820662768&lt;/td&gt;
&lt;td&gt;0.8476389251239238&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wesmckinn&lt;/td&gt;
&lt;td&gt;Wes McKinney&lt;/td&gt;
&lt;td&gt;True&lt;/td&gt;
&lt;td&gt;Data science toolmaker at &lt;a href="https://t.co/YVn0VFqgj0" rel="noopener noreferrer"&gt;https://t.co/YVn0VFqgj0&lt;/a&gt;. Creator of pandas, @IbisData. @ApacheArrow @ApacheParquet PMC. Wrote Python for Data Analysis. Views my own&lt;/td&gt;
&lt;td&gt;2010-02-18 21:01:15&lt;/td&gt;
&lt;td&gt;New York, NY&lt;/td&gt;
&lt;td&gt;[40.7306458, -73.9866136]&lt;/td&gt;
&lt;td&gt;Eastern Time (US &amp;amp; Canada)&lt;/td&gt;
&lt;td&gt;7749&lt;/td&gt;
&lt;td&gt;3021&lt;/td&gt;
&lt;td&gt;33130&lt;/td&gt;
&lt;td&gt;784&lt;/td&gt;
&lt;td&gt;1277&lt;/td&gt;
&lt;td&gt;42.25765306122449&lt;/td&gt;
&lt;td&gt;2.5804195804195804&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cournape&lt;/td&gt;
&lt;td&gt;David Cournapeau&lt;/td&gt;
&lt;td&gt;False&lt;/td&gt;
&lt;td&gt;Python and Stats geek. Former NumPy/Scipy core contributor. Lead the ML engineering team @cogentlabs  Occasional musings on economics/music/Japanese culture&lt;/td&gt;
&lt;td&gt;2010-06-14 03:17:21&lt;/td&gt;
&lt;td&gt;Tokyo, Japan&lt;/td&gt;
&lt;td&gt;[34.6968642, 139.4049033]&lt;/td&gt;
&lt;td&gt;Amsterdam&lt;/td&gt;
&lt;td&gt;15577&lt;/td&gt;
&lt;td&gt;505&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;td&gt;427&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;td&gt;1.873536299765808&lt;/td&gt;
&lt;td&gt;5.395566331832352&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tweets dataframe on the contrary needed some extra preprocessing.&lt;/p&gt;

&lt;p&gt;First of all, I discovered an annoying limitation about the Twitter user timeline API: there's a maximum number of tweets that can be returned (more or less 3200 including retweets and replies). Therefore I decided to group the tweets by username and to get the oldest tweet date for each user:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;username&lt;/th&gt;
&lt;th&gt;count&lt;/th&gt;
&lt;th&gt;min&lt;/th&gt;
&lt;th&gt;max&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;asksol&lt;/td&gt;
&lt;td&gt;2991&lt;/td&gt;
&lt;td&gt;2009-07-19 19:24:33&lt;/td&gt;
&lt;td&gt;2018-05-10 14:58:17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;benoitc&lt;/td&gt;
&lt;td&gt;3199&lt;/td&gt;
&lt;td&gt;2017-02-21 14:55:37&lt;/td&gt;
&lt;td&gt;2018-05-11 19:36:21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cournape&lt;/td&gt;
&lt;td&gt;3179&lt;/td&gt;
&lt;td&gt;2017-06-12 19:13:20&lt;/td&gt;
&lt;td&gt;2018-05-11 16:55:39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;3200&lt;/td&gt;
&lt;td&gt;2017-08-26 20:48:35&lt;/td&gt;
&lt;td&gt;2018-05-11 21:07:46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mitsuhiko&lt;/td&gt;
&lt;td&gt;3226&lt;/td&gt;
&lt;td&gt;2017-06-14 13:23:57&lt;/td&gt;
&lt;td&gt;2018-05-11 19:26:07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;teoliphant&lt;/td&gt;
&lt;td&gt;3201&lt;/td&gt;
&lt;td&gt;2013-02-19 03:54:16&lt;/td&gt;
&lt;td&gt;2018-05-11 16:48:39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wesmckinn&lt;/td&gt;
&lt;td&gt;3205&lt;/td&gt;
&lt;td&gt;2014-01-26 17:45:07&lt;/td&gt;
&lt;td&gt;2018-05-03 14:51:29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zzzeek&lt;/td&gt;
&lt;td&gt;3214&lt;/td&gt;
&lt;td&gt;2015-05-05 13:35:38&lt;/td&gt;
&lt;td&gt;2018-05-11 14:17:02&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Then I filtered out all the tweets before the maximum value between the first dates (2017-08-26 20:48:35).&lt;br&gt;
Starting from these data, &lt;a class="mentioned-user" href="https://dev.to/kennethreitz"&gt;@kennethreitz&lt;/a&gt; is influencing the cut date because he's tweeting a lot more than some other users; but in this way we can at least get the same timeframe for all the users and compare tweets from the same period.&lt;/p&gt;

&lt;p&gt;After this filter I got 25418-14470=10948 tweets, split in this way:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;username&lt;/th&gt;
&lt;th&gt;count&lt;/th&gt;
&lt;th&gt;min&lt;/th&gt;
&lt;th&gt;max&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;asksol&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;td&gt;2017-09-11 18:36:19&lt;/td&gt;
&lt;td&gt;2018-05-10 14:58:17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;benoitc&lt;/td&gt;
&lt;td&gt;1849&lt;/td&gt;
&lt;td&gt;2017-08-26 20:49:13&lt;/td&gt;
&lt;td&gt;2018-05-11 19:36:21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cournape&lt;/td&gt;
&lt;td&gt;1888&lt;/td&gt;
&lt;td&gt;2017-08-27 01:02:11&lt;/td&gt;
&lt;td&gt;2018-05-11 16:55:39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;3200&lt;/td&gt;
&lt;td&gt;2017-08-26 20:48:35&lt;/td&gt;
&lt;td&gt;2018-05-11 21:07:46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mitsuhiko&lt;/td&gt;
&lt;td&gt;2328&lt;/td&gt;
&lt;td&gt;2017-08-27 08:22:19&lt;/td&gt;
&lt;td&gt;2018-05-11 19:26:07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;teoliphant&lt;/td&gt;
&lt;td&gt;443&lt;/td&gt;
&lt;td&gt;2017-08-26 22:57:23&lt;/td&gt;
&lt;td&gt;2018-05-11 16:48:39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wesmckinn&lt;/td&gt;
&lt;td&gt;591&lt;/td&gt;
&lt;td&gt;2017-08-28 18:07:08&lt;/td&gt;
&lt;td&gt;2018-05-03 14:51:29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zzzeek&lt;/td&gt;
&lt;td&gt;596&lt;/td&gt;
&lt;td&gt;2017-08-26 22:14:11&lt;/td&gt;
&lt;td&gt;2018-05-11 14:17:02&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Other preprocessing steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I parsed the source information using Beautiful Soup because it contained HTML entities&lt;/li&gt;
&lt;li&gt;I removed smart quotes from text&lt;/li&gt;
&lt;li&gt;I converted the text to lower case&lt;/li&gt;
&lt;li&gt;I removed URLs and numbers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I filtered out all the tweets with empty text after these steps (i.e. containing only urls, etc) and I got 10948-125=10823 tweets.&lt;/p&gt;

&lt;p&gt;I finally created a new column containing the "tweet type" (standard, reply or retweet) and another column with the tweet length.&lt;/p&gt;

&lt;p&gt;Here are some columns from the final tweets dataframe (first 5 rows):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;username&lt;/th&gt;
&lt;th&gt;created_at&lt;/th&gt;
&lt;th&gt;full_text&lt;/th&gt;
&lt;th&gt;text_clean&lt;/th&gt;
&lt;th&gt;lang&lt;/th&gt;
&lt;th&gt;favorite_count&lt;/th&gt;
&lt;th&gt;retweet_count&lt;/th&gt;
&lt;th&gt;source&lt;/th&gt;
&lt;th&gt;tweet_type&lt;/th&gt;
&lt;th&gt;tweet_len&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;2018-05-11 21:07:46&lt;/td&gt;
&lt;td&gt;RT @IAmAru: Trio is @kennethreitz-approved. #PyCon2018&lt;/td&gt;
&lt;td&gt;rt trio is approved&lt;/td&gt;
&lt;td&gt;en&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Tweetbot for iΟS&lt;/td&gt;
&lt;td&gt;retweet&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;2018-05-10 21:55:35&lt;/td&gt;
&lt;td&gt;If you want to say hi, swing by the DigitalOcean booth during the opening reception! #PyCon2018&lt;/td&gt;
&lt;td&gt;if you want to say hi swing by the digitalocean booth during the opening reception&lt;/td&gt;
&lt;td&gt;en&lt;/td&gt;
&lt;td&gt;20.0&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Tweetbot for iΟS&lt;/td&gt;
&lt;td&gt;standard&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;2018-05-10 20:34:20&lt;/td&gt;
&lt;td&gt;@dbinoj 24x 1x dynos right now&lt;/td&gt;
&lt;td&gt;x x dynos right now&lt;/td&gt;
&lt;td&gt;en&lt;/td&gt;
&lt;td&gt;0.0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Tweetbot for iΟS&lt;/td&gt;
&lt;td&gt;reply&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;2018-05-10 20:11:37&lt;/td&gt;
&lt;td&gt;Swing by the @IndyPy booth for your chance to win a signed copy of The Hitchhiker's Guide to Python! ✨🍰✨ &lt;a href="https://t.co/CZhd2If5s0" rel="noopener noreferrer"&gt;https://t.co/CZhd2If5s0&lt;/a&gt; &lt;a href="https://t.co/3kUaqu5TMX" rel="noopener noreferrer"&gt;https://t.co/3kUaqu5TMX&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;swing by the booth for your chance to win a signed copy of the hitchhiker s guide to python&lt;/td&gt;
&lt;td&gt;en&lt;/td&gt;
&lt;td&gt;25.0&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;IFTTT&lt;/td&gt;
&lt;td&gt;standard&lt;/td&gt;
&lt;td&gt;152&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;2018-05-10 13:53:31&lt;/td&gt;
&lt;td&gt;Let's do this &lt;a href="https://t.co/6xLCE4WCqA" rel="noopener noreferrer"&gt;https://t.co/6xLCE4WCqA&lt;/a&gt; &lt;a href="https://t.co/ERiMmffe8L" rel="noopener noreferrer"&gt;https://t.co/ERiMmffe8L&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;let s do this&lt;/td&gt;
&lt;td&gt;en&lt;/td&gt;
&lt;td&gt;22.0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;IFTTT&lt;/td&gt;
&lt;td&gt;standard&lt;/td&gt;
&lt;td&gt;61&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h1&gt;
  
  
  Explorative Data Analysis
&lt;/h1&gt;

&lt;p&gt;The users dataframe itself already shows some insights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are only two accounts with the verified flag: @mitsuhiko and @wesmckinn&lt;/li&gt;
&lt;li&gt;@wesmckinn, &lt;a class="mentioned-user" href="https://dev.to/kennethreitz"&gt;@kennethreitz&lt;/a&gt;, &lt;a class="mentioned-user" href="https://dev.to/teoliphant"&gt;@teoliphant&lt;/a&gt; and @mitsuhiko are the most popular accounts in the list (according to my "popularity" indicator): &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F1pwqnv2m23tmnwxajg10.png" alt="popindicator"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a class="mentioned-user" href="https://dev.to/kennethreitz"&gt;@kennethreitz&lt;/a&gt; wrote since the creation of his account at least twice the number of tweets per day compared to the other devs in the list: &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fwjxpj95ew9r6js55hz3b.png" alt="tweetsperday"&gt;
&lt;/li&gt;
&lt;li&gt;Most of the accounts in the list live in the US; I used &lt;a href="https://github.com/python-visualization/folium" rel="noopener noreferrer"&gt;Folium&lt;/a&gt; to create a map showing the locations: &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fg8y6ghlaltkquc8hdk7v.png" alt="map"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tweets dataframe needs instead some manipulation before we can gather some good insights.&lt;/p&gt;

&lt;p&gt;First of all let's check the tweet "style" of each account. From the following chart we can see for example that @cournape is retweeting a lot, while @mitsuhiko is replying a lot: &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fnv1czyjrk6q9ipoxdxfj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fnv1czyjrk6q9ipoxdxfj.png" alt="tweettype"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also group by username and tweet type, and show a chart with the mean tweet length. &lt;a class="mentioned-user" href="https://dev.to/kennethreitz"&gt;@kennethreitz&lt;/a&gt; for example writes replies shorter than standard tweets, while &lt;a class="mentioned-user" href="https://dev.to/teoliphant"&gt;@teoliphant&lt;/a&gt; writes tweets longer than the other guys (exceeding the 140 chars limit): &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F4jm1m8gt31f8eg1o93mn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F4jm1m8gt31f8eg1o93mn.png" alt="tweetlen"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ok, now let's filter out the retweets and let's focus on the machine-detected language used in standard tweets and replies. The five most common languages are: English, German, French, undefined and a rather weird "tagalog" (ISO 639-1 code "tl", maybe an error in auto-detection?). Most of the tweets are in English; @mitsuhiko tweets a lot in German, while @benoitrc in French: &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fx15a61kev9e9r9p1zn2q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fx15a61kev9e9r9p1zn2q.png" alt="lang"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, let's just select tweets in English or undefined: all the next charts are just considering tweets and replies in English (but obviously you can tune differently your analysis).&lt;br&gt;
Let's group by username and get statistics about the number of favorites/retweets per user:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;username&lt;/th&gt;
&lt;th&gt;favorite_count count&lt;/th&gt;
&lt;th&gt;favorite_count max&lt;/th&gt;
&lt;th&gt;favorite_count mean&lt;/th&gt;
&lt;th&gt;favorite_count std&lt;/th&gt;
&lt;th&gt;retweet_count max&lt;/th&gt;
&lt;th&gt;retweet_count mean&lt;/th&gt;
&lt;th&gt;retweet_count std&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;asksol&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;td&gt;41.0&lt;/td&gt;
&lt;td&gt;1.608695652173913&lt;/td&gt;
&lt;td&gt;6.111840097055933&lt;/td&gt;
&lt;td&gt;3.0&lt;/td&gt;
&lt;td&gt;0.10869565217391304&lt;/td&gt;
&lt;td&gt;0.48204475908203187&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;benoitc&lt;/td&gt;
&lt;td&gt;1009&lt;/td&gt;
&lt;td&gt;30.0&lt;/td&gt;
&lt;td&gt;0.6531219028741329&lt;/td&gt;
&lt;td&gt;1.8313280878865186&lt;/td&gt;
&lt;td&gt;17.0&lt;/td&gt;
&lt;td&gt;0.13676907829534193&lt;/td&gt;
&lt;td&gt;0.7644934696088941&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cournape&lt;/td&gt;
&lt;td&gt;214&lt;/td&gt;
&lt;td&gt;60.0&lt;/td&gt;
&lt;td&gt;1.2757009345794392&lt;/td&gt;
&lt;td&gt;4.449367547428712&lt;/td&gt;
&lt;td&gt;25.0&lt;/td&gt;
&lt;td&gt;0.205607476635514&lt;/td&gt;
&lt;td&gt;1.7481758044670788&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;2637&lt;/td&gt;
&lt;td&gt;3932.0&lt;/td&gt;
&lt;td&gt;10.062571103526736&lt;/td&gt;
&lt;td&gt;82.09998594317476&lt;/td&gt;
&lt;td&gt;2573.0&lt;/td&gt;
&lt;td&gt;2.620781190747061&lt;/td&gt;
&lt;td&gt;50.79602602503255&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mitsuhiko&lt;/td&gt;
&lt;td&gt;1547&lt;/td&gt;
&lt;td&gt;752.0&lt;/td&gt;
&lt;td&gt;9.657401422107304&lt;/td&gt;
&lt;td&gt;41.06463543974671&lt;/td&gt;
&lt;td&gt;220.0&lt;/td&gt;
&lt;td&gt;1.8526179702650292&lt;/td&gt;
&lt;td&gt;9.932970595417615&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;teoliphant&lt;/td&gt;
&lt;td&gt;186&lt;/td&gt;
&lt;td&gt;808.0&lt;/td&gt;
&lt;td&gt;26.080645161290324&lt;/td&gt;
&lt;td&gt;69.54002504187612&lt;/td&gt;
&lt;td&gt;134.0&lt;/td&gt;
&lt;td&gt;7.806451612903226&lt;/td&gt;
&lt;td&gt;17.085639972995896&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wesmckinn&lt;/td&gt;
&lt;td&gt;433&lt;/td&gt;
&lt;td&gt;2081.0&lt;/td&gt;
&lt;td&gt;45.750577367205544&lt;/td&gt;
&lt;td&gt;142.2699008271913&lt;/td&gt;
&lt;td&gt;695.0&lt;/td&gt;
&lt;td&gt;12.270207852193995&lt;/td&gt;
&lt;td&gt;48.083342617014644&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zzzeek&lt;/td&gt;
&lt;td&gt;439&lt;/td&gt;
&lt;td&gt;85.0&lt;/td&gt;
&lt;td&gt;2.173120728929385&lt;/td&gt;
&lt;td&gt;6.417876507767421&lt;/td&gt;
&lt;td&gt;28.0&lt;/td&gt;
&lt;td&gt;0.44874715261959&lt;/td&gt;
&lt;td&gt;1.896040581838119&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From this table we can see that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a class="mentioned-user" href="https://dev.to/kennethreitz"&gt;@kennethreitz&lt;/a&gt; has the most retweeted and favorited tweet in the dataframe. Here's the tweet: &lt;iframe class="tweet-embed" id="tweet-981547972239417345-95" src="https://platform.twitter.com/embed/Tweet.html?id=981547972239417345"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-981547972239417345-95');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=981547972239417345&amp;amp;theme=dark"
  }



&lt;/li&gt;
&lt;li&gt;@wesmckinn has the second most retweeted and favorited tweet in the dataframe. Here's the tweet: &lt;iframe class="tweet-embed" id="tweet-986998077767716865-492" src="https://platform.twitter.com/embed/Tweet.html?id=986998077767716865"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-986998077767716865-492');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=986998077767716865&amp;amp;theme=dark"
  }



&lt;/li&gt;
&lt;li&gt;@wesmckinn has highest mean value for retweet count and favorite count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since @wesmckinn has also the highest followers count, how these stats change if we normalize the dataframe using the followers count?&lt;br&gt;
Obviously one tweet can get favorited/retweeted even from non-followers, but this normalization will probably produce more fair results because the higher the followers count, the most the tweet will probably be viewed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;username&lt;/th&gt;
&lt;th&gt;favorite_count perc count&lt;/th&gt;
&lt;th&gt;favorite_count perc max&lt;/th&gt;
&lt;th&gt;favorite_count perc mean&lt;/th&gt;
&lt;th&gt;favorite_count perc std&lt;/th&gt;
&lt;th&gt;retweet_count perc max&lt;/th&gt;
&lt;th&gt;retweet_count perc mean&lt;/th&gt;
&lt;th&gt;retweet_count perc std&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;asksol&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;td&gt;1.634117178158629&lt;/td&gt;
&lt;td&gt;0.06411700486942658&lt;/td&gt;
&lt;td&gt;0.243596655920922&lt;/td&gt;
&lt;td&gt;0.11956954962136308&lt;/td&gt;
&lt;td&gt;0.0043322300587450395&lt;/td&gt;
&lt;td&gt;0.019212624913592345&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;benoitc&lt;/td&gt;
&lt;td&gt;1009&lt;/td&gt;
&lt;td&gt;1.5220700152207&lt;/td&gt;
&lt;td&gt;0.0331365754882869&lt;/td&gt;
&lt;td&gt;0.09291365235345102&lt;/td&gt;
&lt;td&gt;0.8625063419583967&lt;/td&gt;
&lt;td&gt;0.0069390704360904&lt;/td&gt;
&lt;td&gt;0.038787086230791176&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cournape&lt;/td&gt;
&lt;td&gt;214&lt;/td&gt;
&lt;td&gt;7.5&lt;/td&gt;
&lt;td&gt;0.1594626168224299&lt;/td&gt;
&lt;td&gt;0.556170943428589&lt;/td&gt;
&lt;td&gt;3.125&lt;/td&gt;
&lt;td&gt;0.02570093457943925&lt;/td&gt;
&lt;td&gt;0.21852197555838485&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kennethreitz&lt;/td&gt;
&lt;td&gt;2637&lt;/td&gt;
&lt;td&gt;21.588974908032725&lt;/td&gt;
&lt;td&gt;0.055249388368344046&lt;/td&gt;
&lt;td&gt;0.4507768404061633&lt;/td&gt;
&lt;td&gt;14.127271728984791&lt;/td&gt;
&lt;td&gt;0.014389618353632417&lt;/td&gt;
&lt;td&gt;0.27889982992935036&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mitsuhiko&lt;/td&gt;
&lt;td&gt;1547&lt;/td&gt;
&lt;td&gt;3.4493830558231275&lt;/td&gt;
&lt;td&gt;0.04429797450624903&lt;/td&gt;
&lt;td&gt;0.18836124691411713&lt;/td&gt;
&lt;td&gt;1.009128021650383&lt;/td&gt;
&lt;td&gt;0.008497857760034094&lt;/td&gt;
&lt;td&gt;0.04556199530029643&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;teoliphant&lt;/td&gt;
&lt;td&gt;186&lt;/td&gt;
&lt;td&gt;4.475958342565921&lt;/td&gt;
&lt;td&gt;0.14447510060541938&lt;/td&gt;
&lt;td&gt;0.38522061290647097&lt;/td&gt;
&lt;td&gt;0.7423000221582097&lt;/td&gt;
&lt;td&gt;0.043244247800261586&lt;/td&gt;
&lt;td&gt;0.09464679798911974&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wesmckinn&lt;/td&gt;
&lt;td&gt;433&lt;/td&gt;
&lt;td&gt;6.281316027769393&lt;/td&gt;
&lt;td&gt;0.138094106149126&lt;/td&gt;
&lt;td&gt;0.42942922072801465&lt;/td&gt;
&lt;td&gt;2.0977965590099608&lt;/td&gt;
&lt;td&gt;0.037036546490172004&lt;/td&gt;
&lt;td&gt;0.1451353535074393&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;zzzeek&lt;/td&gt;
&lt;td&gt;439&lt;/td&gt;
&lt;td&gt;2.8211085297046132&lt;/td&gt;
&lt;td&gt;0.07212481675836008&lt;/td&gt;
&lt;td&gt;0.21300619010180605&lt;/td&gt;
&lt;td&gt;0.9293063391968137&lt;/td&gt;
&lt;td&gt;0.01489369905806803&lt;/td&gt;
&lt;td&gt;0.06292866185987782&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After the normalization we can see that @cournape and &lt;a class="mentioned-user" href="https://dev.to/teoliphant"&gt;@teoliphant&lt;/a&gt; are getting higher mean values, in terms of retweets and favorites.&lt;/p&gt;

&lt;p&gt;We can also see how the monthly number of tweets changes over time, per user. From the following chart we can see for example that &lt;a class="mentioned-user" href="https://dev.to/kennethreitz"&gt;@kennethreitz&lt;/a&gt; tweeted a lot in september 2017 (more than 800 tweets): &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fbk9lx56uecyupdpwjcnr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fbk9lx56uecyupdpwjcnr.png" alt="monthly"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or we can even see which tools are used the most to tweet, per user: &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffj4jd4618hbsithnj6j4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ffj4jd4618hbsithnj6j4.png" alt="sources"&gt;&lt;/a&gt;&lt;br&gt;
I grouped a lot of less used tools under "Other" (Tweetbot for iΟS, Twitter for iPad, OS X, Instagram, Foursquare, Facebook, LinkedIn, Squarespace, Medium, Buffer).&lt;/p&gt;

&lt;p&gt;Finally, we can build a kind of punchcard chart for each user, showing an aggregation of tweets dates by day of the week and hours of the day: &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftki4ig3kewzvtz21mev5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Ftki4ig3kewzvtz21mev5.png" alt="punch"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Topics
&lt;/h1&gt;

&lt;p&gt;But what are the devs in the list talking about?&lt;/p&gt;

&lt;p&gt;Let's start with a simple visualization, a word cloud.&lt;br&gt;
After some basic preprocessing of the text from standard tweets only (tokenization, pos tagging, stopwords removal, bigrams, etc), I grouped the tweets by username and got the most common words for each one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;username&lt;/th&gt;
&lt;th&gt;tweet count&lt;/th&gt;
&lt;th&gt;most_common&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;@asksol&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;[('python', 3), ('enjoy', 1), ('seeing', 1), ('process', 1), ('handle', 1)]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a class="mentioned-user" href="https://dev.to/benoitc"&gt;@benoitc&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;488&lt;/td&gt;
&lt;td&gt;[('like', 40), ('erlang', 33), ('use', 31), ('code', 30), ('people', 30)]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;@cournape&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;[('japan', 8), ('japanese', 6), ('#pyconjp', 6), ('shibuya', 4), ('python', 4)]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a class="mentioned-user" href="https://dev.to/kennethreitz"&gt;@kennethreitz&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1109&lt;/td&gt;
&lt;td&gt;[('pipenv', 157), ('python', 84), ('new', 77), ('requests', 64), ('released', 53)]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;@mitsuhiko&lt;/td&gt;
&lt;td&gt;399&lt;/td&gt;
&lt;td&gt;[('rust', 53), ('like', 36), ('people', 27), ('new', 25), ('way', 20)]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a class="mentioned-user" href="https://dev.to/teoliphant"&gt;@teoliphant&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;113&lt;/td&gt;
&lt;td&gt;[('#pydata', 39), ('#python', 36), ('@anacondainc', 18), ('great', 18), ('new', 15)]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;@wesmckinn&lt;/td&gt;
&lt;td&gt;129&lt;/td&gt;
&lt;td&gt;[('@apachearrow', 32), ('data', 21), ('pandas', 16), ('python', 12), ('new', 10)]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;@zzzeek&lt;/td&gt;
&lt;td&gt;170&lt;/td&gt;
&lt;td&gt;[('like', 14), ('years', 11), ('python', 10), ('time', 10), ('use', 9)]&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Then I created a word cloud for each username using &lt;a href="https://github.com/amueller/word_cloud" rel="noopener noreferrer"&gt;word_cloud&lt;/a&gt;. All the guys are talking about Python or their libraries (like pipenv, pandas, sqlalchemy, etc); we can also spot some other programming languages like erlang and rust.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fc8xc4mjy1pvkajpgowl5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fc8xc4mjy1pvkajpgowl5.png" alt="cloud"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The next step is to identify real topics, using an LDAmodel from Gensim. I still used standard tweets from the two accounts with the higher number of tweets (&lt;a class="mentioned-user" href="https://dev.to/kennethreitz"&gt;@kennethreitz&lt;/a&gt; and @mitsuhiko) and I performed the same preprocessing used for wordclouds generation.&lt;br&gt;
I run the model using two dynamic values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the number of topics (ranging between 2 and 14)&lt;/li&gt;
&lt;li&gt;the alpha value (with possible values 0.2, 0.3, 0.4).
Then I chose the best solution using the Gensim built-in Coherence Model, using c_v as a metric: the optimal model is the one with 9 topics and alpha=0.2 &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Funan0uxh9o8smessm88w.png" alt="coherence"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are the topics:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;topic number&lt;/th&gt;
&lt;th&gt;top words&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;(0, '0.125*"way" + 0.094*"favorite" + 0.076*"feature" + 0.067*"oh" + 0.063*"think"')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;(1, '0.140*"pipenv" + 0.124*"released" + 0.098*"pipenv_released" + 0.082*"want" + 0.073*"code"')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;(2, '0.271*"python" + 0.132*"today" + 0.093*"people" + 0.039*"month" + 0.036*"kenneth"')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;(3, '0.183*"requests" + 0.134*"love" + 0.081*"work" + 0.071*"html" + 0.057*"github"')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;(4, '0.164*"like" + 0.100*"rust" + 0.098*"time" + 0.058*"day" + 0.047*"things"')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;(5, '0.297*"pipenv" + 0.062*"support" + 0.058*"includes" + 0.045*"right" + 0.044*"making"')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;(6, '0.271*"new" + 0.076*"getting" + 0.075*"better" + 0.058*"use" + 0.049*"photos"')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;(7, '0.161*"good" + 0.097*"going" + 0.092*"got" + 0.067*"happy" + 0.058*"current"')&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;(8, '0.114*"great" + 0.091*"ipad" + 0.076*"finally" + 0.066*"heroku" + 0.057*"working"')&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We can check the intertopic distance map and the most relevant terms for each topic using &lt;a href="https://github.com/bmabey/pyLDAvis" rel="noopener noreferrer"&gt;pyLDAvis&lt;/a&gt;: you can explore the interactive data in the jupyter notebook in my github account. &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fadzpp304ongpmq6e4gxe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fadzpp304ongpmq6e4gxe.png" alt="ldavis"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusions and future steps
&lt;/h1&gt;

&lt;p&gt;In this post I showed how to get data from Twitter APIs and how to perform some simple analysis in order to know in advance some features about an account (e.g. tweet style, statistics about tweets, topics).&lt;br&gt;
Your mileage may vary depending on the initial account list and the configuration of the algorithms (especially in topics detection).&lt;/p&gt;

&lt;p&gt;I uploaded a &lt;a href="https://github.com/eric-bonfadini/python-notebooks/blob/master/blog_posts/You_can_tell_a_man_by_his_tweets.ipynb" rel="noopener noreferrer"&gt;jupyter notebook&lt;/a&gt; on my github, with some snippets I used in order to create this blog post.&lt;/p&gt;

&lt;p&gt;Next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Improve preprocessing using lemmatization and stemming&lt;/li&gt;
&lt;li&gt;Try different algorithms for topics detection using Gensim (e.g. AuthorTopicModel or LDAMallet) or scikit-learn&lt;/li&gt;
&lt;li&gt;Add sentiment analysis&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataviz</category>
      <category>dataanalysis</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>Finding my new favorite song on Spotify</title>
      <dc:creator>Eric Bonfadini</dc:creator>
      <pubDate>Thu, 18 Jan 2018 22:14:11 +0000</pubDate>
      <link>https://dev.to/ericbonfadini/finding-my-new-favorite-song-on-spotify-4lgc</link>
      <guid>https://dev.to/ericbonfadini/finding-my-new-favorite-song-on-spotify-4lgc</guid>
      <description>&lt;p&gt;While I'm developing (or just while I'm commuting to work) I usually love to hear some rock music.&lt;/p&gt;

&lt;p&gt;I created some playlists on Spotify, but lately I'm stick to the same playlist, containing my favorite "Indie Rock" songs.&lt;br&gt;
This playlist is made up of more or less 45 songs I discovered through the years in several ways.&lt;/p&gt;

&lt;p&gt;Since I was starting to get bored about always listening to the same songs, last weekend I decided to analyze my playlist using Spotify APIs in order to discover insights and hopefully to find some new tunes I could add.&lt;/p&gt;

&lt;p&gt;Here's what I did in more or less 300 lines of Python 3 code (boilerplate included).&lt;br&gt;
You can find on &lt;a href="https://github.com/eric-bonfadini/python-notebooks/blob/master/blog_posts/Finding_my_new_favorite_song_on_Spotify.ipynb" rel="noopener noreferrer"&gt;github&lt;/a&gt; a Jupyter notebook with the code I used.&lt;/p&gt;

&lt;h1&gt;
  
  
  Setting up the environment
&lt;/h1&gt;

&lt;p&gt;For my analysis I set up a Python 3 virtual environment with the following libraries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;Pandas&lt;/a&gt; for data analysis&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://seaborn.pydata.org/" rel="noopener noreferrer"&gt;Seaborn&lt;/a&gt; for data visualization&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/plamere/spotipy" rel="noopener noreferrer"&gt;Spotipy&lt;/a&gt; for interaction with Spotify APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In order to access the Spotify APIs I &lt;a href="https://developer.spotify.com/my-applications/#!/applications" rel="noopener noreferrer"&gt;registered my app&lt;/a&gt; and then I provided the spotipy library with the client_id, the client_secret and a redirect url.&lt;/p&gt;

&lt;h1&gt;
  
  
  Analyzing my playlist tracks
&lt;/h1&gt;

&lt;p&gt;After a first API call to get my &lt;a href="https://developer.spotify.com/web-api/get-list-users-playlists/" rel="noopener noreferrer"&gt;playlist id&lt;/a&gt;, I got all the tracks of my playlist with another &lt;a href="https://developer.spotify.com/web-api/get-playlists-tracks/" rel="noopener noreferrer"&gt;API call&lt;/a&gt; along with some basic information like: song id, song name, artist id, artist name, album name, song popularity.&lt;/p&gt;

&lt;p&gt;With another &lt;a href="https://developer.spotify.com/web-api/get-several-artists/" rel="noopener noreferrer"&gt;API call&lt;/a&gt; I got some extra information about the artists in my playlist, like genres and artist_popularity.&lt;/p&gt;

&lt;p&gt;Finally with another &lt;a href="https://developer.spotify.com/web-api/get-several-audio-features/" rel="noopener noreferrer"&gt;API call&lt;/a&gt; I got some insightful information about my tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;duration_ms&lt;/em&gt;: the duration of the track in milliseconds;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;acousticness&lt;/em&gt;: describes the acousticness of a song (1 =&amp;gt; high confidence the track is acoustic). It ranges from 0 to 1;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;danceability&lt;/em&gt;: describes the danceability of a song (1 =&amp;gt; high confidence the track is danceable). It ranges from 0 to 1;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;energy&lt;/em&gt;: it's a perceptual measure of intensity and activity (e.g. death metal has high energy while classical music has low energy). It ranges from 0 to 1;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;instrumentalness&lt;/em&gt;: predicts whether a track contains no vocals (1 =&amp;gt; high confidence the track has no vocals). It ranges from 0 to 1;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;liveness&lt;/em&gt;: detects the presence of an audience in the recording (1 =&amp;gt; high confidence the track is live). It ranges from 0 to 1;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;loudness&lt;/em&gt;: detects the overall loudness of a track in decibels. It ranges from -60dB to 0dB;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;valence&lt;/em&gt;: describes the musical positiveness conveyed by a track (1 =&amp;gt; more positive, 0 =&amp;gt; more negative). It ranges from 0 to 1;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;speechiness&lt;/em&gt;: detects the presence of spoken words in a track (1 =&amp;gt; speech, 0 =&amp;gt; non speech, just music). It ranges from 0 to 1;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;key&lt;/em&gt;: describes the &lt;a href="https://en.wikipedia.org/wiki/Pitch_class" rel="noopener noreferrer"&gt;pitch class notation&lt;/a&gt; of the song. It ranges from 0 to 11;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;mode&lt;/em&gt;: the modality of a track (0 =&amp;gt; minor, 1 =&amp;gt; major);&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;tempo&lt;/em&gt;: the overall estimated tempo of a track in beats per minute (BPM);&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;time_signature&lt;/em&gt;: An estimated overall time signature of a track (how many beats are in each bar or measure).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The results of all these calls have been put inside Pandas dataframes in order to simplify the data analysis and then merged in a single dataframe using artist IDs and track IDs.&lt;br&gt;
Some values (like song/artist popularity and tempo) have been normalized.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explorative Data Analysis
&lt;/h2&gt;

&lt;p&gt;After ensuring that the artists in my playlist all contain "Indie Rock" as genre, I saw (using shape, info and describe of the full dataframe) that my playlist consisted of 46 entries, containing all non-null values with these statistics:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fygmh86o65684e8bsmlxc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fygmh86o65684e8bsmlxc.png" alt="describe"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then I created some charts (distplot, countplot, boxplot) using Seaborn:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7irgi3zat6a6ebvpcdfb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F7irgi3zat6a6ebvpcdfb.png" alt="mix"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F2gu8hhqlhgj38taleoxe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F2gu8hhqlhgj38taleoxe.png" alt="boxplot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All these graphs show that I like songs with low acousticness/instrumentalness/speechiness, high energy/loudness/tempo, high artist popularity and duration of more or less 200 seconds.&lt;br&gt;
Valence and song popularity span on a wide range, meaning that I have in my playlist both well-known and unknown songs, and both positive and negative ones.&lt;/p&gt;

&lt;p&gt;But how my playlist compare against the Indie Rock genre?&lt;/p&gt;

&lt;h1&gt;
  
  
  Comparing my playlist with a sample of the genre
&lt;/h1&gt;

&lt;p&gt;I used some calls to the &lt;a href="https://developer.spotify.com/web-api/search-item/" rel="noopener noreferrer"&gt;search API&lt;/a&gt; with 'genre:"Indie Rock"' as a keyword and 'type=tracks' to get a sample of the Indie Rock genre (5000 songs in total).&lt;br&gt;
This API offers also some nice keywords like 'tag:hipster' (to get only albums with the lowest 10% popularity) or 'year:1980-2020' (to get only tracks released in a specific year range).&lt;/p&gt;

&lt;p&gt;Then I repeated the same analysis on both my current playlist and the Indie Rock sample and I got the following charts:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fekoed4saw763kw2pfeog.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fekoed4saw763kw2pfeog.png" alt="mix2"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F77s6a082mbbgx13kqvhy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F77s6a082mbbgx13kqvhy.png" alt="key2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The graphs show that my playlist differs from the 5000 Indie Rock songs because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I like shorter songs&lt;/li&gt;
&lt;li&gt;I like songs with higher energy/loudness/tempo&lt;/li&gt;
&lt;li&gt;I don't like songs with too negative mood (valence &amp;gt; 0.3)&lt;/li&gt;
&lt;li&gt;I like songs mostly in key (0, 1, 6, 9)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The boxplot confirms the same insights and I agree with the outcome of this analysis.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F14b0bgzio79c37tcjvdt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F14b0bgzio79c37tcjvdt.png" alt="box2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Creating a new playlist with songs I potentially like
&lt;/h1&gt;

&lt;p&gt;Using these insights, I applied some filters to the 5000 songs dataframe in order to keep only tracks I potentially like; for each step I logged the dropped songs to double check the filter behavior.&lt;/p&gt;

&lt;p&gt;The first filter I applied was removing songs I already had in my original playlist, obviously.&lt;/p&gt;

&lt;p&gt;The other filters were: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;acousticness &amp;lt; 0.1&lt;/li&gt;
&lt;li&gt;energy &amp;gt; 0.75&lt;/li&gt;
&lt;li&gt;loudness &amp;gt; -7dB&lt;/li&gt;
&lt;li&gt;valence between 0.3 and 0.9&lt;/li&gt;
&lt;li&gt;tempo &amp;gt; 120&lt;/li&gt;
&lt;li&gt;key in (0, 1, 6, 9)&lt;/li&gt;
&lt;li&gt;duration between 10% quartile of original playlist duration and 90% quartile (178s and 280s)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the end I got a dataframe with 220 tracks and I created a new playlist using an &lt;a href="https://developer.spotify.com/web-api/add-tracks-to-playlist/" rel="noopener noreferrer"&gt;API call&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusions and future steps
&lt;/h1&gt;

&lt;p&gt;After a few days listening to the new playlist, I'm quite happy with the results and I'm already promoting some tracks to my original playlist.&lt;/p&gt;

&lt;p&gt;This "recommendation" method is really simple and probably works well only in this specific use case (i.e. it's a well definable subset of a specific genre).&lt;br&gt;
The standard recommendation method of Spotify is obviously much better because, apart from audio analysis, it uses also a mix of &lt;em&gt;Collaborative Filtering models&lt;/em&gt; (analyzing your behavior and others’ behavior) and &lt;em&gt;Natural Language Processing (NLP) models&lt;/em&gt; (analyzing text of the songs).&lt;/p&gt;

&lt;p&gt;Next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run the analysis again after a few months, in order to take into account new entries in my playlist and new songs in the 5000 sample&lt;/li&gt;
&lt;li&gt;Enrich the information I already got with something new using the Spotify APIs (e.g the &lt;a href="https://developer.spotify.com/web-api/get-audio-analysis/" rel="noopener noreferrer"&gt;Audio Analysis endpoint&lt;/a&gt; or other services. It would be nice, as an example, to detect musical instruments in a track (guitars anyone??) or the presence of some features like distortion, riffs, etc.&lt;/li&gt;
&lt;li&gt;Use the preview sample from the API to tag manually what I like/dislike on a subset of the 5000 songs and then run some ML algorithms in order to classify the music I like&lt;/li&gt;
&lt;li&gt;Analyze deeper my playlist using some ML algorithms (e.g. cluster my tracks)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataviz</category>
      <category>dataanalysis</category>
      <category>python</category>
      <category>api</category>
    </item>
  </channel>
</rss>
