loading...
Cover image for Plotting your Spotify data

Plotting your Spotify data

fferegrino profile image Antonio Feregrino ・6 min read

Inspired by Nerudista's post in the Tacos de Datos website (in spanish) I asked Spotify for my data and started making some plots with them. If you want to do the same, head over to this page; once you request it, it will take a couple of days to be available. In the meantime, you can use this Google Colab, I've created a subset of my data for you to play with.

Among all the information you will receive, there will be some files named following this pattern: StreamingHistoryXX.json, and these are the ones we'll use throughout this post.

The data

The files mentioned above files contain something like this:

[
  {
    "endTime" : "2019-02-04 17:14",
    "artistName" : "MGMT",
    "trackName" : "Time to Pretend",
    "msPlayed" : 261000
  },
  {
    "endTime" : "2019-02-04 17:18",
    "artistName" : "MGMT",
"..." 

Where the values are:

  • endTime: Day and time in which a song finished, in UTC format.
  • artistName: Name of the artist of the song.
  • trackName: Name of the song.
  • msPlayed: For how long (in milliseconds) a song was played.

To load these this data into a DataFrame we'll need this tiny function:

from glob import glob
import json
import pandas as pd

def read_history():
    history = []
    for file in sorted(glob("StreamingHistory*.json")):
        with open(file) as readable:
            history.extend(json.load(readable))
    history = pd.DataFrame(history)
    history["endTime"] = pd.to_datetime(history["endTime"])
    return history

streaming_history = read_history()
streaming_history.head(5)

Which should give us something like this once we execute it:

endTime artistName trackName msPlayed
0 2019-02-04 17:14:00 MGMT Time to Pretend 261000
1 2019-02-04 17:18:00 MGMT When You Die 263880
2 2019-02-04 17:23:00 Mr Dooves Super Smash Bros. Brawl Main Theme (From "Super Smash Bros. Brawl") [A Cappella] 201518

Histogram.

I've always been a fan of the way GitHub shows developer contributions, and the data we got from Spotify seems like the perfect candidate to create something similar; however, we need to perform some transformations first.

Since we are not interested in the time each song finished we'll get rid of the temporal part of endTime:

streaming_history["date"] = streaming_history["endTime"].dt.floor('d')

Then we'll get a count of songs per day using groupby:

by_date = streaming_history.groupby("date")[["trackName"]].count()
by_date = by_date.sort_index()

For out plot, we'll need to know which day of the week each date refers to, we can use the properties week and weekday:

by_date["weekday"] = by_date.index.weekday
by_date["week"] = by_date.index.week

By the end, our dataframe should look like this:

trackName weekday week
date
2019-02-04 5 0 6
2019-02-05 8 1 6
2019-02-07 60 3 6

We have almost everything we need, the next step is to get a contiuous sequence of numbers for each week. In the dataframe above the 6th week of 2016 must be the week 0, the 7th must be the week 1... I can't think of a better way to do it than a for loop:

week = 0
prev_week = by_date.iloc[0]["week"]
continuous_week = np.zeros(len(by_date)).astype(int)
sunday_dates = []
for i, (_, row) in enumerate(by_date.iterrows()):
    if row["week"] != prev_week:
        week += 1
        prev_week = row["week"]
    continuous_week[i] = week
by_date["continuous_week"] = continuous_week 
by_date.head()
trackName weekday week continuous_week
date
2019-02-04 5 0 6 0
2019-02-05 8 1 6 0
2019-02-07 60 3 6 0

Our next step is to generate a matrix with days ✕ weeks as dimensions, where each one of the entries will be the number of songs we listened to in that day and week:

songs = np.full((7, continuous_week.max()+1), np.nan)

for index, row in by_date.iterrows():
    songs[row["weekday"]][row["continuous_week"]] = row["trackName"]

Now we could just plot the matrix songs using seaborn:

fig = plt.figure(figsize=(20,5))
ax = plt.subplot()
mask = np.isnan(songs)
sns.heatmap(songs, ax = ax)

But the result is not that great:

Histograma horrible

If we want it to look better we still need some code, the first thing to do is to clean the axis labels:

min_date = streaming_history["endTime"].min()
first_monday = min_date - timedelta(min_date.weekday())
mons = [first_monday + timedelta(weeks=wk) for wk in range(continuous_week.max())]
x_labels = [calendar.month_abbr[mons[0].month]]
x_labels.extend([
    calendar.month_abbr[mons[i].month] if mons[i-1].month != mons[i].month else "" 
    for i in range(1, len(mons))])

y_labels = ["Mon", "", "Wed", "", "Fri", "", "Sun"]

The X axis labels are far more complicated than the Y ones; this is because unlike Y, the X axis is not fixed nor continuous, as such, they need to calculated based on the data. If you want a more detailed explanation, tell me in the comments or at me on twitter at @io_exception).

After doing this we'll perform some modifications with colours and the axis:

fig = plt.figure(figsize=(20,5))
ax = plt.subplot()

ax.set_title("My year on Spotify", fontsize=20,pad=40)
ax.xaxis.tick_top()
ax.tick_params(axis='both', which='both',length=0)
ax.set_facecolor("#ebedf0") 
fig.patch.set_facecolor('white')

And finally, we can use seaborn's heatmap again, this time with a few extr arguments that I will explain later:

sns.heatmap(songs, linewidths=2, linecolor='white', square=True,
            mask=np.isnan(songs), cmap="Greens",
            vmin=0, vmax=100, cbar=False, ax=ax)

ax.set_yticklabels(y_labels, rotation=0)
ax.set_xticklabels(x_labels, ha="left")

The arguments are as follows:

  • songs: our matrix with shape days ✕ weeks with the counts of songs per day,
  • linewidths: the size of the spacing between each patch,
  • linecolor: the colour of the spacing between each patch,
  • square: this tells the function that we want to keep the aspect ratio 1:1 for each patch,
  • mask: a very interesting argument, it will help us "mask" the patches for which there is no recorded value, this argument should be a boolean matrix of the same dimensions as the data being plotted, where each True means that that specific value must be masked,
  • cmap: the colormap to be used, luckily for us, the value "Greens" matches with the colour palette chosen by GitHub,
  • vmin: the value that should be considered as the minimum among our values,
  • vmax: the value that should be considered as the maximum among our values, I'd consider 100 to be the maximum, even though my record sits as 190 in a day!
  • cbar: a boolean value to indicate whether we want to show the colour bar that usually comes with the heatmap,
  • ax: the axes our plot should be plotted on.

And voilà, our plot is ready:

Histograma bonito

It is up to you to modify the plot, may be by adding information about the number of songs, or show the colorbar... a great idea would be to recreate this plot in a framework such as D3.js, but that may well belong to another post. Again, feel free to head over to this Colab and contact me via twitter @io_exception.

Posted on Jan 14 by:

fferegrino profile

Antonio Feregrino

@fferegrino

Data scientist with software development experience.

Discussion

markdown guide