I was excited to find a playlist dataset from Spotify because I have always been unhappy with the recommendations from music streaming platforms. I wanted to use this dataset as an opportunity to learn how to build a personalised recommendation system.
In this series, I'll walk you through the process of building such a system using Python for data processing and Node.js for recommendations retrieval. In this first article, we'll focus on the data processing flow. In the second article, we'll be building a real-time recommendation system.
If you're interested, join me to see the final result!
Conceptually, this is what I wanted to implement:
The Data
The dataset we'll be working with consists of 1000 JSON files, and contains around 1 million public playlists.
Here's an example of a single playlist from the JSON:
playlist content
{
"name": "Throwbacks",
"collaborative": "false",
"pid": 0,
"modified_at": 1493424000,
"num_tracks": 52,
"num_albums": 47,
"num_followers": 1,
"tracks": [
{
"pos": 0,
"artist_name": "Missy Elliott",
"track_uri": "spotify:track:0UaMYEvWZi0ZqiDOoHU3YI",
"artist_uri": "spotify:artist:2wIVse2owClT7go1WT98tk",
"track_name": "Lose Control (feat. Ciara & Fat Man Scoop)",
"album_uri": "spotify:album:6vV5UrXcfyQD1wu4Qo2I9K",
"duration_ms": 226863,
"album_name": "The Cookbook"
},
// ...
{
"pos": 51,
"artist_name": "Boys Like Girls",
"track_uri": "spotify:track:6GIrIt2M39wEGwjCQjGChX",
"artist_uri": "spotify:artist:0vWCyXMrrvMlCcepuOJaGI",
"track_name": "The Great Escape",
"album_uri": "spotify:album:4WqgusSAgXkrjbXzqdBY68",
"duration_ms": 206520,
"album_name": "Boys Like Girls"
}
],
"num_edits": 6,
"duration_ms": 11532414,
"num_artists": 37
}
Each playlist consists of a name, a PID, and a list of tracks, along with some additional data.
When building this recommendation system, I made the assumption that "people are more likely to add similar songs in a playlist." While this may not hold true for every playlist, I believe that on a larger scale, it's a reasonable assumption. As a result, I'll be treating each playlist (identified by its PID) as a unit of taste preferences, and modeling item similarities based on them using various algorithms.
Talk is cheap. Show me the code ― Linus Torvalds
Preparing Data
Before diving into running different ML algorithms, we need to prepare the data for ingestion and do some cleaning.
The goal is to parse the URIs, select the necessary fields, unfold the nested playlist tracks structure, and keep only the top 10K artists. Also, converting the JSON files to CSV and parsing the fields will save time when experimenting.
While this section may be basic, I've included it for completeness. If you're familiar with these concepts, feel free to skip ahead to the next section.
Below are some common imports that I'll be using throughout the examples.
imports
import collections
import functools
import itertools
import json
import math
from collections import Counter, namedtuple
from pathlib import Path
import pandas as pd
import tqdm
from fastai.collab import *
from fastai.tabular.all import *
from sklearn.neighbors import KNeighborsClassifier
Reading JSON dataset
We'll need a few helper functions. The first one is simply reading a file and parsing the JSON contained within it:
def read_file(path):
with open(path, 'r') as f:
return json.load(f)
We'll also need a few field getters for the track entity. Notice that the URI parsing is just a simple substring operation:
artist_uri_prefix = len('spotify:artist:')
track_uri_prefix = len('spotify:track:')
def artist_name(track):
return track['artist_name']
def artist_uri(track):
return track['artist_uri'][artist_uri_prefix:]
def track_name(track):
return track['track_name']
def track_uri(track):
return track['track_uri'][track_uri_prefix:]
The next function reads the JSON files, parses the required fields, and returns an iterator of processed rows. This part might be confusing if you're not familiar with the concept of generators. Since we don't need to store the intermediate JSON files or the unfolded structure in memory, I'm using generators to lazily process them.
root_dir = '/kaggle/input/spotify-millions-playlist/spotify/data'
p = Path(root_dir)
paths = list(p.iterdir())
def read_files(num_files=5):
lookup_path = paths if num_files is None else paths[:num_files]
jsons = (read_file(path) for path in lookup_path)
return (
[
row['pid'],
artist_name(track),
artist_uri(track),
track_name(track),
track_uri(track)
]
for data in jsons
for row in data['playlists']
for track in row['tracks']
)
example of a row
next(read_files(1))
[35000,
'Marian Hill',
'1xHQO9GJIW9OXHxGBISYc5',
'Lovit',
'2Ag3LUfgpN6ymEMwDOqKdg']
Most popular artists
Despite having a million of playlists, these data may not be enough to produce accurate predictions for every track and artist in the dataset. As a result, I want to limit the task to providing recommendations for the N most popular artists.
To do this, I'll need to count the number of artist mentions in the playlists:
artist_uri_pos = 2
c = collections.Counter(r[artist_uri_pos] for r in read_files(None))
After executing this code, I have the following structure:
c.most_common(1)
[('3TVXtAsR1Inumwj472S9r4', 846937)]
You can visit https://open.spotify.com/artist/3TVXtAsR1Inumwj472S9r4 to see the most popular artist according to the data we have from Spotify.
I should have actually counted unique artist mentions in a playlist here, but in this case it's not a problem since I'm not planning to use all 10K.
Let's save the top artists in a CSV for later use:
df = pd.DataFrame(data=c.most_common(10000), columns=['artist_uri', 'count'])
df.to_csv('common_artist_uri.csv', index=False)
Transforming JSON to CSV
Now, we want to convert the initial JSON files to processed CSV. To do this, we'll start by reading the most popular artists created in the previous step:
artists = pd.read_csv('/kaggle/input/common-spotify-artists/common_artist_uri.csv')
artists = artists.set_index('artist_uri')
I'll also modify the previously introduced parsing function by filtering out artists that are not in the allowlist and playlists that have too few tracks. This will help ensure that the models will produce better predictions:
def read_files(artist_allowlist, num_files=5):
lookup_path = paths if num_files is None else paths[:num_files]
jsons = (read_file(path) for path in lookup_path)
return (
[
row['pid'],
artist_name(track),
artist_uri(track),
track_name(track),
track_uri(track)
]
for data in jsons
for row in data['playlists']
if row['num_tracks'] > 5 # added filter or playlist number of tracks
for track in row['tracks']
if artist_uri(track) in artist_allowlist # and filter on popular artists
)
Since we don't want to write everything in a single huge file, I'll split the rows into batches of 500K.
Here's the code for writing the CSV files:
gen = read_files(artists.index, None)
n = 500_000
i = 1
column_names = ['pid', 'artist_name', 'artist_uri', 'track_name', 'track_uri']
while (True):
batch_slice = itertools.islice(gen, n)
df = pd.DataFrame(batch_slice, columns=column_names)
if df.empty:
break
df.to_csv(f'playlist_{i}.csv', index=False)
i += 1
This leaves us with 5 GB of CSV files containing playlist-track data.
Similar items
In this section, we will compute the similarity between items in the dataset, such as artists or tracks. To do this, we will use three different algorithms: two variants of Collaborative Filtering and one based on Co-Occurrence.
Collaborative Filtering is a popular approach for recommendation systems, as it relies on the ratings or preferences of users to make predictions about what items a given user might like. In our case, we will use the presence of an artist in a playlist as a proxy for a "rating" of 1, indicating that the artist is relevant to the playlist.
The Co-Occurrence based algorithm, on the other hand, will measure the similarity between items based on how often they appear in the same playlist.
Preparing training Dataset
We'll start by creating a training dataset.
The first function reads every CSV file and combines them into a single pandas DataFrame. For artist similarity, we are only interested in the presence of artists in a playlist. Here's the code for reading the raw dataset:
def read_for_artists(allowlist):
p = Path('/kaggle/input/spotify-playlists-csv')
paths = list(p.iterdir())
df = pd.concat((pd.read_csv(f, usecols=['pid', 'artist_uri']) for f in paths), ignore_index=True)
df = df[df.artist_uri.isin(allowlist)]
return df
As previously mentioned, I'll further limit the number of artists. By playing with the data, I noticed that it's reasonable to keep around 1K of them, corresponding to 10K playlist mentions. Here's the code for preparing the artist allowlist:
N_10k = 10_000 # leaves ~1K artists
N_50k = 50_000 # leaves ~200 artists
N_88k = 88_000 # leaves ~100 artists
def get_common_artists_index(mentions):
common_artists = pd.read_csv('/kaggle/input/common-spotify-artists/common_artist_uri.csv')
df = common_artists[common_artists['count'] > mentions]
df = df.set_index('artist_uri')
return df.index
To remove noise, I suggest to remove playlists that have too few or too many artists. Having too many artists might mean that this is a playlist with random songs, while having too few may risk losing signals and overfitting the model.
def clean_by_pid(df, p_min, p_max):
pid_counts = df.groupby('pid')['pid'].count()
drop_df = ((pid_counts < p_min) | (pid_counts > p_max))
drop_index = drop_df[drop_df].index
df = df[~df.pid.isin(drop_index)]
return df
Having more than one track from the same artist in a playlist might signal something, but I didn't want to introduce the complexity of a value function, so let's leave only unique mentions of artists in the training set.
def preprocess_artists(df):
df.drop_duplicates(subset=['pid', 'artist_uri'], inplace=True)
return df
After performing all necessary manipulations on the dataset, we are left with relevant (pid, artist) pairs.
The first two algorithms we will consider are variants of Collaborative Filtering. Collaborative Filtering relies on the notion of ratings, so for every pair of pid and artist, we’ll assign a rating of 1 to indicate that the artist is relevant to the playlist.
Additionally, it will be useful to index the dataset by the (pid, artist) pair. This will allow us to efficiently union dataset when we add negative data.
def prepare_positive(df, item_name):
df['rating'] = 1
df = df.set_index(['pid', item_name])
return df
Now, let's move on to generating negative labels as promised. To do this, we will need to obtain the unique playlist IDs and unique artists in our dataset. We can then randomly match them a specified number of times to create our negative examples:
def generate_negative_pairs(N, unique_pids, unique_items):
items = np.random.choice(unique_items, size=N)
pids = np.random.choice(unique_pids, size=N)
df = pd.concat([pids, items], axis=1)
df['rating'] = 0
return df
def generate_negative_artists(N, pids, artists):
df = generate_negative_pairs(N, pids, artists)
df = df.set_index(['pid', 'artist_uri'])
return df
To create a dataset, we can generate two random arrays: one for artists and one for playlists, this process can duplicate items if necessary. Then we can join the arrays into a single dataset by adding them as columns. It is important to note that we will need to index this new dataset accordingly.
To finalize the creation of our positive and negative datasets, we join them together and reset the indexed columns:
def join_dfs(df1, df2):
res = pd.concat([df1, df2], sort=True, copy=False)
# make sure to free the occupied memory
df1.drop(df1.index, inplace=True)
df2.drop(df2.index, inplace=True)
return res
def reindex(df):
df.reset_index(inplace=True)
return df
Here is a summary of the steps we have discussed for creating a training dataset for collaborative filtering:
- Obtain the list of the most popular artists to use as an allowlist.
- Read in the raw data.
- Clean up the noisy playlists.
- Generate the negative dataset by randomly generating pairs of playlist IDs and artists that do not already exist in the dataset.
- Assign positive and negative labels to the data accordingly.
- Join datasets together into a single dataset.
Here is a function that combines these steps into a single flow for creating the training dataset for collaborative filtering:
def artist_training_set(n_common, pid_min_max, negative_ratio):
# read most popular artists
allowed_artists = get_common_artists_index(n_common)
# read and preprocess positive labels data
df = read_for_artists(allowed_artists)
df = clean_by_pid(df, pid_min_max[0], pid_min_max[1])
df = preprocess_artists(df)
# generate negative labels
df2 = generate_negative_artists(math.ceil(negative_ratio * len(df)), df.pid.drop_duplicates(), allowed_artists.to_series())
# prepare negative labels
df = prepare_positive(df, 'artist_uri')
# finilise resulting dataset
df = join_dfs(df, df2)
return reindex(df)
If we execute this function, the resulting data will look this way:
training set
training_df = artist_training_set(n_common=N_10k,
users_min_max=(5, 40),
negative_ratio=1,)
pid artist_uri rating
896625 7DMveApC7UnC2NPfPvlHSU 1
896625 23fqKkggKUBHNkbKtXEls4 1
896625 45eNHdiiabvmbp4erw26rg 1
896625 0L8ExT028jH3ddEcZwqJJ5 1
896625 2wY79sveU1sp5g7SokKOiI 1
... ... ... ...
957041 70BYFdaZbEKbeauJ670ysI 0
587936 4Z8W4fKeB5YxbusRsdQVPb 0
656290 0uq5PttqEjj3IH1bzwcrXF 0
757 2RhgnQNC74QoBlaUvT4MEe 0
677025 2NhdGz9EDv2FeUw6udu2g1 0
Training Models
Now that the training dataset is prepared, we can experiment with various item similarity algorithms. As I mentioned, we will begin with Collaborative Filtering, which has two variations: Matrix Factorization and Collaborative Filtering with Multi-Layer Perceptron (MLP). Both methods follow a similar process: we create a vector representation for each artist, treat this vector as a location in N-dimensional space, and find the K closest artists to a given artist. The main difference lies in how these vectors are computed.
I have tried to make this article as practical as possible. If you are interested in learning more about the underlying theory behind these algorithms, I recommend the following resources:
- The Netflix Tech Blog has an in-depth article on the collaborative filtering algorithms used by Netflix.
- This tutorial from the DataCamp blog provides a practical introduction to implementing collaborative filtering in Python:
- This course from Coursera, "Recommendation Systems and Deep Learning in Python," covers collaborative filtering in detail and includes hands-on exercises:
- Article with theory behind how Collaborative Filtering algorithms work.
We will be utilizing the fastai
library, so let's go ahead and create a Dataset Loader that retrieves data from the created DataFrame:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dls = CollabDataLoaders.from_df(df, item_name='artist_uri', rating_name='rating', username='pid', bs=4096, device=device)
Internally, the Dataset Loader will load the data into a device, encode URIs, shuffle and batch the data, and create a validation subset. Let's make sure it has got the correct data:
data batch
dls.show_batch()
pid artist_uri rating
902723 61lyPtntblHJvA7FMMhi7E 1
28782 7iZtZyCzp3LItcw1wtPI3D 1
41514 11Y54BxlxC3UIAUkU2eadQ 0
682607 4EVpmkEwrLYEg6jIsiPMIb 0
166133 3iri9nBFs9e4wN7PLIetAw 0
886234 7jefIIksOi1EazgRTfW2Pk 1
688805 1f5GqyOPo0CkotzzRwviBu 1
645015 1yAwtBaoHLEDWAnWR87hBT 1
14524 0n94vC3S9c3mb2HyNAOcjg 0
909486 7H55rcKCfwqkyDFH9wpKM6 1
Matrix Factorization Collaborative Filtering
I would like to begin by using the Matrix Factorization Collaborative Filtering item similarity method. Let's create and train the model:
learn = collab_learner(dls, n_factors=50, y_range=(0, 1))
epochs = 5
learning_rate = 5e-3
weight_decay = 0.1
learn.fit_one_cycle(epochs, learning_rate, wd=weight_decay)
Here are model metrics:
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.217431 | 0.217432 | 01:46 |
1 | 0.216207 | 0.216138 | 01:47 |
2 | 0.193768 | 0.194263 | 01:48 |
3 | 0.171074 | 0.176515 | 01:47 |
4 | 0.160071 | 0.174029 | 01:49 |
One of the most important parameters when creating the model is the n_factors
argument. This parameter specifies the size of the N-dimensional vector. A larger value allows the model to capture more concepts, but it will also require more data to be trained effectively.
After completing the model fitting process, we can check the embeddings of the items (artists):
learn.weight(['61lyPtntblHJvA7FMMhi7E'], is_item=True)
tensor([[ 0.1625, -0.3502, -0.5267, -0.4166, -0.5810, 0.8513, -0.4677, 0.8700,
-0.4015, 0.9194, 0.4940, 0.5556, -0.6228, -0.0939, -0.1931, 0.3866,
0.5970, -0.7663, 0.7816, 0.3170, -0.8104, 0.1340, -0.5477, 0.9246,
-0.2875, -0.0404, 0.0587, 0.8526, -0.3287, -0.6467, 0.2383, 0.0480,
0.3321, -0.5684, -0.6459, 0.6053, 0.2226, 0.2793, -0.3024, 0.0718,
-0.4428, -0.1309, -0.2367, -1.0821, 0.3814, -0.0182, -0.1722, 0.4477,
0.1380, 0.3781]])
These vectors may not be interpretable to humans, but we can use them to find similar items by calculating the cosine (or some another) distance between two vectors.
Now, we can construct a similarity matrix using another popular algorithm: a KNN classifier. To do this, we will use the sklearn
library.
The process is as follows: train the KNN model and then, for each artist in the dataset, retrieve the N most similar artists.
Here is how we train the model:
weights = learn.model.i_weight.weight.cpu().detach().numpy()
items = dls.classes['artist_uri'].items
knn = KNeighborsClassifier(n_neighbors=20, algorithm='kd_tree')
knn.fit(weights, items)
Here is how we can retrieve similar items:
item = learn.weight(['3TVXtAsR1Inumwj472S9r4'], is_item=True).cpu().numpy()
knn.kneighbors(item, return_distance=False)
array([[ 520, 1079, 973, 412, 91, 268, 1133, 952, 36, 50, 816,
1137, 886, 670, 775, 48, 644, 317, 346, 148]])
As you can see, the returned items are not URIs but indices, so let's convert them to URIs using the DatasetLoader:
As previously, you can follow https://open.spotify.com/artist/URI link to see artists, see similar items
[dls.classes['artist_uri'].items[i] for i in neighb[0,1:]]
['7CCjtD0hCK005Bvg2WG1a7',
'6d47Z08T4snK50HgTEHo5Z',
'2jku7tDXc6XoB6MO2hFuqg',
'0cGUm45nv7Z6M6qdXYQGTX',
'1tqhsYv8yBBdwANFNzHtcr',
'7kFfY4UjNdNyaeUgLIEbIF',
'6TIYQ3jFPwQSRmorSezPxX',
'0HjYETXAvcL6mzaKjAmH2K',
'0NWbwDZY1VkRqFafuQm6wk',
'5bgfj5zUoWpyeVatGDjn6H',
'7mDU6nMUJnOSY2Hkjz5oqM',
'67nwj3Y5sZQLl72VNUHEYE',
'4TsHKU8l8Wq7n7OPVikirn',
'5G9kmDLg3OeUyj8KVBLzbu',
'0NIIxcxNHmOoyBx03SfTCD',
'4LLpKhyESsyAXpc4laK94U',
'27mFvqQj8KpjmdKIqcw1mG',
'2HPaUgqeutzr3jx5a9WyDV',
'137W8MRPWKqSmrBGDBFSop']
i.e. https://open.spotify.com/artist/7CCjtD0hCK005Bvg2WG1a7
Multi-Layer Perceptron Collaborative Filtering
Now I would like to try a different algorithm for computing embeddings: the Multi-Layer Perceptron.
The data preparation step is the same, as well as instantiation of a DataLoader. The main difference lies in the model definition, with fastai
we can do this:
learn = collab_learner(dls, use_nn=True, layers=[50, 50], y_range=(0, 1))
In this case, we will let the fastai
library to decide on the size of the embeddings and only specify the NN layers.
Let's train the model using the same parameters:
learn.fit_one_cycle(5, 5e-3, wd=0.1)
Model metrics:
epoch | train_loss | valid_loss | time |
---|---|---|---|
0 | 0.167988 | 0.167022 | 07:35 |
1 | 0.142461 | 0.144335 | 07:45 |
2 | 0.127324 | 0.134424 | 07:27 |
3 | 0.099502 | 0.128019 | 07:14 |
4 | 0.056717 | 0.137510 | 07:13 |
Training and using the KNN model is slightly different due to internal structure difference in fastai
embeddings model:
weights = learn.model.embeds[1].weight.cpu().detach().numpy()
items = dls.classes['artist_uri'].items
knn = KNeighborsClassifier(n_neighbors=20, algorithm='kd_tree')
knn.fit(weights, items)
And this is how we use it:
item_idx = dls.classes['artist_uri'].o2i['3TVXtAsR1Inumwj472S9r4']
item = learn.model.embeds[1].weight[[item_idx]].cpu().detach().numpy()
neighb = knn.kneighbors(item, return_distance=False)
array([[ 520, 220, 780, 846, 90, 1081, 387, 1054, 57, 151, 754,
894, 1000, 104, 148, 1029, 1128, 199, 156, 300]])
converted artist URIs
['1Xyo4u8uXC1ZmMpatF05PJ',
'5K4W6rqBFWDnAN6FQUkS6x',
'5pKCCKE2ajJHZ9KAiaK11H',
'0c173mlxpT3dSFRgMO8XPh',
'7CajNmpbOovFoOoasH2HaY',
'2YZyLoL8N0Wb9xBt1NhZWg',
'73sIBHcqh3Z3NyqHKZ7FOL',
'0QHgL1lAIqAw0HtD7YldmP',
'13ubrt8QOOCPljQ2FL1Kca',
'55Aa2cqylxrFIXC767Z865',
'69GGBxA162lTqCwzJG5jLp',
'6l3HvQ5sa6mXTsMTB19rO5',
'0fA0VVWsXO9YnASrzqfmYu',
'137W8MRPWKqSmrBGDBFSop',
'6vWDO969PvNqNYHIOW5v0m',
'7iZtZyCzp3LItcw1wtPI3D',
'1RyvyyTE3xzB2ZywiAwp0i',
'163tK9Wjr9P9DmM0AVK7lm',
'246dkjvS1zLTtiykXe5h60']
Storing similarities
We are now able to query similar items for a given item, so the next step is to compute a similarity matrix and store it in a file which we will use in the second part of this tutorial.
I plan to use the following structure and write it in a plain text:
seed_item1 similar_11 similar_12 similar_13 ...
seed_item2 similar_21 similar_22 similar_23 ...
...
seed_itemN similar_N1 similar_N2 similar_N3 ...
You may find this structure too simple, but in the next part I'll explain why we need it this way.
The first step is to compute the similarity matrix. To do this, we will need a few helper functions, the most notable of them is get_similar
, which queries similar items for a given batch of items. The other functions should be self-explanatory:
def get_emb(learn, items):
if isinstance(learn.model, EmbeddingNN):
emb = learn.model.embeds[1].weight[items]
else:
emb = learn.model.i_weight.weight[items]
return emb.cpu().detach().numpy()
def get_similar(knn, learn, items):
emb = get_emb(learn, items)
return knn.kneighbors(emb, return_distance=False)
def to_uri(dls, i):
return dls.classes['artist_uri'].items[i]
def total_items(dls):
return len(dls.classes['artist_uri'].items)
We can now construct a generator that will compute the similarity matrix, but won't store it in memory. This generator operates in batches when querying the KNN model, but later flattens the batch and iterates over a single row. Querying the KNN model in batches, hopefully, increases the throughput as it can benefit from CPU caches and vector instructions.
def compute_similarity(learn, dls, knn, bs):
items = total_items(dls)
stream = (
(idx, similar_arr)
for start_idx in range(0, items, bs)
for (idx, similar_arr) in zip(
range(start_idx, start_idx+bs),
get_similar(knn, learn, slice(start_idx, start_idx+bs)),
)
)
return (
(to_uri(dls, idx), [to_uri(dls, i) for i in similar if i != idx])
for idx, similar in stream
)
The final piece of the puzzle is a loop that goes through the generator and writes the rows to a file:
def store_similarity_pairs(stream, name):
with open(name, 'w') as f:
for k, v in stream:
f.write(k)
f.write(' ')
f.write(' '.join(v))
f.write('\n')
With that, we have completed the Collaborative Filtering part. Let's move on to another similarity algorithm.
Co-ocurrence Counting
The next similarity algorithm is called Co-ocurrence Counting. Its idea is very simple: it counts the co-occurrences of items within a given window of events. Co-occurrences that happen often (more than the specified threshold) can be considered similar. In the case of Spotify dataset, instead of windows, we can count the co-occurrences of artists within a playlist.
For this algorithm, we are only interested in pairs of playlist ID and artist, and we don't need negative data or ratings.
One good aspect is that we can reuse most of the data processing logic from the previous example. This is how we read and process the data:
allowed_artists = get_common_artists_index(N_10k)
df = read_for_artists(allowed_artists)
df = clean_by_pid(df, 5, 30)
df = preprocess_artists(df)
pid artist_uri
896625 7DMveApC7UnC2NPfPvlHSU
896625 23fqKkggKUBHNkbKtXEls4
896625 45eNHdiiabvmbp4erw26rg
896625 0L8ExT028jH3ddEcZwqJJ5
896625 2wY79sveU1sp5g7SokKOiI
... ...
59361 2Q0MyH5YMI5HPQjFjlq5g3
59361 3XHO7cRUPCLOr6jwp8vsx5
59361 42vg2T0Xg9yPaAgogJzoQH
59361 0YhUSm86okLWldQVwJkLlP
59364 3IYUhFvPQItj6xySrBmZkd
The number of unique artist pairs can increase by several orders of magnitude from the initial dataset size. In such case, using a pair of strings (string, string) is very inefficient and will have a large memory footprint. However, we can encode (factorize) the artist IDs into unique integers and store pairs of integers (int, int) instead. This will reduce the memory footprint by a factor of 10.
df['artist_idx'], i2a = df.artist_uri.factorize()
pid artist_uri artist_idx
896625 7DMveApC7UnC2NPfPvlHSU 0
896625 23fqKkggKUBHNkbKtXEls4 1
896625 45eNHdiiabvmbp4erw26rg 2
896625 01aC2ikO4Xgb2LUpf9JfKp 3
896625 0L8ExT028jH3ddEcZwqJJ5 4
... ... ...
59361 42vg2T0Xg9yPaAgogJzoQH 1348
59361 0YhUSm86okLWldQVwJkLlP 1305
59361 42crL07E4WPfVovyUtMpvC 311
59364 2qFr8w5sWUITRlzZ9kZotF 1330
59364 3IYUhFvPQItj6xySrBmZkd 489
Now, we can generate all present combinations of artist pairs from every playlist:
def coco_pairs(df):
groupped = df.groupby('pid')['artist_idx'].apply(list)
return ((item1, item2) if item1 < item2 else (item2, item1)
for _, arr in groupped.items()
for item1, item2 in itertools.combinations(arr, r=2))
The logic is as follows: we group the dataset by playlist ID and aggregate the artists into a list. Then, for each list, generate combinations, ensuring that the pair is ordered so that (A, B) is the same as (B, A). This is because the order of the pair does not matter. I am using a generator because storing every pair before aggregating them would almost certainly not fit in memory. The output of this generator is a single artist co-occurrence pair from a playlist.
We can now use Python's Counter class to count the co-occurrences of pairs:
c = Counter(coco_pairs(df))
Let's check the most common co-occurrences:
c.most_common(3)
[((77, 434), 9175), ((30, 77), 8889), ((77, 97), 8474)]
And convert the encoded numerical IDs back to artist URIs using the previously saved index:
{i: i2a[i] for i in [77, 434, 30]}
{77: '3TVXtAsR1Inumwj472S9r4',
434: '5K4W6rqBFWDnAN6FQUkS6x',
30: '2YZyLoL8N0Wb9xBt1NhZWg'}
We will create a DataFrame with three columns: item1, item2, and count. Then, group the DataFrame by item1 and get the item2 values that are similar to it, so the item order now matters. Therefore, we will need to duplicate the counter and store both the (A, B) pair and the (B, A) pair:
o1 = ((i1, i2, c[i1, i2]) for (i1, i2) in c)
o2 = ((i2, i1, c[i1, i2]) for (i1, i2) in c)
coco = pd.DataFrame(itertools.chain(o1, o2), columns=['item1', 'item2', 'count'])
item1 item2 count
869 1623 8
535 869 8
536 869 10
224 869 131
869 1279 6
... ... ...
811 120 1
251 120 1
1119 120 1
1352 120 1
1582 1566 1
Let's apply a threshold and see what similarities the co-occurrence counting model yields:
similarities = res[res['count'] > 1000]
item1 item2 count
142 714 1085
236 280 1531
142 236 1046
280 471 1105
142 280 1349
... ... ...
450 252 1085
219 218 1379
450 198 1116
677 1 1162
677 259 1347
And convert encoded ints for the first result:
i2a[142], i2a[714]
('1dfeR4HaWDbWqFHLkxsg1d', '26bcq2nyj5GB7uRr558iQg')
This might be a final solution, but we're overlooking an important aspect when we filter pairs in this way: less popular artists will always have fewer co-occurrences, yet they may still be similar to each other. To address this, we can add a normalisation step.
First, let's count the total number of mentions of each artist:
item_counts = res.groupby('item1')['count'].sum()
item1
0 71840
1 193403
2 29229
3 106297
4 182767
...
1168 18829
1169 9948
1170 24442
1171 12039
1172 8835
Then, we can add a normalised count to each pair, which is computed as pair count diveded by a sum of total counts of individual items:
pair_count = res['count']
item1_total = res.item1.map(item_counts)
item2_total = res.item2.map(item_counts)
res['normalised_count'] = pair_count / (item1_total + item2_total)
This normalised count penalises popular artists and boosts less popular.
item1 item2 count normalised_count
320 714 10 0.000180
236 714 178 0.001518
280 714 202 0.001310
471 714 348 0.003135
714 1071 224 0.003058
... ... ... ...
1088 616 1 0.000015
616 492 1 0.000016
616 517 1 0.000021
710 660 1 0.000030
591 219 1 0.000012
If we apply the threshold to the normalised count, we can see how different pairs that were previously missing or present have been affected:
similarities = res[res.normalised_count > 0.004]
item1 item2 count normalised_count
472 714 312 0.004030
142 714 1085 0.005807
647 714 605 0.005391
236 280 1531 0.008224
236 471 915 0.006402
... ... ... ...
1172 1169 111 0.005910
1169 1166 89 0.005227
1135 1093 206 0.010754
1137 744 178 0.004037
1169 696 78 0.004469
For example, the last pair (1169, 696) which has only 78 co-occurrences would have certainly been filtered out previously. These artist URIs are shown below, you can judge for yourself how similar they are:
i2a[1169], i2a[696]
('4sTFGCigAQIUiEy8wSSQNF', '1YSA4byX5AL1zoTsSTlB03')
Storing similarities
Let's now create the similarity matrix and write it to a file in the same format as before:
similar_items = similarities.groupby('item1').item2.apply(list)
item1
0 [1, 408]
1 [200, 252, 259, 57, 202, 408, 896, 66, 467, 285, 334, 677, 0]
2 [333, 284, 331, 858, 330, 674]
3 [234, 276, 648, 767, 238, 631]
4 [126, 204, 228, 998, 506, 167, 131, 326, 1122, 103, 192, 288]
...
1169 [1172, 1167, 862, 863, 864, 1165, 779, 726, 731, 636, 1059, 1166, 696]
1170 [608, 145, 1038, 473, 989, 147, 387]
1171 [700, 746, 1074, 744, 602, 1154, 114, 745, 1130, 747, 611]
1172 [696, 1091, 1166, 744, 1165, 1169]
One downside of this algorithm is that it doesn't provide a guaranteed number of recommendations. For example, item {idx=0} has only 2 similar items, while item {idx=1169} has 13, etc.
And finally we can write them to the text file:
fname = 'coco.txt'
with open(fname, 'w') as f:
for k, v in similar.items():
f.write(i2a[k])
f.write(' ')
v = [i2a[i] for i in v[:20]]
f.write(' '.join(v))
f.write('\n')
Since this article is getting quite long, I'll leave it as an exercise for you to model the track similarities.
This concludes the section on finding similar items. We overviewed how to build similarity matrices using Collaborative Filtering and Co-occurence counting algorithms.
You can find an example matrix files in this GitHub Gist.
You can also try other algorithms that produce embeddings, such as Word2Vec, or even possibly content-based algorithms that take artist or track features into account.
Share your artist or track similarity results in the comments!
In the next chapter, I'll cover how we can use similarities to retrieve recommendations. If you don't want to miss it, be sure to hit the follow button.
Thank you for reading this far. I hope it was interesting!
Top comments (1)
Waiting for the next chapter, thank you