Isaac Addis

Posted on Feb 20 • Edited on Feb 22 • Originally published at isaacaddis.github.io

A Complete Beginner's Guide to Item-Item Recommendations with Matrix Factorization and BPR

#machinelearning #recsys

When I wanted to build an item-item recommendation system with just implicit ratings (e.g. viewed, clicked as opposed to explicit ratings which are things like ratings for a movie out of 5), Matrix Factorization (MF) with Bayesian Personalized Ranking (BPR) is what Opus 4.5 recommended I get started with. As it helps to understand these systems to identify and debug issues, I'll be giving a walkthrough of the crucial information I learned over the span of a month (on-and-off) until I had enough of an understanding to fix problems and evaluate model performance. I'm now switching to LightFM which overcomes a cold-start problem and has better recommendation performance, but this knowledge should still be valuable whether choosing to use MF + BPR or LightFM.

Problem

I used this algorithm to: given an item i, recommend K items that are similar to item i. The K items could be used as recommendations to item i but the task is strictly to find related items to i. Items can be things like Amazon items, Netflix movies, and Spotify songs.

The Matrix Factorization + BPR Algorithm

Matrix Factorization with BPR allows us to accomplish this in a series of steps:

Create an interaction matrix containing (user, item) pairs. If a user interacted with an item, the interaction matrix at (user, item) is 1. Omit items users don't interact with from the interaction matrix; these items (negative items) are sampled during the training loop.
Use Xavier-initialization to randomly initialize user vectors and item vectors. The number of user vectors should be all the users who were assigned a row in the interaction matrix and the number of item vectors should be all the items that were assigned a column in the interaction matrix.
Run a training loop. For all users and items interacted with by the user (not recommended, please see Edit 2),
1. Sample a (user, positive item i, negative item j) triplet. A positive item is an item the user interacted with and a negative item is an item the user did not interact with.
2. Compute a term, x-hat_{ui}, as the dot product of user vector u with positive item i. Compute another term, x-hat_{uj}, as a dot product of user vector u with negative item vector j. x-hat_{uij} := x-hat_{ui} - x-hat_{uj}. The loss function is defined as -Math.log(sigmoid(x-hat_{uij}) + 1e-10). The 1e-10 is there to avoid computing log(0).
3. Perform gradient descent. Store the values u, p_i, n_i then with a pre-defined learning rate lr and number of factors K, for every 0 <= k < K:
4. userFactor[k] += lr * (sigmoidNegDiff * (pi - ni) - regularization * u)
5. posFactor[k] += lr * (sigmoidNegDiff * u - regularization * pi)
6. negFactor[k] += lr * (-sigmoidNegDiff * u - regularization * ni)

Edit 1: sigmoidNegDiff in my code is sigmoid(-diff) and diff is x-hat_{uij}.

Edit 2: The BPR 2009 paper advises against sampling in this way. It suggests random sampling across all (user, positive item, negative item) pairs.

We now have latent user vectors and latent item vectors with every entry being a factor representing something about the item (these aren't manually labelled).

With cosine similarity (finds the dot product of two vectors normalized by the product of their magnitudes), it's possible to find the closest-matching item vectors as well as the closest-matching user vectors to a user.

Intuition

The dot product represents how much two vectors "agree" with each other – vector cells have a high dot product if their entries agree in positive/negative and magnitude.

x_{uij} is a score representing whether user u prefers item i over item j. It's true if positive, otherwise it's negative. Since the sigmoid function transforms a number x into a probability between 0 and 1, sigmoid(x_{uij}) = P(i > j | u). If x_uij > 0, sigmoidNegDiff = sigmoid(-diff) is near 0 and the update to posFactor is minimal. The result is that incorrect preference rankings result in larger model parameter updates.

Note that gradient descent is moving userFactor • posFactor (x_{ui}) away from userFactor • negFactor (x_{uj}). It makes x_{uij} positive; in other words, it makes the score x_{ui} greater than the score x_{uj}.

Common Issues and their Fixes

I ran into co-exposure bias (an item j gets recommended for item i just because they are visible from the same page) implementing MF + BPR. An item from a trending page would only have other trending items as its recommendations. There are several mitigations for this:

Inverse Propensity Weighting: multiply the loss function by 1/P(exposure).
Remove non-organic interactions from the interaction matrix.
Add a bias term that absorbs general tendencies for users to prefer specific items (for example, popular products on the Amazon website).

Performance and Tuning

Two metrics commonly used for evaluating MF + BPR model performance are Recall@K and HR@K (Hit-rate@K). Hit Rate@K, according to ChatGPT, is defined as "The fraction of users for whom at least one relevant item appears in the top-K recommendations." Recall@K is defined as "The fraction of a user’s relevant items that appear in the top-K recommendations."

One way model testing is commonly done is Leave-One-Out. If a user interaction with items 1, 2, 3, and 4, it's possible to train the model with items 1-3 and verify that item 4 appears in the recommendations within K across all users (the aforementioned HR@K).

I hope you enjoyed this blog article and/or found it helpful.