DEV Community

Saint
Saint

Posted on

Build Your First Movie Recommendation Engine in Python

Ever wonder how Netflix or Spotify seems to know exactly what you want to watch or listen to next? It's not magic, it's the power of recommendation systems. In this post, we'll pull back the curtain and build a simple movie recommender from scratch using Python.

We'll use a popular technique called Collaborative Filtering. The idea is simple: "Show me what people like me also like." Instead of analysing movie genres or actors, we'll just look at user ratings to find "taste twins" and recommend movies based on what they enjoyed.

Step 1: Get the Data

We'll use the classic MovieLens 100k dataset, which contains 100,000 ratings from 943 users on 1,682 movies. First, let's load the data into pandas DataFrames. We need two files: u.data for the ratings and u.item for the movie titles.

import pandas as pd

# Define column names for the data
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']

# Load the data into pandas DataFrames
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols, encoding='latin-1')
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5), encoding='latin-1')

# Merge the two DataFrames into one
movie_data = pd.merge(movies, ratings)
Enter fullscreen mode Exit fullscreen mode

Step 2: Create the User-Item Matrix

To find users with similar tastes, we need to restructure our data. We'll create a user-item matrix, where each row represents a user, each column represents a movie, and the cells contain the ratings. Most of this matrix will be empty (NaN), because most users haven't rated most movies. This is known as a sparse matrix.

Pandas pivot_table function is perfect for this job.

# Create the user-item matrix
user_item_matrix = movie_data.pivot_table(index='user_id', columns='title', values='rating')

# Let's see what it looks like
print(user_item_matrix.head())
Enter fullscreen mode Exit fullscreen mode

Step 3: Find Similar Users

Now for the core logic: measuring similarity. We'll use the Pearson correlation coefficient. This metric measures the linear relationship between two sets of data, with a score from -1 (opposite tastes) to +1 (identical tastes).

A significant advantage of Pearson correlation is that it automatically accounts for user rating bias. It understands that one user's "4 stars" might be another's "3 stars" by comparing how ratings deviate from each user's personal average.

The corrwith() method in pandas makes this calculation easy. We'll pick a target user and find others who have similar rating patterns.

# Choose a target user (e.g., user_id 25)
target_user_ratings = user_item_matrix.loc.[17]dropna()

# Find users similar to our target user
similar_users = user_item_matrix.corrwith(target_user_ratings)

# Create a DataFrame of the results and clean it up
similarity_df = pd.DataFrame(similar_users, columns=['similarity'])
similarity_df = similarity_df.dropna()

# Display the top 10 most similar users
print(similarity_df.sort_values(by='similarity', ascending=False).head(10))
Enter fullscreen mode Exit fullscreen mode

Step 4: Generate Recommendations

We have our "taste twins," so what's next?

  1. Form a "Neighbourhood": Select the top k most similar users (e.g., top 50).
  2. Find Candidate Movies: Gather all the movies rated by users in the neighbourhood, but exclude movies our target user has already seen.
  3. Score the Candidates: Calculate a predicted score for each candidate's movie. We'll use a weighted average: a rating from a highly similar user carries more weight than a rating from a less similar one.
  4. Rank and Recommend: Sort the movies by their predicted score and return the top n recommendations.

Putting It All Together

Let's wrap this logic into a single function.

def generate_recommendations(user_id, user_item_matrix, k=50, n=10):
    """Generates movie recommendations for a user."""

    # 1. Calculate user similarity
    target_user_ratings = user_item_matrix.loc[user_id].dropna()
    similar_users = user_item_matrix.corrwith(target_user_ratings)
    similarity_df = pd.DataFrame(similar_users, columns=['similarity']).dropna().drop(user_id)

    # 2. Find the neighborhood (top k similar users)
    neighborhood = similarity_df[similarity_df['similarity'] > 0].sort_values(by='similarity', ascending=False).head(k)

    # 3. Identify candidate movies
    watched_movies = user_item_matrix.loc[user_id].dropna().index

    candidate_movies = set()
    for user in neighborhood.index:
        neighbor_watched = user_item_matrix.loc[user].dropna().index
        candidate_movies.update(neighbor_watched)

    candidate_movies = list(candidate_movies.difference(watched_movies))

    # 4. Calculate recommendation scores
    recommendation_scores = {}
    for movie in candidate_movies:
        numerator = 0
        denominator = 0
        for user, data in neighborhood.iterrows():
            if not pd.isna(user_item_matrix.loc[user, movie]):
                rating = user_item_matrix.loc[user, movie]
                similarity = data['similarity']
                numerator += similarity * rating
                denominator += abs(similarity)

        if denominator > 0:
            recommendation_scores[movie] = numerator / denominator

    # 5. Rank and return top N recommendations
    recommendations_df = pd.DataFrame.from_dict(recommendation_scores, orient='index', columns=['predicted_score'])
    return recommendations_df.sort_values(by='predicted_score', ascending=False).head(n)

# Let's get recommendations for our target user!
recommendations = generate_recommendations(25, user_item_matrix)
print(f"Top 10 recommendations for user 25:")
print(recommendations)
Enter fullscreen mode Exit fullscreen mode

What's Next?

And there you have it, your very own recommendation engine! While this is a simple model, it serves as the foundation for many real-world systems. It has limitations, like the "cold start" problem (what do you recommend to a new user with no ratings?), but it's a fantastic starting point.

Try it out for yourself! Change the user_id, tweak the neighbourhood size (k), or apply this logic to a different dataset. Happy coding!

Top comments (0)