Transitioning from a pre-medical background to Electrical Engineering at NUST taught me one thing: Math is the universal language of logic. Recently, I decided to dive deep into Machine Learning to build a Content-Based Movie Recommender System.
In this post, Iβll walk you through how I used NLP and Cosine Similarity to suggest movies based on user preferences.
The Tech Stack π οΈ
Language: Python
Libraries: Pandas, NumPy, Scikit-learn, NLTK
Dataset: TMDB 5000 Movies Dataset
The Workflow π§
Data Cleaning & Feature Selection
The first step was to merge datasets and extract relevant features like genres, keywords, cast, and crew. I created a "tags" column that combines all these textual descriptions.Text Preprocessing (Stemming)
To make sure "action" and "actions" are treated the same, I used NLTK's PorterStemmer.
Python
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
Applied to the tags column
Vectorization (Bag of Words)
I converted the text tags into 5,000-dimensional vectors using CountVectorizer, removing common English stop words.The Mathematical Engine: Cosine Similarity
Instead of Euclidean distance, I used Cosine Similarity to calculate the angular distance between movie vectors. The closer the vectors, the more similar the movies!
Key Challenges π§
The biggest hurdle was managing the large similarity matrix in a cloud environment. Dealing with memory limits and "truncated files" taught me a lot about efficient data handling and the importance of proper serialization using pickle.
Conclusion & Future Scope
This project was a fantastic way to apply linear algebra and NLP concepts. My next step is to deploy this as a full web app and integrate movie posters via API.
Check out the full source code on my GitHub: π https://github.com/Urooj25/Movie-Recommender-System.git
Letβs Connect!
I'm always open to feedback and collaboration. Drop a comment or connect with me on LinkedIn!
Top comments (0)