CineMatch: How I Built a Movie Recommendation System That Doesn't Need a Single User Rating
Table Of Content
- What CineMatch Actually Does
- The Real Challenge: Doing This Without Burning Memory
- How It's Put Together
- The Model and the Toolkit
- A Few Things Worth Noticing
- The Part That Surprised Me
- Try It and Tell Me What's Off
Most recommendation engines you've used β Netflix, YouTube, Spotify β learn from millions of people clicking, watching, and rating things. Mine doesn't have that luxury. It works on text alone, and that one constraint shaped almost every decision in this project.
View my work on
- π Personal Portfolio
- π GitHub Profile
- πΌ LinkedIn Network
What CineMatch Actually Does
CineMatch is a content-based movie recommendation system. You give it a movie you like, and it returns similar titles β not because other users rated them similarly, but because the movies themselves share genres, release era, and identifying metadata. Under the hood, every movie in the dataset gets converted into a "tag" signature, and the system finds the closest matches using a Bag-of-Words model and cosine similarity.
It runs on a dataset of roughly 16,250 IMDb movies spanning 1980 to 2026, served through a FastAPI backend with a lightweight frontend on top. You can try it live at cin-match-ai.vercel.app, or read through the full build in the GitHub repo.
Why does content-based matter? Because it sidesteps the "cold start" problem that trips up rating-based systems. A brand-new movie with zero reviews can still get recommended the day it's added, since the system never needed ratings in the first place.
The Real Challenge: Doing This Without Burning Memory
The obvious way to compute similarity between every pair of movies is to build one giant matrix comparing all of them to each other. With around 16,250 movies, that's a matrix of roughly 264 million cells sitting in memory at all times β expensive for not much benefit.
The fix was to flip the order of operations. Instead of precomputing everything, the app keeps only the sparse vectorized version of the dataset in memory and computes cosine similarity for a single row β the movie you just searched for β at the moment you ask for it. It's a memory-for-latency trade that holds up well at this dataset size: the live demo reports under 8MB of in-memory overhead and recommendation requests answering in around 0.12 seconds.
The second challenge was less about code and more about being honest about limitations. Poster images come from an unofficial, community-run IMDb proxy rather than an official API β fine for a side project, but not something you'd want to depend on without a fallback in production. Rather than hide that, it's called out directly in the project's own documentation, alongside a few other rough edges like an in-memory poster cache that resets on every restart. Writing those down instead of glossing over them is, honestly, the more useful habit to build early.
How It's Put Together
The workflow is straightforward once you see it laid out:
- A raw IMDb dataset gets cleaned in a Jupyter notebook β lowercasing text, filling missing values, and combining title, genre, year, and IMDb ID into a single "tags" field per movie.
- At server startup, FastAPI loads the cleaned data and fits a
CountVectorizer(5,000 features, English stop words removed) to turn every movie's tags into a sparse numeric vector. - When you search for a movie and ask for recommendations, the app computes cosine similarity between your chosen movie's vector and every other vector, sorts by score, and returns the closest matches with their metadata attached.
- A separate endpoint fetches movie posters on demand and caches them in memory so repeat lookups don't hit the external API again.
The Model and the Toolkit
The recommendation logic runs on scikit-learn's CountVectorizer for turning text into numbers, paired with cosine similarity to measure how close two movies are in that numeric space. Everything sits behind a FastAPI backend running on Uvicorn, with Pandas and NumPy handling the data wrangling, HTTPX managing async calls to the poster API, and Pydantic validating incoming requests. The frontend is plain HTML, CSS, and JavaScript β no framework overhead, since the goal was a fast, simple interface rather than a complex one.
A Few Things Worth Noticing
- The search bar ranks autocomplete results by popularity (vote count), so typing "dark knight" surfaces the film people actually mean, not just the first alphabetical match.
- Recommendations come back with real metadata attached β genre, release year, IMDb rating, vote count β so you're not just getting a title, you're getting enough context to decide if it's worth watching.
The Part That Surprised Me
Going in, I assumed a "good" recommender needed user behavior data to feel personal. What this project showed me is that text alone β genre, year, a handful of identifiers β captures a surprising amount of what makes two movies feel similar. It's not as nuanced as a system trained on millions of viewing patterns, but it's honest about what it's doing, and it works without needing a single person to have rated anything first.
Try It and Tell Me What's Off
If you want to see how it behaves, search a movie you actually like on the live demo and see if the matches make sense to you. The full code, notebooks, and a documented list of known limitations are in the GitHub repository β feedback, issues, and pull requests are genuinely welcome, especially if you spot a case where the recommendations miss the mark.
Keywords: content-based recommendation system, movie recommendation system Python, machine learning project portfolio, FastAPI machine learning project, scikit-learn cosine similarity, NLP recommender system, build a recommendation engine, AI/ML engineering project, CountVectorizer movie recommender, data science project for resume, ravi kumar vishwakarma, ravi kumar, ravi vishwakarma, ravi recent project, ravi new project







Top comments (0)