Introduction
Sentiment Analysis is more than just binary classification. While most tutorials focus on a simple Positive/Negative split, real-world data often sits in the "gray area." In this post, I’ll walk you through how I built a Sentiment Analysis pipeline using the IMDB 50K dataset and implemented a custom threshold logic to identify neutral sentiments.
The Stack 🛠️
Language: Python
NLP: NLTK (Lemmatization, Stopwords)
ML: Scikit-Learn (Linear SVC)
Vectorization: TF-IDF
- The Preprocessing Pipeline Text cleaning is the most crucial part of NLP. I used a combination of Regex and NLTK to:
Strip HTML tags.
Remove non-alphabetic characters.
Lowercase and Lemmatize words to their root forms.
Python
def clean_text(text):
text = re.sub(r'<.*?>', '', text)
text = re.sub(r'[^a-zA-Z]', ' ', text).lower()
words = text.split()
cleaned_words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
return " ".join(cleaned_words)
Feature Engineering: Why TF-IDF?
I used TF-IDF Vectorization with 5,000 features. This helps the model weigh words based on their importance across the entire dataset, effectively filtering out common but uninformative words.The Model: Linear SVC
I chose Linear SVC (Support Vector Classification) because it is highly efficient for high-dimensional text data. It aims to find the optimal hyperplane that maximizes the margin between classes.The Challenge: Handling "Neutral" Reviews ⚖️
The IMDB dataset is binary (1 or -1). To make the model smarter, I used the decision_function method.
Instead of a hard prediction, I calculated the distance from the decision boundary:
Score > 0.1: Positive
Score < -0.1: Negative
-0.1 to 0.1: Neutral ("Maybe Watch")
This approach allows the model to handle "average" or "mixed" reviews gracefully without being force-trained on neutral data.
- Overfitting & Regularization To prevent the model from memorizing noise, I utilized the built-in L2 Regularization in Linear SVC. By comparing training vs. test accuracy (which settled at a solid 88%), I ensured the model generalizes well to unseen data.
Conclusion
Building this was a great deep-dive into the nuances of NLP. The addition of a threshold-based neutral detection makes the recommendation system much more practical for real-world applications.
GitHub Repo:https://github.com/Urooj25/Movie-review-sentiment-analysis.git
Let's connect on LinkedIn!www.linkedin.com/in/urooj-fatima-b52495342
Top comments (0)