DEV Community

Cover image for Building a Robust Sentiment Classifier with Neutral Detection using Linear SVC
Urooj Fatima
Urooj Fatima

Posted on

Building a Robust Sentiment Classifier with Neutral Detection using Linear SVC

Introduction
Sentiment Analysis is more than just binary classification. While most tutorials focus on a simple Positive/Negative split, real-world data often sits in the "gray area." In this post, I’ll walk you through how I built a Sentiment Analysis pipeline using the IMDB 50K dataset and implemented a custom threshold logic to identify neutral sentiments.

The Stack 🛠️
Language: Python

NLP: NLTK (Lemmatization, Stopwords)

ML: Scikit-Learn (Linear SVC)

Vectorization: TF-IDF

  1. The Preprocessing Pipeline Text cleaning is the most crucial part of NLP. I used a combination of Regex and NLTK to:

Strip HTML tags.

Remove non-alphabetic characters.

Lowercase and Lemmatize words to their root forms.

Python
def clean_text(text):
text = re.sub(r'<.*?>', '', text)
text = re.sub(r'[^a-zA-Z]', ' ', text).lower()
words = text.split()
cleaned_words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
return " ".join(cleaned_words)

  1. Feature Engineering: Why TF-IDF?
    I used TF-IDF Vectorization with 5,000 features. This helps the model weigh words based on their importance across the entire dataset, effectively filtering out common but uninformative words.

  2. The Model: Linear SVC
    I chose Linear SVC (Support Vector Classification) because it is highly efficient for high-dimensional text data. It aims to find the optimal hyperplane that maximizes the margin between classes.

  3. The Challenge: Handling "Neutral" Reviews ⚖️
    The IMDB dataset is binary (1 or -1). To make the model smarter, I used the decision_function method.
    Instead of a hard prediction, I calculated the distance from the decision boundary:

Score > 0.1: Positive

Score < -0.1: Negative

-0.1 to 0.1: Neutral ("Maybe Watch")

This approach allows the model to handle "average" or "mixed" reviews gracefully without being force-trained on neutral data.

  1. Overfitting & Regularization To prevent the model from memorizing noise, I utilized the built-in L2 Regularization in Linear SVC. By comparing training vs. test accuracy (which settled at a solid 88%), I ensured the model generalizes well to unseen data.

Conclusion
Building this was a great deep-dive into the nuances of NLP. The addition of a threshold-based neutral detection makes the recommendation system much more practical for real-world applications.
GitHub Repo:https://github.com/Urooj25/Movie-review-sentiment-analysis.git
Let's connect on LinkedIn!www.linkedin.com/in/urooj-fatima-b52495342

Top comments (0)