Preyum Kumar

Posted on Jun 1

Lexicon vs. Transformers: A Complete Guide to Sentiment Analysis with VADER and RoBERTa

#ai #nlp #roberta #sentimentanalysis

In modern natural language processing (NLP), understanding human emotion and sentiment from text data is a highly sought-after capability. Whether analyzing customer feedback, product reviews, or social media trends, choosing the right modeling paradigm is critical. This comprehensive guide details a complete sentiment analysis workflow comparing a lexicon-based bag-of-words approach (VADER) and a deep learning transformer-based approach (RoBERTa), concluding with an interactive Streamlit dashboard for live model testing.

Index

Introduction to Sentiment Analysis
Dataset Preparation and Exploratory Data Analysis
NLTK Text Preprocessing
VADER Lexicon-Based Sentiment Analysis
RoBERTa Transformer-Based Sentiment Analysis
Model Comparison and Edge Cases
Interactive Streamlit Application
Hugging Face Pipelines for Production
Project Links and Resources

Introduction to Sentiment Analysis

Sentiment Analysis (or opinion mining) is the computational study of people's opinions, sentiments, and emotions toward entities, individuals, issues, or events. In this project, we explore two distinct methodologies:

VADER (Valence Aware Dictionary and sEntiment Reasoner): A lexicon- and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media and product reviews. It relies on a pre-defined dictionary of words mapped to emotional intensities (valence).
RoBERTa (Robustly Optimized BERT Pretraining Approach): An optimized variant of Google's BERT (Bidirectional Encoder Representations from Transformers). By utilizing a self-attention mechanism, RoBERTa captures the bidirectional contextual dependencies between words, making it far superior at recognizing sarcasm, negations, and subtle linguistic nuances.

Feature / Metric	VADER (Lexicon-Based)	RoBERTa (Transformer-Based)
Underlying Approach	Lexicon & Rule-based (Bag-of-words)	Transformer-based (Self-Attention)
Contextual Awareness	None (Analyzes words individually)	Extremely High (Considers whole sentence)
Compute Requirements	Extremely Low (Runs instantly on CPU)	High (Requires GPU for optimal inference)
Handling of Sarcasm	Poor (Often misclassified by literal words)	Excellent (Captures context clues)
Output Representation	Compound (-1 to 1), Pos, Neu, Neg scores	Probabilities (0 to 1) for Neg, Neu, Pos

Dataset Preparation and Exploratory Data Analysis

To demonstrate these models, we utilize the Amazon Fine Food Reviews dataset (Reviews.csv). The dataset contains user reviews of fine foods on Amazon, including text reviews and associated 1 to 5 star ratings.

Initializing and Reducing Dataset

To keep computational overhead low during development, we ingest the first 500 records.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

# Load the dataset
df = pd.read_csv('data/Reviews.csv')
print(f"Original shape: {df.shape}")

# Downsample to 500 rows for rapid prototyping
df = df.head(500)
print(f"Reduced shape: {df.shape}")

Class Balance Visualization

Visualizing the distribution of star ratings helps identify class imbalance, which is critical for understanding model bias.

ax = df['Score'].value_counts().sort_index().plot(
    kind='bar', 
    title='Count of Reviews by Star Rating', 
    figsize=(10, 5)
)
ax.set_xlabel('Review Stars (Rating)')
ax.set_ylabel('Count')
plt.show()

The Amazon Fine Food Reviews dataset is highly skewed toward 5-star ratings (highly positive), which is a common characteristic of consumer feedback datasets.

NLTK Text Preprocessing

Before running lexicon engines, we explore traditional linguistic preprocessing steps utilizing the Natural Language Toolkit (NLTK). These steps help us understand tokenization, Part-of-Speech (POS) tagging, and Named Entity Recognition (NER).

import nltk

# Download required NLTK resources
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

# Select a sample review for preprocessing
example = df['Text'][50]
print(f"Sample Review:\n{example}\n")

# Step 1: Tokenization
tokens = nltk.word_tokenize(example)
print(f"Tokens:\n{tokens[:10]}...\n")

# Step 2: Part-of-Speech (POS) Tagging
tagged = nltk.pos_tag(tokens)
print(f"POS Tags:\n{tagged[:10]}...\n")

# Step 3: Named Entity Recognition (NER)
entities = nltk.chunk.ne_chunk(tagged)
print("Named Entities:")
entities.pprint(margin=40)

Understanding the Pipeline:

Tokenization: Breaks continuous text down into individual words or punctuation marks.
POS Tagging: Identifies the grammatical role of each token (e.g., NN for Noun, JJ for Adjective, VBZ for Verb present tense).
NER: Clusters tokens into structured entities such as PERSON, ORGANIZATION, or GPE (Geopolitical Entity).

VADER Lexicon-Based Sentiment Analysis

VADER uses a dictionary of lexical features mapped to intensity ratings. For example, "good" is positive, "great" is more positive, and "excellent" is even more positive. It also applies heuristics for capitalization ("GREAT" is stronger than "great") and exclamation marks ("great!" is stronger).

Single Sentence Inference with VADER

from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Test VADER polarity scores
print(sia.polarity_scores("This food is absolutely delicious!"))
# Output: {'neg': 0.0, 'neu': 0.412, 'pos': 0.588, 'compound': 0.6028}

Processing the Dataset with VADER

We run VADER on all 500 reviews, merging the compound, positive, neutral, and negative scores back into our original dataframe.

from tqdm.notebook import tqdm

res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = row['Text']
    myid = row['Id']
    res[myid] = sia.polarity_scores(text)

vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={'index': 'Id'})
vaders = vaders.merge(df, how='left')

Visualizing VADER Assumptions

To verify if VADER aligns with customer ratings, we plot both the overall Compound Score (ranging from -1 to 1) and the individual sentiment categories (positive, neutral, and negative) against the Amazon Star Review score (1 to 5).

Compound Score by Amazon Star Review

The compound score is a single metric that sums the valence scores of each word in the text, normalized between -1 (most negative) and 1 (most positive).

# Plot VADER compound score
ax = sns.barplot(data=vaders, x='Score', y='compound')
ax.set_title('Compound Score by Amazon Star Review')
plt.show()

Analysis: The compound score increases linearly from negative to strongly positive as the rating stars increase from 1 to 5, demonstrating a clear positive correlation with user ratings.

Individual Sentiment Categories

We break down the sentiment into positive, neutral, and negative scores across different ratings.

fig, axs = plt.subplots(1, 3, figsize=(15, 5))
sns.barplot(data=vaders, x='Score', y='pos', ax=axs[0])
sns.barplot(data=vaders, x='Score', y='neu', ax=axs[1])
sns.barplot(data=vaders, x='Score', y='neg', ax=axs[2])
axs[0].set_title('Positive Score')
axs[1].set_title('Neutral Score')
axs[2].set_title('Negative Score')
plt.tight_layout()
plt.show()

Assumption Check: As expected, the positive score rises incrementally with higher star reviews, while the negative score trends down to near-zero for 5-star reviews.

RoBERTa Transformer-Based Sentiment Analysis

Unlike VADER's lookup dictionary, RoBERTa processes a sequence of words dynamically. By utilizing attention heads, the model learns the context of how words interact with one another. We utilize cardiffnlp/twitter-roberta-base-sentiment, a RoBERTa model trained on ~58 million tweets and fine-tuned specifically for sentiment detection.

Loading RoBERTa with HuggingFace

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
import torch

# Define model path
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Single Sentence Inference with RoBERTa

To run inference, we encode our text, retrieve the raw prediction logits, and apply a Softmax activation to convert raw outputs to probability distributions.

# Tokenize and encode input
encoded_text = tokenizer(example, return_tensors='pt')

# Model forward pass
with torch.no_grad():
    output = model(**encoded_text)

# Convert raw logits to probabilities
scores = output.logits[0].numpy()
scores = softmax(scores)

# Format scores mapping
scores_dict = {
    'roberta_neg': scores[0],
    'roberta_neu': scores[1],
    'roberta_pos': scores[2]
}
print(scores_dict)

Iterating Over the Dataset with RoBERTa

We write a robust polarity-extracting wrapper that can run on either a standard CPU or a GPU (using CUDA). Utilizing a device-agnostic approach ensures maximum compatibility and high-throughput batch processing.

CPU-Only Implementation

Ideal for local prototyping, low-power edge machines, or environments without a dedicated graphics card.

def roberta_polarity_cpu(text_sample):
    model.to('cpu')  # Force model to CPU
    encoded_text = tokenizer(
        text_sample, 
        return_tensors='pt', 
        truncation=True, 
        max_length=512
    )
    # Move inputs to CPU
    encoded_text = {k: v.to('cpu') for k, v in encoded_text.items()}
    with torch.no_grad():
        output = model(**encoded_text)

    scores = output.logits[0].cpu().numpy()
    scores = softmax(scores)
    return {
        'roberta_neg': scores[0],
        'roberta_neu': scores[1],
        'roberta_pos': scores[2]
    }

GPU-Accelerated Implementation

Recommended for large-scale production runs, speeding up inference by orders of magnitude through parallelized matrix computations on CUDA cores.

def roberta_polarity_gpu(text_sample):
    # Dynamically select device
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)  # Move model weights to GPU memory

    encoded_text = tokenizer(
        text_sample, 
        return_tensors='pt', 
        truncation=True, 
        max_length=512
    )
    # Move input tensors to the chosen device
    encoded_text = {k: v.to(device) for k, v in encoded_text.items()}
    with torch.no_grad():
        output = model(**encoded_text)

    # Must pull logits back to CPU memory before converting to numpy
    scores = output.logits[0].cpu().numpy()
    scores = softmax(scores)
    return {
        'roberta_neg': scores[0],
        'roberta_neu': scores[1],
        'roberta_pos': scores[2]
    }

# Device-agnostic wrapper used for the dataset iteration
def roberta_polarity(text_sample):
    # Dynamically runs on GPU if CUDA is available, otherwise falls back to CPU
    return roberta_polarity_gpu(text_sample)

Process both models simultaneously and store results

res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = row['Text']
    myid = row['Id']

    # Calculate VADER
    vader_res = sia.polarity_scores(text)
    vader_rename = {f"vader_{k}": v for k, v in vader_res.items()}

    # Calculate RoBERTa
    roberta_res = roberta_polarity(text)

    # Combine dictionaries
    combined_res = {**vader_rename, **roberta_res}
    res[myid] = combined_res

# Merge results with metadata
result_df = pd.DataFrame(res).T
result_df = result_df.reset_index().rename(columns={'index': 'Id'})
result_df = result_df.merge(df, how='left')

Model Comparison and Edge Cases

Comparing lexicon-based scores to deep contextual embeddings reveals substantial behavioral variances.

Pairplot Analysis

Using Seaborn, we visualize the correlation and separation between VADER's scores and RoBERTa's scores across all ratings.

sns.pairplot(
    data=result_df, 
    vars=['vader_neg', 'vader_pos', 'roberta_neg', 'roberta_pos'], 
    hue='Score', 
    palette='tab10'
)
plt.show()

RoBERTa demonstrates far cleaner separation in clusters: high-star ratings congregate strictly around low roberta_neg and high roberta_pos, while VADER exhibits overlapping, noisy scatter layouts.

Edge Case 1: Sarcastic 1-Star Reviews

Consider a 1-star review where the customer uses positive words sarcastically:

"I was so excited to receive these, but they turned out to be completely stale and flavorless. A total waste of money."

VADER's Interpretation: Sees positive lexical entries like "excited" and moderately scores it as neutral or slightly positive.
RoBERTa's Interpretation: Detects the contrasting shift and correctly classifies it as strongly Negative (neg ~0.95).

# Query extreme positive mismatch in 1-Star reviews
sarcastic_review = result_df.query('Score == 1').sort_values('roberta_pos', ascending=False)['Text'].values[0]
print(f"Top RoBERTa positive-rated 1-star review:\n{sarcastic_review}")

Edge Case 2: Highly Cynical 5-Star Reviews

Consider a 5-star review where the customer uses words like "dangerously addictive":

"This chocolate is dangerously good. I cannot stop eating them. It is a serious problem."

VADER's Interpretation: Picks up heavily negative tokens like "dangerously" and "problem," flagging the sentiment as neutral or negative.
RoBERTa's Interpretation: Analyzes contextual flow, understands the hyperbolic nature of "dangerously good," and flags it as highly Positive (pos ~0.98).

# Query extreme negative mismatch in 5-Star reviews (highly cynical positive comments)
cynical_review = result_df.query('Score == 5').sort_values('roberta_neg', ascending=False)['Text'].values[0]
print(f"Top RoBERTa negative-rated 5-star review:\n{cynical_review}")

Interactive Streamlit Application

To deploy this analysis and make it accessible, we build a lightweight, high-performance web dashboard using Streamlit. It loads both models, caches them to avoid memory bloat, and provides side-by-side comparative graphs.

# streamlit_app.py
import nltk
import pandas as pd
import streamlit as st
from nltk.sentiment import SentimentIntensityAnalyzer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import warnings
import logging

# Suppress Hugging Face warnings
logging.getLogger("transformers").setLevel(logging.ERROR)
warnings.filterwarnings('ignore')

# Guarantee VADER Lexicon Download
try:
    nltk.data.find('vader_lexicon')
except LookupError:
    nltk.download('vader_lexicon', quiet=True)

# Page Metadata Setup
st.set_page_config(page_title="Review Sentiment Analysis", layout='wide')
st.title("Review Sentiment Analysis App")
st.write("Analyze review sentiment dynamically using VADER and RoBERTa models.")

# Sidebar Configurations
st.sidebar.header("Settings")
models_to_use = st.sidebar.multiselect(
    "Select Models", 
    ["VADER", "RoBERTa"], 
    default=["VADER", "RoBERTa"]
)
show_graph = st.sidebar.checkbox("Show Sentiment Distribution Graphs", value=True)

# Cached Resource Loaders
@st.cache_resource
def load_vader():
    return SentimentIntensityAnalyzer()

@st.cache_resource
def load_roberta():
    MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL)
    return tokenizer, model

# Load instances based on selection
sia_vader = load_vader() if "VADER" in models_to_use else None
roberta_data = load_roberta() if "RoBERTa" in models_to_use else None
roberta_tokenizer, sia_roberta = roberta_data if roberta_data else (None, None)

# User Interface Text Area
st.subheader("Enter Review Text")
text_input = st.text_area("Type your review text here:", height=120)

if st.button("Analyze Sentiment"):
    if not text_input.strip():
        st.warning("Please enter some review text first!")
    else:
        st.subheader("Sentiment Analysis Results")
        col1, col2 = st.columns(2)

        # 1. Run VADER Analysis
        if "VADER" in models_to_use and sia_vader:
            vader_scores = sia_vader.polarity_scores(text_input)
            col1.write("### VADER Sentiment Scores")
            col1.write(f"Positive Score: {vader_scores['pos']:.2f}")
            col1.write(f"Neutral Score: {vader_scores['neu']:.2f}")
            col1.write(f"Negative Score: {vader_scores['neg']:.2f}")

            # Define classification bounds
            compound = vader_scores['compound']
            if compound >= 0.05:
                vader_sentiment = "Positive 😊"
            elif compound <= -0.05:
                vader_sentiment = "Negative 😡"
            else:
                vader_sentiment = "Neutral 😐"
            col1.write(f"**Overall Classification:** {vader_sentiment}")

            if show_graph:
                vader_df = pd.DataFrame({
                    'Sentiment': ['Positive', 'Neutral', 'Negative'],
                    'Score': [vader_scores['pos'], vader_scores['neu'], vader_scores['neg']]
                })
                col1.bar_chart(vader_df.set_index('Sentiment'))

        # 2. Run RoBERTa Analysis
        if "RoBERTa" in models_to_use and sia_roberta and roberta_tokenizer:
            tokens = roberta_tokenizer(text_input, return_tensors='pt')
            with torch.no_grad():
                output = sia_roberta(**tokens)

            # Apply softmax to model logits
            scores = torch.softmax(output.logits, dim=1).numpy()[0]
            roberta_sentiment = ["Negative 😡", "Neutral 😐", "Positive 😊"][scores.argmax()]

            col2.write("### RoBERTa Sentiment Scores")
            col2.write(f"Positive Score: {scores[2]:.2f}")
            col2.write(f"Neutral Score: {scores[1]:.2f}")
            col2.write(f"Negative Score: {scores[0]:.2f}")
            col2.write(f"**Overall Classification:** {roberta_sentiment}")

            if show_graph:
                roberta_df = pd.DataFrame({
                    'Sentiment': ['Positive', 'Neutral', 'Negative'],
                    'Score': [scores[2], scores[1], scores[0]]
                })
                col2.bar_chart(roberta_df.set_index('Sentiment'))

Hugging Face Pipelines for Production

If you need to quickly deploy sentiment models in microservices without manually defining tokenizers, logits, and tensor parameters, Hugging Face provides highly optimized Pipelines.

from transformers import pipeline

# Load default sentiment analysis pipeline (DistilBERT-SST-2)
senti_pipeline = pipeline("sentiment-analysis")

# Single-line inference
res_1 = senti_pipeline("This oatmeal is delicious and perfectly sweet!")
print(res_1)
# Output: [{'label': 'POSITIVE', 'score': 0.99986}]

res_2 = senti_pipeline("I paid $3.99 for this. What a complete rip-off.")
print(res_2)
# Output: [{'label': 'NEGATIVE', 'score': 0.99876}]

Summary Recommendation

Use VADER for resource-constrained local pipelines, edge IoT devices, or highly structured datasets where latency and compute budgets are strictly constrained.
Use RoBERTa or similar Transformer architectures for customer-facing production systems where capturing exact context, emotional tone, and sarcasm is vital.

Project Links and Resources

GitHub Repository: PreyumKr/Sentiment_Analyser - Contains the fully functional source code, Jupyter notebooks detailing the step-by-step EDA and model evaluation, Streamlit configuration files, and localized environment setup instructions. Feel free to clone, star, or fork the repository to build your own comparative sentiment analysis pipelines.
Live Demonstration: Streamlit Web Application - Access the interactive, cloud-hosted Streamlit web application to test custom reviews in real-time. Toggle between VADER and RoBERTa models, adjust visualization preferences, and see how each architecture handles sarcasm, double negations, and complex structures dynamically.
Kaggle Dataset: Amazon Fine Food Reviews - The dataset used for this project comprises 568,454 fine food reviews from Amazon up to October 2012, featuring product IDs, helpfulness scores, ratings, and raw text reviews. We utilized a downsampled subset of the first 500 records for localized testing and rapid prototyping.

DEV Community