## DEV Community is a community of 604,851 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

loading...

# How to build a sentiment analysis engine in Python Davide Santangelo Updated on ・3 min read

# Intro

A little tutorial to show how to build and train a classifier to distinguish positive from negative reviews:

as an example dataset we download Movie Reviews from Kaggle.

This dataset contains 1000 positive and 1000 negative processed reviews.

# Scikit-learn

Scikit-learn is a free software machine learning library for the Python programming language.

It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

https://scikit-learn.org/stable/

# Classifier

use BernoulliNB Naive Bayes classifier for multivariate Bernoulli models.

Like MultinomialNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html

# CountVectorizer

Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# Packages

``````import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
``````

# Read CSV as DataFrame

``````df = pd.read_csv('movie_review.csv')
``````

# DataFrame preview

``````<bound method NDFrame.head of        fold_id cv_tag  html_id  sent_id                                               text  tag
0            0  cv000    29590        0  films adapted from comic books have had plenty...  pos
1            0  cv000    29590        1  for starters , it was created by alan moore ( ...  pos
2            0  cv000    29590        2  to say moore and campbell thoroughly researche...  pos
3            0  cv000    29590        3  the book ( or " graphic novel , " if you will ...  pos
4            0  cv000    29590        4  in other words , don't dismiss this film becau...  pos
...        ...    ...      ...      ...                                                ...  ...
64715        9  cv999    14636       20  that lack of inspiration can be traced back to...  neg
64716        9  cv999    14636       21  like too many of the skits on the current inca...  neg
64717        9  cv999    14636       22  after watching one of the " roxbury " skits on...  neg
64718        9  cv999    14636       23   bump unsuspecting women , and . . . that's all .  neg
64719        9  cv999    14636       24  after watching _a_night_at_the_roxbury_ , you'...  neg

[64720 rows x 6 columns]>
``````

# Preparing Data

``````X = df['text']
y = df['tag']
``````

# Vectorize Data

``````vect = CountVectorizer(ngram_range=(1, 2))

X = vect.fit_transform(X)
``````

# Split data into random train and test subsets

``````X_train, X_test, y_train, y_test = train_test_split(X, y)
``````

# Train Bayesan Model

``````model = BernoulliNB()

model.fit(X_train, y_train)
``````

# Predict

``````p_train = model.predict(X_train)
p_test = model.predict(X_test)
``````

# Calculating the Accuracy

Accuracy classification score.

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

``````acc_train = accuracy_score(y_train, p_train)
acc_test = accuracy_score(y_test, p_test)
``````

# Result

``````print(f'Train ACC: {acc_train}, Test ACC: {acc_test}')

Train ACC: 0.9564276885043264, Test ACC: 0.6988875154511743
``````

# Notebook

Notebook available on Kaggle: https://www.kaggle.com/davidesantangelo/movie-review-sentiment-analysis

## Discussion (1) 