Amal Shaji

Posted on Aug 14, 2020 • Edited on Mar 29, 2021 • Originally published at amalshaji.wtf

Building the classifier - Part I (Live tweet sentiment analysis)

#python #machinelearning

In this series, I will show how to create a sentiment analysis app and perform analysis on any hashtag. The series is divided into 3 parts

Building the classifier
Building the backend
Building the frontend

In the final post, we'll bring everything together to make the app. We'll use tools like nltk, docker, streamlit, fastAPI and the link to the code will be provided.

Final product

Building the classifier

We'll be using a pre-trained model that I trained using open-source code. This is not a SOTA(State-of-the-Art) model, but for our task, this should be fine.

Let's begin by making a project directory.

mkdir sentwitter && cd sentwitter
mkdir backend
mkdir frontend

Download the trained model to backend/models directory

wget https://raw.githubusercontent.com/amalshaji/sentwitter/master/backend/models/sentiment_model.pickle

mv sentiment_model.pickle /backend/models

Install required libraries

python3 -m pip install nltk
python3 -m nltk.downloader punkt
python3 -m nltk.downloader wordnet
python3 -m nltk.downloader stopwords
python3 -m nltk.downloader averaged_perceptron_tagger

Let's write a function to pre-process the input(tweet)

# backend/classify.py
import re, string
from nltk.tag import pos_tag
from nltk import WordNetLemmatizer
from nltk.tokenize import word_tokenize


def remove_noise(tweet_tokens, stop_words=()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub(
            "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|"
            "(?:%[0-9a-fA-F][0-9a-fA-F]))+",
            "",
            token,
        )
        token = re.sub("(@[A-Za-z0-9_]+)", "", token)
        # remove all the links and special characters

        if tag.startswith("NN"):
            pos = "n"
        elif tag.startswith("VB"):
            pos = "v"
        else:
            pos = "a"

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)
        # pos tagging

        if (
            len(token) > 0
            and token not in string.punctuation
            and token.lower() not in stop_words
        ):
            cleaned_tokens.append(token.lower())
        # remove stopwords and punctuations

    return cleaned_tokens

Write a helper function to load the model.

# backend/utils.py

import pickle

def load_model():
    f = open("./models/sentiment_model.pickle", "rb")
    classifier = pickle.load(f)
    f.close()

    return classifier

Test out classifier

# backend/test.py

from utils import load_model
from classify import remove_noise
from nltk.tokenize import word_tokenize

model = load_model()

while True:
    _input = input("Enter a sample sentence: ")
    custom_tokens = remove_noise(word_tokenize(_input))
    result = model.classify(dict([token, True] for token in custom_tokens))
    print(f"{_input}: {result}")

Output

❯ python .\test.py
Enter a sample sentence: I am awesome
I am awesome: Positive
Enter a sample sentence: I hate you
I hate you: Negative
Enter a sample sentence: I have a gun
I have a gun: Positive
Enter a sample sentence: I like your hair
I like your hair: Negative

This isn't the best model😂, gun labeled as Positive and hair sentence as Negative. Feel free to build your own model or try with the heavy ones using transformers( huggingface).

In the next post, we'll be building the backend to serve predictions through an API.

References

The open-source code used to train the model was used a long time ago for a project, I can't find the source. So if you do, or you are the author, please comment the link.
Article Cover by MorningBrew
Series Cover by Ravi Sharma

DEV Community

Building the classifier - Part I (Live tweet sentiment analysis)

Final product

Building the classifier

Install required libraries

Let's write a function to pre-process the input(tweet)

Write a helper function to load the model.

Test out classifier

Output

References

Top comments (0)

Read next

Unlocking Efficient Training for AI Language Giants: Deep Optimizer States

Microsoft Autogen Has Split in 2... Wait 3... No, 4 Parts

WebRL: Self-Evolving LLM Agents Learn Web Navigation via Adaptive Curriculum Training

This Week In Python