Hi! I’m Przemysław from AI Development Company Profil Software. This will be my 2nd article connected with AI. If you want to answer the question of how can object classification be easily fooled you should refer to the first article of mine.
This time I will try to present some hands-on examples of how to deal with simple NLP tasks like sentiment analysis. The solutions used in this article could be easily reused in other classification tasks. We will traverse the rough trails of AI starting with data cleaning and preprocessing, then move on to model definition and visualising the results. The solution was prepared using Google Colab and will be shared at the end. So are you ready? ;)
Requirements
While using Google Colab no external packages are required to run the solution. Libraries like scikit-learn are already included. While moving to another environment you can easily grab dependencies using the !pip freeze command in a notebook cell.
Dataset
In this article I will be using Twitter dataset from the kaggle competition . The dataset consists of 1.6M tweets written in English and extracted using the Twitter api. They are grouped into 2 classes (named targets):
- positive (target 0 in csv)
- negative (target 4 in csv) The dataset also contains other columns like corresponding date or the user that posted the tweet. For the purpose of this article, we will be using text and target info only. What can be useful for further processing might be the data distribution over the 2 classes. Part of the code responsible for loading the dataset and counting target distribution is presented below:
import pandas as pd
df = pd.read_csv(
filepath_or_buffer='data/training.1600000.processed.noemoticon.csv',
encoding="ISO-8859-1",
usecols=[0, 5],
names=['target', 'text'],
engine='python',
error_bad_lines=False,
)
df['target'].value_counts().plot(kind='bar')
Sample rows of the raw dataset are displayed below:
At first glance we can see that data is a bit dirty and can be cleaned to remove bogus parts like links and mentions. To briefly apply simple preprocessing, some pandas utils were used. You can see that part below:
- remove all urls and user mentions, hashtags,
- accept only letters and digits
- remove extra spaces
- parse everything to lowercase
- rename target class 4 -> 1
Enjoying this article? Need help? Talk to us!
Receive a free consultation on your software project
Sample output of preprocessed data (can be compared with previous image):
Data split
To be able to train our AI model we need to first split the data into train and test sets. The code below shows how to do it:
from sklearn.model_selection import train_test_split
TEST_SIZE = .1
RANDOM_STATE = 123
def split_data(test_size: float = TEST_SIZE, random_state: int = RANDOM_STATE):
X = df['text']
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=test_size,
random_state=random_state,
stratify=y)
return X_train, X_test, y_train, y_test
I used therandom_state option to enable reproductivity between experiments, while the stratify option is responsible for enabling similar distribution of classes in both sets.
Model
The first model that will be checked is a so-called Bag of words model. The bag-of-words model is a simplifying representation used in NLP. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Its visualisation is presented below:
I will implement it using CountVectorizer from sklearn which converts a collection of text documents to a matrix of token counts. This approach will then enable us to use preprocessed vectors as an input for the LogisticRegression model.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
NGRAM_RANGE = 2
MAX_ITER = 1000
vectorizer = CountVectorizer(ngram_range=(1, NGRAM_RANGE))
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
log_reg = LogisticRegression(max_iter=MAX_ITER, random_state=RANDOM_STATE)
log_reg = log_reg.fit(X_train, y_train)
As you can see, with just a few lines of code we can prepare a fully working model. In the above code, NGRAM_RANGE describes how many coexisting words will be analysed as a feature. MAX_ITER enables us stop the algorithm after a certain amount of iteration; for such big datasets it is sometimes safe to limit that.
Results
And voilà, that was it, we did it. Now let’s see what results we receive.
For sklearn models, we can use visualisation functions that help with ad hoc prototyping.
- classification_report: grabs prediction and true labels and prepares printable report
- plot_confusion_matrix: plots a heat-map of classification; for binary classification it will have a structure of 4 squares
from sklearn.metrics import classification_report, plot_confusion_matrix
print(classification_report(log_reg.predict(X_train), y_train))
print(classification_report(log_reg.predict(X_test), y_test))
plot_confusion_matrix(log_reg, X_train, y_train)
plot_confusion_matrix(log_reg, X_test, y_test)
Outputs of both functions are presented below:
As we can see, the model reaches above 80% accuracy for the unknown samples which is a great result compared to the amount of code that was used to achieve that.
_
Interested in working for our software development company in Poland?_
Check out our current job offers now
Feature importance
The logistic regression model enables us to use its features (in our case 1-word or 2-words pairs) and coefficients calculated during data fitting to obtain features that are influential for choosing a specific label. Below is a short snippet:
import numpy as np
top_n = 10
features = vectorizer.get_feature_names()
# get features with lowest coeficients
positive = np.argsort(log_reg.coef_[0])[::-1][:top_n]
# get features with heighest coeficients
negative = np.argsort(log_reg.coef_[0])[:top_n]
print(f'\n Positive features: \n {[features[x] for x in positive]}')
print(f'\n Negative features: \n {[features[x] for x in negative]}')
Positive features: ['not sad', 'no problem', 'doesnt hurt', 'not bad', 'no problems', 'no prob', 'not problem', 'never too', 'no probs', 'cant miss'] Negative features: ['clean me', 'not happy', 'sad', 'passed away', 'rip', 'not looking', 'funeral', 'headache', 'disappointing', 'upsetting']
Code
Here is the complete solution (https://github.com/profilsoftware/sentiment-detection-article):
Next steps
In the next article I will try to implement a model doing the same task but constructed with neural networks. Can’t wait for it and I hope that you cannot wait for it either!
Source:
https://colab.research.google.com/
https://www.kaggle.com/kazanova/sentiment140
Thanks to Katarzyna Latarska.
Top comments (0)