DEV Community: manifoldmindaway

Predicting Tweet Sentiments about Apple with the TfidfVectorizer

manifoldmindaway — Mon, 04 Jul 2022 19:46:32 +0000

Tweets provide numerous ways to numerically rank them such as likes, comments and retweets. However, training models to evaluate the text of the tweet and label its sentiment has many advantages. This project takes a dataset from kaggle filled with tweets about Apple labeled as negative, neutral or positive. For simplicity, we will turn this into a binary classifier by reducing the dataset to tweets labeled negative or positive. Also, negative tweets have been relabeled from '-1' to '0'. POsitive tweets remain labeled '1'.

The first step will be to import the dataset with the neutral tweets removed, replace '-1' sentiments with '0' and print the first five rows.

import pandas as pd
df = pd.read_csv('apple_twitter_sentiment.csv')
df.sentiment.replace({-1:0}, inplace=True)
df.head()

[output]

Next, we'll print df.info to make sure each column has the same amount of rows and no values are missing. This also confirms our datatypes.

df.info()

[output]

Now let's check the distribution of our target variable. For that I have created a function that does this by reproducing a dataframe with the value counts for the column in sum and percentages, plus a plot of the distribution.

# function to present and plot distribtuion of values in series
# automatically prints plot, summary df is returned to unpack and present 
def summerize_value_counts(series):

    # extract name of series
    series_name = series.name

    # make dataframe to display value count sum and percentage for series 
    series_count = series.value_counts().rename('sum')
    series_perc = series.value_counts(normalize=True).round(2).rename('percentage')
    series_values_df = pd.concat([series_count, series_perc], axis=1)

    # plot series distribution 
    plot = series_values_df['sum'].plot(kind='bar', title=f'Distribution of {series_name.title()} Column', 
                                 xlabel=f'{series_name.title()}', ylabel='Count');

    # rename df index to series name
    series_values_df.index.name = series_name
    series_values_df

    return series_values_df

Passing the sentiments column through the functions shows that we have a large class imbalance between negative sentiments and positive ones. First we'll proceed as normal, but later we'll resample the data based on the class imbalance to see if the model improves.

summerize_value_counts(df.sentiment)

[output]

To start, we'll have to separate our tweets and sentiments into different variables so we can further split them into a training and testing set.

text = df.text
sentiment = df.sentiment

sklearn's train-test split function does this easily. passing in our features then our target will get us a training and testing set for each. It also gives us the ability to set a random state for our data that will ensure the data in each set will not change any time we rerun our models.

from sklearn.model_selection import train_test_split

# random state of 0 is established for the data
X_train, X_test, y_train, y_test = train_test_split(text, sentiment, random_state=0, test_size=0.25)

Now to handle our text data. We cannot pass in strings into our models, so first we'll have to convert each tweet into a numerical representation. sklearn provides us this ability with the TfidfVectorizer(). This will convert our column of text into a matrix where each column is a unique word that appears in the dataset as a whole and every row is tweet with a floating point value in each column where that word appears in the tweet. With the TfidfVectorizer(), the floating point value that represents the word is a calculation of not just that word appearing in the tweet, but also how often that word appears in the dataset as a whole. This ensures that words that appear too often or too rarely are not detracting away from the model's ability to discover underlying patterns in the data.

To vectorize the data, all we have to do is import the vectorizer, fit and transform the training data with it and then transform the testing data.

from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()

X_train_tf = tf.fit_transform(X_train)
X_test_tf = tf.transform(X_test)

To view the matrix, all we have to do is convert it into a pandas DataFrame by passing in the matrix converted into an array with the .toarray() method and pass the vectorizer with the .get_feature_names() method to 'columns'.

df_tf = pd.DataFrame(X_train_tf.toarray(), columns=tf.get_feature_names())
df_tf.head()

[output]

Now that our string data has been converted into numerical form, we can pass it into the model. Since this is a binary classifier, I chose LogisticRegression, but any binary classifier will work.

This is as easy as importing the model, fitting it with the vectorized training features data and sentiment labels in y_train.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train_tf, y_train)

[output]

LogisticRegression()

To score the model, first we'll use the models .predict() method with transformed vectorized testing features to retrieve an array of predicted labels.

from sklearn.metrics import accuracy_score, plot_confusion_matrix, plot_roc_curve

y_pred_tf = clf.predict(X_test_tf)

print(f'accuracy: {accuracy_score(y_test, y_pred_tf)}')

[output]

accuracy: 0.8413461538461539

To dig deeper into our results, first we'll plot a confusion matrix showing the count of, from left to right and top to bottom, True Negatives, False Positives, False Negatives, and True Positives.

Next we'll plot the ROC curve for both the training and testing data. For this, the greater area under the curve, the better performing the model is.

from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(clf, X_test_tf, y_test);

[output]

from sklearn.metrics import plot_roc_curve
import matplotlib.pyplot as plt

# plot an ROC curve
fig, ax = plt.subplots()
plt.title('ROC Curve')
plot_roc_curve(clf, X_train_tf, y_train, name='Train', ax=ax)
plot_roc_curve(clf, X_test_tf, y_test, name='Test', ax=ax);

[output]

The results of our accuracy score show us our model is performing well, and our ROC curve tells us it is performing even better, but looking at the bottom row of our confusion matrix tells a different story. Our model mislabeled 33 of our positive tweets as negative and only correctly labeled 4 positive tweets.

This is because our target variable has the large class imbalance we discovered earlier. One way of dealing with this SMOTE, or Synthetic Minority Over-sampling Technique, which is a way of splitting our training data in a way that prioritizes the class that has significantly fewer examples. To see if this helps the performance of our model, we'll redo our logistic regression with resampled data.

This is easily done in Python with the imbalanced-learn library, which is built off of sklearn. To implement, first we initiate an instance of a SMOTE sampling object. For simplicity, we'll only pass a random state as an argument, which will be the state we passed earlier. This will mean the default sampling strategy will be done, which resamples all classes except the majority class. For us, that will mean over-sampling the tweets labeled positive.

Let's redo our vectorizing and modeling process with SMOTE to see if our model was better at predicting true positives.

tf_sm = TfidfVectorizer()

X_train_tf_sm = tf_sm.fit_transform(X_train)
X_test_tf_sm = tf_sm.transform(X_test)

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=0)

X_train_sm, y_train_sm = smote.fit_sample(X_train_tf_sm, y_train)

clf_sm = LogisticRegression()
clf_sm.fit(X_train_sm, y_train_sm)

y_pred_tf_sm = clf_sm.predict(X_test_tf_sm)
print(f'SMOTEd accuracy: {accuracy_score(y_test, y_pred_tf_sm)}')

[output]

SMOTEd accuracy: 0.8942307692307693

plot_confusion_matrix(clf_sm, X_test_tf_sm, y_test);

# plot an ROC curve
fig, ax = plt.subplots()
plt.title('ROC Curve')
plot_roc_curve(clf_sm, X_train_sm, y_train_sm, name='Train', ax=ax)
plot_roc_curve(clf_sm, X_test_tf_sm, y_test, name='Test', ax=ax);

[output]

As you can see, the ROC curve shows the same performance, however this new model trained on smoted data correctly predicted 23 more true positives than the original with a 5% better accuracy score.

As a final model, I'll give a quick example of the powerful preprocessing tools nltk comes with. A more detailed walkthrough can be found in my previous post here.

When it comes to working with textual data, preprocessing data in a way that helps models deal with lexical meaning of words can greatly help predict from longer and more complex text features. Two steps to get at this involve removing words that don't 'mean' anything, or filler words, and condensing similar words to a common meaning. In nltk the filler words are called 'stop words' and removing them can help reduce the noise in textual data so our models can focus on reading only important words. As far as condensing words based on similar meaning, or in nltk, lemmatizing, this process involves identifying the part of speech for each word to combine words whose spelling differences are a reflection of being different inflected form of the same word, rather than words with completely different meanings.

Below I will import libraries to handle these tasks, preprocess the data we have been working with with two functions, re-smote and remodel to see how these preprocessing techniques can help our models' performance.

import re

import nltk
from nltk import pos_tag
from nltk.corpus import wordnet, stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

# This function gets the correct Part of Speech so the Lemmatizer can work
def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def text_prep(text, sw):

    sw = stopwords.words('english')
    regex_token = RegexpTokenizer(r"([a-zA-Z]+(?:’[a-z]+)?)")
    text = regex_token.tokenize(text)
    text = [word for word in text]
    text = [word for word in text if word not in sw]
    text = pos_tag(text)
    text = [(word[0], get_wordnet_pos(word[1])) for word in text]
    lemmatizer = WordNetLemmatizer() 
    text = [lemmatizer.lemmatize(word[0], word[1]) for word in text]
    return ' '.join(text)

tf_tok = TfidfVectorizer()

sw = stopwords.words('english')
X_train_tokenized = [text_prep(text, sw) for text in X_train]


X_train_tf_tok = tf_tok.fit_transform(X_train_tokenized)
X_test_tf_tok = tf_tok.transform(X_test)


smote2 = SMOTE(random_state=0)
X_train_sm2, y_train_sm2 = smote2.fit_sample(X_train_tf_tok, y_train)


clf_sm_tok = LogisticRegression()

clf_sm_tok.fit(X_train_sm2, y_train_sm2)
y_pred_tf_tok = clf_sm_tok.predict(X_test_tf_tok)

accuracy_score(y_test, y_pred_tf_tok)

[output]

0.9038461538461539

plot_confusion_matrix(clf_sm_tok, X_test_tf_tok, y_test);

[output]

# plot an ROC curve
fig, ax = plt.subplots()
plt.title('ROC Curve')
plot_roc_curve(clf_sm_tok, X_train_sm2, y_train_sm2, name='Train', ax=ax)
plot_roc_curve(clf_sm_tok, X_test_tf_tok, y_test, name='Test', ax=ax);

[output]

As you can see, preprocessing tweets resulted in another 5% jump in accuracy from our original model.

IMDb sentiment modeling with the CountVectorizer

manifoldmindaway — Mon, 04 Jul 2022 19:35:09 +0000

Analyzing test data can seem much more difficult than numeric data, however the sklearn library provides useful
feature extraction tools that can transform any text data you may have into a numerical format. In addition to sklearn, the Natural Language Toolkit library provides many useful tools that can be vitally useful for further preprocessing textual data that is longer.

Below I will show how to use sklearn's CountVectorizor to convert reviews from a string to a matrix filled with numeric values for each review, plus some of the preprocessing tools, in order to predict the sentiment of IMDb reviews. The data set comes from kaggle where I have taken the training set and testing set, but omitted the validation set for simplicity.

The first step will be to load the data set into a pandas DataFrame and use df.info() to gain insights into the size of each dataset, the data types used and if there are any missing values.

import pandas as pd

df_train = pd.read_csv('imdb_review_sentiment_train.csv')
df_test = pd.read_csv('imdb_review_sentiment_test.csv')

df_train.head()

[output]

As you can see, the datasets contain two columns. One with the text from a review, and the other with the sentiment of the review under 'label'. Inspecting examples for each review revealed that a '0' indicates a negative review and a '1' indicates a positive one. There are also eight times as many reviews in our training set than our testing set, which should help the performance of our model.

df_train.info()

[output]

df_test.info()

[output]

Our next step will be to checkout the distribution of our target variable to see if there is any class imbalance amongst the sentiments. To do this I have written a function that takes in a pandas Series and prints out a bar chart of the Series counts and returns a DataFrame presenting the Series' value_counts() output for the sum and percentages of the column.

# function to present and plot distribtuion of values in series
# automatically prints plot, summary df is returned to unpack and present 
def summerize_value_counts(series):

    # extract name of series
    series_name = series.name

    # make dataframe to display value count sum and percentage for series 
    series_count = series.value_counts().rename('sum')
    series_perc = series.value_counts(normalize=True).round(2).rename('percentage')
    series_values_df = pd.concat([series_count, series_perc], axis=1)

    # plot series distribution 
    plot = series_values_df['sum'].plot(kind='bar', title=f'Distribution of {series_name.title()} Column', 
                                 xlabel=f'{series_name.title()}', ylabel='Count');

    # rename df index to series name
    series_values_df.index.name = series_name
    series_values_df

    return series_values_df

passing our label (sentiment) columns for both the training and testing datasets through shows us that each sentiment column is perfectly evenly distributed. We don't have to worry about class imbalance.

summerize_value_counts(df_train.label)

[output]

summerize_value_counts(df_test.label)

Next we will have to separate the data into our independent (reviews, or X variable) and dependent (sentiment, or y variable) variables for both the training and test sets. Additionally, let's convert all text values to lowercase to remove the effects of word case in our modeling.

# traning data
X_train = df_train.text.str.lower()
y_train = df_train.label 

# testing data
X_test = df_test.text.str.lower()
y_test = df_test.label

However, our reviews are still in string format that the computer cannot understand with any context, so in order to turn each review into a numeric value we'll have to use sklearn's CountVectorizer. This will turn our column of reviews into a matrix with every word occuring in our dataset, or corpus in NLP nomenclature, becoming a column and every review, or document in NLP, becomes a row of zeros and ones. Intuitively, every word that appears in the review becomes a one in that column, with the others zeroed out.

Doing this is easy and only takes a few lines of code. After importing the CountVectorizer, simply initiate an instance object and fit and transform (so we have columns of words and rows of reviews) the training data and then transform the testing data.

from sklearn.feature_extraction.text import CountVectorizer

# initiate vectorizer
cv = CountVectorizer()

# fit and transform training data, than tranform testing data
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

trying to view X_train shows it is a Compressed Sparse Row (csr) format that is not viewable with some code.

X_train_cv

[output]

<40000x92908 sparse matrix of type '<class 'numpy.int64'>'
    with 5463770 stored elements in Compressed Sparse Row format>

We can view it by converting X_train into an array and putting it into a DataFrame. We'll also apply thr .get_feature_names() method on our CountVectorizer object to label the columns.

df_cv = pd.DataFrame(X_train_cv.toarray(), columns=cv.get_feature_names())
df_cv.head()

df_cv.head()

[output]

This can be difficult to look at, so let's sort the dataframe, take the top five words and plot them.

df_cv.sum().sort_values(ascending=False)[:5].plot(kind='barh', title='Top Five Words', 
                                                  xlabel='Word Count', ylabel='Words' );

With labeled data like this, it can also be useful to do these steps with the data split by sentiment to get a look and what words appear most per sentiment.

Now that we have our text data into numeric values, we are ready to model our data. I chose a Random Forrest classifier, but there are many binary classifiers out there to choose from.

We'll initiate an instance object of our model and then fit it with our new X_train_cv matrix and our normal y_train filled with each row's sentiment.

from sklearn.ensemble import RandomForestClassifier

forest_cv = RandomForestClassifier()
forest_cv.fit(X_train_cv, y_train)

using the models predict method and passing in our X_test_cv matrix we receive an array of predicted sentiment labels. passing this through sklearns accuracy_score() function gives us a score for our model.

from sklearn.metrics import accuracy_score 

y_pred_cv = forest_cv.predict(X_test_cv)
accuracy_score(y_test, y_pred_cv)

[output]

0.8504

to dig a little deeper into our models performance, I also passed the model and testing data into sklearns plot_confusion_matrix() function to show the distribution of true/false positives and negatives. The bottom right corner (true 1, predicted 1) represents true positives while the top left corner (true 0, predicted 0) represents true negatives.

from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(forest_cv, X_test_cv, y_test, normalize='true');

[output]

A further test of the model's performance is the ROC curve, which tests the power of the classifier as a function of false positives. The training data shows a perfect result, while the testing data shows very good results. The more area under the curve, the better the model performed.

from sklearn.metrics import plot_roc_curve
import matplotlib.pyplot as plt

# plot an ROC curve
fig, ax = plt.subplots()
plt.title('ROC Curve')
plot_roc_curve(forest_cv, X_train_cv, y_train, name='Train', ax=ax)
plot_roc_curve(forest_cv, X_test_cv, y_test, name='Test', ax=ax);

[output]

Although these reviews are not very long, they are still more words than tweets or headlines. This means it is worthwhile to explore some of NLTK's preprocessing functions that can be a useful way of reducing the effect of outlier words in text data to find underlying patterns in data.

To show a way of doing this, I will perform some preprocessing step by step on one review and then on the entire set of reviews.

# select a review as an example 
example = X_train[3]
print(len(example.split()))
example

[output]

The first function is the RegexpTokenizer(), which is initialized with a regular expression to tokenize a string based on the regular expression passed through. A tokenized string becomes a list of substrings that have not been filtered out. The regular expression below will select only words beginning lower or uppercase letters and remove punctuation.

import nltk
from nltk.tokenize import RegexpTokenizer

regex_token = RegexpTokenizer(r"([a-zA-Z]+(?:’[a-z]+)?)")

after passing our example through the tokenize method of the RegexpTokenizer object, we see our example review has been converted to a list of substrings that cut the punctuation out of our review, which can be seen by the comma missing after 'movies' and the period after 'horror'

example_reg = regex_token.tokenize(example)
print(len(example_reg))
print(example_reg[:10])
print(example_reg[-1:])

[output]

69
['even', 'though', 'i', 'have', 'great', 'interest', 'in', 'biblical', 'movies', 'i']
['horror']

This is a good start, but there are many words like 'I', 'you', and 'do' that will weigh on our model due to their use in constructing sentences, but not reveal the contextual patterns we are looking for. In NLTK these are called 'stopwords' and removing them can further tokenize our text data to focus only on important strings. This can be done by iterating over our example review and only including words that are not in NLTK's English stop words.

from nltk.corpus import stopwords

sw = stopwords.words('english')
print(f'amount of english stopwords: {len(sw)}')
sw[:10]

[output]

amount of english stopwords: 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

# remove stopwords from example review
example_tokenized = [word for word in example_reg if word not in sw] 
print(len(example_tokenized))
print(example_tokenized[:10])

[output]

33
['even', 'though', 'great', 'interest', 'biblical', 'movies', 'bored', 'death', 'every', 'minute']

This has cut our example review from 69 words to 33. Now that our review has been tokenized, let's explore ways to further cut the words in our data. We have already taken out punctuation and stopwords, but there are also words that are different to the computer that are effectively the same words, like 'movie' and 'movies'. As of now, the computer sees these two as completely different words, but it may help us find the underlying patterns for the sentiment of a text if we consider them the same. To do this, first we will use NLTK's pos_tag() function that will take our review, which is now a list of strings, and make each word a tuple with the words as the first element and the part of speech of the word as the second element.

from nltk import pos_tag

# pass part of speech to each word in the tokenized review 
example_pos = pos_tag(example_tokenized)
example_pos[:10]

[output]

[('even', 'RB'),
 ('though', 'IN'),
 ('great', 'JJ'),
 ('interest', 'NN'),
 ('biblical', 'JJ'),
 ('movies', 'NNS'),
 ('bored', 'VBN'),
 ('death', 'NN'),
 ('every', 'DT'),
 ('minute', 'NN')]

These codes are hard to read, so let's make a list of the unique pos tags in our review and consult nltk.help.upenn_tagset() to get a description for each coded tag

unique_pos = set([x[1] for x in example_pos])

description = []
for pos in unique_pos:
    pos_described = nltk.help.upenn_tagset(pos)
    description.append(pos_described)

[output]

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...

The goal for these tuples is to find the stem of each word, the part responsible for its lexical meaning. This can be done with NLTK's WordNetLemmatizer, however we'll first have to convert the tag's pos_tag() gave us and then convert them into tags WordNetLemmatizer will recognize by using NLTK's wordnet. Below I have written a function that will do that by taking each word's pos_tag and converting it into a part of speech tag the lemmatizer will understand.

from nltk.corpus import wordnet

# This function gets the correct Part of Speech so the Lemmatizer can work
def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

Passing our list of tuples and putting each tag through our function returns a new list of tuples that is ready for our lemmatizer.

example_wordnet = [(word[0], get_wordnet_pos(word[1])) for word in example_pos]
example_wordnet[:10]

[output]

[('even', 'r'),
 ('though', 'n'),
 ('great', 'a'),
 ('interest', 'n'),
 ('biblical', 'a'),
 ('movies', 'n'),
 ('bored', 'v'),
 ('death', 'n'),
 ('every', 'n'),
 ('minute', 'n')]

Now to convert words to their stem, we'll initialize an instance object of our WordNetLemmatizer and pass each tuple through its lemmatize method.

from nltk.stem import WordNetLemmatizer

# initialize a lemmatizer
lemmatizer = WordNetLemmatizer()

Comparing our new list of strings we can see that 'movies' has been changed to 'movie' and 'bored' has been changed to 'bore'

# lemmatize each tuple to return a list a strings with some reducing towards their stem 
example_lemmatized = [lemmatizer.lemmatize(word[0], word[1]) for word in example_wordnet]
example_lemmatized[:10]

[output]

['even',
 'though',
 'great',
 'interest',
 'biblical',
 'movie',
 'bore',
 'death',
 'every',
 'minute']

Finally, we can convert our review as a list of strings back into one string by joining our lemmetized version into a string with one space so words will be spaced.

example_rejoined = ' '.join(example_lemmatized)
example_rejoined

[output]

'even though great interest biblical movie bore death every minute movie everything bad movie long acting time joke script horrible get point mix story abraham noah together value time sanity stay away horror'

Comparing our final version to the original shows that our review has been shortened by 36 words, punctuation removed and some words reduced to their stem.

print(f'word count: {len(example.split())}')
print()
print(example)
print()
print(f'word count: {len(example_rejoined.split())}')
print()
print(example_rejoined)

[output]

word count: 69

even though i have great interest in biblical movies, i was bored to death every minute of the movie. everything is bad. the movie is too long, the acting is most of the time a joke and the script is horrible. i did not get the point in mixing the story about abraham and noah together. so if you value your time and sanity stay away from this horror.

word count: 33

even though great interest biblical movie bore death every minute movie everything bad movie long acting time joke script horrible get point mix story abraham noah together value time sanity stay away horror

Now let's put this whole process into a function, process our training data with that funciton and train a new random forrest with the preprocessed data.

def text_prep(text, sw):

    sw = stopwords.words('english')
    regex_token = RegexpTokenizer(r"([a-zA-Z]+(?:’[a-z]+)?)")
    text = regex_token.tokenize(text)
    text = [word for word in text]
    text = [word for word in text if word not in sw]
    text = pos_tag(text)
    text = [(word[0], get_wordnet_pos(word[1])) for word in text]
    lemmatizer = WordNetLemmatizer() 
    text = [lemmatizer.lemmatize(word[0], word[1]) for word in text]
    return ' '.join(text)

def tokenize_vector(vectorizer, X_train, X_test):
#     sw = stopwords.words('english')

    X_train_tokenized = [text_prep(text, sw) for text in X_train]
    X_train_token_vec = vectorizer.fit_transform(X_train_tokenized)
    X_test_token_vec = vectorizer.transform(X_test)

    return X_train_token_vec, X_test_token_vec

cv_tok = CountVectorizer()

X_train_tokenized = [text_prep(text, sw) for text in X_train]

X_train_cv_tok = cv_tok.fit_transform(X_train_tokenized)
X_test_cv_tok = cv_tok.transform(X_test)

forest_cv_tok = RandomForestClassifier()

forest_cv_tok.fit(X_train_cv_tok, y_train)
y_pred_cv_tok = forest_cv_tok.predict(X_test_cv_tok)

accuracy_score(y_test, y_pred_cv_tok)

plot_confusion_matrix(forest_cv_tok, X_test_cv_tok, y_test, normalize='true');

[output]

# plot an ROC curve
fig, ax = plt.subplots()
plt.title('ROC Curve')
plot_roc_curve(forest_cv_tok, X_train_cv_tok, y_train, name='Train', ax=ax)
plot_roc_curve(forest_cv_tok, X_test_cv_tok, y_test, name='Test', ax=ax);

[output]

As you can see in this example, overall model accuracy has not changed, however with some trials not shown it improved and with more data to process these steps can work to find the patterns in textual data that will unlock growth as your models develop.

Plotting the Trees of NYC with Folium

manifoldmindaway — Mon, 04 Jul 2022 18:59:00 +0000

Folium is a great package for plotting data that contains latitude and longitude information. To do this it brings the power of leaflet, a JavaScript library for mobile friendly interactive maps, to Python. For this project I took a dataset of every publicly owned tree in New York City per a 2015 tree census to show how to clean such a dataset to plot a marker on a folium map for each row with a popup that provides a table of information for each row.

To begin, the first step is to import the data into a pandas DataFrame and inspect the first five rows.

# import pandas, load data
import pandas as pd
df_trees = pd.read_csv('new_york_tree_census_2015.csv.zip')

# print first 5 rows
df_trees.head()

[output]

Using df.info to inspect the data shows that the dataset is over 600,000 rows long, with a mix of strings, integers and floating point data types. It also shows that there are many columns, so our next step will be to filter out some of the rows and reduce the amount of columns to only the ones we are interested in for our maps popup.

df_trees.info()

[output]

Using df.column can give us an easier way to visualize the columns we have.

df_trees.columns

[output]

Index(['tree_id', 'block_id', 'created_at', 'tree_dbh', 'stump_diam',
       'curb_loc', 'status', 'health', 'spc_latin', 'spc_common', 'steward',
       'guards', 'sidewalk', 'user_type', 'problems', 'root_stone',
       'root_grate', 'root_other', 'trunk_wire', 'trnk_light', 'trnk_other',
       'brch_light', 'brch_shoe', 'brch_other', 'address', 'zipcode',
       'zip_city', 'cb_num', 'borocode', 'boroname', 'cncldist', 'st_assem',
       'st_senate', 'nta', 'nta_name', 'boro_ct', 'state', 'latitude',
       'longitude', 'x_sp', 'y_sp'],
      dtype='object')

Looking at the 'status' column’s value_counts() method shows that there are over 30,000 trees that are either dead or only a stump. These may be worth mapping some other time, but for now we'll exclude them by making the DataFrame a slice of itself that only includes rows that list the tree as 'Alive'.

df_trees.status.value_counts()

[output]

Alive    652173
Stump     17654
Dead      13961
Name: status, dtype: int64

# slice DataFrame and inspect value counts
df_trees = df_trees[df_trees.status == 'Alive']
df_trees.status.value_counts()

[output]

Alive    652173
Name: status, dtype: int64

Now to reduce the columns in our DataFrame, we'll first make a list of all the columns we want to include and then slice the DataFrame with that list of columns. The columns keeped include the latitude and longitude, measurements of the tree, the address, zip code and health of the tree.

After that we make a list of cleaner column names and pass it to df.columns for better readability.

columns_of_interest = ['tree_id', 'tree_dbh', 'stump_diam','curb_loc', 'health', 'spc_latin', 'spc_common', 
                       'steward', 'sidewalk', 'problems', 'address',
                       'zipcode', 'nta', 'latitude', 'longitude']
df_trees = df_trees[columns_of_interest]

new_column_names = ['tree_id', 'breast_height_diam', 'stump_diam','curb_loc', 'health', 'spc_latin', 'spc_common',
                    'num_stewards', 'sidewalk_damage', 'problems', 'address', 'zipcode', 'borough', 'latitude', 
                    'longitude']
df_trees.columns = new_column_names

Printing out the first five rows confirms these changes took place. Now we can move on to cleaning the values inside the columns so that the popups on our map will present the information in a more readable way.

df_trees.head()

[output]

Looking at the 'curb_loc' column shows that the values are strings that capitalizes the first letter of every word, yet does not put spaces between words. To solve this we can use Series.apply() to pass a lambda function that uses a regular expression to substitute each value. The regular expression will place a space before each capitalized letter. Since this creates a leading space, adding .lstrip() will take that leading space out and then chaining .capitalize() will ensure only the first word of the string is upper case.

Checking .value_counts() confirms the formatting has been fixed.

import re

df_trees.curb_loc = df_trees.curb_loc.apply(lambda x: re.sub('([A-Z])', r' \1', x).lstrip().capitalize())
df_trees.curb_loc.value_counts()

[output]

On curb             626298
Offset from curb     25875
Name: curb_loc, dtype: int64

Next is the column that tells how many stewards take care of the tree. Since this uses 'or' to imply a range of stewards, we'll just replace the 'or' with a hyphen. We'll also change 'None' to '0'. This can be done by sending a mapping dictionary to .replace() where the keys are the value to change and the value is the value to substitute it with. inplace=True commits these changes and .value_counts() confirms they took place.

df_trees.num_stewards.replace({'None': '0', '1or2': '1-2' , '3or4': '3-4', '4orMore': '4+'}, inplace=True)
df_trees.num_stewards.value_counts()

[output]

0      487823
1-2    143557
3-4     19183
4+       1610
Name: num_stewards, dtype: int64

We'll use the same .replace() method for the 'sidewalk_damage' column. Since this lets us know if the sidewalk surrounding the tree is damaged or not, we'll change the values to a simple 'Yes' or 'No'.

df_trees.sidewalk_damage.replace({'NoDamage': 'No', 'Damage': 'Yes'}, inplace=True)
df_trees.sidewalk_damage.value_counts()

[output]

No     464978
Yes    187194
Name: sidewalk_damage, dtype: int64

Cleaning the 'problems' column will take a little more effort. Just like the 'curb_loc' column the values are strings without spaces between words where every word is capitalized. However, this column is different because some values are only one word long. For this I wrote a function that first checks how many words are in the string. If it only contains one word then the string is returned as is, while if the string contains more than one word the regular expression substitution from 'curb_loc' is applied and the leading space is stripped. Additionally, the string is joined onto a string containing a comma followed by a space to add commas plus the word 'and' is added before the final word.

def space_words(string):

    num_words = len(re.findall(r'[A-Z]',string))
    if num_words == 1:
        string = string

    if num_words > 1:
        string = re.sub('([A-Z])', r' \1', string).lstrip()
        string = ', '.join(string.split())
        string = string.split()
        string.insert(-1, 'and')
        string = ' '.join(string)

    return string

df_trees.problems.fillna('None', inplace=True)
df_trees.problems = df_trees.problems.apply(space_words)
df_trees.problems.value_counts()

[output]

None                                                                                   426329
Stones                                                                                  95673
Branch, and Lights                                                                      29452
Stones, Branch, and Lights                                                              17808
Root, and Other                                                                         11418
                                                                                        ...  
Stones, Metal, Grates, Root, Other, Wires, Rope, Trunk, Other, Branch, and Lights           1
Stones, Root, Other, Branch, Lights, Sneakers, Branch, and Other                            1
Wires, Rope, Trunk, Other, Branch, Lights, and Sneakers                                     1
Stones, Root, Other, Wires, Rope, Trunk, Lights, Trunk, Other, Branch, and Lights           1
Stones, Wires, Rope, Trunk, Lights, Trunk, Other, Branch, Lights, Branch, and Other         1
Name: problems, Length: 232, dtype: int64

The original borough column mixed boroughs and neighborhoods, so it was dropped. However, the 'nta' column that is now called 'borough' contains strings with the first two characters being the abbreviation for the borough location. Reducing these values to simply the abbreviation is as simple as passing a lambda function to slice the first two characters out of each value through to the .apply() method.

df_trees.borough.apply(lambda x: x[:2]).value_counts()

[output]

QN    237947
BK    169771
SI    101443
BX     80348
MN     62664
Name: borough, dtype: int64

df_trees.info()

[output]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 652173 entries, 0 to 683787
Data columns (total 15 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   tree_id             652173 non-null  int64  
 1   breast_height_diam  652173 non-null  int64  
 2   stump_diam          652173 non-null  int64  
 3   curb_loc            652173 non-null  object 
 4   health              652172 non-null  object 
 5   spc_latin           652168 non-null  object 
 6   spc_common          652168 non-null  object 
 7   num_stewards        652173 non-null  object 
 8   sidewalk_damage     652172 non-null  object 
 9   problems            652173 non-null  object 
 10  address             652173 non-null  object 
 11  zipcode             652173 non-null  int64  
 12  borough             652173 non-null  object 
 13  latitude            652173 non-null  float64
 14  longitude           652173 non-null  float64
dtypes: float64(2), int64(4), object(9)
memory usage: 79.6+ MB

We are almost ready to start plotting our map, but first we will need to write out how to prepare our tree information for our popups. I'll first show how this is done one an example tree and then how to do this in a code to plot on our folium map.

Let's isolate the first tree and inspect it for the rows of interest for our popup.

test = df_trees.iloc[0]
test

[output]

tree_id                               606945
breast_height_diam                        10
stump_diam                                 0
curb_loc                             On curb
health                                  Good
spc_latin             Fraxinus pennsylvanica
spc_common                         green ash
num_stewards                               0
sidewalk_damage                           No
problems                              Stones
address                    76-046 164 STREET
zipcode                                11366
borough                                 QN37
latitude                             40.7243
longitude                           -73.8052
Name: 0, dtype: object

Next we'll make a dictionary for the tree with every key being the name of the information in a more readable format and every value the value for the column. Then we can pass this dictionaries values with .values() into a pandas DataFrame constructor with the dictionaries .keys() method chained on and passed into the index= argument. Lastly we can give the DataFrames values column a new title by passing a string in a list to the columns= argument.

tree_details = {'breast height diameter': test.breast_height_diam, 'stump diameter': test.stump_diam, 
                'curb location': test.curb_loc, 'health': test.health, 'latin name': test.spc_latin, 
                'common name': test.spc_common, 'number of stewards': test.num_stewards, 
                'sidewalk damage': test.sidewalk_damage, 'problems': test.problems, 'address': test.address, 
                'zipcode': test.zipcode, 'borough': test.borough}

test_df = pd.DataFrame(tree_details.values(), index=tree_details.keys(), columns=['Information'])

test_df

Now that we know how to make this for every tree, it's time to make our plot. Since our dataset contains over 600,000 trees, we'll only select a sample to plot. For this I decided to look at the list of species and pick one, then set our new map DataFrame to a slice of our cleaned original DataFrame for all trees of that kind

The Douglas fir has 85 listings so let's choose that one.

df_trees.spc_common.value_counts(ascending=True)[:15]

[output]

Virginia pine       10
Scots pine          25
Osage-orange        29
pitch pine          33
black pine          37
American larch      46
European alder      47
smoketree           58
Shantung maple      59
boxelder            64
Himalayan cedar     72
Ohio buckeye        75
southern red oak    83
quaking aspen       83
Douglas-fir         85
Name: spc_common, dtype: int64

map_df = df_trees[df_trees.spc_common == 'Douglas-fir']

Now we can initialize our map. After importing folium, we can use it's Map class with a location passed. This will be the center of our map upon loading. For this we'll pass in a list with the first element being the mean latitude of our map_df and the second the mean longitude. A zoom of 10 is added to make the map more appealing upon loading, as well.

import folium

tree_map = folium.Map(location=[map_df.latitude.mean(), map_df.longitude.mean()], zoom_start=10)

tree_map

Now that we have our map object that prints a map, it's time to add our popups. To do this, we set up a loop to go over our map_df with iterrows(). Then, for every row/tree in our map_df we construct the tree's information DataFrame like we did before. Now, if we put this DataFrame into our popup it wouldn't be appealing at all. In order to format it for a popup we'll first convert the tree_details_df to html by chaining the .to_html() method. After that, we can insert that html object into folium's IFrame() class, which uses html to create a figure. This figure will then get put into folium's Popup() class which creates a popup instance to pass into the .Marker() with the latitude and longitude for the tree.

for index, tree_info in map_df.iterrows():

    tree_details = {'breast height diameter': tree_info.breast_height_diam, 'stump diameter': tree_info.stump_diam, 
                    'curb location': tree_info.curb_loc, 'health': tree_info.health, 
                    'latin name': tree_info.spc_latin, 'common name': tree_info.spc_common, 
                    'number of stewards': tree_info.num_stewards, 'sidewalk damage': tree_info.sidewalk_damage, 
                    'problems': tree_info.problems, 'address': tree_info.address, 'zipcode': tree_info.zipcode, 
                    'borough': tree_info.borough}

    tree_details_df = pd.DataFrame(tree_details.values(), index=tree_details.keys(), columns=['Information'])

    html = tree_details_df.to_html()
    iframe = folium.IFrame(html=html, width=300, height=400)
    popup = folium.Popup(iframe)

    folium.Marker([tree_info['latitude'], tree_info['longitude']], popup=popup).add_to(tree_map)

tree_map

The Pandas Series and Memory Consumption

manifoldmindaway — Mon, 04 Jul 2022 18:58:50 +0000

The pandas Series is a numpy array with the addition of an index for each entry. It is also the building block object for the pandas DataFrame, like below.

Since the Series is constructed from numpy arrays, this gives them the added benefit of faster computation on a Series or between multiple ones due to numpy's vectorized computation. While with a base python list a loop must be used to perform computation on elements of a list, slowly checking the index and confirming the data type, numpy handles this in pre-compiled C code for much faster computation.
Let's construct the above DataFrame by building a Series for each column and exploring how to change the data type of our values. Although the length of these columns are trivially small, knowing how to maximize the efficiency of your code by choosing the right data type will become important when dealing with larger amounts of data and performing more complex computations.
We'll start by creating a Series for the year column. We can do this by initializing a variable named 'years' that contains a base python list filled with the year for the past 8 years. Then we will pandas Series method. We will also give the Series a name by passing a string to the name = keyword argument. This Series name will also provide pandas with the column name for these values when added to a DataFrame.

import pandas as pd
import numpy as np

years = [2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014]
year_series = pd.Series(data=years, name='year')
year_series

[output]

0    2021
1    2020
2    2019
3    2018
4    2017
5    2016
6    2015
7    2014
Name: year, dtype: int64

As you can see, pandas has automatically generated a numeric index for our Series and decided to use the 'int64' data type for our values based on the data we passed through. Let's use some attributes and methods to explore what this Series is made of.
The .values attribute gives us the list of years we gave to pandas initially, but we see here that it has been turned it into a numpy array.

print(type(year_series.values))
year_series.values

[output]

<class 'numpy.ndarray'>
array([2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014])

Now let's check how much memory are Series is using with the .memory_usage() method, which outputs the amount of space being used in bytes. For a refresher, every byte is 8 bits. Since our Series contains 8 values, the sum of the bytes used for our array will equal the amount of bits used for each element in the array.
Since our data type is 'int64', measured in bits, and we have 8 values, we could assume our Series consumes 64 bits x 8 values = 512 bits. Converting that to bytes, the .memory_usage() method should give us an output of 64bytes.

print(year_series.memory_usage(), 'bytes')

[output]

192 bytes

Interesting, it did not. Let's dig around to find where the other 128 bytes are coming from. The .nbytes attribute will give us the memory consumption of a numpy array. Let's combine it with the .values and .index attributes on the Series object to check the consumption of both.

print(
f'''
values consumption: {year_series.values.nbytes} bytes
index consumption:  {year_series.index.nbytes} bytes
''')

[output]

values consumption: 64 bytes
index consumption:  128 bytes

The index is storing its values in twice the amount of memory as the actual data! This is an even more odd behavior when considering the results of checking the dtype of the index, which says the index is using 'int64'.

year_series.index.dtype

[output]

dtype('int64')

This will be the same index behavior when building a DataFrame with the pd.DatFrame() method. To reduce the amount of memory used by the index, let's try creating our own index with the pandas .Index() method. Then we can recreate our Series with our own custom index by passing it to the index= keyword argument. For even more memory efficiency we can alter the dtype used for the underlying data in our Series. Pandas decided to use 'int64' which can store values between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807. This is far more space than we need. Let's downgrade the data type of our values to 'uint16'. The 'u' is for unsigned, which works well for values that will not be negative like a year and 'int16' for 16 bits, which can store numbers ranging between 0 and 65,535. This is plenty for representing the years of python and the pandas library versions.

index = pd.Index(list(range(len(years))))
year_series2 = pd.Series(years, index=index, dtype='uint16', name='year')
year_series2

[output]

0    2021
1    2020
2    2019
3    2018
4    2017
5    2016
6    2015
7    2014
Name: year, dtype: uint16

Our custom index using the listed range of the length of our data seems to give us the same index as before. Let's see if its memory consumption is the same.

print(
f'''
total consumption:  {year_series2.memory_usage()} bytes
values consumption: {year_series2.values.nbytes} bytes
index consumption:  {year_series2.index.nbytes} bytes
''')

[output]

total consumption:  80 bytes
values consumption: 16 bytes
index consumption:  64 bytes

Great, that cut the memory consumption of our index in half from 128 bytes to 64 bytes and of our values from 64 bytes to 16 bytes.
Now that we can create a Series and alter the data types used, let's continue building the DatFrame above with a Series for the python version column for each year in our first Series. We already have a custom index created so we will just have to pass that into the new Series. As for the underlying data, since our version numbers are floating point values, our dtype for this Series will default to 'float64' like our Series containing years defaulted to 'int64'. Let's go ahead and downgrade this to 'float32'.

python_versions = [3.9, 3.9, 3.8, 3.7, 3.6, 3.6, 3, 3]
python_series = pd.Series(python_versions, index=index, dtype='float32', name='python_version')
python_series

[output]

0    3.9
1    3.9
2    3.8
3    3.7
4    3.6
5    3.6
6    3.0
7    3.0
Name: python_version, dtype: float32

Now let's check the memory consumption.

print(
f'''
total consumption:  {python_series.memory_usage()} bytes
values consumption: {python_series.values.nbytes} bytes
index consumption:  {python_series.index.nbytes} bytes
''')

[output]

total consumption:  96 bytes
values consumption: 32 bytes
index consumption:  64 bytes

Not bad, this column will take up more memory than the one with our years, but we were able to decrease the underlying datas memory consumption from 64 bytes to 32 bytes.
Now let's create our last Series that contains the versions of pandas for the years in our first Series. Since each year has multiple pandas updates, these are stored as a sting representing the range of version updates for that year.

pandas_versions = ['1.2->1.4', '1.0->1.1', '0.24->0.25', '0.23', '0.20->0.22', 
                    '0.18->0.19','0.16->0.17','0.13->0.15']
pandas_series = pd.Series(pandas_versions, index=index, name='pandas_version')
pandas_series

[output]

0      1.2->1.4
1      1.0->1.1
2    0.24->0.25
3          0.23
4    0.20->0.22
5    0.18->0.19
6    0.16->0.17
7    0.13->0.15
Name: pandas_version, dtype: object

We can see a string is represented in the Series as the numpy 'object' dtype.
Now for the memory consumption.

print(
f'''
total consumption:  {pandas_series.memory_usage()} bytes
values consumption: {pandas_series.values.nbytes} bytes
index consumption:  {pandas_series.index.nbytes} bytes
''')

[output]

total consumption:  128 bytes
values consumption: 64 bytes
index consumption:  64 bytes

We can see the 'object' dtype totals to 64bytes for the column, or 8bytes/64bits, per element. Unlike for numeric values, there is not a smaller 'object' dtype to downgrade to.
Now let's finally put these in a DataFrame with the pd.concat() method.

df = pd.concat([year_series2, python_series, pandas_series], axis = 1)
df

[output]

Memory consumption of a this DataFrame can be checked with the same .memory_usage() method as the Series' used.

print(df.memory_usage())
print()
print(f'total memory consumption: {df.memory_usage().sum()} bytes')

[output]

Index             64
year              16
python_version    32
pandas_version    64
dtype: int64

total memory consumption: 176 bytes

For comparison, I created this same DataFrame by passing a dictionary with the values to the pd.DataFrame() constructor method and checked the memory consumption.

data = {'year': years, 'python_versions': python_versions, 'pandas_versions': pandas_versions}
df2 = pd.DataFrame(data=data)
df2

[output]

Memory consumption for this DataFrame.

print(df2.memory_usage())
print()
print(f'total memory consumption: {df2.memory_usage().sum()} bytes')

[output]

Index              128
year                64
python_versions     64
pandas_versions     64
dtype: int64

total memory consumption: 320 bytes

Our DataFrame made from customized Series' added up 176 bytes, while our DataFrame made from the the pd.DatFrame() method and no customizing came in at 320 bytes.That's a 45% increase in memory consumption!
As you can see, changing the data types used in your pandas DataFrames can drastically decrease the amount of memory used. For larger datasets this can have useful effects in both file size and time for computation.