Preparing data for Sentiment analysis with TensorFlow

#python #tensorflow #deeplearning #programming

Sentiment in Text

Tokenizer - to vectorise an sentence with words into numbers, it strips punctuation out and converts everything to lower case, The num_words parameter used in the initializer specifies the maximum number of words minus one (based on frequency) to keep when generating sequences.

from tensorflow.keras.preprocessing.text import Tokenizer

# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat'
    ]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100) #It only takes the first 100 most occurring words  

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)

```python
# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 1)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)
```

The important thing to note is it does not affect how the word_indexdictionary is generated. You can try passing 1 instead of 100 as shown on the next cell and you will arrive at the same word_index.
Padding - since before feeding the data into the model, we need to make sure it’s uniform. We use padding to do so, and append the corpus with zero's. With the use of arguments we can determine either to append from the front or the back.

We use "OOV" for out of vocabulary words.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Define your input texts
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")

# Tokenize the input sentences
tokenizer.fit_on_texts(sentences)

# Get the word index dictionary
word_index = tokenizer.word_index

# Generate list of token sequences
sequences = tokenizer.texts_to_sequences(sentences)

# Print the result
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)

# Pad the sequences to a uniform length
padded = pad_sequences(sequences, maxlen=5) #override the max length you want for the sequence

# Print the result
print("\nPadded Sequences:")
print(padded)

By default the padding happens from left side, same with the maxlen function, the data is truncated from left. We can change this by specifying in the arguments.
```
padded = pad_sequences(sequences,padding = 'post',trunction = 'post',maxlen=5)
```

Using a sarcasm dataset to try this out

# Download the dataset
!wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json

import json

# Load the JSON file
with open("./sarcasm.json", 'r') as f:
    datastore = json.load(f)

# Initialize lists
sentences = [] 
labels = []
urls = []

# Append elements in the dictionaries into each list
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

Processing the dataset, using tokenizer and then padding the corpus.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initialize the Tokenizer class
tokenizer = Tokenizer(oov_token="<OOV>")

# Generate the word index dictionary
tokenizer.fit_on_texts(sentences)

# Print the length of the word index
word_index = tokenizer.word_index
print(f'number of words in word_index: {len(word_index)}')

# Print the word index
print(f'word_index: {word_index}')
print()

# Generate and pad the sequences
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

# Print a sample headline
index = 2
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {padded[index]}')
print()

# Print dimensions of padded sequences
print(f'shape of padded sequences: {padded.shape}')

Preparing data for NLP algorithms, by removing stop words and converting everything to lower case

def remove_stopwords(sentence):
    # List of stopwords
    stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

    # Sentence converted to lowercase-only
    sentences = sentence.lower()

    words=sentences.split()
    filtered = [word for word in words if word not in stopwords]
    sentence = " ".join(filtered)

    return sentence

Reading data and extracting labels and text’s


def parse_data_from_file(filename):
    sentences = []
    labels = []
    with open(filename, 'r') as csvfile:

        reader = csv.reader(csvfile)
        next(reader) #ignoring the first line, headers

        for row in reader:
            lable = row[0]
            text = " ".join(row[1:])
            text = remove_stopwords(text)
            labels.append(lable)
            sentences.append(text)

    return sentences, labels

Tokenizing labels

def tokenize_labels(labels):

    # Instantiate the Tokenizer class
    # No need to pass additional arguments since you will be tokenizing the labels
    label_tokenizer = Tokenizer()

    # Fit the tokenizer to the labels
    label_tokenizer.fit_on_texts(labels)

    # Save the word index
    label_word_index = label_tokenizer.word_index

    # Save the sequences
    label_sequences = label_tokenizer.texts_to_sequences(labels)

    return label_sequences, label_word_index

DEV Community

Preparing data for Sentiment analysis with TensorFlow

Sentiment in Text

Top comments (0)

Read next

What is new feature of React version19

Hard Delete Vs Soft Delete in DBMS

How to use generics in pipe-and-combine

Dirty Code: Simple Rules to Avoid It