Sentiment in Text
-
Tokenizer - to vectorise an sentence with words into numbers, it strips punctuation out and converts everything to lower case, The
num_words
parameter used in the initializer specifies the maximum number of words minus one (based on frequency) to keep when generating sequences.
from tensorflow.keras.preprocessing.text import Tokenizer # Define input sentences sentences = [ 'i love my dog', 'I, love my cat' ] # Initialize the Tokenizer class tokenizer = Tokenizer(num_words = 100) #It only takes the first 100 most occurring words # Generate indices for each word in the corpus tokenizer.fit_on_texts(sentences) # Get the indices and print it word_index = tokenizer.word_index print(word_index)
```python
# Define input sentences
sentences = [
'i love my dog',
'I, love my cat',
'You love my dog!'
]
# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 1)
# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)
# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)
```
The important thing to note is it does not affect how the
word_index
dictionary is generated. You can try passing1
instead of100
as shown on the next cell and you will arrive at the sameword_index
.Padding - since before feeding the data into the model, we need to make sure it’s uniform. We use padding to do so, and append the corpus with zero's. With the use of arguments we can determine either to append from the front or the back.
-
We use "OOV" for out of vocabulary words.
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences # Define your input texts sentences = [ 'I love my dog', 'I love my cat', 'You love my dog!', 'Do you think my dog is amazing?' ] # Initialize the Tokenizer class tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>") # Tokenize the input sentences tokenizer.fit_on_texts(sentences) # Get the word index dictionary word_index = tokenizer.word_index # Generate list of token sequences sequences = tokenizer.texts_to_sequences(sentences) # Print the result print("\nWord Index = " , word_index) print("\nSequences = " , sequences) # Pad the sequences to a uniform length padded = pad_sequences(sequences, maxlen=5) #override the max length you want for the sequence # Print the result print("\nPadded Sequences:") print(padded)
-
By default the padding happens from left side, same with the maxlen function, the data is truncated from left. We can change this by specifying in the arguments.
padded = pad_sequences(sequences,padding = 'post',trunction = 'post',maxlen=5)
-
Using a sarcasm dataset to try this out
# Download the dataset !wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json import json # Load the JSON file with open("./sarcasm.json", 'r') as f: datastore = json.load(f) # Initialize lists sentences = [] labels = [] urls = [] # Append elements in the dictionaries into each list for item in datastore: sentences.append(item['headline']) labels.append(item['is_sarcastic']) urls.append(item['article_link'])
-
Processing the dataset, using tokenizer and then padding the corpus.
from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences # Initialize the Tokenizer class tokenizer = Tokenizer(oov_token="<OOV>") # Generate the word index dictionary tokenizer.fit_on_texts(sentences) # Print the length of the word index word_index = tokenizer.word_index print(f'number of words in word_index: {len(word_index)}') # Print the word index print(f'word_index: {word_index}') print() # Generate and pad the sequences sequences = tokenizer.texts_to_sequences(sentences) padded = pad_sequences(sequences, padding='post') # Print a sample headline index = 2 print(f'sample headline: {sentences[index]}') print(f'padded sequence: {padded[index]}') print() # Print dimensions of padded sequences print(f'shape of padded sequences: {padded.shape}')
-
Preparing data for NLP algorithms, by removing stop words and converting everything to lower case
def remove_stopwords(sentence): # List of stopwords stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ] # Sentence converted to lowercase-only sentences = sentence.lower() words=sentences.split() filtered = [word for word in words if word not in stopwords] sentence = " ".join(filtered) return sentence
-
Reading data and extracting labels and text’s
def parse_data_from_file(filename): sentences = [] labels = [] with open(filename, 'r') as csvfile: reader = csv.reader(csvfile) next(reader) #ignoring the first line, headers for row in reader: lable = row[0] text = " ".join(row[1:]) text = remove_stopwords(text) labels.append(lable) sentences.append(text) return sentences, labels
-
Tokenizing labels
def tokenize_labels(labels): # Instantiate the Tokenizer class # No need to pass additional arguments since you will be tokenizing the labels label_tokenizer = Tokenizer() # Fit the tokenizer to the labels label_tokenizer.fit_on_texts(labels) # Save the word index label_word_index = label_tokenizer.word_index # Save the sequences label_sequences = label_tokenizer.texts_to_sequences(labels) return label_sequences, label_word_index
Top comments (0)