Encoding Corpus - NLP

#nlp #python #beginners #machinelearning

Why Encoding?
Machines don't understand text the way humans do. To process text, we need to convert it into numbers. One way to do this is by assigning a unique number to every word — this process is called tokenization.

In Natural Language Processing (NLP), we use TensorFlow's Keras API **to easily convert text into sequences of numbers using a **Tokenizer.

Why use tensorflow.keras?
TensorFlow’s built-in high-level API for building and training neural networks. It is a wrapper around the original Keras library but tightly integrated with TensorFlow's backend for performance and scalability.

For smoother experience, especially when using TensorFlow, tools like Google Colab are recommended to avoid installation issues.

1. Tokenisation

from tensorflow.keras.preprocessing.text import Tokenizer

tok=Tokenizer()

corp=['This is a book that is about the history of the world in a very detailed and interesting way.','Some corpora have further structured levels of analysis applied. In particular, smaller corpora may be fully parsed.']

tok.fit_on_texts(corp)

print(tok.word_index)

Explanation:

tok=tokenizer() it is an initialisation of tokeniser which will break sentence into words.
corp, corpus or paragraph which we will break it.
tok.fit_on_texts(corp), it will give index to words. How indexing works in here? the most frequent word get lowest index like 1.
print(tok.word_index), it ask for what is word index.

Output:
{'is': 1, 'a': 2, 'the': 3, 'of': 4, 'in': 5, 'corpora': 6, 'this': 7, 'book': 8, 'that': 9, 'about': 10, 'history': 11, 'world': 12, 'very': 13, 'detailed': 14, 'and': 15, 'interesting': 16, 'way': 17, 'some': 18, 'have': 19, 'further': 20, 'structured': 21, 'levels': 22, 'analysis': 23, 'applied': 24, 'particular': 25, 'smaller': 26, 'may': 27, 'be': 28, 'fully': 29, 'parsed': 30}

2. Convert text to sequences

print(tok.texts_to_sequences(corp))

Output:
[[7, 1, 2, 8, 9, 1, 10, 3, 11, 4, 3, 12, 5, 2, 13, 14, 15, 16, 17],
 [18, 6, 19, 20, 21, 22, 4, 23, 24, 5, 25, 26, 6, 27, 28, 29, 30]]

Problem: What Happens with New Words?

When you train a tokenizer on some text, it only builds a vocabulary from those words.
If later you try to encode a sentence with new/unknown words, texts_to_sequences() it will ignore it silently.

Example Without OOV Handling

from tensorflow.keras.preprocessing.text import Tokenizer

tok=Tokenizer()

corp=['I need coffe,','we can make it from water']

tok.fit_on_texts(corp)

print(tok.word_index)

we tokenised the text into word and it have assigned the index.

{'i': 1, 'need': 2, 'coffe': 3, 'we': 4, 'can': 5, 'make': 6, 'it': 7, 'from': 8, 'water': 9}

Now try a sentence with a new word "black":

corp=['I need coffe,','black coffee we can make it from water']
print(tok.texts_to_sequences(corp))

Output:
[[1, 2, 3], [4, 5, 6, 7, 8, 9]]

Now we have add word black texts_to_sequences() skip the word black.

Handle OOV (Out-of-Vocabulary) Words

Out of vocabulary words is word that is not present during training fit_on_texts().

from tensorflow.keras.preprocessing.text import Tokenizer

tok=Tokenizer(oov_token='balck')

corp=['I need coffe,','we can make it from water']

tok.fit_on_texts(corp)

print(tok.word_index)

Output:
{'balck': 1, 'i': 2, 'need': 3, 'coffe': 4, 'we': 5, 'can': 6, 'make': 7, 'it': 8, 'from': 9, 'water': 10}

Even though "black" wasn’t in training, it gets index 1 — the reserved OOV token.

Limiting the number of words.

Let say I don't want whole paragraph I want to limit the words why? because of efficiency if I take corpora (whole paragraph) then to process it will take time. To make it easy we will limit it.

tok=Tokenizer(oov_token='balck',num_word=4)

DEV Community

Encoding Corpus - NLP

Top comments (0)