DEV Community

Cover image for Corpus & Vocabulary - NLP
datatoinfinity
datatoinfinity

Posted on • Edited on

Corpus & Vocabulary - NLP

Corpus is collection of text. This can multiple paragraphs to entire book.

Now question arises why we talking about?

Well in Natural Processing Language there could be multiple paragraph
or long text for analysis we are going learn how to deal with huge text.

For text analysis you need to remember this steps or preprocessing of corpus:

  1. Tokenisation
  2. Stop Word Removal
  3. Special Character Removal
  4. Converting LowerCase

1. Tokenisation

from nltk.tokenize import word_tokenize

corpus="In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual."

print(word_tokenize(word))

['In', 'order', 'to', 'make', 'the', 'corpora', 'more', 'useful', 'for', 'doing', 'linguistic', 'research', ',', 'they', 'are', 'often', 'subjected', 'to', 'a', 'process', 'known', 'as', 'annotation', '.', 'An', 'example', 'of', 'annotating', 'a', 'corpus', 'is', 'part-of-speech', 'tagging', ',', 'or', 'POS-tagging', ',', 'in', 'which', 'information', 'about', 'each', 'word', "'s", 'part', 'of', 'speech', '(', 'verb', ',', 'noun', ',', 'adjective', ',', 'etc', '.', ')', 'is', 'added', 'to', 'the', 'corpus', 'in', 'the', 'form', 'of', 'tags', '.', 'Another', 'example', 'is', 'indicating', 'the', 'lemma', '(', 'base', ')', 'form', 'of', 'each', 'word', '.', 'When', 'the', 'language', 'of', 'the', 'corpus', 'is', 'not', 'a', 'working', 'language', 'of', 'the', 'researchers', 'who', 'use', 'it', ',', 'interlinear', 'glossing', 'is', 'used', 'to', 'make', 'the', 'annotation', 'bilingual', '.']

2. Stop Word Removal and Converting Lower Case:

from nltk.corpus import stopwords
for word in word_tokenize(corpus):
    if (word.lower() not in stopwords.words('english') and (len(word)>=2)):
        print(word)
Output:
order
make
corpora
useful
linguistic
research
often
subjected
process
known
annotation
example
annotating
corpus
part-of-speech
tagging
POS-tagging
information
word
's
part
speech
verb
noun
adjective
etc
added
corpus
form
tags
Another
example
indicating
lemma
base
form
word
language
corpus
working
language
researchers
use
interlinear
glossing
used
make
annotation
bilingual

3. Put in List:

words=[]
for word in word_tokenize(corpus):
    if (word.lower() not in stopwords.words('english') and (len(word)>=2)):
        words.append(word.lower())

print(words)
Output:
['order', 'make', 'corpora', 'useful', 'linguistic', 'research', 'often', 'subjected', 'process', 'known', 'annotation', 'example', 'annotating', 'corpus', 'part-of-speech', 'tagging', 'pos-tagging', 'information', 'word', "'s", 'part', 'speech', 'verb', 'noun', 'adjective', 'etc', 'added', 'corpus', 'form', 'tags', 'another', 'example', 'indicating', 'lemma', 'base', 'form', 'word', 'language', 'corpus', 'working', 'language', 'researchers', 'use', 'interlinear', 'glossing', 'used', 'make', 'annotation', 'bilingual']

4. Unique Words:

print(set(words))
Output:
['order', 'make', 'corpora', 'useful', 'linguistic', 'research', 'often', 'subjected', 'process', 'known', 'annotation', 'example', 'annotating', 'corpus', 'part-of-speech', 'tagging', 'pos-tagging', 'information', 'word', "'s", 'part', 'speech', 'verb', 'noun', 'adjective', 'etc', 'added', 'corpus', 'form', 'tags', 'another', 'example', 'indicating', 'lemma', 'base', 'form', 'word', 'language', 'corpus', 'working', 'language', 'researchers', 'use', 'interlinear', 'glossing', 'used', 'make', 'annotation', 'bilingual']

Vocabulary in context of nlp, refers to unique words in corpus. After preprocessing total unique words.

To see the difference after Stop Word Removal and getting Unique value from words

print(len(words))
print(len(set(words)))
Output:
49
41

Top comments (0)