Corpus is collection of text. This can multiple paragraphs to entire book.
Now question arises why we talking about?
Well in Natural Processing Language there could be multiple paragraph
or long text for analysis we are going learn how to deal with huge text.
For text analysis you need to remember this steps or preprocessing of corpus:
- Tokenisation
- Stop Word Removal
- Special Character Removal
- Converting LowerCase
1. Tokenisation
from nltk.tokenize import word_tokenize corpus="In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual." print(word_tokenize(word))
['In', 'order', 'to', 'make', 'the', 'corpora', 'more', 'useful', 'for', 'doing', 'linguistic', 'research', ',', 'they', 'are', 'often', 'subjected', 'to', 'a', 'process', 'known', 'as', 'annotation', '.', 'An', 'example', 'of', 'annotating', 'a', 'corpus', 'is', 'part-of-speech', 'tagging', ',', 'or', 'POS-tagging', ',', 'in', 'which', 'information', 'about', 'each', 'word', "'s", 'part', 'of', 'speech', '(', 'verb', ',', 'noun', ',', 'adjective', ',', 'etc', '.', ')', 'is', 'added', 'to', 'the', 'corpus', 'in', 'the', 'form', 'of', 'tags', '.', 'Another', 'example', 'is', 'indicating', 'the', 'lemma', '(', 'base', ')', 'form', 'of', 'each', 'word', '.', 'When', 'the', 'language', 'of', 'the', 'corpus', 'is', 'not', 'a', 'working', 'language', 'of', 'the', 'researchers', 'who', 'use', 'it', ',', 'interlinear', 'glossing', 'is', 'used', 'to', 'make', 'the', 'annotation', 'bilingual', '.']
2. Stop Word Removal and Converting Lower Case:
from nltk.corpus import stopwords for word in word_tokenize(corpus): if (word.lower() not in stopwords.words('english') and (len(word)>=2)): print(word)
Output: order make corpora useful linguistic research often subjected process known annotation example annotating corpus part-of-speech tagging POS-tagging information word 's part speech verb noun adjective etc added corpus form tags Another example indicating lemma base form word language corpus working language researchers use interlinear glossing used make annotation bilingual
3. Put in List:
words=[] for word in word_tokenize(corpus): if (word.lower() not in stopwords.words('english') and (len(word)>=2)): words.append(word.lower()) print(words)
Output: ['order', 'make', 'corpora', 'useful', 'linguistic', 'research', 'often', 'subjected', 'process', 'known', 'annotation', 'example', 'annotating', 'corpus', 'part-of-speech', 'tagging', 'pos-tagging', 'information', 'word', "'s", 'part', 'speech', 'verb', 'noun', 'adjective', 'etc', 'added', 'corpus', 'form', 'tags', 'another', 'example', 'indicating', 'lemma', 'base', 'form', 'word', 'language', 'corpus', 'working', 'language', 'researchers', 'use', 'interlinear', 'glossing', 'used', 'make', 'annotation', 'bilingual']
4. Unique Words:
print(set(words))
Output: ['order', 'make', 'corpora', 'useful', 'linguistic', 'research', 'often', 'subjected', 'process', 'known', 'annotation', 'example', 'annotating', 'corpus', 'part-of-speech', 'tagging', 'pos-tagging', 'information', 'word', "'s", 'part', 'speech', 'verb', 'noun', 'adjective', 'etc', 'added', 'corpus', 'form', 'tags', 'another', 'example', 'indicating', 'lemma', 'base', 'form', 'word', 'language', 'corpus', 'working', 'language', 'researchers', 'use', 'interlinear', 'glossing', 'used', 'make', 'annotation', 'bilingual']
Vocabulary in context of nlp, refers to unique words in corpus. After preprocessing total unique words.
To see the difference after Stop Word Removal and getting Unique value from words
print(len(words)) print(len(set(words)))
Output: 49 41
Top comments (0)