By starting this I am assuming that you know basic of python.
Source:- Python Basics
NLTK (Natural Language ToolKit) it is a library for Natural Language Processing or NLP. It is a largest library for performing NLP task.
It supports classification,tokenization,stemming,tagging,parsing and semantic reasoning functionalities.
To install nltk run the following command in your terminal:
pip install nltk
In your code editor write:
import nltk
This is coding which I am going to write nothing to do with "import nltk" right now. This for explaining purpose.
In your code editor:
txt="Hello Guys. We're hoping you guys are doing great." string1=txt.split('.') string2=txt.split(' ') print(string1) print(string2)
Output: ['Hello Geek', " We're hoping you guys are doing great", ''] ['Hello', 'Geek.', "We're", 'hoping', 'you', 'guys', 'are', 'doing', 'great']
Explanation:
- In the first string, the text is split by sentences—that is, the text is divided wherever there is a full stop.
- In the second string, the text is split by words—that is, each word is separated individually.
print(len(string1)) print(len(string2))
3 9
Explanation:
- The length of string1 should be 2 but it is showing three. why? Because split() function count whitespace by default.
- It can hinder NLP task can become lengthy and complex.
Tokenisation
Tokenisation basic meaning is breaking a document or body of text into small units called tokens.
In nltk library there is package tokenize which does this task where we wouldn't have this problem. There is module word_tokenize and sent_tokenize.
First, You need to install, you write "python" in terminal it will create python environment.
import nltk nltk.download('punkt')
txt="Hello Geek. We're hoping you guys are doing great." from nltk.tokenize import word_tokenize,sent_tokenize print(word_tokenize(txt)) print(sent_tokenize(txt))
['Hello', 'Geek', '.', 'We', "'re", 'hoping', 'you', 'guys', 'are', 'doing', 'great', '.'] ['Hello Geek.', "We're hoping you guys are doing great."]
Now you can see even full stop is counted as character.
for word in word_tokenize(txt): if word!='.': print(word)
You guys will what is the output going to be.
Top comments (0)