When you want to learn a new language, wordbook is indispensable.
In this article, I tried to create own wordbook from podcasts transcript.
Leaning the words which Engineers often use is more efficiently about my study.
I tried to use Natural language processing for the first time.
Please don't go too hard on me.
1. Finding articles or transcripts about your new language.
I found the changelog podcasts transcript
https://github.com/thechangelog/transcripts/tree/master/gotime
I'll use these texts for this time.
2. Cleansing
Preparation
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "We've got a great show lined up today. This is our first episode, so we're gonna do some brief introductions"
nltk.sent_tokenize
We can get separated words from text which sentence tokenizer uses.
>>> sent_tokenize(text)
["We've got a great show lined up today.", "This is our first episode, so we're gonna do some brief introductions"]
nltk.word_tokenize
Loop these sentences which are separated and word_tokenize functions separate from the sentence and get tokenized words.
>>> [word_tokenize(sent) for sent in sent_tokenize(text)]
[['We', "'ve", 'got', 'a', 'great', 'show', 'lined', 'up', 'today', '.'], ['This', 'is', 'our', 'first', 'episode', ',', 'so', 'we', "'re", 'gon', 'na', 'do', 'some', 'brief', 'introductions']]
nltk.pos_tag
pos_tag provides part of speech tagger for given list of tokens.
>>> [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
[[('We', 'PRP'), ("'ve", 'VBP'), ('got', 'VBD'), ('a', 'DT'), ('great', 'JJ'), ('show', 'NN'), ('lined', 'VBD'), ('up', 'RP'), ('today', 'NN'), ('.', '.')], [('This', 'DT'), ('is', 'VBZ'), ('our', 'PRP$'), ('first', 'JJ'), ('episode', 'NN'), (',', ','), ('so', 'IN'), ('we', 'PRP'), ("'re", 'VBP'), ('gon', 'VBG'), ('na', 'TO'), ('do', 'VB'), ('some', 'DT'), ('brief', 'NN'), ('introductions', 'NNS')]]
You can get a description about tag given to post_tag function
https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk
3. Lemmatize the words
We have to change tuple data for lemmatize.
("We", "VBD") -> ("We", "v")
You can get lemmatization words. WordNetLemmatizer algorithm is messy to use. We have to give tags to lemmatize function explicitly
>>> lemmatizer = nltk.stem.WordNetLemmatizer()
>>> lemmatizer.lemmatize("better", pos=nltk.corpus.wordnet.ADJ)
'good'
FYI: get_wordnet_pos function is useful for changing data for pos tagging.
I tryed to StanfordCoreNLP also. It looks better than WordNet.
https://github.com/stanfordnlp/stanfordnlp
>>> import stanfordnlp
>>> stanfordnlp.download('en')
>>> nlp = stanfordnlp.Pipeline()
>>> text = "We've got a great show lined up today. This is our first episode, so we're gonna do some brief introductions"
>>> doc = nlp(text)
>>> print([w.lemma for w in doc.sentences[1].words])
['this', 'be', 'we', 'first', 'episode', ',', 'so', 'we', 'be', 'go', 'to', 'do', 'some', 'brief', 'introduction']
I cloud get more better lemmatized words than before
(ex. WordNetLemmatizer: "gonna" => "gon", StanfordCoreNLP: "gonna" => "go")
I can get separated and lemmatized words through any stages.
In the next stage, I try to list the words and create useful wordbook.
Top comments (0)