DEV Community

Discussion on: Anatomy of Language Models In NLP

Collapse
 
amananandrai profile image
amananandrai

Lemmatization and tokenization are used in the case of text classification and sentiment analysis as far as I know. In case of Neural language models use word embeddings which find relation between various words and store them in vectors. Like it can find that king and queen have the same relation as boy and girl and which words are similar in meaning and which are far away in context. In case of statistical models we can use tokenization to find the different tokens. Neural models have there own tokenizers and based on these tokens only the next token is generated during the test phase and tokenization is done during the training phase. Lemmatization will cause a little bit of error here as it trims the words to base form thus resulting in a bit of error. Eg- the base form of is, are and am is be thus a sentence like " I be Aman" would be grammatically incorrect and this will occur due to lemmatization