DEV Community

Cover image for NLP & Python: An introduction with code
Federico Sartoris
Federico Sartoris

Posted on

NLP & Python: An introduction with code

Hi, this is my first post and I want to share some examples that i recently uploaded in my github related to how to start working with Natural Language Processing using python.

There are some points that are some relevant when you start working into NLP projects or maybe just want to prototype an idea. Below I share some concepts/techniques that are part of all NLP projects.

Tokenization

Is the process of break up the original text into component pieces called "tokens". Tokens are the basic building block of a document object. This is the basic information which help us understand the meaning of the text is derived from the tokens and their relationship to one another.

Stemming

Is the process to find variations from one keyword. For example, when you search for "boat" might also returns: "boating", "boats". For this case, boat would be the stem for boat, boater, boating, boats.

Steam is related to a method able to catalogue related words based on a keyword defined previously. The method takes the first letter and make a change in the last ones in order to make the keyword more likely other ones.

Martin Porter's Algorithm is one of the most famous implementation as stemming tools because include five phases of word reduction based on mapping rules. Starting with the suffix (end of string) replacing some letters looking for similarity between the words going to primitive form.

Lemmatization

From other side, lemmatization is looking for word reduction based on a morphological analysis of the words. For example, the lemma of "meeting" might be is "meet" or can be "meeting" depending on it is use in a sentence. In contrast of stemming, this method is more informative when look up the word reduction.

Stop Words

There is one of the most used in all projects. In a few words, there are words which appears frequently and they are not nouns, verbs or modifiers. This words do not require tagging.

Part of speech "POS"

The same word in a different order can mean something completely different. The context define the meaning of the words. Same words in different order can mean something completely different.

Named Entity Recognition "NER"

This method locate and classify named entity mentions in unstructured text into predefined categories like person names, organizations, locations, medical codes, time expressions, monetary, quantity, percentages and so on. Spacy is one of the best libraries that manages this concept.

Feature Extraction

Machine Learning models can not take a raw text to process, for this reason we have to pre-process the text based on the frequency of the words and convert this information in numeric values. 

SKLearn provides a function to vectorize the raw text as parts (as tokens) and start the analysis. 

  • Term-Frequency: Using count vectorization function we can create a Document Term Matrix (DTM) generates a matrix for each unique word in the raw text and count for each one. After that we have an array for each word and the number of occurrences in the text.

  • Inverse document frequency: Based on TF explained before, this define weights for each word but following the inverse logic. If the word is very frequent the weight will be less than the word that appears only a few times. After that applies a log function to calculate the rate for each word. 

Both techniques are supported in SKLearn to process text.

Semantic Analysis using VADER

VADER (Valence Aware Dictionary for sEntiment Reasoning) is another model to use in sentimental analysis which is sensitive to both polarity (positive or negative) and intensity of emotion. The "score" will be calculated summing the intensity of each word in the text (positive, negative, strong)

Topic Modeling

To analyze and classify large volumes of text by clustering documents into topics. Discover the labels or categories is the challenge for this activity. Group by the documents with similar topic. This is part of Unsupervised Learning.

Stop reading and start coding! Here is the link with examples: [https://github.com/fsartoris/nlp]

Thanks.

Top comments (0)