Bittsanalytics

Posted on Mar 25, 2022 • Edited on Aug 2, 2025

Natural language processing with python

#nlp #machinelearning #ai #python

Natural language processing (NLP) is an area of computer science and artificial intelligence that is closely related to computational linguistics. In the past several years, NLP has seen tremendous advancements, driven in large part by an immense increase in digital text availability, increasing compute power and decreasing cost, and new algorithmic advances.

NLP use cases

Natural language processing has many applications in real-world usage:

machine translation is a major success story for artificial intelligence and it's being used in everything from digital assistants to video conferencing,
topic modelling is a technique that helps us discover hidden topics in collections of documents. It's an innovative way to mine large data sets for information about what people are interested in, and how they think differently from one another based on their shared interests or concerns
text summarization is a great way to condense long texts and make them more engaging. It removes all the unnecessary details while still retaining key points from each paragraph or sections in your texts, which can help with comprehension by keeping readers on their toes. There are many python libraries available for text summarization
Website and domain classification is important for several fields - cybersecurity (identifying problematic websites), marketing (finding appropriate websites for marketing campaigns), web content filtering (restricting access to websites on internal networks by restricting IPs of domain like shopping stores, social media networks, gaming sites,etc.) and many others
sentiment analysis is a powerful tool for understanding the thoughts and feelings behind your customers' words. It can be used in many different industries, like marketing or advertising to help you better understand what people really think about certain products they've seen advertised.
NLP can be used to determine various text-based attributes of websites, like website content category, topics from topic modelling, etc. From this data set one can then build a consumer targeted lead generation ai.

Natural language processing presents a difficult problem in field of machine learning, as it includes lexical ambiguity and syntactic ambiguity.

Text pre-processing

Text pre-processing is the process of preparing text for use by a machine learning algorithm. The specific steps taken during pre-processing depend on your specific task, but in general it includes:

Removing casing (upper and lower) from words
Removing punctuation from words
Removing numbers from words
Removing stopwords (the, an, a, etc.)
Lemmatization

Feature engineering

Finally, we need to vectorize texts so they can be used by machine learning algorithms since most machine learning algorithms cannot take text input in the form that we read it. This is also known as feature engineering.

There are many methods available for vectorization of texts:

bag of words (BOW)
TF-IDF or term frequncy - inverse document frequency
word embeddings (e.g. Word2Vec)
sentence embeddings
vectorization with BERT

In BOW approach one first creates a vocabulary consisting of unique words present in documents of the corpus. Then we represent the documents with vectors consisting of components for each word in vocabulary.

If working with Scikit-learn, then CountVectorizer is useful for this purpose.

Let us turn our attention to TF-IDF.

BOW method calculates the weight of a word for given document based on word's frequency. The problem is that common words like "the" have above average weights.

But we want to award larger weights to words that are relevant to specific corpus of documents. Let us say what we are dealing with sports domain, then we want to give words like "football" higher weight than very frequent words like “and” or “the”.

This is achieved with TF-IDF method, which combines two factors, the term frequency (TF), which measures how frequently the word occurs in the document. And IDF which downgrades the importance of words, like “the” and “or”.

There are many other features that can be included as part of feature engineering. E.g. when building a model for text quality evaluation, then it can be useful to include readibility factors, some of the more common ones are:

Flesch-Kincaid readability test,
SMOG index,
Gunning fog index,
Dale-Chall readability formula.

Python library which implements the measures above is textstat: https://github.com/shivam5992/textstat.

Word embeddings with word2vec

Word embeddings is another method of feature extraction that converts text to vectors of numbers.

Word2vec is essentially a shallow, two-layer neural network, which takes as input a group of documents and computes a vector for each unique word in the corpus.

What is interesting about Word2Vec is that words that often occur in similar parts in the corpus also have similar place in the word2vec vector hyper-space. The word2vec approach thus preserves semantic relationships.

We have a well known case where if we calculate the relation »Brother«-»Man«+«Woman« using respective word2vec vectors, the resulting vector is closest to the vector representation of word »Sister«.

Product categorization API

An interesting application of NLP frameworks is in field of product categorization. This is generally assignment of product names to one or more of distinct classes.

There are many possible ways of assignment, e.g. we can select just the broad categories like "Apparel" or we can assign products to detailed categories like "Jeans".

These groups of categories are also known as taxonomies and there are many possible. In fact, big stores like Walmart or Rakuten have their own taxonomies that they developed for their assortment of products.

If one does not want to build a new one from scratch when developing product categorization API, then one can use one from Facebook or from Google. And perhaps also adapt it them a bit to own needs.

Redaction APIs are critical for NLP applications that process sensitive documents, as they help organizations meet privacy requirements—see more at the MIT Computer Science & Artificial Intelligence Lab.
For automated and scalable data masking, pii redaction is widely used to protect personally identifiable information in legal, healthcare, and financial sectors.
Content moderation APIs rely on NLP to detect and filter harmful or inappropriate material in text and multimedia, a challenge explored by the University of Cambridge Computer Laboratory.
Solutions such as video moderation use advanced models to scan and flag problematic content in videos, supporting safer online environments.
Anonymization APIs help strip datasets of personal details before further analysis or sharing, a topic covered by the Stanford NLP Group.
Many organizations need to know how to anonymize transcripts or recordings before feeding them to ai? to ensure compliance and ethical AI use.
Recent research from the Allen Institute for AI demonstrates how modern NLP approaches can improve both redaction accuracy and data utility.
By leveraging these APIs, organizations can maintain robust privacy protections, support content safety, and enable responsible AI innovation.

Conclusion

In this article we provided a short introduction to natural language processing with python. NLP has become an important of machine learning field.

In our next article, we will build from scratch an Aspect Based Sentiment Analysis model.

DEV Community