With the huge influx of unstructured text data from a plethora of social media platforms , different forums and a whole wealth of documents, it’s evident that processing these sources of data to distill the information that they contain is challenging because of the inherent complexity involved in processing them.
Natural Language Processing (NLP) helps greatly in processing, analyzing and understanding these sources to gain information and meaningful insights; With the recent advances in computing and easier access to computing resources, certain Deep Learning models have achieved SOTA in solving some of the most challenging NLP tasks.
The NLP series by Women Who Code Data Science track gives the learners a comprehensive learning path; starting from the basics of NLP, gradually introducing advanced concepts like Deep Learning approaches to solve NLP tasks.
In this blog post, let us focus on answering the following questions.
- What is NLP?
- What are some interesting use cases of NLP?
- What are the challenges in processing natural language?
- What are the steps in a generic NLP task?
- What are common text pre-processing techniques?
What is NLP?
Natural language processing (NLP) can be considered to be a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language; in particular, how to program computers to process and analyze large amounts of natural language data.
With interesting applications such as text classification, sentiment analysis, machine translation, speech to text, text to speech, and so on, NLP has evolved over the past few decades from rule-based approaches, statistical techniques to AI-powered applications in the recent past.
Interesting use cases of NLP
Let’s take a look at some of the common use cases of NLP.
Machine Translation is the task of automatically converting one natural language into another while preserving the meaning of the input text and producing fluent text in the output language. However, this task of machine translation comes with inherent challenges.
Text Classification is the process of assigning tags or categories to text according to its content.
It’s a fundamental problem in NLP and can be done either manually(tedious, time-consuming, and susceptible to human errors) or by leveraging ML techniques.
Sentiment Analysis is the contextual mining of text which identifies and extracts subjective information in the source text, such as recognizing polarity(positive, negative, neutral), identifying emotions, etc.
A typical example is in the e-commerce industry, where mining and analyzing reviews for gaining insights on customer satisfaction and experience, identifying potential areas for improvement are important.
Virtual assistants such as Siri, Alexa and Cortana; Google Translate, Speech to text and text to speech converters are all cool NLP applications that we use in our everyday lives!
Challenges in understanding natural language
Natural language has such great diversity, and every language has its own rich grammar and uniqueness. The following are some of the inherent challenges that arise in NLP tasks.
Ambiguity is an intrinsic characteristic of human conversations and is particularly challenging in Natural Language Understanding scenarios where there might be different forms that are relevant in natural language and in the AI system that we’ve programmed. In AI theory, the process of handling ambiguity is called disambiguation.
Synonymity stems from the fact that we can express the same idea with different terms (which are also dependent on the specific context); For example,
‘large’ have a similar meaning when referring to sizes, whereas
‘large’ doesn’t make sense when used as a qualifier to the word
Co-reference is the process of finding all expressions that refer to the same entity in a text.
Co-Reference resolution is an important step for a lot of higher-level NLP tasks that involve natural language understanding and is often instrumental in improving the performances of neural architectures like RNN and LSTM.
Knowledge about the structure and syntax of the language is often helpful. For a more detailed note on the different parsing techniques, please read through my original post: Natural Language Processing: Concepts and Workflow.
Generic NLP Workflow
The standard workflow for an NLP problem includes the following steps.
- The first step is usually text wrangling and pre-processing on the corpus of documents, followed by parsing and basic exploratory data analysis.
- As the next step, we look at representing text with word embeddings and subsequent feature engineering, followed by choosing the model depending on whether we’re looking at a supervised/unsupervised learning problem.
- As with any ML workflow, the final stage involves model evaluation and deployment.
Contraction Mapping/ Expanding Contractions
Contractions are a shortened version of words or a group of words, quite common in both spoken and written language. In English, they are quite common, such as
I will to
I have to
do not to
Mapping these contractions to their expanded form helps in text standardization.
Tokenization is the process of separating a piece of text into smaller units called tokens.
Given a document, tokens can be sentences, words, subwords, or even characters depending on the application.
Special characters and symbols contribute to extra noise in unstructured text. Using regular expressions to remove them or using tokenizers, which do the pre-processing step of removing punctuation marks and other special characters, is recommended.
Documents in a corpus are prone to spelling errors; In order to make the text clean for the subsequent processing, it is a good practice to run a spell checker and fix the spelling errors before moving on to the next steps.
Stop words Removal
Stop words are those words which are very common and often less significant. Hence, removing these is a pre-processing step as well.
This can be done explicitly by retaining only those words in the document which are not in the list of stop words or by specifying the stop word list as an argument in
TfidfVectorizer methods when getting Bag-of-Words(BoW)/TF-IDF scores for the corpus of text documents.
Both stemming and lemmatization are methods to reduce words to their base form.
While stemming follows certain rules to truncate the words to their base form, often resulting in words that are not lexicographically correct, lemmatization always results in base forms that are lexicographically correct.
However, stemming is a lot faster than lemmatization. Hence, to stem/lemmatize is dependent on whether the application needs quick pre-processing or requires more accurate base forms.
This is a very comprehensive introduction to NLP. For a longer read, including an example walkthrough of EDA and text pre-processing on the SMS spam classification dataset, refer to my post NLP: Concepts and Workflow😀
References and Additional Reading
A Practitioners' Guide to NLP
EDA for Text Data
EDA and Visualization for Text Data
Pre-processing Text Data
Cover Image: Photo by Kimberly Farmer on Unsplash
Top comments (1)
thank you for sharing