Pawan Jain

Posted on Jan 5, 2022 • Originally published at pyarmy.com

5 NLP Libraries Everyone Should Know

#nlp #python #programming

In this guide, we’ll be touring the essential stack of Python NLP libraries.

These packages handle a wide range of tasks such as part-of-speech (POS) tagging, dependency parsing, document classification, topic modelling, and much more.

The fundamental aim of NLP libraries is to simplify text preprocessing

There are many tools and libraries created to solve NLP problems… but you’ll cover all the essential bases once you master a handful of them. That’s why I decided to feature the Five Python NLP libraries I’ve found to be the most useful.

But before that, you should have some basic knowledge about various components and topics of NLP

For Foundation

There are some well-known, top-notch mainstay resources for the theoretical depth of Natural Language Processing

Stanford Course — Natural Language Processing with Deep Learning
Deeplearning.ai Specialization — Natural Language Processing Specialization
Best book for FUNDAMENTALS (aka Bible of NLP) — Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics
Another Good reference book — Foundations of Statistical Natural Language Processing

1. Spacy Library

spaCy is a well-known and straightforward natural language processing library in Python. It contributes to state-of-the-art efficiency and agility and has a proactive open-source association.

Plus points:

Interface well with all major deep learning frameworks and comes with some outstanding and useful language models pre-installed
Comparatively faster because of Cython support

The best things you can do with spaCy

Part-of-speech (POS) Tagging: It is the process of appointing grammatical properties like a noun, verb, adjective, adverb, etc. to words
Entity Recognition: The primary method of labeling named entities discovered in the text into the pre-defined group
Dependency Parsing: Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
Text Classification: Assigning categories or labels to a whole document, or parts of a document.
Sentence Boundary Detection (SBD): Finding and segmenting individual sentences.

Resources

Here is the link for there free Official Course: Advanced NLP with spaCy

More resources

A good blog post including installation process and other Spacy uses (Get started blog): Natural Language Processing With spaCy in Python
Intro to Spacy in Python (Videos) — Talks and tutorials in video format

2. NLTK Toolkit

NLTK is one of the most excellent libraries available out there to train NLP models. This library is elementary to use. It is a beginner-friendly library for NLP. It has a lot of pre-trained models and corpora, which helps us to analyze things very quickly.

Plus point: Built-in support for dozens of corpora and trained models

The best things you can do with NLTK?

Recommendation: Recommendation of content can be made based on the similarity.
Sentiment Analysis: It measures the inclination of people’s opinions through natural language processing
Wordnet[1] Support: We can use Synset to look up words in WordNet. And hence access to homonyms, Hypernyms, synonyms, definitions, family, etc. of many words
Machine Translation: Used to translate source content into target languages

Resources

Best source for learning NLTK is their book: Analyzing Text with the Natural Language Toolkit
Good collection for related articles: NLTK (Natural Language Toolkit) Tutorial in Python
Wordnet Docs — WordNet 3.0 Reference Manual

Unlike spaCy which focuses on providing software for production usage, NLTK is widely used for teaching and research — Wikipedia

3. Transformers

The transformers library is an open-source, community-based repository to train, use and share models based on the Transformer architecture[2] such as Bert[3], Roberta[4], GPT2[5], XLNet[6], etc.

The library downloads pre-trained models for Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks

Plus point: Over 32+ pre-trained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. Best for deep learning

The best things you can do with Transformers

Summarization: Summarization is the task of summarizing a text / an article into a shorter text.
Translation: Task of translating a text from one language to another. 3.** Text Generation**: The goal is to create a coherent portion of text that is a continuation of the given context.
Extractive Question Answering: Task of extracting an answer from a text given a question.

Resources

Official Docs — Huggingface Transformers
Building question-answering API with BERT, HuggingFace, and AWS Lambda — Serverless BERT with HuggingFace and AWS Lambda
Learn how to fine-tune BERT for sentiment analysis — Sentiment Analysis with BERT and Transformers

4. Gensim Package

Gensim is a Python library that specializes in identifying semantic similarity between two documents through vector space modeling and topic modeling toolkit

By the way, it’s abbreviated for “Generate Similar” (Gensim) :)
Plus Point: High-level processing speed and the ability to handle large amounts of Text.

The best things you can do with Gensim

Distributed computing: It can able to run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers. (Reason for handling the massive amount of data)
Document indexing: Process of associating the information with a file or specific tag allowing it to be easily retrieved later
Topic Modelling: Automatically clustering word groups and similar expressions that best defines a set of documents.
Similarity retrieval: Deals with the organization, storage, retrieval and evaluation of similar information from document repositories (here textual information)

Resources

Official API Documentation — API Reference
Official Tutorials — Core tutorials
Using Gensim LDA for hierarchical document clustering — Document Clustering with Python
Beginner tutorial for Installing, handling, etc. — Python for NLP: Working with the Gensim Library

5. Stanza

Stanza[7] is a collection of accurate and efficient tools for many human languages in one place. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages

The toolkit is built on top of the PyTorch library with support for using GPU and pre-trained neural models.

In addition, Stanza includes a Python interface to the CoreNLP Java package and inherits additional functionality from there

Plus point: It’s fast, accurate, and able to support several major languages. Suitable for product implementations

Resource: Here is the List of Python wrappers for CoreNLP

The best things you can do with Stanza

Morphological Feature Tagging: For each word in a sentence, Stanza evaluates its universal morphological features (e.g., singular/plural, 1st/2nd/3rd person, among others).
Multi-Word Token Expansion: Extending sentence split into the syntactic words as the foundation for downstream processing.

The innate characteristics of these five libraries make it a top choice for any project that relies on machine understanding of human expressions.

Thank you for reading. Don’t hesitate to stay tuned for more! Is there any other foundational or essential library? Let me know in the comments.

Footnotes

Introduction to WordNet: An On-line Lexical Database — George A. Miller & al. 1993
Attention Is All You Need — Vaswani & al., 2017
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin & al., 2018
RoBERTa: A Robustly Optimized BERT Pretraining Approach — Liu & al., 2019
Language Models are Unsupervised Multitask Learners(GPT2) — Radford & al., 2019
XLNet: Generalized Autoregressive Pretraining for Language Understanding — Yang & al., 2019
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages — Peng & al., 2020

DEV Community