DEV Community

Zander Bailey
Zander Bailey

Posted on

What's in a Name: Named Entity Recognition

Text is complicated, especially for a computer. In Natural Language Processing, or NLP, there are many ways to analyze text, and many aspects of the text to examine. There are different ways to extract features of text, and one of these to use Named Entity Recognition. A Named Entity Recognizer is a type of model which is trained on a special dataset, and then used to search through a body of text and find all the Named Entities. We can break this down into a couple parts. First, what is a Named Entity?

Named Entity

A Named Entity is any word referring to a real world object, usually a noun, with a proper name. This could be a person, a place, an organization, and other things as well. Depending on the rules used Named Entity Recognition can be modified to include articles like her, him, or they, but normally will only identify actual names, like Jack or Microsoft. Depending on what your are using Named Entity Recognition for it is usually more important to identify names than articles.

Training Data

A Named Entity Recognizer is trained on a special language dataset so that it learns how to identify a name in a sentence. There are ways to compile a training set out of your own data, but because training an NER usually requires the same dataset there are existing datasets that can be used to compile a standard NER. This can change slightly depending on what package you are using to build a NER, but unless you want to add custom rules to an NER it is often more efficient to seek out existing NER models.

Uses

As an aspect of Natural Language Processing, Named Entity Recognition is used for many things, including classifying news articles, search algorithms, content recommendations, and customer support.

In addition to the many uses on its own, there are other possibly uses for NER. When used in combination with other Natural Language Processing techniques, NER can be even more powerful. For instance, there is another NLP process called Topic Modeling. Topic Modeling analyzes a piece of text to find the most prominent topic. This is done by training a model on a corpus of documents to find a determine a set number of topics, based on word usage. In some types of documents, names can be featured quite frequently. If you train a topic model on a corpus of documents with a lot of names, it will return a topics based heavily on names, and sometimes that’s less useful. One strategy would be to use Named Entity Recognition to identify all the names before hand and remove them from the text, in order to have clearer topics. This is just one example of how Named Entity Recognition can be used with other NLP models to create more dynamic output.

Words and documents can contain amazing amounts of information and insights, and there are many was we can use machine learning and Natural Language Processing to analyze them. Named Entity Recognition can be a powerful tool in NLP toolbox, and there are many existing models and large-scale NER projects. The Stanford NER and spaCy are two examples of existing NER models, built on existing corpuses of documents.

Top comments (0)