vincent d warmerdam for Rasa

Posted on Mar 9, 2021 • Originally published at blog.rasa.com

Proper Name Detection

#machinelearning #inclusion #nlp

Detecting names in a user message is a common challenge when designing a virtual assistant. It's a task many Rasa users face, which is why you can find many questions on the topic in the Rasa forum. It’s also an issue that is more complicated than many people initially think.

The Problem

Suppose you want to detect names in French—would you consider this to be a hard problem?

You should remember, French is spoken in a lot of places. It’s an official language in over 30 countries, many of which also have Arabic as an official language.

That means that when you are detecting names in French, you might be trying to detect names with an Arabic origin in a French body of text. This gives the problem a whole new dimension.

Suppose that you were thinking about using a pre-trained language model for French. Would it be able to find all the names? After all, it’s plausible that a pre-trained French model can overfit on names from France. Such a model might be good at detecting names like "Jacques" and "Véronique.” But what about names like "Aaadil" or "Heer"? Can we still consider pre-trained French models outside of France?

This problem isn’t just limited to regions outside of France. Inside of France, you would expect names with an Arabic origin too. In fact, this issue should be seen as a worldwide phenomenon! Whether your users are ex-pats or have a non-traditional name, you should still be able to detect their name. Detecting a person's name is a hard problem because it’s not limited to any country or language.

In this blog post, we’re going to discuss three approaches to detecting names in utterances. None of these solutions will be perfect, but they should offer reasonable places to start when you’re building a virtual assistant.

Approach 1: Pre-Trained Language Models

Pre-trained models should not be considered to be 100% perfect. That said, they can still be very useful. We can expect them to find some of the names so we shouldn’t ignore pre-trained models altogether.

A popular component for this task is the SpacyEntityExtractor. It can be configured to detect many kinds of entities and it supports many languages. In particular, we can configure spaCy to detect “PERSON” entities in a Rasa pipeline. You can see an example configuration for English below.

pipeline:
- name: SpacyNLP
  model: "en_core_web_md"
- name: SpacyTokenizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: SpacyFeaturizer
  pooling: mean
- name: SpacyEntityExtractor
  dimensions: ["PERSON"]
- name: DIETClassifier
  epochs: 100

With SpacyEntityExtractor configured, Rasa will detect any entities that spaCy detects. As said before, it won’t be perfect, but it is a start. A common issue is that a lot of names of people are commonly confused with the names of organizations. Another issue is that spaCy, at least partially, trains its models on Wikipedia, which does not represent text that your users might type. If you’d like to understand these issues in more detail you should watch our Algorithm Whiteboard video on the topic.

Approach 2: NameLists

If you're interested in detecting names of people, then you might wonder if we need machine learning. After all, there are many datasets available containing baby names from around the world, and you can try to apply basic string matching against these lists. To help, we've started collecting some of these lists over at the rasa-nlu-examples repository (contributions welcome!). These names can be combined with our RegexEntityExtractor to find names in utterances.

To get this to work, you’ll first need to add a lookup table to your NLU data. The lookup table is added to your project as a separate YAML file whose contents might look something like this:

nlu:
- lookup: PERSON
  examples: |
    - aafrae
    - aasmae
    - abad
    - ...
    - zoumourrouda
    - zounnoun

Once the lookup table has been defined, you can add the RegexEntityExtractor to the pipeline that references it. The pipeline below demonstrates an example that uses both spaCy and a name list.

pipeline:
- name: SpacyNLP
  model: "en_core_web_md"
- name: SpacyTokenizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: SpacyFeaturizer
  pooling: mean
- name: SpacyEntityExtractor
  dimensions: ["PERSON"]
- name: RegexEntityExtractor
  case_sensitive: False
  use_lookup_tables: True
- name: DIETClassifier
  epochs: 100

Note that we’ve set our RegexEntityExtractor to be case insensitive. The benefit of this is that we’re robust against users who don’t apply capitalization properly, but again, it won’t be perfect. The word “Mark” could be the name of a person, but it can also refer to a verb. A spelling error in the name could be another reason our name-lookup-table might not be able to catch it.

Approach 3: UI

Instead of detecting the name of a user, we can also just ask the user directly. If you must accurately capture the full name, it would be better just to give the user a form to fill in.

You can read more about forms in our documentation. When you create a form, you can define rules that determine how the required information is retrieved from the conversation. For example, you can add a step in your conversation that confirms if the name is spelled correctly. It might be an extra action in the conversation, but it will be more accurate than any machine learning model out there.

If you're interested in an example of how to build this yourself, be sure to checkout our youtube tutorial.

The main downside of using a form this way is that the user may need to take more time to repeat utterances. It’s a valid trade-off though since you’ll have more control over the quality of your data.

Conclusion: It’s still an unsolved problem.

We don't think there is a one-size-fits-all solution when it comes to digital assistants, just like there is no one-size-fits-all approach to finding names in the text. There are so many different languages, norms, and users out there, a custom approach is often necessary to find the best solution.

This doesn't mean that your custom solution needs to be very complicated. If you're detecting names, a small but thoughtfully defined form might be the best solution, versus relying on machine learning. In this post, we’ve outlined a few of the most common approaches to extracting names from text, but remember, in many cases, it’s a simple solution that works best.

DEV Community