DEV Community

Cover image for Understanding the RegexEntityExtractor in RASA
Unknownerror-404
Unknownerror-404

Posted on

Understanding the RegexEntityExtractor in RASA

Our previous blog explored how the Entity Synonym Mapper helps normalize extracted entities into canonical values.

Hereafter, we’ll move one step deeper into how entities are detected in the first place, specifically using pattern-based extraction.
This is where the RegexEntityExtractor comes into play.

Contents of this blog

  • What is RegexEntityExtractor
  • YAML configuration
  • Internal working
  • When and why to use it

What is the RegexEntityExtractor?
The RegexEntityExtractor is a rule-based entity extractor that uses regular expressions to identify entities in user input.
Unlike ML-based extractors, it does not learn from data.
Instead, it works on a very simple principle:

If the text matches a predefined pattern, extract it as an entity.

This makes it:

  • Deterministic
  • Fast
  • Extremely precise (when patterns are well-defined)

Why do we need it?
Not all entities are ambiguous.
Some entities:

  1. Follow fixed formats
  2. Are numerical or structured
  3. Do not benefit from ML generalization

Examples:

  • Phone numbers
  • Email addresses
  • Order IDs
  • Dates
  • ZIP codes
  • Trying to train an ML model to extract these is often overkill.

YAML Configuration Example
Regex patterns are defined directly in your NLU YAML file.

version: "3.1"

nlu:
  - regex: phone_number
    examples: |
      - abc@gmail.com
      - xyz@gmail.com
Enter fullscreen mode Exit fullscreen mode

Pipeline Configuration
To enable it, the extractor must be added to your pipeline:

pipeline:
  - name: WhitespaceTokenizer
  - name: RegexEntityExtractor
Enter fullscreen mode Exit fullscreen mode

Internal Working
At a low level, the RegexEntityExtractor works as follows:

  1. Takes the raw user message
  2. Iterates over each regex pattern defined in YAML
  3. Applies the pattern to the text
  4. If a match is found:
    • Extracts the matched substring
    • Assigns it as an entity
    • Stores start and end character indices

Consider the example:

"My phone number is 9876543210"
Enter fullscreen mode Exit fullscreen mode

Then the entity extracted is:

{
  "entity": "phone_number",
  "value": "9876543210",
  "start": 19,
  "end": 29
}
Enter fullscreen mode Exit fullscreen mode

Combining with Entity Synonym Mapper
A very common pattern is:

  1. RegexEntityExtractor extracts the entity
  2. Entity Synonym Mapper normalizes it

This combination gives:

  • Precision
  • Consistency
  • Clean downstream data

When should RegexEntityExtractor be used?

  • When Entity format is predictable
  • When Precision matters more than recall
  • When You want to reduce ML complexity
  • When You want deterministic behavior

Hereafter we’ll explore CRFEntityExtractor, where entities are learned statistically rather than matched explicitly.

Top comments (0)