Our previous blog explored how the Entity Synonym Mapper helps normalize extracted entities into canonical values.
Hereafter, we’ll move one step deeper into how entities are detected in the first place, specifically using pattern-based extraction.
This is where the RegexEntityExtractor comes into play.
Contents of this blog
- What is RegexEntityExtractor
- YAML configuration
- Internal working
- When and why to use it
What is the RegexEntityExtractor?
The RegexEntityExtractor is a rule-based entity extractor that uses regular expressions to identify entities in user input.
Unlike ML-based extractors, it does not learn from data.
Instead, it works on a very simple principle:
If the text matches a predefined pattern, extract it as an entity.
This makes it:
- Deterministic
- Fast
- Extremely precise (when patterns are well-defined)
Why do we need it?
Not all entities are ambiguous.
Some entities:
- Follow fixed formats
- Are numerical or structured
- Do not benefit from ML generalization
Examples:
- Phone numbers
- Email addresses
- Order IDs
- Dates
- ZIP codes
- Trying to train an ML model to extract these is often overkill.
YAML Configuration Example
Regex patterns are defined directly in your NLU YAML file.
version: "3.1"
nlu:
- regex: phone_number
examples: |
- abc@gmail.com
- xyz@gmail.com
Pipeline Configuration
To enable it, the extractor must be added to your pipeline:
pipeline:
- name: WhitespaceTokenizer
- name: RegexEntityExtractor
Internal Working
At a low level, the RegexEntityExtractor works as follows:
- Takes the raw user message
- Iterates over each regex pattern defined in YAML
- Applies the pattern to the text
- If a match is found:
- Extracts the matched substring
- Assigns it as an entity
- Stores start and end character indices
Consider the example:
"My phone number is 9876543210"
Then the entity extracted is:
{
"entity": "phone_number",
"value": "9876543210",
"start": 19,
"end": 29
}
Combining with Entity Synonym Mapper
A very common pattern is:
- RegexEntityExtractor extracts the entity
- Entity Synonym Mapper normalizes it
This combination gives:
- Precision
- Consistency
- Clean downstream data
When should RegexEntityExtractor be used?
- When Entity format is predictable
- When Precision matters more than recall
- When You want to reduce ML complexity
- When You want deterministic behavior
Hereafter we’ll explore CRFEntityExtractor, where entities are learned statistically rather than matched explicitly.
Top comments (0)