In the previous blog, we explored RegexEntityExtractor, a rule-based approach where entities are extracted by explicitly matching patterns.
That works extremely well when entity formats are predictable.
But not all entities behave that way.
Some entities depend heavily on context, word boundaries, and surrounding tokens.
This is where statistical learning becomes necessary.
Enter the CRFEntityExtractor.
Contents of this blog
- What is CRFEntityExtractor
- Why do we need it
- How CRF works at a high level
- Training data format
- Pipeline configuration
- Internal working
- Strengths and limitations
- When and why to use it
What is the CRFEntityExtractor?
The CRFEntityExtractor is a machine learning based entity extractor that uses a Conditional Random Field (CRF) model.
Unlike regex-based extractors, it does not rely on fixed patterns.
Instead, it learns how entities appear in context from labeled training data.
In simple terms:
Given a sequence of tokens, the model learns which tokens belong to which entity types.
This allows it to extract entities even when:
- Formats vary
- Words are ambiguous
- Structure is loose
- Context determines meaning
Why do we need it?
Many real-world entities are not strictly structured.
Examples:
- Person names
- Locations
- Job titles
- Product names
- Custom domain-specific terms
Consider the word “Apple”:
- “Buy Apple stock” → organization
- “Eat an apple” → food
Regex cannot solve this.
CRF can, because it looks at neighboring tokens, not just the token itself.
How CRF works (high level)
CRF is a sequence labeling model.
Instead of classifying individual tokens independently, it predicts the most likely sequence of labels for an entire sentence.
Each token is assigned a label such as:
- B-entity (beginning)
- I-entity (inside)
- O (outside)
For example:
Book a flight from New York to Paris
Token labels might look like:
Book O
a O
flight O
from O
New B-location
York I-location
to O
Paris B-location
The CRF learns which label sequences are valid and likely, not just which individual labels fit.
Training data format
CRFEntityExtractor requires annotated training data in your NLU YAML file.
Example:
version: "3.1"
nlu:
- intent: book_flight
examples: |
- Book a flight from [New York](location) to [Paris](location)
- Fly from [Berlin](location) to [London](location)
From this data, the model learns:
- Token patterns
- Contextual relationships
- Entity boundaries
- Transition probabilities between labels
More diverse examples generally lead to better generalization.
Pipeline configuration
To enable CRF-based extraction, add it to your pipeline:
pipeline:
- name: WhitespaceTokenizer
- name: LexicalSyntacticFeaturizer
- name: CRFEntityExtractor
Key supporting components:
- Tokenizer → splits text into tokens
- Featurizer → generates features such as:
- Lowercase form
- Word shape
- Prefixes / suffixes
- Token position CRF does not work directly on raw text, it works on features.
Internal Working
At runtime, the CRFEntityExtractor operates roughly as follows:
- Tokenizes the user message
- Generates features for each token
- Applies the trained CRF model
- Predicts a label for every token
- Groups consecutive B- / I- labels into entities
- Outputs entities with:
- Entity name
- Extracted value
- Start and end character indices For the input: > "I want to fly tomorrow"
The extractor may output:
{
"entity": "location",
"value": "San Francisco",
"start": 19,
"end": 32
}
The phrase is extracted not because it matches a pattern, but because the model learned that this sequence of tokens commonly forms a location.
When should CRFEntityExtractor be used?
CRFEntityExtractor is a good fit when:
- Entity boundaries depend on context
- Formats are inconsistent or unknown
- Natural language varies widely
- You want generalization rather than exact matching
It is often used alongside RegexEntityExtractor, not instead of it.
Each extractor solves a different problem class.
In the next blog, we’ll look at how DIETClassifier unifies intent classification and entity extraction, and why modern pipelines increasingly rely on it over standalone CRF models.
Top comments (1)
🤖 AhaChat AI Ecosystem is here!
💬 AI Response – Auto-reply to customers 24/7
🎯 AI Sales – Smart assistant that helps close more deals
🔍 AI Trigger – Understands message context & responds instantly
🎨 AI Image – Generate or analyze images with one command
🎤 AI Voice – Turn text into natural, human-like speech
📊 AI Funnel – Qualify & nurture your best leads automatically