What happens after the text arrives from ASR.
๐ฃ๏ธ Say you tell a voice assistant:
"Book me a flight to Paris next Friday"
ASR does its job and converts that into text.
But at this point, the system still doesnโt really understand anything.
It doesnโt know:
๐นwhat youโre trying to do.
๐นwhich parts of the sentence matter.
๐นor what information is missing.
Thatโs where NLU (Natural Language Understanding) comes in.

Hereโs what NLU figures out behind the scenes:
1๏ธโฃ - ๐๐ป๐๐ฒ๐ป๐ ๐๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป
What are you trying to do?
โ You want to book a flight.
2๏ธโฃ - ๐๐ป๐๐ถ๐๐ ๐๐
๐๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป- details (entities)
โ destination: Paris
โ date: next Friday
3๏ธโฃ And finally - ๐ฆ๐น๐ผ๐ ๐๐ถ๐น๐น๐ถ๐ป๐ด - whatโs missing
โ where are you flying from?
So the system knows it needs to ask a follow-up.
That's the moment where the conversation starts to feel natural instead of scripted.
With models like GPT-4 or Claude, etc, a lot of this NLU work can now happen in one step without training separate intent classifiers or entity models. The model reasons about intent, details, and gaps together.
Thatโs a big reason modern Voice AI agents feel more flexible than the older "say it exactly this way" systems.
Top comments (0)