Why Does an Agent Need Intent Recognition?
The intuitive approach is to just hand user input directly to the LLM and let it figure out what to do. This works fine when your Agent has few tools and a single use case.
But when an Agent simultaneously has a search tool, a code tool, a calculator, and a knowledge base, a problem emerges: the same LLM, with different tool sets and system prompts, delivers dramatically different quality on the same task.
A dedicated QA Agent will use the knowledge base, cite sources, and explain concepts deeply. A general-purpose Agent facing the same question might fire off a quick web search and return a snippet. Intent routing gives each request to the Agent best equipped to handle it.
That's what the intent layer does: determine what the user actually wants before dispatching the request to a specialist Agent.
Keyword Matching vs. LLM Classification
Let's benchmark both approaches on real inputs before building anything.
The Problem with Keyword Matching
The classic approach in traditional NLP projects:
_KEYWORD_RULES = {
"search": ["search", "latest", "news", "find"],
"code": ["code", "write a", "function", "implement", "bug"],
"calculate": ["calculate", "how much", "equals", "add", "multiply"],
"qa": ["what is", "why", "how", "explain", "describe"],
}
def keyword_classify(text: str) -> str:
scores = {k: 0 for k in _KEYWORD_RULES}
for intent, keywords in _KEYWORD_RULES.items():
for kw in keywords:
if kw in text.lower():
scores[intent] += 1
best = max(scores, key=lambda k: scores[k])
return best if scores[best] > 0 else "unknown"
Testing against real natural language inputs (in Chinese):
△ [Clear intent - search]
Input: "What's new in the latest version of LangChain?"
Keyword → unknown ← no keyword hit
LLM → search (80%)
△ [Clear intent - code]
Input: "Help me write a bubble sort algorithm"
Keyword → unknown ← "help me write" not in the list
LLM → code (90%)
[Clear intent - QA]
Input: "What's the principle behind Transformer models?"
Keyword → qa ← "what" hit
LLM → qa (90%)
[Colloquial calculation]
Input: "2 plus 3 times 4, what does that equal?"
Keyword → calculate ← "equals" hit
LLM → calculate (100%)
The fatal flaw of keyword matching: It only works when users happen to use the exact keywords in your rule table. "Has LangChain released a new version?" or "implement a sort for me" produce unknown. Maintaining this rule table is extremely costly — every new business scenario and every new way users phrase things requires manual updates.
LLM Classifier
Replace keyword rules with an LLM-based classifier, manually parsing JSON output:
def llm_classify(text: str, history: list[str] | None = None) -> IntentResult:
system_prompt = f"""You are an intent classifier. Classify user input into one of 5 intents:
search — search for info, latest news, current events
code — write/debug/optimize code, programming questions
calculate — math calculations, unit conversions
qa — knowledge Q&A, concept explanations, how things work
clarify — input unclear or ambiguous, cannot determine intent
Rules:
1. If conversation history is provided, instructions like "just fix it", "add to it",
"clean it up" should be interpreted based on the historical context
2. Return clarify when confidence is below 0.6 — do not guess
Return only this JSON format, nothing else:
{{"intent": "<intent>", "confidence": <0-1 number>, "reasoning": "<one sentence>"}}"""
response = llm.invoke([SystemMessage(system_prompt), HumanMessage(text)])
raw = response.content if isinstance(response.content, str) else str(response.content)
# Extract JSON from response (handles model adding extra text around JSON)
try:
match = re.search(r"\{[^{}]+\}", raw, re.DOTALL)
data = json.loads(match.group() if match else raw)
...
except Exception:
# Fallback: look for intent keywords in raw text
for candidate in ("search", "code", "calculate", "qa"):
if candidate in raw.lower():
return IntentResult(intent=candidate, confidence=0.5, reasoning="(JSON parse degraded)")
return IntentResult(intent="clarify", confidence=0.3, reasoning="(could not parse LLM output)")
Why manual JSON parsing instead of
with_structured_output?In testing, GLM-4-Flash's
with_structured_outputJSON schema mode sometimes returns only partial fields, causing a PydanticValidationError. Manual parsing with regex extraction and fallback logic is more robust — and this pattern is commonly used in production systems.
LangGraph Intent Router: Dispatching to Specialist Agents
Once intent is identified, LangGraph's conditional edges are the natural mechanism for routing.
Graph Structure
START
│
▼
[classify node] ← LLM classification, outputs intent + confidence
│
├─ "search" ──→ [search_agent] web_search tool
├─ "code" ──→ [code_agent] run_code tool
├─ "calculate" ──→ [calculator_agent] calculator tool
├─ "qa" ──→ [qa_agent] knowledge_base tool
└─ "clarify" ──→ [clarify_agent] (no tools, generates a question)
│
END
Implementation:
class RouterState(TypedDict):
user_input: str
conversation_history: list[str]
intent: str
confidence: float
reasoning: str
response: str
def classify_node(state: RouterState) -> dict:
result = llm_classify(state["user_input"], state.get("conversation_history"))
print(f" [classify] intent={result.intent} confidence={result.confidence:.0%}")
return {"intent": result.intent, "confidence": result.confidence, "reasoning": result.reasoning}
def route_by_intent(state: RouterState) -> str:
return state["intent"] # used directly as the next node name
def build_intent_router():
graph = StateGraph(RouterState)
graph.add_node("classify", classify_node)
graph.add_node("search", _make_specialist_node("search_agent", [web_search]))
graph.add_node("code", _make_specialist_node("code_agent", [run_code]))
graph.add_node("calculate", _make_specialist_node("calculator_agent",[calculator]))
graph.add_node("qa", _make_specialist_node("qa_agent", [knowledge_base]))
graph.add_node("clarify", clarify_node)
graph.set_entry_point("classify")
graph.add_conditional_edges(
"classify",
route_by_intent,
{"search": "search", "code": "code", "calculate": "calculate",
"qa": "qa", "clarify": "clarify"},
)
for node in ["search", "code", "calculate", "qa", "clarify"]:
graph.add_edge(node, END)
return graph.compile()
Each specialist Agent is created with a factory function, bound to a specific tool set and system prompt:
def _make_specialist_node(node_name: str, tools_list: list, system_text: str):
specialist = create_react_agent(model=specialist_llm, tools=tools_list)
def _node(state: RouterState) -> dict:
result = specialist.invoke({
"messages": [
("system", system_text),
("user", state["user_input"]),
]
})
...
return _node
Routing Results
[Search] "What's new in the latest version of LangGraph?"
[classify] intent=search confidence=80%
[route] → search_agent
[answer] LangGraph 0.2 introduced a functional API for more flexible Agent orchestration.
[QA] "What is RAG and what problems does it solve for LLMs?"
[classify] intent=qa confidence=90%
[route] → qa_agent
[answer] RAG (Retrieval-Augmented Generation) = retrieve relevant docs from a knowledge base
+ inject into Prompt + LLM generates. Core value: reduce hallucinations, inject real-time knowledge.
[Ambiguous] "Just handle it"
[classify] intent=clarify confidence=50%
[route] → clarify_agent
[answer] Could you tell me what specific thing you'd like help with?
Confidence Thresholds and Clarification
Intent classification isn't binary — "just fix it" is genuinely ambiguous. Not guessing, just asking is the better strategy.
The confidence threshold controls routing behavior:
# When confidence < 0.6 → LLM should return clarify intent
# clarify_node: no tools, just generate a clarifying question
def clarify_node(state: RouterState) -> dict:
resp = llm.invoke([
SystemMessage(
"The user's request is unclear. Ask one short, friendly question "
"to find out exactly what they need. Don't guess — just ask."
),
HumanMessage(state["user_input"]),
])
return {"response": resp.content.strip()}
Real results from the demo:
[Low confidence - ambiguous] "Just fix it"
confidence: 50% intent: clarify
Clarification: What would you like me to fix?
[Low confidence - unclear reference] "Another one"
confidence: 50% intent: clarify
Clarification: What specific information or help are you looking for?
[Low confidence - incomplete] "How do I do that thing"
confidence: 50% intent: clarify
Clarification: Are you asking about specific steps? Could you describe what you need help with?
Versus high-confidence clear inputs:
[High confidence] "What's 2 to the power of 10?"
confidence: 100% intent: calculate → calculated immediately
[High confidence] "What's new in Python's latest version?"
confidence: 80% intent: search → searched immediately
Multi-Turn Intent Tracking: Context Makes Ambiguous Instructions Clear
This is the most valuable finding in the demo.
Scenario A: A code conversation has been going on, then an ambiguous instruction arrives:
# Existing conversation history:
User: "Write me a Python function to calculate the average of a list"
Agent: "def average(lst): return sum(lst) / len(lst) if lst else 0.0"
User: "What happens if there are non-numeric values in the list?"
Agent: "Use try/except TypeError, or filter non-numeric elements beforehand"
Classification of the phrase "just optimize it" (three characters in Chinese):
Input: "just optimize it"
✗ No history → clarify (50%) Input unclear, cannot determine intent.
✓ With history → code (80%) User requesting code optimization based on conversation context.
"Change comments to English":
Input: "change comments to English"
✗ No history → search (50%) (JSON parse failure, degraded)
✓ With history → code (100%) User requesting code modification, clearly a coding task.
Scenario B: Calculation continuation:
# History: User asked "what is 2 to the power of 10?" → Agent answered "1024"
Input: "now multiply that by 3"
✗ No history → calculate (90%) (keywords are clear enough)
✓ With history → calculate (100%) User continuing the previous calculation.
Implementation: inject the last 4 conversation turns into the classification prompt:
history_section = "\n\nRecent conversation history:\n" + "\n".join(f" {h}" for h in history[-4:])
system_prompt = f"""You are an intent classifier...
Rules:
1. If conversation history is present, instructions like "fix it", "add to it",
"another one" should be interpreted based on that context
...{history_section}"""
Interesting Findings from Real Execution
Running all 5 demos surfaced several behaviors worth documenting:
Finding 1: JSON parse failure causes wrong routing
In Demo 5 Turn 2, "Help me write the simplest possible Hello World Agent with it" — an unambiguously code request — was routed to clarify_agent because GLM-4-Flash failed to output valid JSON for this input. The fallback logic couldn't find keywords either, so it returned clarify (30%).
Turn 2: "Help me write the simplest Hello World Agent using it"
[classify] intent=clarify confidence=30%
reasoning: (could not parse LLM output)
[route] → clarify_agent
[answer] Are you asking me to write a Hello World example in some language or framework?
Interestingly, the clarification question was logically correct — the model understood the request, it just failed to produce valid JSON.
Finding 2: Wrong routing, but specialist Agent salvaged it
"Write me a Python function to calculate the average of a list" was mis-classified as calculate (50%) due to JSON parse degradation and sent to calculator_agent. The calculator_agent only has a calculator tool — but it answered with the actual function anyway, using the LLM directly:
[route] → calculator_agent (should have been code_agent)
[answer] def calculate_average(numbers):
if not numbers:
return 0
return sum(numbers) / len(numbers)
Lesson: even with wrong routing, a specialist Agent with a good system prompt has some fault tolerance. But don't rely on this — fix reliability at the classification layer.
Finding 3: Model language drift
Some reasoning outputs were in English even when user input was in Chinese ("The user is asking for a mathematical calculation."). GLM-4-Flash sometimes thinks in English. This doesn't affect functionality, but if you're surfacing reasoning to users, you'll want to normalize the output language.
Finding 4: Conversation history is the biggest quality multiplier
The most impactful improvement across the entire demo was injecting conversation history into the classification prompt. "Just optimize it" — with zero history → clarify every time; with code conversation history → code (80%). In production systems, this means: conversation history doesn't just help the LLM answer better, it directly improves intent routing accuracy.
Production Architecture: Three-Layer Classification + OOD Rejection + Data Loop
The demo above uses a large model directly for intent classification — that's fine for learning and prototyping, but it's not how production systems work.
Real-world industrial intent recognition uses a three-layer funnel:
User Input
│
▼
┌─────────────────────────────┐
│ Layer 1: Rule Routing │ < 1ms
│ Keywords + Regex + FSM │ ← handles ~5% of explicit commands
└─────────────┬───────────────┘
│ no match
▼
┌─────────────────────────────┐
│ Layer 2: Fine-tuned SLM │ 10~50ms
│ 5B/7B model (Qwen/GLM/ │ ← handles ~90% of routine intents
│ Llama fine-tuned) │
└─────────────┬───────────────┘
│ low confidence / long-tail
▼
┌─────────────────────────────┐
│ Layer 3: Large Model │ 100~500ms
│ GPT-4o / Claude / Qwen-72B │ ← handles ~5% of complex edge cases
└─────────────┬───────────────┘
│
▼
OOD Rejection Layer
(filter out-of-scope requests)
Layer 1: Rule Routing (<1ms)
The first layer handles only semantically certain, fixed-expression commands — no semantic understanding, just string matching or FSM transitions:
RULE_ROUTES = {
r"^transfer to human$|^speak to agent$|^I want to complain$": "human_handoff",
r"^open .+ app$|^launch .+": "app_launch",
r"^(quit|back|cancel|never mind)$": "cancel",
r"^(hi|hello|hey|are you there)$": "greeting",
}
def rule_route(text: str) -> str | None:
for pattern, intent in RULE_ROUTES.items():
if re.fullmatch(pattern, text.strip(), re.IGNORECASE):
return intent
return None # no match — pass to next layer
Advantages: zero LLM calls, <1ms latency, fully predictable, logic is auditable.
Use cases: "transfer to human" in customer service, shortcut commands in smart assistants, toggle operations.
Layer 2: Fine-tuned Small Language Model (10~50ms)
Requests that pass Layer 1 go to a fine-tuned small language model (SLM) — typically 5B to 7B parameters:
- Model options: Qwen2.5-7B-Instruct, GLM-4-9B, Llama-3.1-8B fine-tuned
- Training data: Annotated production logs + data augmentation; a few thousand to tens of thousands of samples to reach production-grade accuracy
- Deployment cost: A single A10 (24GB VRAM) can serve a 7B model, reaching 100+ QPS in batched inference
- Accuracy: Covers 90%+ of routine intents at 92–97% accuracy
# Pseudocode: call a locally deployed fine-tuned SLM
def slm_classify(text: str, history: list[str]) -> IntentResult:
response = slm_client.chat(
messages=build_classify_prompt(text, history),
temperature=0.1, # low temperature for classification
max_tokens=64, # only need intent + confidence
)
return parse_intent(response)
Key metric: average latency 20–50ms, 80%+ cost reduction vs. a large model.
Layer 3: Large Model Fallback (100~500ms)
When SLM confidence is low, inputs are out-of-distribution, or the request involves complex multi-intent scenarios, escalate to a large model:
- Trigger condition: SLM confidence < 0.6, or input outside training distribution
- Traffic share: typically only 5–10% of requests escalate
- Cost control: because escalation rate is low, total cost remains manageable
Three-layer comparison:
| Layer | Latency | Cost | Traffic Share | Use Case |
|---|---|---|---|---|
| Rule Routing | <1ms | Minimal | ~5% | Fixed commands, quick actions |
| Fine-tuned SLM | 10~50ms | Low | ~90% | Routine intent classification |
| Large Model | 100~500ms | High | ~5% | Long-tail, complex, edge cases |
OOD Rejection: Filtering Out-of-Scope Requests
OOD (Out-of-Distribution) rejection is often overlooked but critically important — it identifies and refuses requests that fall outside the system's service scope.
Typical scenario: a user says "write me a poem" to a shopping assistant. It's valid natural language, but not in scope. Without OOD rejection, this gets classified with low confidence and routed incorrectly.
# OOD rejection approaches
# Approach A: embedding similarity threshold
def ood_reject_by_embedding(text: str, threshold: float = 0.5) -> bool:
emb = embed_model.encode(text)
# compare against the intent sample corpus
max_sim = max(cosine_sim(emb, sample) for sample in intent_samples)
return max_sim < threshold # True = OOD, should reject
# Approach B: confidence fallback (reject if all layers return low confidence)
def should_reject(intent: str, confidence: float) -> bool:
return intent == "clarify" and confidence < 0.4
Rejection responses should be friendly and guiding, not error messages:
"Sorry, I can currently only help with [XX-related] questions.
Your request is outside my service scope.
You can try: [relevant suggestion] or [transfer to human]"
Data-Eval Loop: Continuous Improvement
Deploying an intent recognition system is just the beginning. Continuous improvement depends on a data flywheel:
Live Traffic
│
▼
Log Collection → Bad Case Mining → Annotation
│ │
│ ▼
User Behavior Signals Training Data Update
(clicks, retries, escalations) │
│ ▼
└──────────────────→ SLM Incremental Fine-tuning
│
▼
Golden Set Validation
│
▼
Canary → Full Rollout
Three core practices:
① Daily bad case fixes: Mine routing errors from production logs (complaints, retries, escalations to human), annotate them, add to training set, iterate the next day.
② Golden sets for high-frequency intents: For the Top 20 most frequent intents, maintain a fixed golden test set (50–100 examples per intent). Every model update must pass the golden set before deployment — this prevents long-tail fixes from breaking high-traffic scenarios.
③ Capture user behavior signals: Instead of relying purely on manual annotation, infer intent quality from behavior:
- User replies "you misunderstood me" → routing error signal
- User transfers to human immediately after bot response → low quality signal
- User clicks on a recommended result → correct routing signal
- User ends session abruptly → inconclusive, requires context
This loop continuously improves accuracy without proportionally increasing annotation cost.
Intent Routing Design Checklist
What to consider when building a reliable intent routing layer:
Classifier Design
- [ ] No more than 10 intent types (too many reduces accuracy)
- [ ] Each intent description is mutually exclusive with clear boundaries
- [ ]
clarifyincluded as the safety fallback for low confidence - [ ] JSON output required, with manual parsing + fallback logic
Routing Graph Design
- [ ] Each specialist Agent has only the minimum necessary tool set
- [ ] System prompt clearly defines each Agent's specialty and scope
- [ ]
clarifynode only generates questions, no tools, no guessing
Multi-Turn Dialogue
- [ ] Last 3-5 conversation turns injected into classification prompt
- [ ] Truncate history when too long (4 turns is a good limit)
- [ ] Distinguish "continuing the same topic" from "switching topics"
Stability
- [ ] JSON parse failure has fallback (keyword fallback → clarify fallback)
- [ ] Monitor wrong-routing rate (intent == clarify when intent was actually clear)
- [ ] Under high load, the classify node is a bottleneck — consider caching classification results for similar inputs
Summary
Key takeaways:
- Keyword matching is fragile in production: Users are unpredictable, maintaining rule tables is expensive, natural language coverage is low
-
LLM classifiers need robust JSON parsing: Models output inconsistently — manual parsing with fallbacks beats
with_structured_output - Confidence threshold is your safety exit: Clarifying when uncertain beats wrong routing every time
-
Conversation history amplifies intent accuracy: "Just optimize it" with no history →
clarify; with code history →code (80%) - Specialist Agents deliver higher quality: Each Agent focused on one thing — targeted tools, targeted prompts, better results
Next: Memory Management — the four types of Agent memory (sensory/working/episodic/semantic), and how to use LangGraph's checkpointer and store to make an Agent genuinely remember what users have said across conversations.
References
- LangGraph StateGraph Documentation
- LangGraph Conditional Edges
- Full demo code for this series:
agent-04-intent-routing/intent_routing_demo.py
Find more useful knowledge and interesting products on my Homepage
Top comments (0)