DEV Community

Cover image for Agent Series (5): Intent Recognition and Routing — Making Agents Actually Understand Users
WonderLab
WonderLab

Posted on

Agent Series (5): Intent Recognition and Routing — Making Agents Actually Understand Users

Why Does an Agent Need Intent Recognition?

The intuitive approach is to just hand user input directly to the LLM and let it figure out what to do. This works fine when your Agent has few tools and a single use case.

But when an Agent simultaneously has a search tool, a code tool, a calculator, and a knowledge base, a problem emerges: the same LLM, with different tool sets and system prompts, delivers dramatically different quality on the same task.

A dedicated QA Agent will use the knowledge base, cite sources, and explain concepts deeply. A general-purpose Agent facing the same question might fire off a quick web search and return a snippet. Intent routing gives each request to the Agent best equipped to handle it.

That's what the intent layer does: determine what the user actually wants before dispatching the request to a specialist Agent.


Keyword Matching vs. LLM Classification

Let's benchmark both approaches on real inputs before building anything.

The Problem with Keyword Matching

The classic approach in traditional NLP projects:

_KEYWORD_RULES = {
    "search":    ["search", "latest", "news", "find"],
    "code":      ["code", "write a", "function", "implement", "bug"],
    "calculate": ["calculate", "how much", "equals", "add", "multiply"],
    "qa":        ["what is", "why", "how", "explain", "describe"],
}

def keyword_classify(text: str) -> str:
    scores = {k: 0 for k in _KEYWORD_RULES}
    for intent, keywords in _KEYWORD_RULES.items():
        for kw in keywords:
            if kw in text.lower():
                scores[intent] += 1
    best = max(scores, key=lambda k: scores[k])
    return best if scores[best] > 0 else "unknown"
Enter fullscreen mode Exit fullscreen mode

Testing against real natural language inputs (in Chinese):

△ [Clear intent - search]
  Input:    "What's new in the latest version of LangChain?"
  Keyword → unknown         ← no keyword hit
  LLM    → search (80%)

△ [Clear intent - code]
  Input:    "Help me write a bubble sort algorithm"
  Keyword → unknown         ← "help me write" not in the list
  LLM    → code (90%)

  [Clear intent - QA]
  Input:    "What's the principle behind Transformer models?"
  Keyword → qa              ← "what" hit
  LLM    → qa (90%)

  [Colloquial calculation]
  Input:    "2 plus 3 times 4, what does that equal?"
  Keyword → calculate       ← "equals" hit
  LLM    → calculate (100%)
Enter fullscreen mode Exit fullscreen mode

The fatal flaw of keyword matching: It only works when users happen to use the exact keywords in your rule table. "Has LangChain released a new version?" or "implement a sort for me" produce unknown. Maintaining this rule table is extremely costly — every new business scenario and every new way users phrase things requires manual updates.

LLM Classifier

Replace keyword rules with an LLM-based classifier, manually parsing JSON output:

def llm_classify(text: str, history: list[str] | None = None) -> IntentResult:
    system_prompt = f"""You are an intent classifier. Classify user input into one of 5 intents:
  search    — search for info, latest news, current events
  code      — write/debug/optimize code, programming questions
  calculate — math calculations, unit conversions
  qa        — knowledge Q&A, concept explanations, how things work
  clarify   — input unclear or ambiguous, cannot determine intent

Rules:
1. If conversation history is provided, instructions like "just fix it", "add to it", 
   "clean it up" should be interpreted based on the historical context
2. Return clarify when confidence is below 0.6 — do not guess

Return only this JSON format, nothing else:
{{"intent": "<intent>", "confidence": <0-1 number>, "reasoning": "<one sentence>"}}"""

    response = llm.invoke([SystemMessage(system_prompt), HumanMessage(text)])
    raw = response.content if isinstance(response.content, str) else str(response.content)

    # Extract JSON from response (handles model adding extra text around JSON)
    try:
        match = re.search(r"\{[^{}]+\}", raw, re.DOTALL)
        data = json.loads(match.group() if match else raw)
        ...
    except Exception:
        # Fallback: look for intent keywords in raw text
        for candidate in ("search", "code", "calculate", "qa"):
            if candidate in raw.lower():
                return IntentResult(intent=candidate, confidence=0.5, reasoning="(JSON parse degraded)")
        return IntentResult(intent="clarify", confidence=0.3, reasoning="(could not parse LLM output)")
Enter fullscreen mode Exit fullscreen mode

Why manual JSON parsing instead of with_structured_output?

In testing, GLM-4-Flash's with_structured_output JSON schema mode sometimes returns only partial fields, causing a Pydantic ValidationError. Manual parsing with regex extraction and fallback logic is more robust — and this pattern is commonly used in production systems.


LangGraph Intent Router: Dispatching to Specialist Agents

Once intent is identified, LangGraph's conditional edges are the natural mechanism for routing.

Graph Structure

START
  │
  ▼
[classify node]     ← LLM classification, outputs intent + confidence
  │
  ├─ "search"    ──→ [search_agent]      web_search tool
  ├─ "code"      ──→ [code_agent]        run_code tool
  ├─ "calculate" ──→ [calculator_agent]  calculator tool
  ├─ "qa"        ──→ [qa_agent]          knowledge_base tool
  └─ "clarify"   ──→ [clarify_agent]     (no tools, generates a question)
                           │
                          END
Enter fullscreen mode Exit fullscreen mode

Implementation:

class RouterState(TypedDict):
    user_input:           str
    conversation_history: list[str]
    intent:               str
    confidence:           float
    reasoning:            str
    response:             str

def classify_node(state: RouterState) -> dict:
    result = llm_classify(state["user_input"], state.get("conversation_history"))
    print(f"  [classify]  intent={result.intent}  confidence={result.confidence:.0%}")
    return {"intent": result.intent, "confidence": result.confidence, "reasoning": result.reasoning}

def route_by_intent(state: RouterState) -> str:
    return state["intent"]  # used directly as the next node name

def build_intent_router():
    graph = StateGraph(RouterState)

    graph.add_node("classify",  classify_node)
    graph.add_node("search",    _make_specialist_node("search_agent",    [web_search]))
    graph.add_node("code",      _make_specialist_node("code_agent",      [run_code]))
    graph.add_node("calculate", _make_specialist_node("calculator_agent",[calculator]))
    graph.add_node("qa",        _make_specialist_node("qa_agent",        [knowledge_base]))
    graph.add_node("clarify",   clarify_node)

    graph.set_entry_point("classify")
    graph.add_conditional_edges(
        "classify",
        route_by_intent,
        {"search": "search", "code": "code", "calculate": "calculate",
         "qa": "qa", "clarify": "clarify"},
    )
    for node in ["search", "code", "calculate", "qa", "clarify"]:
        graph.add_edge(node, END)

    return graph.compile()
Enter fullscreen mode Exit fullscreen mode

Each specialist Agent is created with a factory function, bound to a specific tool set and system prompt:

def _make_specialist_node(node_name: str, tools_list: list, system_text: str):
    specialist = create_react_agent(model=specialist_llm, tools=tools_list)

    def _node(state: RouterState) -> dict:
        result = specialist.invoke({
            "messages": [
                ("system", system_text),
                ("user",   state["user_input"]),
            ]
        })
        ...
    return _node
Enter fullscreen mode Exit fullscreen mode

Routing Results

[Search]  "What's new in the latest version of LangGraph?"
  [classify]  intent=search  confidence=80%
  [route]  → search_agent
  [answer] LangGraph 0.2 introduced a functional API for more flexible Agent orchestration.

[QA]  "What is RAG and what problems does it solve for LLMs?"
  [classify]  intent=qa  confidence=90%
  [route]  → qa_agent
  [answer] RAG (Retrieval-Augmented Generation) = retrieve relevant docs from a knowledge base
           + inject into Prompt + LLM generates. Core value: reduce hallucinations, inject real-time knowledge.

[Ambiguous]  "Just handle it"
  [classify]  intent=clarify  confidence=50%
  [route]  → clarify_agent
  [answer] Could you tell me what specific thing you'd like help with?
Enter fullscreen mode Exit fullscreen mode

Confidence Thresholds and Clarification

Intent classification isn't binary — "just fix it" is genuinely ambiguous. Not guessing, just asking is the better strategy.

The confidence threshold controls routing behavior:

# When confidence < 0.6 → LLM should return clarify intent
# clarify_node: no tools, just generate a clarifying question

def clarify_node(state: RouterState) -> dict:
    resp = llm.invoke([
        SystemMessage(
            "The user's request is unclear. Ask one short, friendly question "
            "to find out exactly what they need. Don't guess — just ask."
        ),
        HumanMessage(state["user_input"]),
    ])
    return {"response": resp.content.strip()}
Enter fullscreen mode Exit fullscreen mode

Real results from the demo:

[Low confidence - ambiguous]  "Just fix it"
  confidence: 50%  intent: clarify
  Clarification: What would you like me to fix?

[Low confidence - unclear reference]  "Another one"
  confidence: 50%  intent: clarify
  Clarification: What specific information or help are you looking for?

[Low confidence - incomplete]  "How do I do that thing"
  confidence: 50%  intent: clarify
  Clarification: Are you asking about specific steps? Could you describe what you need help with?
Enter fullscreen mode Exit fullscreen mode

Versus high-confidence clear inputs:

[High confidence]  "What's 2 to the power of 10?"
  confidence: 100%  intent: calculate  → calculated immediately

[High confidence]  "What's new in Python's latest version?"
  confidence: 80%   intent: search     → searched immediately
Enter fullscreen mode Exit fullscreen mode

Multi-Turn Intent Tracking: Context Makes Ambiguous Instructions Clear

This is the most valuable finding in the demo.

Scenario A: A code conversation has been going on, then an ambiguous instruction arrives:

# Existing conversation history:
User:  "Write me a Python function to calculate the average of a list"
Agent: "def average(lst): return sum(lst) / len(lst) if lst else 0.0"
User:  "What happens if there are non-numeric values in the list?"
Agent: "Use try/except TypeError, or filter non-numeric elements beforehand"
Enter fullscreen mode Exit fullscreen mode

Classification of the phrase "just optimize it" (three characters in Chinese):

Input: "just optimize it"
  ✗ No history → clarify (50%)   Input unclear, cannot determine intent.
  ✓ With history → code   (80%)  User requesting code optimization based on conversation context.
Enter fullscreen mode Exit fullscreen mode

"Change comments to English":

Input: "change comments to English"
  ✗ No history → search (50%)   (JSON parse failure, degraded)
  ✓ With history → code (100%)  User requesting code modification, clearly a coding task.
Enter fullscreen mode Exit fullscreen mode

Scenario B: Calculation continuation:

# History: User asked "what is 2 to the power of 10?" → Agent answered "1024"

Input: "now multiply that by 3"
  ✗ No history → calculate (90%)   (keywords are clear enough)
  ✓ With history → calculate (100%) User continuing the previous calculation.
Enter fullscreen mode Exit fullscreen mode

Implementation: inject the last 4 conversation turns into the classification prompt:

history_section = "\n\nRecent conversation history:\n" + "\n".join(f"  {h}" for h in history[-4:])

system_prompt = f"""You are an intent classifier...
Rules:
1. If conversation history is present, instructions like "fix it", "add to it", 
   "another one" should be interpreted based on that context
...{history_section}"""
Enter fullscreen mode Exit fullscreen mode

Interesting Findings from Real Execution

Running all 5 demos surfaced several behaviors worth documenting:

Finding 1: JSON parse failure causes wrong routing

In Demo 5 Turn 2, "Help me write the simplest possible Hello World Agent with it" — an unambiguously code request — was routed to clarify_agent because GLM-4-Flash failed to output valid JSON for this input. The fallback logic couldn't find keywords either, so it returned clarify (30%).

Turn 2: "Help me write the simplest Hello World Agent using it"
  [classify]  intent=clarify  confidence=30%
              reasoning: (could not parse LLM output)
  [route]  → clarify_agent
  [answer] Are you asking me to write a Hello World example in some language or framework?
Enter fullscreen mode Exit fullscreen mode

Interestingly, the clarification question was logically correct — the model understood the request, it just failed to produce valid JSON.

Finding 2: Wrong routing, but specialist Agent salvaged it

"Write me a Python function to calculate the average of a list" was mis-classified as calculate (50%) due to JSON parse degradation and sent to calculator_agent. The calculator_agent only has a calculator tool — but it answered with the actual function anyway, using the LLM directly:

[route]  → calculator_agent (should have been code_agent)
[answer] def calculate_average(numbers):
             if not numbers:
                 return 0
             return sum(numbers) / len(numbers)
Enter fullscreen mode Exit fullscreen mode

Lesson: even with wrong routing, a specialist Agent with a good system prompt has some fault tolerance. But don't rely on this — fix reliability at the classification layer.

Finding 3: Model language drift

Some reasoning outputs were in English even when user input was in Chinese ("The user is asking for a mathematical calculation."). GLM-4-Flash sometimes thinks in English. This doesn't affect functionality, but if you're surfacing reasoning to users, you'll want to normalize the output language.

Finding 4: Conversation history is the biggest quality multiplier

The most impactful improvement across the entire demo was injecting conversation history into the classification prompt. "Just optimize it" — with zero history → clarify every time; with code conversation history → code (80%). In production systems, this means: conversation history doesn't just help the LLM answer better, it directly improves intent routing accuracy.


Production Architecture: Three-Layer Classification + OOD Rejection + Data Loop

The demo above uses a large model directly for intent classification — that's fine for learning and prototyping, but it's not how production systems work.

Real-world industrial intent recognition uses a three-layer funnel:

User Input
    │
    ▼
┌─────────────────────────────┐
│  Layer 1: Rule Routing      │  < 1ms
│  Keywords + Regex + FSM     │  ← handles ~5% of explicit commands
└─────────────┬───────────────┘
              │ no match
              ▼
┌─────────────────────────────┐
│  Layer 2: Fine-tuned SLM    │  10~50ms
│  5B/7B model (Qwen/GLM/     │  ← handles ~90% of routine intents
│  Llama fine-tuned)          │
└─────────────┬───────────────┘
              │ low confidence / long-tail
              ▼
┌─────────────────────────────┐
│  Layer 3: Large Model       │  100~500ms
│  GPT-4o / Claude / Qwen-72B │  ← handles ~5% of complex edge cases
└─────────────┬───────────────┘
              │
              ▼
        OOD Rejection Layer
   (filter out-of-scope requests)
Enter fullscreen mode Exit fullscreen mode

Layer 1: Rule Routing (<1ms)

The first layer handles only semantically certain, fixed-expression commands — no semantic understanding, just string matching or FSM transitions:

RULE_ROUTES = {
    r"^transfer to human$|^speak to agent$|^I want to complain$": "human_handoff",
    r"^open .+ app$|^launch .+": "app_launch",
    r"^(quit|back|cancel|never mind)$": "cancel",
    r"^(hi|hello|hey|are you there)$": "greeting",
}

def rule_route(text: str) -> str | None:
    for pattern, intent in RULE_ROUTES.items():
        if re.fullmatch(pattern, text.strip(), re.IGNORECASE):
            return intent
    return None  # no match — pass to next layer
Enter fullscreen mode Exit fullscreen mode

Advantages: zero LLM calls, <1ms latency, fully predictable, logic is auditable.

Use cases: "transfer to human" in customer service, shortcut commands in smart assistants, toggle operations.

Layer 2: Fine-tuned Small Language Model (10~50ms)

Requests that pass Layer 1 go to a fine-tuned small language model (SLM) — typically 5B to 7B parameters:

  • Model options: Qwen2.5-7B-Instruct, GLM-4-9B, Llama-3.1-8B fine-tuned
  • Training data: Annotated production logs + data augmentation; a few thousand to tens of thousands of samples to reach production-grade accuracy
  • Deployment cost: A single A10 (24GB VRAM) can serve a 7B model, reaching 100+ QPS in batched inference
  • Accuracy: Covers 90%+ of routine intents at 92–97% accuracy
# Pseudocode: call a locally deployed fine-tuned SLM
def slm_classify(text: str, history: list[str]) -> IntentResult:
    response = slm_client.chat(
        messages=build_classify_prompt(text, history),
        temperature=0.1,  # low temperature for classification
        max_tokens=64,    # only need intent + confidence
    )
    return parse_intent(response)
Enter fullscreen mode Exit fullscreen mode

Key metric: average latency 20–50ms, 80%+ cost reduction vs. a large model.

Layer 3: Large Model Fallback (100~500ms)

When SLM confidence is low, inputs are out-of-distribution, or the request involves complex multi-intent scenarios, escalate to a large model:

  • Trigger condition: SLM confidence < 0.6, or input outside training distribution
  • Traffic share: typically only 5–10% of requests escalate
  • Cost control: because escalation rate is low, total cost remains manageable

Three-layer comparison:

Layer Latency Cost Traffic Share Use Case
Rule Routing <1ms Minimal ~5% Fixed commands, quick actions
Fine-tuned SLM 10~50ms Low ~90% Routine intent classification
Large Model 100~500ms High ~5% Long-tail, complex, edge cases

OOD Rejection: Filtering Out-of-Scope Requests

OOD (Out-of-Distribution) rejection is often overlooked but critically important — it identifies and refuses requests that fall outside the system's service scope.

Typical scenario: a user says "write me a poem" to a shopping assistant. It's valid natural language, but not in scope. Without OOD rejection, this gets classified with low confidence and routed incorrectly.

# OOD rejection approaches

# Approach A: embedding similarity threshold
def ood_reject_by_embedding(text: str, threshold: float = 0.5) -> bool:
    emb = embed_model.encode(text)
    # compare against the intent sample corpus
    max_sim = max(cosine_sim(emb, sample) for sample in intent_samples)
    return max_sim < threshold  # True = OOD, should reject

# Approach B: confidence fallback (reject if all layers return low confidence)
def should_reject(intent: str, confidence: float) -> bool:
    return intent == "clarify" and confidence < 0.4
Enter fullscreen mode Exit fullscreen mode

Rejection responses should be friendly and guiding, not error messages:

"Sorry, I can currently only help with [XX-related] questions.
Your request is outside my service scope.
You can try: [relevant suggestion] or [transfer to human]"
Enter fullscreen mode Exit fullscreen mode

Data-Eval Loop: Continuous Improvement

Deploying an intent recognition system is just the beginning. Continuous improvement depends on a data flywheel:

Live Traffic
    │
    ▼
Log Collection → Bad Case Mining → Annotation
    │                                   │
    │                                   ▼
User Behavior Signals            Training Data Update
(clicks, retries, escalations)         │
    │                                   ▼
    └──────────────────→  SLM Incremental Fine-tuning
                                        │
                                        ▼
                               Golden Set Validation
                                        │
                                        ▼
                             Canary → Full Rollout
Enter fullscreen mode Exit fullscreen mode

Three core practices:

① Daily bad case fixes: Mine routing errors from production logs (complaints, retries, escalations to human), annotate them, add to training set, iterate the next day.

② Golden sets for high-frequency intents: For the Top 20 most frequent intents, maintain a fixed golden test set (50–100 examples per intent). Every model update must pass the golden set before deployment — this prevents long-tail fixes from breaking high-traffic scenarios.

③ Capture user behavior signals: Instead of relying purely on manual annotation, infer intent quality from behavior:

  • User replies "you misunderstood me" → routing error signal
  • User transfers to human immediately after bot response → low quality signal
  • User clicks on a recommended result → correct routing signal
  • User ends session abruptly → inconclusive, requires context

This loop continuously improves accuracy without proportionally increasing annotation cost.


Intent Routing Design Checklist

What to consider when building a reliable intent routing layer:

Classifier Design

  • [ ] No more than 10 intent types (too many reduces accuracy)
  • [ ] Each intent description is mutually exclusive with clear boundaries
  • [ ] clarify included as the safety fallback for low confidence
  • [ ] JSON output required, with manual parsing + fallback logic

Routing Graph Design

  • [ ] Each specialist Agent has only the minimum necessary tool set
  • [ ] System prompt clearly defines each Agent's specialty and scope
  • [ ] clarify node only generates questions, no tools, no guessing

Multi-Turn Dialogue

  • [ ] Last 3-5 conversation turns injected into classification prompt
  • [ ] Truncate history when too long (4 turns is a good limit)
  • [ ] Distinguish "continuing the same topic" from "switching topics"

Stability

  • [ ] JSON parse failure has fallback (keyword fallback → clarify fallback)
  • [ ] Monitor wrong-routing rate (intent == clarify when intent was actually clear)
  • [ ] Under high load, the classify node is a bottleneck — consider caching classification results for similar inputs

Summary

Key takeaways:

  1. Keyword matching is fragile in production: Users are unpredictable, maintaining rule tables is expensive, natural language coverage is low
  2. LLM classifiers need robust JSON parsing: Models output inconsistently — manual parsing with fallbacks beats with_structured_output
  3. Confidence threshold is your safety exit: Clarifying when uncertain beats wrong routing every time
  4. Conversation history amplifies intent accuracy: "Just optimize it" with no history → clarify; with code history → code (80%)
  5. Specialist Agents deliver higher quality: Each Agent focused on one thing — targeted tools, targeted prompts, better results

Next: Memory Management — the four types of Agent memory (sensory/working/episodic/semantic), and how to use LangGraph's checkpointer and store to make an Agent genuinely remember what users have said across conversations.


References


Find more useful knowledge and interesting products on my Homepage

Top comments (0)