RV Anusha

Posted on Mar 23

How Hindsight Exposed Our Keyword-Matching Chatbot Limits

#ai #llm #monitoring #react

We shipped a chatbot that answered "what is the law of conservation of energy?" with a detailed explanation of Newton's Laws of Motion F=ma, inertia, action reaction and neither the student nor the system had any idea the answer was completely wrong.

The chatbot looked right. The response was formatted, confident, and arrived after a thoughtful pause. The student read it, possibly took notes, and moved on. Without Hindsight tracing inputs against outputs over time, that wrong answer would have been invisible to us forever just one of thousands of interactions where the system appeared to work.

What the AI Tutor Is ?
The AI Tutor in src/pages/AITutor.tsx is a full chat interface: sidebar with session history, message bubbles, a gradient AI avatar, a three-dot typing indicator, quick-action buttons after each response ("Explain simpler", "Give examples", "Generate quiz"). It looks and feels like a modern LLM chat UI. Students typing into it would have no reason to think the responses were not generated.

The interface is built around a Message[] array and a ChatSession[] history. The layout is split-panel history on the left, active chat on the right. The input handles Enter key submission. Scroll-to-bottom is wired to a useEffect on message updates. Every detail of the interaction pattern is right.

The Routing Function
Behind all of that UI is a single function:

tsconst getAIResponse = (userInput: string): string => {
  const lower = userInput.toLowerCase();
  if (lower.includes("simpler") || lower.includes("simple") || lower.includes("easier"))
    return smartResponses.explain_simpler;
  if (lower.includes("example") || lower.includes("show me"))
    return smartResponses.examples;
  if (lower.includes("photosynthesis") || lower.includes("plant"))
    return smartResponses.photosynthesis;
  if (lower.includes("newton") || lower.includes("motion") ||
      lower.includes("force") || lower.includes("law"))
    return smartResponses.newton;
  if (lower.includes("recursion") || lower.includes("recursive") ||
      lower.includes("function call"))
    return smartResponses.recursion;
  return smartResponses.default;
};

Five conditions. One default. The function runs synchronously in under a millisecond, then waits in a setTimeout for 1–2.2 seconds before the response appears.

The responses themselves are well-written. The photosynthesis response explains the light-dependent reactions and the Calvin cycle. The Newton response includes a formatted table of F=ma variables and the action-reaction principle. The recursion response has a factorial code example with a base case callout. If the routing were accurate, these would be genuinely useful explanations.
The routing is not accurate.

The Failure Modes Hindsight Would Surface

Keyword matching on lower.includes("law") is the most obvious problem, but it is not the only one. Here is a catalog of failure modes that are directly visible in the routing code:

"law" matches everything.

"what is the law of large numbers?" → Newton's Laws (F=ma)
"explain Kirchhoff's voltage law" → Newton's Laws (F=ma)
"what is Ohm's law?" → Newton's Laws (F=ma)
"explain Boyle's law" → Newton's Laws (F=ma)
"is there a law against plagiarism?" → Newton's Laws (F=ma)

Every question containing "law" routes to the same pre-written Newton response. The student asking about Kirchhoff's voltage law gets a response that mentions "a hockey puck slides on ice until friction slows it down." The system presents this with full confidence and a gradient Sparkles icon.

"force" is overloaded.

"explain the force of gravity" → Newton's Laws ✓
"what forces act on a charged particle?" → Newton's Laws (wrong — should be electromagnetism)
"brute force vs dynamic programming" → Newton's Laws (wrong — unrelated domain)

"plant" catches unintended queries.

"explain how a plant cell differs from an animal cell" → Photosynthesis
"what is a plant-based diet?" → Photosynthesis (wrong)
"explain the power plant in a car engine" → Photosynthesis (wrong)

The default response hides all failures.

ts
default: `Great question! Let me walk you through this step by step.
\\Key Concepts
**The Foundation:** Every complex topic builds on simpler ideas...`

Any question that doesn't match a keyword "explain DNA replication", "what is osmosis?", "how does TCP/IP work?" gets a response that starts with "Great question!" and proceeds to say nothing specific about anything. The student has no way to know the system doesn't know their topic. The response is confidently generic.

Why This Is an Observability Problem

The reason these failure modes survived is that we had no way to observe them at scale. Testing the chatbot manually, a developer would type "explain photosynthesis" and get the photosynthesis response, type "explain recursion" and get the recursion response, and conclude the system works. The failure modes only appear when you look at the distribution of actual student queries and trace which response each one received.

This is the problem that Hindsight is designed for. The Hindsight agent memory model is built on the premise that you need to observe behavior across the full input distribution, not just the happy path. In practice, this means:

Log every input query and the response key it routed to
Flag inputs that hit the default fallback — these are queries the system cannot answer
Flag inputs that matched a keyword but where the matched topic and the query topic are semantically distant
Aggregate: what percentage of queries are getting meaningfully answered versus being routed to the wrong template or falling through to the generic default?

Without that observability layer, we are blind to the shape of our own failures. We know the system works for photosynthesis, Newton, and recursion because those are the cases we built. We have no idea how it performs on the actual long tail of student questions because we never looked.

The Architecture That Would Actually Work

The right replacement for getAIResponse is not a more sophisticated keyword matcher. It is a real LLM call with the student's question as input and a system prompt that defines the tutor's scope and style:

ts
const getAIResponse = async (userInput: string): Promise<string> => {
  const response = await fetch("https://api.anthropic.com/v1/messages", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "claude-sonnet-4-20250514",
      max_tokens: 1000,
      system: `You are a study tutor helping a student understand academic concepts.
               Explain clearly, use concrete examples, and adapt to the student's level.
               If you don't know something, say so directly rather than giving a generic answer.`,
      messages: [{ role: "user", content: userInput }],
    }),
  });
  const data = await response.json();
  return data.content[0].text;
};

This handles arbitrary student questions correctly, admits uncertainty when appropriate, and does not silently route "Kirchhoff's voltage law" to a hockey puck analogy. The fake setTimeout delay becomes unnecessary because there is real latency from an actual API call.

The chat history sidebar currently populated with hardcoded defaultHistory sessions becomes meaningful: actual conversation history stored in context, summarized by the LLM for the sidebar title, persisted across sessions via something like Hindsight's memory layer so the tutor can reference what a student has studied before.

LESSONS

Keyword matching does not degrade gracefully. A regex that misses a query returns nothing obvious. A keyword matcher that mismatches a query returns a confident wrong answer. The failure mode is worse than no answer at all, because it looks like a correct answer.

The default fallback hides your coverage gaps. Our default response — "Great question! Let me walk you through this step by step" masked the fact that the system had no answer for the majority of possible student questions. A fallback that admits it doesn't know is more honest and more useful.

Manual testing only covers the cases you thought of. The chatbot passed every test we ran because we only tested the six topics we built responses for. Hindsight style distribution tracing would have shown us immediately that most real queries were hitting the default or the wrong template.

Fake thinking is worse than slow thinking. The **Math.random() * 1200 **delay does not make wrong answers less wrong. It just makes them arrive with more apparent confidence. Real latency from a real model is a better signal to the user than artificial delay from a lookup table.Content is user-generated and unverified.

DEV Community

How Hindsight Exposed Our Keyword-Matching Chatbot Limits

Top comments (0)