JennyThomas498

Posted on Oct 10

Under the Hood of Conversational AI Search: A Deep Dive into the NLWeb Prototype

#ai #llm #vectorsearch

You've seen it everywhere: the little chat box that promises to find exactly what you need. From e-commerce sites to documentation portals, conversational AI is changing how we interact with data. But have you ever stopped to wonder what’s actually happening when you type "Find vegetarian recipes for Diwali" and get a perfect list of results back?

It's not magic; it's a sophisticated pipeline of LLMs, vector databases, and smart engineering. Today, we're going to pull back the curtain on exactly how a system like this works. We'll be dissecting the NLWeb search prototype, an open-source project from Microsoft Research. It's a fantastic real-world example of how to combine technologies like OpenAI's LLMs, the Qdrant vector database, and Schema.org for structured data to build a powerful, context-aware search experience.

This deep dive is based on an excellent technical breakdown originally published on iunera.com's blog. We're going to expand on it, add some developer-focused context, and explore what it takes to turn these concepts into reality.

So, grab your favorite beverage, and let's trace the life of a query! 🚀

Setting the Scene: Our Example Query

To understand the flow, we need a concrete example. Imagine we're building a search for a recipe website. The user has already asked a couple of questions, and now they're refining their search. Here’s what the JSON payload sent to our NLWeb backend looks like:

{
  "query": "Find vegetarian recipes for Diwali",
  "site": "example.com",
  "prev": [
    "What are some Indian festival recipes?",
    "Any vegetarian options?"
  ],
  "mode": "list",
  "streaming": true
}

Let's quickly break this down:

query: The user's latest message.
site: The target website we're searching on.
prev: The secret sauce for conversational context! This is the history of the conversation.
mode: How we want the results. list gives us structured data, while summarize would trigger an LLM summarization step.
streaming: A boolean to stream the response back, great for UX.

The goal is to take this conversational input and turn it into a precise, structured Recipe JSON object. Let's see how NLWeb does it.

The Core Pipeline: A Simplified 8-Step Journey

First, we'll look at the standard, sequential flow of a query. Think of this as the foundational logic that makes everything work.

Step 1: Query Received & Context Loaded

The journey begins with an HTTP POST request to the /ask endpoint, handled by a script called ask.py. The first thing the server does is parse the incoming JSON. It also loads a configuration file, site_type.xml, which defines the context for example.com. In this file, we'd have specified that example.com is a recipe_website and that its content maps to the Schema.org Recipe type. This is a crucial first step: it tells the system what kind of information to expect and how to structure the output.

Step 2: The Relevancy Check

Before we spend compute cycles on a complex search, we need to ask a simple question: is the user's query even relevant to a recipe website? There's no point in trying to find recipes for "latest JavaScript frameworks."

To answer this, analyze_query.py makes a call to an OpenAI LLM. It essentially asks the LLM, "Does the query 'Find vegetarian recipes for Diwali' relate to the Recipe schema?" The LLM returns a simple JSON object, like {"is_relevant": true}. If it were false, the process would stop here and return an error. This is a smart, efficient gatekeeping step.

Step 3: Remembering the Past (Memory Detection)

Great conversational AI feels like it has a memory. NLWeb implements this with memory.py. This component analyzes the query for instructions that should be remembered across sessions. For example, if a user said, "From now on, only show me gluten-free options," memory.py would use an LLM to extract this constraint and store it. In our current query, this step might not find new long-term memories, but it would load any existing ones that might be relevant.

Step 4: Making the Query Whole (Decontextualization)

This is where the magic of handling conversation history (prev) happens. The query "Find vegetarian recipes for Diwali" is pretty clear, but the previous one, "Any vegetarian options?", is meaningless on its own.

The prompt_runner.py script takes the current query and the prev array and sends them to an LLM. The prompt is designed to rewrite the query into a single, standalone, or decontextualized, query that incorporates all the previous context.

Input Query: "Find vegetarian recipes for Diwali"
Context: ["What are some Indian festival recipes?", "Any vegetarian options?"]
LLM-Generated Decontextualized Query: "Find vegetarian Indian festival recipes for Diwali"

Now we have a query that can be executed without needing any of the previous chat history. This makes the downstream search components much simpler and stateless.

Step 5: Turning Words into Numbers (Vectorization)

Computers don't understand words; they understand numbers. To find recipes that are semantically similar to our query, we need to convert our decontextualized query text into a numerical representation called an embedding vector.

ask.py sends the query "Find vegetarian Indian festival recipes for Diwali" to an embedding model like OpenAI's text-embedding-ada-002. The model returns a high-dimensional vector (an array of numbers) that captures the meaning of the text.

Step 6: The Vector Search

With our query vector in hand, it's time to search! ask.py queries our vector database, Qdrant. It passes the query vector and instructs Qdrant to find the most similar document vectors in its index for example.com.

This is also where memory items can be used as filters. For example, if the user had previously specified a dietary restriction, that could be passed as a metadata filter to Qdrant, ensuring we only search within relevant recipes. Qdrant returns a list of matching Recipe documents, for instance, a document for "Vegetarian Diwali Samosas."

This is a core component of the Retrieval-Augmented Generation (RAG) pattern, where we retrieve relevant information before generating a final answer.

Step 7: Optional Post-Processing

Remember the mode parameter? Since our request specified mode: "list", this step is skipped. However, if we had set mode: "summarize", ask.py would make another LLM call to take the retrieved recipes and generate a natural language summary for the user.

Step 8: Return the Response

Finally, ask.py formats the results into the structured Schema.org Recipe JSON format. Because we set streaming: true, the results are streamed back to the client as they become available, providing a responsive user experience. The client might receive something like:

{
  "@type": "Recipe",
  "name": "Vegetarian Diwali Samosas",
  "suitableForDiet": "VegetarianDiet",
  ...
}

And that's it! From a multi-turn conversation to a structured JSON object, ready to be displayed beautifully on the front end.

Leveling Up: The Extended Flow with a "Fast Track"

The simplified flow is great, but in the real world, latency matters. Waiting 1-2 seconds for a response can feel slow. The extended pipeline in NLWeb introduces clever optimizations, primarily through parallel processing, to cut that time in half.

Here’s how it works: Steps 2 (Relevance), 3 (Memory), and 4 (Decontextualization) can all be run in parallel! They are independent operations, so there's no need to wait for one to finish before starting the next.

This parallel flow reduces the total latency from around 1.2–2 seconds down to a much snappier 0.5–0.7 seconds.

But there's another cool trick: the Fast Track.

The system makes a bet. Alongside the other parallel checks, it runs one more: analyze_query.py asks an LLM, {"is_simple": ...}. A "simple" query is one that is already decontextualized and doesn't need the prev history to be understood.

If the query is deemed simple, the system can immediately vectorize the original query and start the vector search.
Meanwhile, the full decontextualization process (Step 4) is still running in another thread.

Once decontextualization finishes, the system compares its output to the original query.

If they are the same (the LLM's bet was right!), the fast track results are used, and the user gets a response even quicker.
If they are different (the bet was wrong), the fast track thread is simply terminated, and the system continues with the correctly decontextualized query.

This is a brilliant and pragmatic engineering pattern. It optimizes for the common case (simple, direct queries) while maintaining correctness for the complex conversational cases.

Why This Matters for You, the Developer

Breaking down NLWeb is more than just a fun academic exercise. It reveals several key principles for building modern AI applications:

Modularity is King: Each step in the pipeline is a distinct component. You can swap out OpenAI for another LLM, Qdrant for another vector DB, or even change the entire domain from recipes to e-commerce just by updating the site_type.xml and the underlying data schema.
Context is Everything: The decontextualization step is a powerful pattern for building stateful-feeling applications on top of stateless components. It isolates the complexity of conversation history into a single, predictable step.
Structure is Your Friend: By leveraging Schema.org, NLWeb ensures a predictable, machine-readable output. This is far more reliable than just asking an LLM to "format the output nicely" and then trying to parse the resulting text. Using structured data like this is a known way to improve the performance of vector search and RAG systems.
Performance is a Feature: The extended pipeline's use of parallel processing and the "fast track" heuristic shows a commitment to user experience. In AI, where LLM calls can be slow, these optimizations are critical.

From Prototype to Production Enterprise AI

The NLWeb prototype provides a fantastic blueprint. However, taking these concepts and deploying them in a large-scale enterprise environment introduces new challenges around data ingestion, scalability, and integration with existing systems.

For instance, how do you handle real-time data streams for your search index? This is where technologies like Apache Druid, a real-time analytics database, come into play. Building a conversational AI layer on top of a powerful database like Druid requires specialized expertise. If you're tackling these kinds of complex problems, exploring professional services like Apache Druid AI Consulting for Europe can provide the necessary guidance to architect a robust solution.

Furthermore, building the conversational interface itself is a significant undertaking. The ask.py endpoint is just the beginning. A full-fledged system, like an Enterprise MCP (Master Control Program) Server, involves managing user sessions, security, observability, and seamless integration with various data backends. You can see an example of this in the Apache Druid MCP Server, which applies these principles to time-series data.

Conclusion

The NLWeb search prototype beautifully demystifies the process behind modern, AI-powered conversational search. By intelligently combining LLMs for understanding, vector databases for semantic retrieval, and structured data for reliable output, it creates a powerful and efficient system. The journey of our simple query—from conversational context to a precise JSON object—showcases a modular, performant, and adaptable architecture.

You can explore the code yourself by forking the NLWeb GitHub repository and start building your own custom search solutions.

What would you build with a framework like this? Let me know in the comments below!

DEV Community