Introduction
Chatbots have evolved significantly since their inception in the 1960s with simple programs like ELIZA, which could mimic human conversation through predefined scripts. Initially limited to basic interactions, typically in customer service roles, advancements in artificial intelligence have transformed chatbots into essential tools for businesses, enhancing user engagement and providing 24/7 customer support. Recently, the shift from rule-based systems to the utilization of large language models (LLMs) has enabled chatbots to understand and respond with greater comprehension and context awareness.
AI-powered assistants are now revolutionizing the developer tools landscape, offering more efficient ways to interact with documentation. When they are at their best, these intelligent search assistants provide accurate, contextual, and efficient navigation through complex information, significantly boosting productivity and saving valuable time.
Our decision to build an AI-powered documentation assistant was driven by the desire to provide rapid and customized responses to engineers developing with ApostropheCMS. While we remain committed to providing guidance and fostering community in Discord, support via this channel is limited by personnel availability. Implementing an AI-driven chatbot enables developers to receive instant, customized answers anytime, even outside of regular support hours, and expands accessibility by providing support in multiple languages. Our assistant aims to transform the documentation experience by moving beyond simple scripts, offering a seamless and intelligent solution tailored to developers' needs.
Technical Design Highlights
Building an AI-powered chatbot is more than just connecting a user’s query to an LLM. While commercial options often bundle everything—natural language processing, conversational memory, knowledge retrieval—into a neat package, they tend to be a black box, leaving us with little control over the finer details. When we set out to create our documentation chatbot, we knew we wanted more than just an out-of-the-box solution. We leveraged the power of an LLM, but also took steps to refine the process, enhancing accuracy and overall user experience by making thoughtful design choices along the way. In this section, we will highlight some of those key design decisions.
Retrieval-Augmented Generation (RAG) with Vector Database
When designing our documentation chatbot, we knew from using commercial offerings that there were significant issues with hallucination, where AI generates confident but incorrect or irrelevant information, and references to outdated information about older versions of Apostrophe in answers. Hallucination occurs because large language models (LLMs) are designed to generate the most likely response based on their training data, even if the data doesn't contain relevant or up-to-date information.
We toyed with “prompt engineering”, essentially adding extra information to guide the AI’s response to enhance the accuracy of answers. For example, for each query we added language like, “Using only information about the latest version of Apostrophe, answer the following query: `{{ user_query }}.`” But as many of you have probably found in your own experiments, this approach on its own has its limits. The AI would still go off track, lacking the real-time knowledge to separate fact from fiction.
So rather than relying solely on prompt engineering, we chose a Retrieval-Augmented Generation (RAG) approach for our chatbot. Without diving too deep into the technicalities (there is a great article here for a more comprehensive explanation), RAG is fundamentally about enhancing an LLM's responses by providing it with additional, real-time info. At its simplest, RAG involves three main steps:
- Retrieval: Using the user's query to find relevant information from a knowledge base.
- Augmentation: Adding this retrieved information to context provided along with the query to the LLM.
- Generation: Having the LLM generate a response based on both the original query and the augmented context.
This approach lets us feed the LLM current knowledge that wasn't part of its original training, leading to more accurate and up-to-date answers. The retrieval part can be done in various ways, from simple keyword matching to fancier semantic search techniques. For us, it was a game-changer in helping to reduce hallucination and outdated info problems.
Vector Embedding
We settled on using a vector database mechanism for our RAG implementation. This approach allows us to efficiently store and retrieve relevant information based on the semantic meaning of queries. Here's how we built our vector database:
First, we gathered all of our documentation and extension README files into a collection of about 150 documents. We then split these documents into smaller chunks of 1000 characters each, with an overlap of 200 characters between chunks. This chunking process helps maintain context while allowing for more precise retrieval of relevant information.
Next, we created embeddings for each of these chunks. Embeddings are numerical representations that capture the semantic meaning of the text, allowing for similarity comparisons. We used the OpenAI text-embedding-3-small model to convert each text chunk into a high-dimensional vector. This is a decision that we may re-think moving forward, based on a number of factors such as whether more context is worth the cost. We'll touch on this briefly in the last section of this article.
Finally, we stored these vectors in our chosen database: the activeloop DeepLake database. This database is open source, something near and dear to our own open-source hearts. We will cover some additional details in a further section, but it is specifically designed to handle vector data and perform efficient similarity searches, which is crucial for quick and accurate retrieval during the RAG process.
One of the key advantages of this approach is its flexibility and cost-effectiveness. Updating our RAG database is a straightforward process that costs only about five cents per update. This allows us to continuously expand and refine our knowledge base as our documentation evolves, ensuring that our chatbot always has access to the most up-to-date information.
Compared to alternatives like fine-tuning an entire LLM, which can be time-consuming and expensive, especially with frequently changing content, our vector database approach for RAG is more accurate and cost-effective for maintaining current and constantly changing knowledge in our chatbot.
Document Retrieval with Nearest Neighbor
When a user submits a query to our docbot, we employ a process known as nearest neighbor search to find the most relevant information:
- Vector Conversion: The query is first converted into a vector, representing its semantic meaning in a multi-dimensional space. This is done with the same embedding model as was used to create the database.
- Nearest Neighbor Search: We use a nearest neighbor algorithm to find the most similar document chunks in our knowledge base. This technique efficiently identifies the 'closest' vectors to our query vector in the high-dimensional space.
- Initial Retrieval: Our system retrieves the top 75 nearest neighbors (fetch_k = 75). In essence, we're finding the 75 document chunks that are most semantically similar to the query.
- Refinement: From these 75 chunks, we further refine our selection to the 12 most relevant (k = 12). This two-step process helps balance between broad coverage and focused relevance.
- Prompt Creation: The selected chunks, along with the original query, are formatted into a prompt for the LLM.
Nearest neighbor search is a fundamental concept in information retrieval and machine learning. It's particularly useful in our context because it allows us to quickly find the most similar documents without having to compare the query to every single document in our database. We use cosine similarity as our distance metric to determine how 'close' or similar two vectors are.
This approach ensures that the model's answers are grounded in the most relevant and up-to-date information available in our documentation. Our initial parameter choices to fetch 75 document chunks and narrow it to 12 can likely be further optimized to balance between response accuracy and processing speed.
LangChain Framework
Any commercial or open-source LLM model is going to have some type of API that allows you to interact with the model, but using these APIs directly can lead to model-specific implementations and require significant custom development for advanced features. To enhance flexibility and streamline development, we chose to use the LangChain framework. LangChain offers a layer of abstraction that allows us to create model-agnostic chains of processes, easily switch between different LLMs, and leverage pre-built components for common tasks like memory management and prompt engineering.
As the name suggests, the LangChain framework allows for queries and retrieved documents to be passed down a chain of processes. Here's how we've implemented it:
- Query Reformulation: We first combine the user's query with the current user’s chat history from that same session to create a new, stand-alone query. This process, using LangChain's ConversationBufferMemory, improves clarity by providing context from previous interactions. For example, if a user asks, "How do I modify it?" the reformulated query might be, "How do I modify the page template in ApostropheCMS?" based on the context of the previous conversation.
- Document Retrieval and Prompt Engineering: The reformulated query is used to retrieve relevant documents from our RAG database. We then apply prompt engineering using LangChain's PromptTemplate before querying the LLM.
Other Useful Features:
Model Flexibility: LangChain's model-agnostic design allows for easy A/B testing between different LLMs. This means that you can easily switch between different LLMs or even fine-tuned versions of the same model to compare performance and outcomes. We've been comparing OpenAI's GPT-4 and Anthropic's Claude Sonnet. This flexibility allows for continuous optimization of the chatbot's responses and ensures that you can choose the best model for your specific needs without significant re-engineering.
Performance Monitoring: LangChain integrates with LangSmith, offering advanced tools for debugging and monitoring. Using LangSmith, we identified that the amount of chat history we were passing with our queries is likely creating excessive and unnecessary token usage. This is an area we can actively investigate to see if we can reduce costs without impacting response quality.
This integrated suite of tools makes LangChain a powerful choice for building and optimizing AI-powered chatbots. It allows us to continually refine our implementation, ensuring we deliver the best possible user experience while managing resources efficiently.
DeepLake Database
While we've already discussed the basics of our vector database implementation, it's worth diving deeper into why we chose activeloop DeepLake and how it enhances our chatbot's performance.
Memory-Resident Capability: DeepLake offers the ability to create a memory-resident database. This feature significantly reduces latency by keeping the data in RAM, close to where it's processed. For our current dataset of about 150 documents, this in-memory approach provides very rapid retrieval times.
Versioning System: DeepLake's built-in versioning system is a powerful tool for managing our knowledge base. It allows us to track changes over time, making it easy to update our documentation or roll back to previous versions if needed. This is particularly valuable as our product evolves and documentation is frequently updated.
Efficient Querying and Compression: The database supports efficient data querying, allowing us to quickly retrieve relevant information. Additionally, its data compression features help optimize memory usage, which is crucial for maintaining high performance as our dataset grows.
Scalability: While we currently use DeepLake in-memory due to our relatively small dataset, we have a clear path for scaling. If we expand our dataset to include a large codebase, for example, we can easily transition to DeepLake's cloud offering. However, this would come at the cost of increased latency in document retrieval and hosting costs.
Future Optimizations: As our dataset grows and we potentially move to cloud storage, we're already considering optimizations. For instance, we may need to implement increased caching of the first document retrieval in our chain of processes to mitigate latency issues. Our current setup, with document retrieval occurring twice in our process flow, may need adjustment to maintain optimal performance.
By leveraging DeepLake's features, we've created a flexible, efficient foundation for our RAG system. This allows us to focus on improving the chatbot's responses and user experience, knowing that our data retrieval system can scale and adapt as our needs evolve.
Reflection from Early Adoption
Our experience with the chatbot in production is still in the early stages, but we’ve already seen some promising results. For instance, there have been strong examples where the chatbot has effectively guided users through building features like custom widgets from scratch. This includes the main index.js file, Nunjucks template, and player code. As user adoption increases, we hope these positive outcomes will continue. While we've only touched on the key design highlights, there are several other noteworthy takeaways from our early experiences.
One observation is that users often ask a series of highly similar questions sequentially. This behavior can increase hallucinations, as the LLM may attempt to modify its responses each time. Fortunately, since we control the chain of processes, we were able to introduce a step that evaluates new queries against previous ones in the chat history to mitigate this issue. Unless the two queries have a low similarity (<0.85), they are rejected as being the same questions. Although this change has occasionally led to user frustration when they perceive their queries as different, it has generally reduced hallucinations.
Similarly, we noticed that users sometimes ask questions using terminology that is absent in our documentation but appears in our codebase. Because the LLM is primarily limited to our documentation, it tends to hallucinate answers for these unfamiliar terms. By adjusting the process chain to examine the vector scores of the retrieved documents and ensuring that the returned documents have significant (0.85) semantic similarity with the query, we've also been able to reduce this type of hallucination.
Another unexpected benefit of monitoring user interactions is that it has highlighted areas for improving our documentation. When the LLM struggles to find information, it's often a sign that the content is either not in the right context or not prominently featured. In one case, this led us to discover an incorrectly documented feature, allowing us to rectify the error and enhance our overall documentation quality.
These early reflections have not only helped us refine our chatbot but have also provided valuable insights into our documentation strategy and user behavior. As we continue to learn and adapt, we're excited about the potential for further improvements and the positive impact on our users' experience.
Looking Ahead
When originally designing the chatbot, we opted to build it in Python, despite being a heavily JavaScript-oriented shop. This decision was driven by the availability of more mature analytic tools for objectively testing chatbot hallucination and accuracy in Python. So far, we've been evaluating answers qualitatively, but we plan to incorporate a tool like Giskard to bring a more quantitative approach to our evaluations. This step is crucial and one that, anecdotally, is often overlooked in many production chatbots.
Another goal on the horizon is optimizing how we manage passed history to reduce token usage. However, with the decreasing cost of most LLM models and the increasing token limits, this concern has become less pressing. These advancements allow us to pass more context for less money without the worry of truncated answers due to token constraints.
We’re also exploring the idea of supplementing our RAG database with several codebases, including the core Apostrophe repository and potentially some of our starter kits. However, this would complicate the embedding process and might challenge the LLM’s primary goal of guiding users to the most relevant documentation for further reading. However, along with this, we have also thought about changing to the text-embedding-3-large model. This would provide more context for our documentation and might also be better suited for large amounts of code. We’re carefully weighing the best way forward, aiming to balance the richness of our database with the clarity and utility of the chatbot’s responses.
As we continue refining our chatbot, our primary focus remains on providing developers with precise, actionable information. Every enhancement is a step toward that goal, and we're excited to see how these improvements will further elevate the user experience. One key item under discussion is how to best collect user feedback and ensure that when answers have a degree of hallucination and are not effective solutions for the end developer, we can help remedy this.
We invite you to experience our AI-powered documentation assistant firsthand and explore how it can enhance your development workflow. Try it out here, and let us know your thoughts! Join our Discord community to share feedback, ask questions, and collaborate with other developers. Together, we can continue to refine and improve this tool to better serve the ApostropheCMS community.
Top comments (0)