Matthieu Lienart

Posted on Jun 15 • Edited on Aug 6 • Originally published at amanox.ch

A Serverless Chatbot with LangChain & AWS Bedrock

#serverless #ai #langchain #aws

LangChain is an open-source framework for building applications powered by large language models (LLMs), while AWS Bedrock is a fully managed service that provides access to foundation models from leading AI companies. I know these days it's all about agentic AI, but even if you're trying to develop a simple serverless non-agentic chatbot using LangChain and AWS Bedrock, you still need to combine many advanced capabilities. This article walks you through the challenges of integrating these powerful tools to create a sophisticated chatbot with features such as conversation history management, retrieval-augmented generation (RAG), multilingual support, and more.

What is the problem?

When developing a serverless non-agentic chatbot using LangChain and AWS Bedrock, you'll likely want to incorporate several key features to make it truly useful and robust.

The ability to maintain the current conversation (here I limit the scope to the current conversation, not storing past interactions).
Provide the model with your own specific context using your knowledge base and use retrieval-augmented generation (RAG) to generate answers relating to your context (here I limit myself to crawled web pages).
The ability to answer in the language of the user.
Guardrails to make sure the answers are compliant with your chatbot objectives, prevent prompt attacks, etc.
The ability to generate outputs directly in a JSON structured format for the frontend.

At least I wanted all those things.

While numerous code samples and tutorials exist that demonstrate one or two of these capabilities, I found none that comprehensively cover all five. Moreover, many of the more complete examples rely on outdated versions of LangChain with deprecated APIs.

This article aims to fill the gap despite the fact that it might become itself quickly obsolete, considering the pace of innovation in the field.

The solution

Here's an overview of the solution I developed to address these challenges:

For conversation history management, I use DynamoDB and LangChain to store the conversation history, but I implement a custom solution instead of using the common RunnableWithMessageHistory.
For multilingual support, I use AWS Comprehend to detect the question's language and generate appropriate language instructions for the model's response. Language detection could also be done using an LLM, but I suspect (although I haven't tested it) that the response time and cost would be higher.
I utilize Bedrock's built-in capabilities for the RAG knowledge base (employing the built-in web crawler to index the content, not shown here), for guardrails and for generating structured JSON outputs.

Figure 1: High-level architecture of the serverless chatbot

For the full code, you can refer to the Jupyter notebook in this GitHub repository. While the notebook demonstrates the components locally, the principles apply directly to a Lambda Function deployment.

Details

1. Manage the Conversation History

Why not Manage Conversation History with RunnableWithMessageHistory?

LangChain provides a RunnableWithMessageHistory class for managing conversation history which you will see in many code samples. Although very convenient, this approach has two significant drawbacks for my use case:

Performance Limitations: RunnableWithMessageHistory retrieves the conversation history before running the chain. But in my implementation, I need to perform three independent tasks: a) RAG context retrieval, b) Language detection and instruction generation, c) Conversation history retrieval. By parallelizing these tasks, I can reduce the latency of the initial LangChain.
Incompatibility with Structured Output: The default implementation doesn't work well with structured output. While this can be bypassed by always returning the raw output together with the structured output and storing the raw output in the message history, it introduces a new problem. It would include unnecessary information like RAG references in the conversation history which would consume LLM prompt tokens when using that history in later prompts. So, I need to customize what is stored in the database.

Parallelization of Conversation History Retrieval

The solution implemented uses DynamoDBChatMessageHistory to define the storage.

history = DynamoDBChatMessageHistory( 
    table_name=CONVERSATION_HISTORY_TABLE_NAME, 
    session_id=session_id, 
    key=this_session_key, 
)

In the initial step of the LangChain chain, I use RunnableParralel to read the past messages from the current session using a RunnableLambda in parallel with other steps.

RunnableParallel({
        "references": …,
        "language_instructions": …,
        "history": RunnableLambda(lambda x: history.messages),
        "question": …
    })
})

To illustrate the resulting performance improvement, let's compare the telemetry traces of both approaches.

Image 1: Telemetry trace using RunnableWithMessageHistory

Image 2: Telemetry trace with my solution

The first trace shows the sequential nature of RunnableWithMessageHistory, where conversation history retrieval in DynamoDB happens before other tasks. In contrast, the second trace demonstrates how my custom implementation allows for concurrent execution of RAG retrieval, language detection, and conversation history retrieval, leading to improved overall performance. The latency of the initial steps before calling the LLM model is improved from 1.2 sec to 1.0 sec by parallelizing the retrieval of the conversation history from DynamoDB.

Using a Callback Handler for Storing Messages

To store the new user question and LLM answer, I use a LangChain callback on_llm_end. This allows me to:

Extract only the relevant parts of the model's response for storage
Separate the answer from the references
Minimize token usage in future prompts

class StoreMessagesCallbackHandler(BaseCallbackHandler): 
    def __init__(self, history: BaseChatMessageHistory, session_id: str, question: str):
        self.history = history
        self.session_id = session_id
        self.question = question

    def on_llm_end(self, response: LLMResult, **kwargs)  -> Any
:
        logger.info("Storing question and LLM answer back into DynamoDB")
        generations = response.generations
        if generations and len(generations) > 0 and generations[0] and len(generations[0]) > 0:
            response_message = generations[0][0].message
            ai_message_kwargs = response_message.model_dump()
            if isinstance(response_message.content, list) and response_message.content:
                input = response_message.content[0].get("input")
                ai_message_kwargs["content"] = input.get("answer")
                ai_message_kwargs["references"] = input.get("references")
            self.history.add_messages([
                HumanMessage(content=self.question),
                AIMessage(**ai_message_kwargs)
            ])
        else:
            logger.warning("No generations returned by LLM; no AI message to store.")
            self.history.add_message(HumanMessage(content=self.question))

It's crucial to keep the number of tokens as low as possible for two reasons:

Model prompts have a limitation in the number of input tokens,
We are charged per token used.

In my use-case, the model response consists of two parts: the answer and a list of references (including URLs and excerpts). For future interactions, only the text answer is truly necessary for the model to follow the conversation. By storing only the answer and not the references, I can significantly reduce token usage in subsequent prompts.

2. RAG Retrieval

Document retrieval using RAG is done in a classic manner using AmazonKnowledgeBasesRetriever.

kb_retriever = AmazonKnowledgeBasesRetriever(
    client=bedrock_agent_client,
    knowledge_base_id=BEDROCK_KNOWLEDGE_BASE_ID,
    retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
)

The kb_retriever is then used in the initial step of the chain to retrieve content based on the question and use a custom function to format the result to inject it into the model prompt.

itemgetter("question") | kb_retriever | format_references

3. Generate Language Instructions

I created a custom function which uses AWS Comprehend to detect the user language and generate instructions for the model which will be injected into the model prompt.

def generate_language_instructions(question: str) -> str:
    try:
        response = comprehend_client.detect_dominant_language(Text=question)
        logger.info(f"Comprehend language detection response: {response}")
        if languages := response.get("Languages"):
            # Sort languages by score and return the one with the highest score
            languages.sort(key=lambda x: x["Score"], reverse=True)
            dominant_language = languages[0]["LanguageCode"]
            logger.info(f"Detected language: {dominant_language}")
            return f"Answer the question in the provided RFC 5646 language code: '{dominant_language}'."
        logger.warning("No language detected, defaulting to basic instructions.")
        return "Answer in the same language as the question."
    except Exception as e:
        logger.error(f"Error detecting language: {e}")
        logger.warning("Defaulting to basic language instructions.")
    return "Answer in the same language as the question."

The initial step of the LangChain chain, I call the above function using a RunnableLambda and passing it the user question.

itemgetter("question") | RunnableLambda(generate_language_instructions)

A logical improvement is to store the language code in the conversation history so that we don’t redetect it at every new user message. But this is not implemented yet.

4. Using Guardrails

Using AWS Guardrails when calling a model using AWS Bedrock is very simple. You just must pass the guardrails to ChatBedrockConverse.

llm = ChatBedrockConverse(
    client=bedrock_client,
    model=BEDROCK_MODEL,
    verbose=True,
    max_tokens=2048,
    temperature=0.0,
    top_p=1,
    stop_sequences=["\n\nHuman"],
    guardrail_config={
        "guardrailIdentifier": BEDROCK_GUARDRAIL_ID,
        "guardrailVersion": BEDROCK_GUARDRAIL_VERSION
    }
)

5. Structured Output

To request the model to generate the answer following a specific structure, I just specify that structure using Pydantic.

class ChatBotResponseReference(BaseModel):
   """A web reference used to answer the question"""
    url: str = Field(description="The URL of the reference")
    excerpt: str = Field(description="The extract from the reference")

class ChatBotResponse(BaseModel):
   """The response from the chatbot."""
    answer: str = Field(description="The answer to the question")
    references: list[ChatBotResponseReference] = Field(description="A list of references relating to the question")

And then update the model definition as follows.

structured_llm = llm.with_structured_output(
    ChatBotResponse, 
    include_raw=True,
)

Here, I force the model to provide both the raw answer and the structured answer. I do this because there is no guarantee that the model will follow the format instructions, so I want the raw answer as a fallback source if needed.

Warning: with this approach, there is still the risk that the model will hallucinate and, instead of reusing the references retrieved from the knowledge base, will generate non existing ones in the answer. If you are facing such an issue and you need to give the exact outputs from the RAG knowledge-base retrieval step to the user, the solution is to not ask the model to generate a structured output with references. Instead, you create a first LangChain initial_step as below, but you invoke it first to generate the prompt inputs. Then you pass that as inputs when invoking the prompt|llm chain. You then combine the content of the llm answer with the references gathered in the initial step.

The LangChain Chain

Now all the pieces are there to create the prompt and the chain.

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            """You are an assistant for question-answering tasks. Use the following pieces of retrieved references to answer the question.
            If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

            Here is a list of web pages references to be used as context to answer the question.
            Copy-paste them together with your answer in the output:
            {references}

            {language_instructions}\n"""
            ), 
        MessagesPlaceholder(variable_name="history"),
        ("human", "{question}"),
    ]
)
initial_step = RunnableParallel({
        " references ": itemgetter("question") | kb_retriever | format_references,
        "language_instructions": itemgetter("question") | RunnableLambda(generate_language_instructions),
        "history": RunnableLambda(lambda x: history.messages),
        "question": itemgetter("question"),
    })

full_chain = (
    initial_step
    | prompt
    | structured_llm
)

The model can then be called as follow:

question = "C'est quoi LangChain?"
chain_callbacks = [
    StoreMessagesCallbackHandler(history, session_id, question),
    CloudWatchLoggingHandler(session_id)
]
response = full_chain.invoke({"question":question }, {"callbacks": chain_callbacks})

The Results

Now that we've walked through the implementation details, let's examine the outputs of the LangChain and AWS Bedrock-powered chatbot. We'll look at three key aspects of the results:

The multi-language conversation: We'll see how the chatbot handles a multi-turn conversation in French, demonstrating its language detection and response capabilities.
The generated prompts: We'll examine the prompts created by my LangChain setup, showcasing how the conversation history and context are incorporated.
The Message History DynamoDB Table: We'll verify how the conversation is stored in the DynamoDB table, ensuring persistence across interactions.

The Conversation

Asking a question in French "C'est quoi LangChain?" (“What is Langchain?”), results in a prompt as follow:

"""System: You are an assistant for question-answering tasks. Use the following pieces of retrieved context references to answer the question. Use three sentences maximum and keep the answer concise.

Here is a list of web pages references to be used as context to answer the question. Copy-paste them together with your answer in the output:
[
    {
        "url": "https://python.langchain.com/docs/introduction/",
        "excerpt": "LangChain is a framework for developing applications powered by large language models (LLMs)."
    }
]

Answer the question in the provided RFC 5646 language code: 'fr'.

Human: C'est quoi LangChain?"""

This results in a structured answer in the user language including the RAG references.

{
  "answer": "LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.",
  "references": [
    {
      "url": "https://python.langchain.com/docs/introduction/",
      "excerpt": "LangChain is a framework for developing applications powered by large language models (LLMs)."
    }
  ]
}

A follow-up question asking "Cela fonctionne-t'il avec AWS Bedrock?" ("Can it work with AWS Bedrock?") produces an answer showing that it used the context (we were talking about LangChain) of the conversation to answer the new question.

{
  "answer": "LangChain est compatible avec AWS Bedrock, permettant l’intégration et l’utilisation des modèles de langage fournis par AWS.",
  "references": [
    {
      "url": " https://python.langchain.com/docs/integrations/chat/bedrock/",
      "excerpt": " Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs)"
    }
  ]
}

The prompt generated looks something like the shortened sample below, showing the conversation history being used.

"""System: You are an assistant for question-answering tasks. Use the following pieces of retrieved references to answer the question.
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

Here is a list of web pages references to be used as context to answer the question. Copy/paste them together with your answer in the output:
[…]

Answer the question in the provided RFC 5646 language code: 'fr'.

Human: C'est quoi LangChain?
AI: LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.
Human: Cela fonctionne-t'il avec AWS Bedrock?"""

The Message History DynamoDB Table

If we list the messages stored in DynamoDB for this conversation, we see the following, showing that the content of messages does not include references (although stored in the table).

[HumanMessage(content="C'est quoi LangChain?", ...),
 AIMessage(content="LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.", ...),
 HumanMessage(content="Cela fonctionne-t'il avec AWS Bedrock?", ...),
 AIMessage(content=" LangChain est compatible avec AWS Bedrock, permettant l’intégration et l’utilisation des modèles de langage fournis par AWS.", ...)]

Lessons Learned

LangChain is a powerful framework that offers numerous abstractions for rapidly developing applications that interact with LLMs. However, given the rapid pace of innovation in this field, it's crucial to approach development thoughtfully:

Before diving into coding based on web examples (including this article), invest time in learning LangChain fundamentals.Always verify that you're using the latest version of the framework to avoid deprecated features.
While LangChain's modular nature makes it a flexible and powerful tool to work with LLMs, integrating these modules effectively for your specific use case can be complex.
Be prepared to adapt and innovate, as off-the-shelf solutions may not fully address your unique requirements.
Using structured output, does not guarantee the model will respect your desired answer structure so you need a fallback mechanism. Also, when asking for references in your structured answer, be aware that the model might hallucinate and generate fake references not in your knowledge base.

By keeping these lessons in mind, you'll be better equipped to leverage LangChain's capabilities while navigating its challenges.

Now that we have a feature-rich chatbot, I will present in the next article how we can log LangChain's details in AWS CloudWatch Logs.