Rishab Dugar

Posted on Sep 20

Managing Chat History in AWS Bedrock Models: A Deep Dive into Llama 3 🦙 and Anthropic Claude 🤖

#llama #anthropic #ai #aws

When developing conversational AI systems, handling multi-turn conversations effectively is crucial for maintaining a coherent dialogue and providing contextually relevant responses.
Amazon Bedrock is a fully managed service that makes foundation models accessible via an API. Creating chatbots that handle multi-turn conversations requires maintaining and utilizing context from previous interactions. This ensures relevant and coherent responses, enhancing user satisfaction.
Llama 3, developed by Meta, offers robust capabilities for managing such interactions.
This section delves into the specifics of structuring prompts for Llama 3 and provides a Python example to invoke this model using AWS Bedrock.

Understanding Prompt Tokens in Llama 3

Llama 3 utilizes specific tokens to manage conversation flow, ensuring clarity and context retention across multiple turns. Here’s an overview of key tokens:

<|start_header_id|> and <|end_header_id|>: These tokens define the role of each message segment within the conversation (e.g., system, user, assistant). Encapsulating messages with these tags helps the model understand who is speaking and adjust its responses accordingly.
<|eot_id|>: The "End of Turn" token signifies that the model has completed its response for the current turn. This is crucial in multi-turn conversations to delineate where one turn ends and another begins.
<|eom_id|>: "End of Message" indicates a potential continuation point within a conversation where a tool call might be needed. This token is particularly useful when integrating external tools or APIs that require back-and-forth interaction within a single turn.

These tokens play pivotal roles in structuring inputs and outputs for Llama 3, enabling it to handle complex conversational scenarios effectively.

Example: Structuring Prompts for Multi-Turn Conversation

Consider a scenario where you are creating an AI assistant capable of conducting an interactive session about travel recommendations:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an AI trained to provide travel advice.<|eot_id|
<|start_header_id|>user<|end_header_id|>
What are some top destinations in Europe?<|eot_id|
<|start_header_id|>assistant<|end_header_id|
Top destinations include Paris, Rome, and Barcelona.<eot_id|

In this example:

Each participant's role is clearly marked using <start_header_id> and <end_header_id>.
The <eot_id> token after each message ensures that each turn is distinctly recognized by the model.

Python Example: Invoking Llama 3 via AWS Bedrock

Below we demonstrate invoking Llama 3 using AWS Bedrock API in Python:

import json
import logging
import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create a Bedrock Runtime client in the AWS Region of your choice.
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Set the model ID, e.g., Llama 3 8b Instruct.
model_id = "meta.llama3-8b-instruct-v1:0"

class AWSLambdaLLAMA3:
    def __init__(self):
        self.temperature = 0.5
        self.maxTokens = 512
        self.topP = 0.9

    def construct_prompt(self, question, chat_history):
        """Create prompt for LLAMA3
        Args:
            question: str; query from user
            chat_history: list; list of formatted conversation history
        Returns:
            Prompt text to be used in the model
        """
        header = "<|start_header_id|>system<|end_header_id|>\nYour system prompt here<|eot_id|>\n"
        chat_history_formatted = self.format_chat_history(chat_history)
        msg = f"<|start_header_id|>user<|end_header_id|>\n{question}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n"
        final_prompt = header + chat_history_formatted + msg
        logger.info(f"Final Prompt: {final_prompt}")
        return final_prompt

    def format_chat_history(self, chat_history):
        """Format chat history for LLAMA3
        Args:
            chat_history: list; list of conversation history
        Returns:
            Formatted chat history string
        """
        formatted_history = ""
        for entry in chat_history:
            user_input = entry['user']
            assistant_response = entry['assistant']
            formatted_history += f"<|start_header_id|>user<|end_header_id|>\n{user_input}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n{assistant_response}<|eot_id|>\n"
        return formatted_history

    def get_response(self, question, chat_history):
        """Generate response using LLAMA3 model via AWS Bedrock
        Args:
            question: str; question from the user
            chat_history: list; list of formatted conversation history
        Returns:
            The generated AI response text
        """
        prompt = self.construct_prompt(question, chat_history)
        native_request = {
            "prompt": prompt,
            "max_gen_len": self.maxTokens,
            "temperature": self.temperature,
            "top_p": self.topP
        }
        request = json.dumps(native_request)
        logger.info(f"Request Body: {request}")

        try:
            response = client.invoke_model(modelId=model_id, body=request)
            response_body = json.loads(response['body'].read())
            logger.info(f"Response Body: {response_body}")
            return response_body['generation'].replace('"', '')
        except (ClientError, Exception) as e:
            logger.error(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
            return None

# Example usage
lambda_llama3 = AWSLambdaLLAMA3()
chat_history = [
    {"user": "Hello, how are you?", "assistant": "I'm good, thank you! How can I assist you today?"},
    {"user": "Can you tell me a joke?", "assistant": "Sure! Why don't scientists trust atoms? Because they make up everything!"}
]
response = lambda_llama3.get_response("What's the weather like today?", chat_history)
print(response)

Final Prompt:

<|start_header_id|>system<|end_header_id|>
Your system prompt here<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Hello, how are you?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
I'm good, thank you! How can I assist you today?<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Can you tell me a joke?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Sure! Why don't scientists trust atoms? Because they make up everything!<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What's the weather like today?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

Challenges with Llama 3

While this format is straightforward, managing longer conversations can become complex due to token limits. As conversations grow longer, earlier parts may need pruning or summarizing to fit within constraints.

Best Practices:

Limit Number of Turns: Pass only recent interactions as older contexts might become irrelevant.
Use Sliding Window: Retain only the last few turns within token limits.
Optimize System Prompts: Ensure concise prompts that set clear contexts for focused responses.

Mastering Multi-Turn Conversations with Anthropic Claude on AWS Bedrock

Anthropic Claude is one of the advanced language models available on Bedrock, designed to understand and generate human-like text. By integrating Claude with Bedrock, developers can harness powerful NLP capabilities without managing the underlying infrastructure. It offers robust capabilities to manage such interactions seamlessly. Let's explore how to pass chat history to Anthropic Claude, enabling our applications to engage in meaningful, context-aware dialogues.

APIs for Claude on Amazon Bedrock

Anthropic Claude on Amazon Bedrock offers two distinct APIs tailored to different versions of the model:

Text Completion API (Claude v1 and v2.x)

The Text Completion API is used by earlier versions of Claude (v1 and v2.x). It allows developers to generate text completions based on a given prompt. While effective for single-turn interactions, managing multi-turn conversations requires additional handling of context.

To pass chat history for Claude's text completion API (versions 1 and 2.x), you need to structure the prompt manually by appending both user inputs and AI responses in a dialogue-like format. Each turn in the conversation is represented as a string, where "Human" represents the user input and "Assistant" represents Claude's response. For example:

# Chat history formatted for text completion API
chat_history = [
    "Human: What is the capital of France?",
    "Assistant: The capital of France is Paris.",
    "Human: Tell me more about Paris."
]
prompt = "\n".join(chat_history) + "\n\nAssistant:"

# Now use the prompt to generate the next response

This format allows Claude to "remember" previous interactions by feeding the chat history into the prompt, maintaining the context of the conversation straightforwardly. The prompt is then passed to the text completion API, which generates the next assistant response.

Messages API (Claude v3)

Introduced with Claude version 3, the Messages API facilitates multi-turn conversations by natively supporting passing chat history. This enables the model to maintain context across multiple interactions seamlessly. This API simplifies implementing context-aware dialogues, making it ideal for modern conversational applications.

Implementing Multi-Turn Conversations

To create a multi-turn conversation with Anthropic Claude on AWS Bedrock, follow these key steps:

Formatting Chat History

First, format the existing chat history into a structure that Claude can understand by mapping user inputs and AI responses appropriately.

def multi_turn_bedrock_request(chat_history):
    """
    Convert chat history to the required format for Bedrock Anthropic Claude.

    Parameters:
        chat_history (list): A list of chat history where 'Human' corresponds 
                             to 'user' and 'AI' to 'assistant'.

    Returns:
        messages: The formatted request body for Bedrock Anthropic Claude.
    """

    messages = []

    # Convert chat history into expected format
    for entry in chat_history:
        if "Human" in entry:
            messages.append({
                "role": "user",
                "content": [{'type': 'text', 'text': entry["Human"]}]
            })
        elif "AI" in entry:
            messages.append({
                "role": "assistant",
                "content": [{'type': 'text', 'text': entry["AI"]}]
            })

    print("Multi turn chat history formatted:", messages)
    return messages


chat_history = [ 
    {"Human": "If I start the trip at 8:00 AM and stop for lunch at noon, what time will I reach Berlin?"},
    {"AI": "If you stop for lunch at noon for 1 hour and then continue driving, you would reach Berlin around 7:30 PM, assuming you maintain a speed of 100 kilometers per hour."},
    {"Human": "What if I encounter traffic that delays me by 30 minutes?"},
    {"AI": "If you're delayed by 30 minutes due to traffic, you'll reach Berlin by 8:00 PM."},
    {"Human": "Interesting! Now, let's switch topics. What’s the largest planet in the Solar System?"},
    {"AI": "The largest planet in the Solar System is Jupiter."},
    {"Human": "How many moons does it have?"}, 
    {"AI": "Jupiter has 95 known moons, with the four largest being the Galilean moons: Io, Europa, Ganymede, and Callisto."},
    {"Human": "If I were traveling at the speed of light, how long would it take to reach there from Earth?"}, 
    {"AI": "At the speed of light, it would take approximately 43.3 minutes to reach Jupiter from Earth when they are at their closest approach."}
]

formatted_messages = multi_turn_bedrock_request(chat_history)

Sample Output (formatted_messages)
This structured output is sent to the Anthropic Claude API for processing, maintaining the context of the conversation.

[
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "If I start the trip at 8:00 AM and stop for lunch at noon, what time will I reach Berlin?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "If you stop for lunch at noon for 1 hour and then continue driving, you would reach Berlin around 7:30 PM, assuming you maintain a speed of 100 kilometers per hour."}]
    },
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "What if I encounter traffic that delays me by 30 minutes?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "If you're delayed by 30 minutes due to traffic, you'll reach Berlin by 8:00 PM."}]
    },
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "Interesting! Now, let's switch topics. What’s the largest planet in the Solar System?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "The largest planet in the Solar System is Jupiter."}]
    },
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "How many moons does it have?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "Jupiter has 95 known moons, with the four largest being the Galilean moons: Io, Europa, Ganymede, and Callisto."}]
    },
    {
        "role": "user",
        "content": [{'type': 'text', 'text': "If I were traveling at the speed of light, how long would it take to reach there from Earth?"}]
    },
    {
        "role": "assistant",
        "content": [{'type': 'text', 'text': "At the speed of light, it would take approximately 43.3 minutes to reach Jupiter from Earth when they are at their closest approach."}]
    }
]

Constructing the Request Payload

Once formatted, combine the chat history with a new user prompt to form a complete request payload. This payload includes metadata like version information and settings.

chat_history = [ 
    {"Human": "If I start the trip at 8:00 AM and stop for lunch at noon, what time will I reach Berlin?"},
    {"AI": "If you stop for lunch at noon for 1 hour and then continue driving, you would reach Berlin around 7:30 PM, assuming you maintain a speed of 100 kilometers per hour."},
    {"Human": "What if I encounter traffic that delays me by 30 minutes?"},
    {"AI": "If you're delayed by 30 minutes due to traffic, you'll reach Berlin by 8:00 PM."},
    {"Human": "Interesting! Now, let's switch topics. What’s the largest planet in the Solar System?"},
    {"AI": "The largest planet in the Solar System is Jupiter."},
    {"Human": "How many moons does it have?"}, 
    {"AI": "Jupiter has 95 known moons, with the four largest being the Galilean moons: Io, Europa, Ganymede, and Callisto."},
    {"Human": "If I were traveling at the speed of light, how long would it take to reach there from Earth?"}, 
    {"AI": "At the speed of light, it would take approximately 43.3 minutes to reach Jupiter from Earth when they are at their closest approach."}
]

formatted_messages = multi_turn_bedrock_request(chat_history)

prompt = "What if I reduce my speed by 50%?" #followup question related to chat history

body = {
  "anthropic_version": "bedrock-2023-05-31",
  "system": "You are an AI assistant that remembers past conversations.",
  "messages": formatted_messages + [
      {
          "role": "user",
          "content": [{"type": "text", "text": prompt}]
      }
  ],
  "max_tokens": 256,
  "temperature": 0.01
}

Invoking Anthropic Claude

Invoke Anthropic Claude using AWS Bedrock's runtime client by sending the constructed payload.

import boto3
import json
from botocore.exceptions import ClientError

def invoke_anthropic_claude():

    client = boto3.client("bedrock-runtime", region_name="us-east-1")
    model_id = "<model-id>"  # Replace with your specific model ID

    try:
        response = client.invoke_model(
            modelId=model_id,
            body=json.dumps(body),
            accept='application/json',
            contentType='application/json'
        )
    except ClientError as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)

    model_response = json.loads(response["body"].read())
    response_text = model_response["content"][0]["text"]
    print("Response text:", response_text)

invoke_anthropic_claude()

Explanation

Let's break down the provided code to understand its functionality and how it facilitates multi-turn conversations with Anthropic Claude.

Chat History Formatting:
- The multi_turn_bedrock_request function takes a list of chat history entries.
- Each entry is a dictionary with either a "Human" key (representing the user) or an "AI" key (representing Claude's response).
- The function maps "Human" to the role "user" and "AI" to the role "assistant," formatting the content accordingly.
Request Payload Construction:
- The formatted chat history is combined with a new user prompt.
- Additional metadata such as the Anthropic version, system prompt, maximum tokens, and temperature settings are included.
- This structured payload ensures that Claude understands the context and generates appropriate responses.
AWS Bedrock Invocation:
- A Bedrock Runtime client is created using Boto3, AWS's SDK for Python.
- The request payload is sent to the specified model ID.
- The response is decoded, and the generated text is extracted and printed.

Complete Code

import boto3
import json
from botocore.exceptions import ClientError

def multi_turn_bedrock_request(chat_history):
    """
    Convert chat history to the required format for Bedrock Anthropic Claude.

    Parameters:
    chat_history (list): A list of chat history where 'Human' corresponds to 'user' and 'AI' to 'assistant'.

    Returns:
    messages: The formatted request body for Bedrock Anthropic Claude.
    """
    messages = []

    # Convert chat history into the expected format
    for entry in chat_history:
        if "Human" in entry:
            messages.append({
                "role": "user",
                "content": [{'type': 'text', 'text': entry["Human"]}]
            })
        elif "AI" in entry:
            messages.append({
                "role": "assistant",
                "content": [{'type': 'text', 'text': entry["AI"]}]
            })

    print("Multi turn chat history formatted:", messages)
    return messages

# Example chat history
chat_history = [ 
    {"Human": "If I start the trip at 8:00 AM and stop for lunch at noon, what time will I reach Berlin?"},
    {"AI": "If you stop for lunch at noon for 1 hour and then continue driving, you would reach Berlin around 7:30 PM, assuming you maintain a speed of 100 kilometers per hour."},
    {"Human": "What if I encounter traffic that delays me by 30 minutes?"},
    {"AI": "If you're delayed by 30 minutes due to traffic, you'll reach Berlin by 8:00 PM."},
    {"Human": "Interesting! Now, let\'s switch topics. What\'s the largest planet in the Solar System?"},
    {"AI": "The largest planet in the Solar System is Jupiter."},
    {"Human": "How many moons does it have?"}, 
    {"AI": "Jupiter has 95 known moons, with the four largest being the Galilean moons: Io, Europa, Ganymede, and Callisto."},
    {"Human": "If I were traveling at the speed of light, how long would it take to reach there from Earth?"}, 
    {"AI": "At the speed of light, it would take approximately 43.3 minutes to reach Jupiter from Earth when they are at their closest approach."}
]

# Convert chat history to the required format
formatted_messages = multi_turn_bedrock_request(chat_history)

# Define the prompt for the model
prompt = "What if I reduce my speed by 50%?"  # followup question related to chat history

# Define the request payload
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "system": "This is your system prompt",
    "messages": formatted_messages + [
        {
            "role": "user",
            "content": [{"type": "text", "text": prompt}]
        }
    ],
    "max_tokens": 256,
    "temperature": 0.01
}

# Complete sample code to invoke Anthropic Claude using Python on AWS
def invoke_anthropic_claude():
    # Create a Bedrock Runtime client in the AWS Region of your choice.
    client = boto3.client("bedrock-runtime", region_name="us-east-1")

    # Set the model ID, e.g., Claude 3 Haiku.
    model_id = "anthropic.claude-3-haiku-20240307-v1:0"

    # Print the request body before sending
    print("Request body:", json.dumps(body, indent=2))

    try:
        # Invoke the model with the request.
        response = client.invoke_model(modelId=model_id, body=json.dumps(body), accept='application/json', contentType='application/json')
    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)

    # Decode the response body.
    model_response = json.loads(response["body"].read())

    # Extract and print the response text.
    response_text = model_response["content"][0]["text"]
    print("Response text:", response_text)

# Invoke the function
invoke_anthropic_claude()

Expected AI Response

When the user adds a new prompt, such as requesting a horror story from Tokyo, Claude uses the established context to generate a relevant and coherent narrative.

{
  "content": [{
      "type": "text",
      "text": "If you reduce your speed to 50% of the speed of light, it would take about 86.6 minutes to travel from Earth to Jupiter at their closest distance."
  }]
}

This response demonstrates Claude's ability to remember the context of the conversation and provide a creative continuation that aligns with the user's request.

Error Handling and Monitoring

Regardless of which model you use, it’s important to handle errors gracefully. Both Llama 3 and Claude can encounter issues related to token limits or malformed inputs, which can lead to incomplete or incorrect responses. Implement robust error-handling mechanisms in your code to catch these issues early and retry requests when necessary.

Additionally, monitor the performance of the model over time. If you notice that responses are becoming less coherent as the conversation grows, you may need to adjust how much history is being passed or experiment with different system prompts.

Conclusion

Effectively managing chat history in AWS Bedrock models, such as Llama 3 and Anthropic Claude, is pivotal for developing sophisticated, context-aware AI systems. Llama 3 employs a straightforward concatenated prompt structure, whereas Claude’s message array format offers enhanced flexibility for handling multi-turn conversations. Understanding these distinctions and adhering to best practices—such as limiting conversation turns, utilizing clear system prompts, and monitoring token usage—enables the creation of more intelligent and efficient conversational applications.

For developers working on customer support chatbots, travel assistants, or any AI-driven applications, mastering chat history management is essential for optimizing both user experience and AI performance. Leveraging the Messages API in Claude v3 facilitates the maintenance of coherent dialogues that remember and build upon previous interactions, thereby ensuring seamless and engaging user interactions.

The provided sample code and enhanced conversation examples serve as a robust foundation for implementing these capabilities. As conversational AI continues to evolve, mastering these techniques will be crucial for developing applications that not only respond accurately but also understand and retain the nuances of human interactions.

DEV Community