Raj Murugan

Posted on Jan 20

# Complete Guide to RAG Evaluations in Amazon Bedrock

#aws #bedrock

Introduction

In the rapidly evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) has emerged as a powerful technique for enhancing the capabilities of large language models (LLMs). By grounding LLMs with external knowledge bases, RAG systems can generate more accurate, relevant, and up-to-date responses, mitigating issues like hallucination and outdated information. Amazon Bedrock provides a robust platform for building and deploying RAG applications, offering a suite of foundation models and tools to streamline development.

However, the true power of a RAG system lies not just in its construction, but in its continuous evaluation and refinement. Ensuring that your RAG application consistently delivers high-quality responses requires a systematic approach to assessment. This comprehensive guide will walk you through the process of setting up and conducting RAG evaluations within Amazon Bedrock, focusing on automatic assessment of your knowledge base performance. We will cover everything from initial prerequisites and environment setup to creating evaluation jobs, monitoring key metrics, and interpreting results, empowering you to build and maintain highly effective RAG solutions.

Prerequisites & Environment Setup

Before diving into the RAG evaluation process, it's essential to ensure your environment is correctly configured and you have the necessary prerequisites in place. This section outlines the foundational requirements for a smooth evaluation experience.

Essential Prerequisites

To begin, you will need:

An AWS Account: Access to an active Amazon Web Services (AWS) account is fundamental for utilizing Amazon Bedrock and its associated services like S3 and IAM.
Basic Knowledge of AWS S3 and IAM Roles: Familiarity with Amazon S3 for data storage and AWS Identity and Access Management (IAM) for managing permissions is crucial. You will be interacting with S3 buckets for storing evaluation datasets and configuring IAM roles for service access.

Environment Configuration

Careful environment setup ensures compatibility and optimal performance for your RAG evaluations:

AWS Region Selection: It is recommended to use either the US East (N. Virginia) or US West (Oregon) AWS regions. These regions typically offer the broadest support for the latest Amazon Bedrock features and foundation models. Always verify that your chosen services and models are available in your selected region.
Model Selection: For the purpose of this guide, we will primarily use the Amazon Nova Micro v1.0 model. This model is a good starting point for evaluations due to its balance of performance and cost-effectiveness. However, it is imperative to:
- Verify Regional Support: Confirm that the Amazon Nova Micro v1.0 model (or any other model you choose) is supported in your selected AWS region.
- Check Pricing: Always review the pricing for your chosen model, as costs can vary based on model type, usage, and region. Understanding the cost implications upfront will help manage your AWS expenditure effectively.

By ensuring these prerequisites are met and your environment is properly configured, you lay the groundwork for a successful and insightful RAG evaluation journey within Amazon Bedrock.

Step-by-Step Guide to RAG Evaluation

This section provides a detailed, step-by-step walkthrough of how to set up and execute RAG evaluations in Amazon Bedrock. Each step is designed to guide you through the process, from creating your knowledge base to analyzing the evaluation results.

Step 1: Create a Knowledge Base

The foundation of any RAG application is its knowledge base. This knowledge base serves as the external data source that the LLM will retrieve information from. Follow these instructions to set up your knowledge base in Amazon Bedrock:

Navigate to Amazon Bedrock: In the AWS Management Console, search for and select "Amazon Bedrock."
Access Knowledge Bases: From the Bedrock console, go to the left-hand navigation pane and select Knowledge Bases. Then, click on the Create knowledge base button.
Provide Knowledge Base Details:
- Knowledge base name: Enter a descriptive name for your knowledge base (e.g., myFirstBedrockKB).
- Description: (Optional) Provide a brief description of your knowledge base.
- IAM service role: Choose to Create and use a new service role. Make a note of the role name, as it will be useful for future reference and permissions management.
Configure Data Source:
- Data source: Select S3 as your data source type.
- Data source location: Specify "This AWS account".
- S3 URI: Provide the S3 URI for your S3 bucket where your data is stored (e.g., s3://mykbbucket). This bucket should contain the documents that your RAG application will use.
- Chunking and parsing configurations: For initial setup, you can keep the default settings. These configurations determine how your documents are split and processed for retrieval.
Select Embeddings Model: Choose "Titan Text Embeddings v2" as your embeddings model. This model will convert your documents into vector embeddings, enabling semantic search and retrieval.
Set up Vector Database: For the vector database, select "Quick create a new vector store". You can choose between "Amazon OpenSearch Serverless" or "Amazon S3 Vector Store (In Preview)". The vector database stores the embeddings and facilitates efficient similarity searches.
Create Knowledge Base: Click Next and then Create knowledge base.

Note: The creation of the knowledge base and the associated vector database can take some time. Please be patient during this provisioning process.

Step 2: Sync Data Source

Once your knowledge base is created, you need to synchronize it with your data source to ensure that the latest information is available for retrieval. This step ensures that any updates or new documents in your S3 bucket are ingested and indexed by the knowledge base.

Go to your Knowledge Base: Navigate back to the Amazon Bedrock console and select your newly created Knowledge Base.
Navigate to Data Source Tab: Within your Knowledge Base details, click on the Data source tab.
Select Your Data Source: From the list of data sources, select the one you configured in the previous step.
Initiate Sync: Click on the "Sync" button.

Note: Similar to the creation process, syncing the data source with your Knowledge Base can take some time, especially for large datasets. Monitor the status in the console until the synchronization is complete.

Step 3: Test Your Knowledge Base

Before proceeding with formal evaluations, it's a good practice to manually test your knowledge base to get a preliminary understanding of its retrieval capabilities. This helps in identifying any immediate issues with data ingestion or relevance.

Navigate to Test Knowledge Base: In the Amazon Bedrock console, go to your Knowledge Base and select the Test Knowledge Base tab.
Select Model: Choose "Amazon Nova Micro" as the model for testing.
Enter Questions: In the chat interface, enter specific questions related to the data you ingested into your knowledge base. For example, if your data contains information about product service intervals, you might ask: "What is the recommended service interval for your product?"
Review Responses: Carefully review the responses provided by the knowledge base. Verify that they are accurate, relevant, and directly supported by your source data. Pay attention to whether the responses correctly retrieve information from your documents.
Iterative Testing: Try different types of questions, including those that require precise factual recall and those that involve more general understanding. This iterative testing helps you gauge the breadth and depth of your knowledge base's retrieval capabilities.

Tip: To ensure the Knowledge Base is retrieving relevant information correctly, ask specific questions that can be directly answered by the content within your data sources.

Step 4: Creating Evaluation Examples

To automatically evaluate your RAG system, you need a dataset of evaluation examples. These examples consist of prompts and their corresponding reference responses, which the evaluation job will use to assess the quality of your knowledge base's retrieval. This process involves creating a batchinput.jsonl file.

Copy Example for a Single Record: Begin by visiting the official AWS documentation for prompt retrieve examples [1]. This documentation provides a structured JSON format for evaluation inputs. Copy an input record example, ensuring to remove any extraneous spaces to maintain valid JSONL formatting. A typical example might look like this:
```
{"conversationTurns":[{"prompt":{"content":[{"text":"What is the recommended service interval?"}]},"referenceResponses":[{"content":[{"text":"The recommended service interval is two years."}]}]}]}
```
Create More Examples: Manually creating a large number of diverse evaluation examples can be time-consuming. To expedite this process, you can leverage tools like Amazon Q Developer (Free version) to generate additional samples. Focus on creating a variety of prompts that cover different aspects of your knowledge base content and expected user queries.
Save All Records to batchinput.jsonl: Consolidate all your generated evaluation examples into a single file named batchinput.jsonl. Each line in this file must be a valid JSON object, representing one evaluation example. Ensure the file adheres strictly to the JSONL (JSON Lines) format, where each line is a self-contained JSON object, without commas between objects or an enclosing array.

Note: It is crucial that your batchinput.jsonl file is correctly formatted. You can use online JSON formatters and validators like jsonformatter.org or jsonlint.com to verify its integrity before proceeding.

Step 5: Upload the File to S3

With your batchinput.jsonl file prepared, the next step is to upload it to an Amazon S3 bucket. This S3 location will serve as the input for your RAG evaluation job in Amazon Bedrock.

Prepare Your batchinput.jsonl File: Ensure your file contains all the evaluation examples and is correctly formatted as JSONL, as detailed in the previous step.
Navigate to the AWS S3 Console: In the AWS Management Console, search for and select "S3."
Select Your S3 Bucket: Locate and select the S3 bucket you intend to use for storing your evaluation input (e.g., mybatchinferenceinput). If you don't have a dedicated bucket, you may need to create one.
Initiate Upload: Click on the "Upload" button.
Select Your File: Drag and drop your batchinput.jsonl file into the upload area, or use the "Add files" button to browse and select it from your local machine.
Review and Confirm: Review the upload settings. For evaluation input files, default settings are usually sufficient, but ensure public access is not inadvertently granted if your data is sensitive.
Complete Upload: Click "Upload" to finalize the process.

Important: Double-check that your batchinput.jsonl file is in the correct JSONL format with no extra spaces or malformed JSON objects. Incorrect formatting can lead to errors during the evaluation job processing.

Step 6: Create an Evaluation Job

Now that your evaluation examples are ready and uploaded to S3, you can create an evaluation job in Amazon Bedrock to automatically assess your knowledge base.

Navigate to Amazon Bedrock Evaluations: In the AWS Management Console, go to Amazon Bedrock. In the left-hand navigation pane, select Inference and Assessment, then Evaluations, and finally RAG.
Create New Evaluation: Click on the "Create" button to start configuring a new evaluation job.
Provide Evaluation Details:
- Evaluation name: Enter a unique and descriptive name for your evaluation job.
- Description: (Optional) Provide a brief description of the evaluation.
Select Evaluator Model: Choose "Amazon Nova Micro v1.0" as the evaluator model. This model will be used to automatically score the responses generated by your knowledge base against the reference responses you provided.
Specify Source: Select "Bedrock Knowledge Base" as the source for the evaluation. Then, choose your specific Knowledge Base (e.g., myFirstBedrockKB) from the dropdown list.
Define Evaluation Type and Metrics:
- Evaluation type: Select "Retrieval only". This focuses the evaluation on the quality of the information retrieved by your knowledge base.
- Metrics: Under the Metrics section, select "Context relevance" and "Context coverage". These are crucial metrics for assessing how well the retrieved context aligns with the prompt and how comprehensively it covers the necessary information.
Configure Input and Output Locations:
- Input: Specify the S3 location of your batchinput.jsonl file (e.g., s3://mybatchinferenceinput/batchinput.jsonl).
- Output: Choose an S3 output bucket and prefix where the evaluation results will be stored (e.g., s3://mymodelevaloutput/output).
Set up Service Role: Create and use a new service role for this evaluation job. This role grants Bedrock the necessary permissions to access your S3 buckets and run the evaluation. Remember to note down the role's name for future reference.
Initiate Evaluation: Review all your settings and click "Create evaluation job".

Once the evaluation job is created, Amazon Bedrock will begin processing your evaluation examples, using the chosen evaluator model to score the retrieval performance of your knowledge base. You can monitor the job's status in the Bedrock console.

Monitoring and Detailed Analysis

After setting up and running your RAG evaluation job, monitoring its performance and conducting detailed analysis of the results are crucial steps. This allows you to gain insights into the efficiency and effectiveness of your knowledge base. The following steps, illustrated in the provided flowchart, guide you through this process.

Step 1: Prerequisites for Monitoring

Before you can effectively monitor and analyze your RAG evaluations, ensure you have the following foundational elements in place:

Amazon Bedrock and Bedrock Knowledge Base: Your RAG application, including the Amazon Bedrock service and your configured Knowledge Base, must be operational.
Prompt Dataset in S3: The batchinput.jsonl file containing your evaluation prompts and reference responses should be stored in an accessible S3 bucket, as this is the input for your evaluation jobs.

Step 2: Enable Logging

To capture the necessary metrics and logs for monitoring and detailed analysis, you must enable logging for your Bedrock evaluations. This ensures that invocation details and other critical information are recorded.

Navigate to Bedrock Evaluations: Go to the Amazon Bedrock console, then Inference and Assessment, and select Evaluations.
Enable Model Invocation Logging: Within the evaluation settings, ensure that Model Invocation Logging is enabled. This setting directs Bedrock to send invocation data to a logging service.
Choose S3/CloudWatch Logs: Configure where these logs should be stored. You can choose to send them to Amazon S3 for long-term storage and batch analysis, or to Amazon CloudWatch Logs for real-time monitoring and querying.

Step 3: Create Evaluation (Recap)

As previously detailed, the creation of the evaluation job is where you define what to evaluate and how. This step is a prerequisite for the monitoring phase.

Go to Bedrock Evaluations: Access the Evaluations section in Amazon Bedrock.
Create Knowledge Base Evaluation Job: Initiate the creation of a new evaluation job, specifying it as a Knowledge Base evaluation.
Configure Job Settings: Define the evaluation name, description, and select the evaluator model (e.g., Amazon Nova Micro v1.0).
Specify Prompt Dataset & S3 Output: Point to your batchinput.jsonl file in S3 as the input and define an S3 bucket for storing the evaluation output.
Click Create Evaluation Job: Launch the evaluation process.

Step 4: Monitor with CloudWatch

Amazon CloudWatch provides powerful tools for monitoring your Bedrock evaluations in real-time. You can use CloudWatch dashboards to visualize key performance indicators.

Open CloudWatch Console: In the AWS Management Console, search for and select "CloudWatch."
Go to Automatic Dashboards: In the CloudWatch console, navigate to the Dashboards section and look for automatically generated dashboards.
Select Bedrock Dashboard: Choose the dashboard specifically created for Amazon Bedrock. This dashboard typically provides pre-configured widgets for common Bedrock metrics.
View InvocationLatency Metrics: Within the Bedrock dashboard, focus on metrics such as InvocationLatency. This metric indicates the total response time of your knowledge base, which is critical for user experience.
Filter by Model ID: To narrow down your analysis, you can filter the metrics by Model ID. This allows you to observe the performance of specific models used in your RAG evaluations.

Step 5: Analyze Detailed with CloudWatch Logs Insights

For a deeper dive into individual evaluation runs and to troubleshoot specific issues, CloudWatch Logs Insights offers a powerful query language to analyze your raw logs.

Go to CloudWatch Logs Insights: In the CloudWatch console, navigate to Logs and then select Logs Insights.
Query for Individual Customer Metrics: Use the Logs Insights query editor to write custom queries that extract specific information from your Bedrock evaluation logs. You can query for details related to individual prompts, responses, and the metrics computed by the evaluator model.
Analyze Raw Logs from S3: If you configured your logs to be stored in S3, you can also directly access and analyze these raw log files using tools like Amazon Athena or other data processing services for more complex, large-scale analysis.

By following these monitoring and analysis steps, you can continuously track the performance of your RAG system, identify areas for improvement, and ensure your knowledge base is delivering optimal results.

Key Performance Metrics

Understanding the performance of your RAG system involves analyzing several key metrics that provide insights into its efficiency and effectiveness. These metrics are crucial for identifying bottlenecks, optimizing costs, and ensuring a high-quality user experience. The primary metrics to focus on include:

InvocationLatency: This metric represents the total response time of your RAG system. It measures the duration from when a request is made to the knowledge base until a response is fully generated. Lower invocation latency indicates a more responsive system, which is vital for interactive applications. High latency can point to issues with network connectivity, model inference speed, or knowledge base retrieval efficiency.
InputTokenCount: This metric tracks the number of tokens in the input provided to the LLM. In a RAG context, this typically includes the user's query and the retrieved context from the knowledge base. Monitoring input token count helps in understanding the complexity of the prompts being processed and has direct implications for cost, as most LLM providers charge based on token usage.
OutputTokenCount: This metric measures the number of tokens in the output generated by the LLM. It reflects the length and verbosity of the responses. Similar to input tokens, output token count is a significant factor in determining the operational cost of your RAG application. Optimizing the conciseness and relevance of responses can help manage this cost.
Invocations: This metric quantifies the number of successful requests made to the InvokeModel and InvokeModelWithResponseStream API operations. It provides a direct measure of the usage volume of your RAG system. Tracking invocations helps in capacity planning, understanding demand patterns, and correlating usage with overall system performance and cost.

By regularly monitoring and analyzing these key performance metrics, you can gain a comprehensive understanding of your RAG system's behavior, identify areas for optimization, and make data-driven decisions to improve its efficiency and user satisfaction.

Cost Considerations

When deploying and evaluating RAG systems on Amazon Bedrock, understanding the cost implications of different models is paramount. Pricing for LLMs is typically based on token usage, with separate rates for input and output tokens. The choice of model can significantly impact your operational expenses. Below is a table summarizing the pricing for various models available in Amazon Bedrock, based on the provided data:

Model Provider	Model Name	Input Price (per 1K tokens)	Output Price (per 1K tokens)	Region
Amazon	Nova Micro	$0.000035	$0.000140	us-east-1
Amazon	Nova Lite	$0.000060	$0.000240	us-east-1
Amazon	Nova Pro	$0.000800	$0.003200	us-east-1
Anthropic	Claude 4.0 Sonne	$0.003000	$0.015000	us-east-1
Meta	Llama 3 70B	$0.000720	$0.000720	us-east-1

As you can observe from the table, there is a considerable variation in pricing across different models. For instance, Amazon Nova Micro offers a very cost-effective option for both input and output tokens, making it suitable for initial evaluations and applications where cost efficiency is a primary concern. In contrast, models like Anthropic Claude 4.0 Sonne, while potentially offering advanced capabilities, come with a significantly higher price point.

When selecting a model for your RAG application and its evaluations, it is crucial to balance performance requirements with budgetary constraints. Consider the following:

Evaluation Frequency: Frequent evaluations will incur costs based on the number of tokens processed. Opting for more cost-effective models for evaluation jobs can help manage expenses.
Production Workloads: For production deployments, assess the expected volume of input and output tokens to project monthly costs. A small difference in per-token pricing can accumulate into substantial costs at scale.
Model Performance vs. Cost: While cheaper models might seem attractive, ensure they meet your performance benchmarks for accuracy, relevance, and latency. Sometimes, investing in a slightly more expensive model that delivers superior results can lead to better overall ROI.

By carefully analyzing these cost factors alongside performance metrics, you can make informed decisions about model selection and optimize the financial efficiency of your RAG solutions on Amazon Bedrock.

Evaluation Results Interpretation

Interpreting the results of your RAG evaluations is key to understanding the strengths and weaknesses of different models and optimizing your knowledge base. The provided spreadsheet image (pasted_file_ekW3hG_image.png) offers a comprehensive comparison across several models, highlighting various performance and quality metrics. Let's break down how to interpret such a detailed evaluation.

Overview of Models and Performance Metrics

The evaluation typically compares several models, such as Nova Micro, Nova Lite, Nova Pro, Claude 4.0 Sonne, and Llama 3 70B. For each model, several performance metrics are usually captured:

Input: This likely refers to the total number of input tokens processed during the evaluation run. A higher number indicates more extensive testing or longer prompts/contexts.
Throughput: This metric measures the processing speed, often expressed as tokens per second or invocations per second. Higher throughput indicates a more efficient model capable of handling a larger volume of requests in a given time frame.
Cost: This is a critical metric, often broken down into cost per second, cost per invocation, and cost per 1K tokens. As discussed in the previous section, these figures directly reflect the financial implications of using each model for your RAG system. Lower costs are generally desirable, provided the quality remains acceptable.

Quality Metrics

The core of RAG evaluation lies in assessing the quality of the generated responses. The spreadsheet categorizes quality into several dimensions, each contributing to a holistic view of model performance:

Correctness: Measures whether the generated response is factually accurate and free from errors. This is paramount for RAG systems, as their purpose is to provide grounded information.
Completeness: Assesses if the response addresses all aspects of the user's query and provides sufficient information. An incomplete response, even if correct, may not be helpful.
Helpfulness: Evaluates how useful and actionable the response is to the user. A helpful response goes beyond mere correctness to provide practical value.
Coherence: Determines if the response is logically structured, easy to understand, and flows naturally. A coherent response enhances user experience.
Harmfulness: Identifies if the response contains any toxic, biased, or otherwise inappropriate content. This is a crucial safety metric for all LLM applications.
Groundedness: This is particularly important for RAG systems. It verifies that all information presented in the response can be directly traced back to the provided source documents (the knowledge base). A high groundedness score indicates that the LLM is effectively utilizing the retrieved context and not hallucinating information.

Each of these quality metrics is typically scored, often on a scale (e.g., 0 to 1, or 0 to 5), with higher scores indicating better performance in that specific dimension.

Weighted Composite Score and Final Ranking

To provide an overall assessment, a Weighted Composite Score is often calculated. This score combines the individual quality metrics, allowing you to assign different weights based on the importance of each metric to your specific application. For example, if correctness and groundedness are more critical for your use case, they would receive higher weights. The formula for this composite score is usually defined within the evaluation setup.

Finally, a Final Ranking Calculation provides an ordered list of models based on their overall performance, considering both quantitative metrics (like latency and cost) and qualitative metrics (like correctness and groundedness). This ranking helps in making informed decisions about which model is best suited for your RAG application, balancing performance, quality, and cost.

By meticulously analyzing these metrics, you can identify which models excel in certain areas, pinpoint specific weaknesses, and iteratively refine your knowledge base, prompt engineering, or even the underlying LLM choice to achieve optimal RAG performance.

Important Notes & Reminders

As you embark on your RAG evaluation journey in Amazon Bedrock, keep the following important notes and reminders in mind to ensure efficient resource management, security, and best practices.

Resource Cleanup

AWS services, especially those involving machine learning models and data storage, can incur significant costs if left running unnecessarily. It is highly recommended that you diligently delete or release all resources after completing your lab work or evaluations. This includes:

Knowledge Base: The Amazon Bedrock Knowledge Base itself.
Vector Database: The underlying vector store, whether it's Amazon OpenSearch Serverless or Amazon S3 Vector Store.
S3 Buckets: Any S3 buckets you created or used for storing source data, evaluation inputs, or outputs.
IAM Roles: The IAM roles created for the Knowledge Base and evaluation jobs.

Failing to clean up resources can lead to unexpected charges on your AWS bill. Always verify that all associated resources have been terminated or deleted.

Cross-Origin Resource Sharing (CORS)

When developing web applications that interact with Amazon Bedrock, you might encounter issues related to Cross-Origin Resource Sharing (CORS). CORS is a security feature implemented by web browsers that restricts web pages from making requests to a different domain than the one that served the web page. If your frontend application is hosted on a different domain than your Bedrock API endpoints, you will need to configure CORS policies.

For detailed information on how to configure CORS with Amazon Bedrock, please refer to the official AWS documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/cors.html

JSON Formatters

Throughout the process of creating evaluation examples, you will be working with JSONL files. Ensuring that your JSON objects are correctly formatted is crucial for the evaluation jobs to run successfully. Malformed JSON can lead to errors and failed evaluations.

Several online tools can help you validate and format your JSON content. Some popular options include:

These tools can help you quickly identify syntax errors, pretty-print your JSON for readability, and ensure compliance with the JSON standard.

By adhering to these important notes and reminders, you can maintain a secure, cost-effective, and efficient environment for your RAG evaluations in Amazon Bedrock.

Conclusion

Evaluating Retrieval Augmented Generation (RAG) systems is not merely a best practice; it is a critical component for ensuring the reliability, accuracy, and cost-effectiveness of your AI applications. This guide has provided a comprehensive walkthrough of how to leverage Amazon Bedrock's evaluation capabilities to automatically assess your knowledge base performance. From setting up your environment and creating evaluation examples to monitoring key metrics and interpreting detailed results, you now have the knowledge to systematically enhance your RAG solutions.

By diligently following these steps, you can:

Improve Response Quality: Continuously refine your knowledge base and model choices to deliver more accurate, complete, and helpful responses.
Optimize Costs: Make informed decisions about model selection based on performance and pricing, ensuring your RAG system operates efficiently within budget.
Enhance User Experience: Reduce latency and improve the relevance of information, leading to a more satisfying experience for end-users.
Maintain System Health: Proactively identify and address issues through continuous monitoring and detailed analysis of performance metrics.

We encourage you to implement these practices within your own AWS environment. The journey of building robust AI applications is iterative, and effective evaluation is the compass that guides you toward excellence. Start evaluating your RAG systems today to unlock their full potential and deliver truly intelligent solutions.

Understanding Retrieval Augmented Generation (RAG)

To better understand the evaluation process, it's helpful to visualize the core components of a RAG system. The diagram below illustrates the typical flow:

In this flow:

User Query: The user initiates a request or question.
Retrieval: The RAG system queries a Knowledge Base (an external data source) to retrieve relevant information based on the user's query.
Generation: The retrieved information is then passed to a Large Language Model (LLM), which uses this context to generate a comprehensive and grounded response.
Response: The final generated response is presented to the user.

This process ensures that the LLM's output is informed by up-to-date and specific data, making evaluations of both the retrieval and generation components critical.

RAG Evaluation Workflow Overview

The entire process of setting up, running, and analyzing RAG evaluations in Amazon Bedrock can be visualized as a clear workflow. The following flowchart provides a high-level overview of the steps involved, from initial prerequisites to detailed analysis:

This visual guide helps in understanding the sequence of operations and the interdependencies between different stages of the evaluation process.

Visualizing Key Performance Metrics

To further clarify the key performance metrics discussed, the following diagram illustrates their relationships and what they represent:

These metrics provide a quantitative foundation for assessing the efficiency and responsiveness of your RAG system.

Visualizing Model Pricing

To provide a clear overview of the cost differences, the following image illustrates the pricing structure for various models:

This visual representation emphasizes the importance of cost-conscious model selection.

Detailed Evaluation Results

The following image provides a detailed breakdown of evaluation results across different models, showcasing various performance and quality metrics:

This spreadsheet is instrumental in conducting a thorough comparative analysis of model performance.

DEV Community