Eliana Lam for AWS Community On Air

Posted on Nov 17

Rethinking GenAI Agent: RAG & MCP

#aws #beginners #cloud #productivity

Speaker: Carson Chan & Angelo Mao @ AWS Community Day Hong Kong 2025

Summary by Amazon Nova

https://www.youtube.com/watch?v=EzXVpznMRC8

RA and MCP in Agentic Architectures:

Focus of the Session:

Discussion on Retrieval-Augmented (RA) and Multi-Chain Processing (MCP), two traditional techniques in the agentic world.
Emphasis on what matters in agentic architectures and how to boost agent performance.

Key Points:

Agent Orchestration: Agents are orchestrated by models, making the model crucial.
Model Evolution: The evolution of logical thinking and capabilities in models is primarily driven by model providers unless custom training is undertaken.
External System Connectivity: Agents' ability to connect with external systems and perform accurately is vital.

Retrieval-Augmented (RA):

Definition: RA allows Large Language Models (LLMs) to access custom knowledge bases without the need for retraining.
Benefits: Reduces the need for computational power and costs associated with model fine-tuning or retraining.
Use Case: Ideal for complex workloads where prompt engineering with few-shot or one-shot learning may be insufficient.

Goals:

Enhance the true performance of agents.
Improve the performance of RA in agentic architectures.

Architecture Overview:

The session will delve into the architecture of agentic systems, focusing on how RA and MCP contribute to overall performance.

Detailed Breakdown of Retrieval-Augmented (RA) Architecture:

Steps to Implement RA:

Document Preparation:

Start with your own documents (PDFs, media, videos).

Chunking:

Segment your document into chunks (meaningful segments of text).

Embedding:

Use an embedding model to convert chunks into numeric representations (factors).

Factor Storage:

Store factors in a database (relational or native factor store like Neo4j or Pinecone).

Query Process:

When a user asks a question, it is converted into a factor.
The factor database performs a similarity search to find the most relevant answer.
The answer is returned to the Large Language Model (LLM) for a customized response.

Key Elements for Boosting RA Performance:

Chunking Strategy:

Importance of chunking:
Early-stage LLMs had limited context windows.
Chunking prevents "loss in the middle" phenomenon where important information is missed.
Key considerations:
Chunk size: Subjective and personal preference.
Chunking logic: How to segment text meaningfully.

Embedding Model Choice:

Selecting the right embedding model is crucial for accurate factor representation.

Factor Database Selection:

Different factor databases use different algorithms and libraries.
Choosing the right database impacts performance and accuracy.

Why Chunking is Important:

Context Window Limitation: Early LLMs couldn't handle large inputs; chunking allows for manageable segments.
Avoiding Loss in the Middle: Ensures important information isn't missed during processing.
Chunk Size and Logic: Subjective choices that impact the effectiveness of chunking.

Further Details on Chunking Strategies and Impact:

Impact of Chunk Size:

Chunk size affects response time and performance metrics like faithfulness and relevancy.
Llama Index illustrates the relationship between chunk size and performance.

Chunking Strategies on AWS:

Fixed Size Chunking:

Regular, syntactic division of text into fixed-size chunks.

Semantic Chunking:

Uses Natural Language Processing (NLP) to understand and chunk text in a meaningful way.

Hierarchical Chunking:

Involves parent-child relationships, often used for documents with hierarchical structures.

Key Considerations for Chunking:

Meaningful Chunks: Ensure that chunks are semantically meaningful, not just syntactically divided.
Understanding Text: The goal is to understand the text to determine if a chunk is meaningful.
No Chunk Strategy: If the dataset is exhaustive or well-trained, no chunking may be necessary.

Summary:

Chunking is not just about size but about understanding and segmenting text meaningfully.
AWS offers various chunking strategies, including fixed size, semantic, and hierarchical chunking.
The choice of chunking strategy depends on the nature of the text and the desired performance outcomes.

Detailed Explanation of Embedding Models and Similarity Metrics:

Embedding Models:

Embedding models convert text into numerical representations (factors).
Example: "ABC" (A: "I like running", B: "My favorite spot is running", C: "Today's weather is good") are embedded into factors like [0.8, 0.6, 0.11].
The meaning of these factors is determined by the embedding model.

Similarity Metrics:

Euclidean Distance: Measures the straightforward distance between two factors. Smaller distances indicate higher similarity.
Cosine Similarity: Measures the angle between two factors. Values closer to 1 indicate higher similarity.

Example:

Using ABC example, A and B are more mathematically similar or close in both Euclidean distance and cosine similarity.

Selecting the Right Embedding Model:

[ 1 ] Use Case and Language:
Determine the specific use case and language requirements (e.g., multilingual support).
[ 2 ] Token Size or Window Size:
Ensure the model can handle the size of your text materials.
[ 3 ] Dimensionality:
Higher dimensionality can capture more semantic meanings but may require more memory.
Trade-off between capturing information and memory cost.

AWS Embedding Models:

Various embedding models are available on AWS, with new models like Nova embedding recently launched.
Testing results and understanding the model's capabilities are crucial for selection.

Summary:

Embedding models convert text into numerical factors, enabling similarity searches.
Similarity is measured using Euclidean distance and cosine similarity.
Selecting the right embedding model involves considering use case, token size, and dimensionality.

Summary of Vector Databases and Searching Algorithms:

Vector Databases (Factor DB):

After converting text into numerical representations (factors), these need to be stored in a vector database.
AWS offers various database options, including open-source serverless solutions like Pinecone or Mu.
OpenSearch is highlighted as an example for understanding searching methods.

Searching Methods:

[ 1 ] Hierarchical Navigable Small World (HNSW):
Creates a graphical layer-by-layer structure for faster searching.
Drawback: Higher memory consumption.
[ 2 ] Inverted File Index (IVF):
Turns factors into buckets for searching sub-buckets.
Advantage: Lower memory usage compared to HNSW.

Choosing the Right Searching Method:

Consider query latency, query quality, memory usage, and indexing.
OpenSearch recommends considering these factors when choosing a method.
Methods like FAISS may combine HNSW and IVF for optimal performance.

Trade-offs in Factor DB Selection:

No straightforward answer; performance and tolerance levels are user-specific.
Balance between quality, cost, and latency.
More memory usage leads to higher costs but more accurate results.

Conclusion:

The choice of factor DB and searching algorithm depends on specific use cases and requirements.
Decision diagrams and tree diagrams can help in selecting the right options.

Transition to MCP:

The session will now move on to discussing Multi-Chain Processing (MCP) and sharing experiences with MCP.

Introduction to Multi-Chain Processing (MCP)

What is MCP?

MCP stands for Model Context Protocol, an open-source standard for connecting AI applications with external systems, databases, or knowledge bases.
Aims to facilitate more effective responses from the system.

MCP Architecture Overview:

MCP Host: Central component with multiple MCP clients.
MCP Clients: Interact with the MCP server.
MCP Server: Represents different systems like file systems, databases, or other applications.

Core Features of MCP Client:

Sampling: Controls the number of requests sent to the MCP server.
Roots: Specifies the type of directory or data the client wants to extract from the MCP server.
Elicitation: Defines preferences for the data the client wants to receive from the MCP server.

Reference Architecture for Deploying MCP Server on AWS:

CloudFront and Buff: Used for performance enhancement and security against malicious attacks.
Amazon Cognito: Leveraged for authentication to ensure valid requests from MCP clients.

Summary:

MCP is a protocol for enhancing AI application interactions with external systems.
The architecture includes an MCP host, clients, and server.
Key features of MCP clients include sampling, roots, and elicitation.
AWS services like CloudFront, Buff, and Amazon Cognito are recommended for deploying MCP servers securely and efficiently.

Continued Explanation of MCP Deployment on AWS

Backend Services for MCP Server:

AWS Fargate and AWS Lambda:
- Fargate: Hosts MCP servers that need to run for a long time.
- Lambda: Hosts code triggered by events, running temporarily.

Additional AWS Services:

Amazon CloudWatch: Monitors MCP server performance.
Parameter Store: Stores environment variables for MCP servers.
Amazon ECR: Hosts images for building MCP servers.
ACM (AWS Certificate Manager): Ensures data in transit is encrypted with certificates.

Demo of MCP Server Deployment on AWS:

CDK (Cloud Development Kit): Used to deploy the MCP server architecture.
[ 1 ] Steps:
Build up the CDK toolkit.
Use AWS-provided CDK to deploy the backend architecture.
[ 2 ] Deployed Stack Includes:
VPC
Security groups
CloudFront
Cognito user pool for authentication
MCP server

Deployment Process:

The deployment process includes building Docker images from source code, pushing them to ECR, and deploying onto services like ECS Fargate and Lambda.
The process is time-consuming due to the complexity of building and deploying the server.

Summary:

The backend of the MCP server leverages serverless platforms like Fargate and Lambda.
Additional AWS services like CloudWatch, Parameter Store, ECR, and ACM are used for monitoring, storing variables, hosting images, and ensuring encryption.
A demo using CDK shows the deployment of the MCP server on AWS, highlighting the complexity and time required for the process.

Final Remarks on MCP Client Interaction and Best Practices

Interaction with MCP Server:

After deploying the backend, the MCP client can be used to interact with the MCP server.
Demonstration of interacting with an MCP server connected to an external website (API) for weather alerts in the United States.
Example: Inputting longitude and latitude to predict weather alerts and receiving a response from api.weather.gov.

Best Practice Tip:

Avoid Placing MCP Client with Callback Host:
It is recommended not to place the MCP client on the same host as the callback host for the MCP server.
This practice helps avoid callback timeout issues.

Summary:

The MCP client can effectively interact with the MCP server to fetch data from external systems.
A best practice is to separate the MCP client from the callback host to prevent timeout problems.
The demonstration showed how to use the MCP server to predict weather alerts by interacting with an external API.

Team:
AWS FSI Customer Acceleration Hong Kong

AWS Amarathon Fan Club

AWS Community Builder Hong Kong

DEV Community

Rethinking GenAI Agent: RAG & MCP

Top comments (0)