DEV Community: Suman Debnath

Creating Software Teams with AI Agents and Bedrock

Suman Debnath — Mon, 28 Apr 2025 21:24:29 +0000

Imagine having a complete engineering team at your fingertips - an engineering lead, a backend developer, a frontend engineer, and a test engineer - all working in perfect harmony to build your application. Now imagine this entire team is powered by AI.

In this blog post, I'll walk you through building an AI-powered engineering team using CrewAI and Amazon Bedrock to develop a stock management application. Let's turn this sci-fi concept into reality!

What is CrewAI?

CrewAI is a game-changing open-source framework that lets you create a team of specialized AI agents working together on complex projects. Unlike traditional single-agent approaches, CrewAI mimics how human teams operate - with different specialists handling specific aspects of a project.

Think of it as assembling your dream engineering team, where each AI agent brings unique skills to the table:

The visionary Engineering Lead creates detailed designs
The meticulous Backend Engineer writes clean, efficient code
The creative Frontend Engineer crafts intuitive user interfaces
The thorough Test Engineer ensures everything works perfectly

Our Mission: Build a Stock Management App

Our AI team will build a comprehensive stock management application that:

Manages user accounts and funds
Handles buying and selling of shares
Tracks portfolio value and calculates profits/losses
Enforces sensible constraints (no overdrafts or overselling)

Let's dive into how we'll make this happen!

Project Setup: The Skeleton of Our AI Team

Before we write a single line of code, let's understand the architecture of our project:

engineering_team/
├── config/
│   ├── agents.yaml  (Defines our AI team members)
│   └── tasks.yaml   (Specifies what each team member will do)
├── output/          (Where our finished application will go)
├── src/
│   ├── crew.py      (Orchestrates team collaboration)
│   └── main.py      (Starting point for our project)
└── ...

Each file plays a crucial role in bringing our AI team to life.

Getting Started: Setup in 3 Easy Steps

Let's roll up our sleeves and get started:

1. Install the essentials

# Install uv (a faster Python package installer)
$ curl -LsSf https://astral.sh/uv/install.sh | sh

# Verify installation
$ uv --version

# Install CrewAI
$ uv tool install crewai

2. Create your AI crew

$ crewai create my_engg_team

When prompted, select Amazon Bedrock as your provider and choose Claude 3.5 Sonnet as your model. You'll need to provide your AWS credentials to access these powerful models.

3. Configure your team

Now comes the fun part - defining our AI team members and their responsibilities!

The Dream Team: Configuring Our AI Engineers

The Engineering Lead (agents.yaml)

Our engineering lead takes high-level requirements and turns them into detailed designs:

engineering_lead:
  role: >
    Engineering Lead for the engineering team, directing the work of the engineer
  goal: >
    Take the high level requirements described here and prepare a detailed design for the backend developer;
    everything should be in 1 python module; describe the function and method signatures in the module.
    The python module must be completely self-contained, and ready so that it can be tested or have a simple UI built for it.
    Here are the requirements: {requirements}
    The module should be named {module_name} and the class should be named {class_name}
  backstory: >
    You're a seasoned engineering lead with a knack for writing clear and concise designs.
  llm: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0

backend_engineer:
  role: >
    Python Engineer who can write code to achieve the design described by the engineering lead
  goal: >
    Write a python module that implements the design described by the engineering lead, in order to achieve the requirements.
    The python module must be completely self-contained, and ready so that it can be tested or have a simple UI built for it.
    Here are the requirements: {requirements}
    The module should be named {module_name} and the class should be named {class_name}
  backstory: >
    You're a seasoned python engineer with a knack for writing clean, efficient code.
    You follow the design instructions carefully.
    You produce 1 python module named {module_name} that implements the design and achieves the requirements.
  llm: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0

frontend_engineer:
  role: >
    A Gradio expert to who can write a simple frontend to demonstrate a backend
  goal: >
    Write a gradio UI that demonstrates the given backend, all in one file to be in the same directory as the backend module {module_name}.
    Here are the requirements: {requirements}.
  backstory: >
    You're a seasoned python engineer highly skilled at writing simple Gradio UIs for a backend class.
    You produce a simple gradio UI that demonstrates the given backend class; you write the gradio UI in a module app.py that is in the same directory as the backend module {module_name}.
  llm: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0

test_engineer:
  role: >
    An engineer with python coding skills who can write unit tests for the given backend module {module_name}
  goal: >
    Write unit tests for the given backend module {module_name} and create a test_{module_name} in the same directory as the backend module.
  backstory: >
    You're a seasoned QA engineer and software developer who writes great unit tests for python code.
  llm: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0

Each agent configuration includes:

Role: Describes the agent's position in the team
Goal: Specifies what the agent needs to accomplish
Backstory: Provides context that shapes the agent's approach
LLM: Specifies which language model powers the agent (in this case, Claude 3.5 Sonnet via Amazon Bedrock)

Our team includes four agents, each serving a distinct role:

Engineering Lead: Creates detailed design specifications
Backend Engineer: Implements the core functionality in Python
Frontend Engineer: Develops a Gradio-based UI
Test Engineer: Writes unit tests for quality assurance

Assigning Tasks (tasks.yaml)

Each team member needs clear instructions on what to do:

Crew Definition (`crew.py`)

The crew.py file organizes the agents and tasks into a cohesive team:

from crewai import Agent, Crew, Process, Task
from crewai.project import CrewBase, agent, crew, task

# If you want to run a snippet of code before or after the crew starts,
# you can use the @before_kickoff and @after_kickoff decorators
# https://docs.crewai.com/concepts/crews#example-crew-class-with-decorators

@CrewBase
class MyEnggTeam():
    """EngineeringTeam3 crew"""

    agents_config = 'config/agents.yaml'
    tasks_config = 'config/tasks.yaml'

    @agent
    def engineering_lead(self) -> Agent:
        return Agent(
            config=self.agents_config['engineering_lead'],
            verbose=True,
        )

    @agent
    def backend_engineer(self) -> Agent:
        return Agent(
            config=self.agents_config['backend_engineer'],
            verbose=True,
            allow_code_execution=True,
            code_execution_mode="safe",  # Uses Docker for safety
            max_execution_time=120, 
            max_retry_limit=3 
        )

    @agent
    def frontend_engineer(self) -> Agent:
        return Agent(
            config=self.agents_config['frontend_engineer'],
            verbose=True,
        )

    @agent
    def test_engineer(self) -> Agent:
        return Agent(
            config=self.agents_config['test_engineer'],
            verbose=True,
            allow_code_execution=True,
            code_execution_mode="safe",  # Uses Docker for safety
            max_execution_time=120, 
            max_retry_limit=3 
        )

    @task
    def design_task(self) -> Task:
        return Task(
            config=self.tasks_config['design_task']
        )

    @task
    def code_task(self) -> Task:
        return Task(
            config=self.tasks_config['code_task'],
        )

    @task
    def frontend_task(self) -> Task:
        return Task(
            config=self.tasks_config['frontend_task'],
        )

    @task
    def test_task(self) -> Task:
        return Task(
            config=self.tasks_config['test_task'],
        )   

    @crew
    def crew(self) -> Crew:
        """Creates the research crew"""
        return Crew(
            agents=self.agents,
            tasks=self.tasks,
            process=Process.sequential,
            verbose=True,
        )

The Starting Gun (main.py)

Finally, we kick off the development process:

#!/usr/bin/env python
import warnings
import os
import logging

from my_engg_team.crew import MyEnggTeam

warnings.filterwarnings("ignore", category=SyntaxWarning, module="pysbd")

# Disable only CrewAI telemetry
os.environ['CREWAI_DISABLE_TELEMETRY'] = 'true'

# Disable all OpenTelemetry (including CrewAI)
os.environ['OTEL_SDK_DISABLED'] = 'true'

# Create output directory if it doesn't exist
os.makedirs('output', exist_ok=True)

requirements = """
A simple account management system for a trading simulation platform.
The system should allow users to create an account, deposit funds, and withdraw funds.
The system should allow users to record that they have bought or sold shares, providing a quantity.
The system should calculate the total value of the user's portfolio, and the profit or loss from the initial deposit.
The system should be able to report the holdings of the user at any point in time.
The system should be able to report the profit or loss of the user at any point in time.
The system should be able to list the transactions that the user has made over time.
The system should prevent the user from withdrawing funds that would leave them with a negative balance, or
 from buying more shares than they can afford, or selling shares that they don't have.
 The system has access to a function get_share_price(symbol) which returns the current price of a share, and includes a test implementation that returns fixed prices for AAPL, TSLA, GOOGL.
"""
module_name = "accounts.py"
class_name = "Account"

def run():
    """
    Run the research crew.
    """
    inputs = {
        'requirements': requirements,
        'module_name': module_name,
        'class_name': class_name
    }

    # Create and run the crew
    result = MyEnggTeam().crew().kickoff(inputs=inputs)

if __name__ == "__main__":
    run()

Showtime! Watching AI Engineers in Action

With everything set up, it's time to see our AI team in action:

$ crewai run

Now sit back and watch as your AI engineering team springs to life! You'll see a cascade of activity:

The Engineering Lead analyzes the requirements and crafts a detailed design
The Backend Engineer takes that design and implements it in clean Python code
The Frontend Engineer creates a user-friendly Gradio interface
The Test Engineer ensures everything works as expected with comprehensive tests

The Grand Finale: Running Our Application

Once the AI team finishes their work, we can actually run the application they built:

$ cd output
$ uv run app.py

And voilà! A fully functional stock management application appears:

You can now deposit funds, buy and sell stocks, track your portfolio value, and more - all created by an AI engineering team!

What Just Happened? Behind the Scenes

Let's take a moment to appreciate what just happened:

Design Phase: Our AI Engineering Lead analyzed requirements and created a detailed blueprint
Implementation Phase: The Backend Engineer wrote clean, functional Python code
UI Development: The Frontend Engineer built an intuitive Gradio interface
Quality Assurance: The Test Engineer ensured everything worked correctly

All of this happened with minimal human intervention - just a few configuration files and a press of a button!

Why This Is Revolutionary

This approach to software development is groundbreaking for several reasons:

Speed: Development that would take days or weeks happens in minutes
Specialization: Each agent brings focused expertise to their specific role
Collaboration: Information flows naturally between team members
Quality: The test-driven approach ensures robust, working software
Accessibility: Complex applications can be built even if you're not a coding expert

The Future of Development?

Will AI teams replace human developers? Not likely. But they will transform how we work:

Rapid Prototyping: Test ideas and build MVPs in record time
Skill Augmentation: Fill gaps in your human team's expertise
24/7 Development: Your AI team never sleeps (or needs coffee breaks!)
Learning Tool: Watch experts at work and learn best practices

What's Next?

The possibilities are endless! You could:

Expand the team with more specialized roles (DevOps engineer, security specialist)
Tackle more complex applications
Customize the agents with domain-specific knowledge
Integrate with your existing development workflow

Conclusion

Building an AI engineering team with CrewAI and Amazon Bedrock demonstrates how far AI has come. We've moved from AI assistants that can answer questions to AI teams that can build entire applications from scratch.

This approach doesn't replace human creativity and innovation - it amplifies it. By handling routine implementation details, AI engineering teams free human developers to focus on the bigger picture: solving real-world problems and creating value.

So, are you ready to assemble your AI engineering dream team? The tools are here, and the possibilities are limitless!

Standardizing AI Tooling with Model Context Protocol (MCP)

Suman Debnath — Wed, 09 Apr 2025 13:15:20 +0000

What is Model Context Protocol (MCP) ?

MCP is like a USB-C port for your AI applications.

Just as USB-C offers a standardized way to connect devices to various accessories, MCP standardizes how your AI apps connect to different data sources and tools.

MCP Architecture

At its core, MCP follows a client-server architecture where a host application can connect to multiple servers.

It has three key components:

1. Host

2. Client

3. Server

MCP Host represents any AI app (Claude desktop, Cursor) that provides an environment for AI interactions, accesses tools and data, and runs the MCP Client. The host manages the AI model and runs the MCP Client.

MCP Client within the host and facilitates communication with MCP servers. It's responsible for discovering server capabilities and transmitting messages between the host and servers.

MCP Server exposes specific capabilities and provides access to data, like,

How MCP Works ?

Understanding client-server communication is essential for building your own MCP client-server.

Let's see how MCP communication flows:

1. Discovery Phase: The client queries the server to learn its capabilities

2. Capability Exchange: The server shares what it can do

3. Acknowledgment: The client confirms successful connection

4. Ongoing Communication: Messages flow between client and server

Real-World Example: Smart Home Integration

Let's illustrate MCP with a practical example comparing traditional API integration with MCP for a smart home system:

Traditional API Approach:

You build a voice assistant that integrates with your smart lighting system.
Initially, your lighting API accepts two parameters:

- room (which room to control)
- state (on/off)

You hardcode your assistant to send requests with these exact parameters.

Later, the lighting system company updates their API to require a third required parameter:

- brightness (light intensity)

Suddenly, your voice assistant stops working correctly because your code doesn't include this new required parameter.

MCP Approach:

Your voice assistant (MCP Host) connects to your smart lighting system (MCP Server).

- The assistant asks: "What commands do you support?"

- The lighting system responds: "I support controlling lights with room and state parameters."

- The assistant works with these parameters.

When the lighting company updates their system:

- The assistant asks again: "What commands do you support?"

- The lighting system now responds: "I support controlling lights with room, state, and brightness parameters."

- The assistant automatically adapts to include this new parameter in its requests.

No code changes needed! Your assistant continues to work flawlessly.

Conclusion

MCP is more than just a technical spec, it's a bridge that enables seamless, future-proof connections between AI applications and evolving tools or data sources. By abstracting away brittle integrations and fostering adaptability, MCP is paving the way for more robust, modular, and intelligent AI systems.

Just like USB-C simplified hardware connections, MCP simplifies AI interoperability.

To learn more you can check here:

Introducing AWS MCP Servers for code assistants
AWS MCP Servers

Integrating Vision-Language Models into Agentic RAG Systems with ColPali

Suman Debnath — Mon, 31 Mar 2025 21:33:26 +0000

In this tutorial, we will walk through a relatively new technique to build a RAG based pipeline using vision based model, which is based on a paper called ColPali (published in June 2024). In the rapidly evolving world of AI, we're constantly seeking more natural ways for machines to understand and process information. Traditional RAG based systems have been transformative, but they often struggle with multimodal content (i.e. documents that contains a mix of text, images, tables, and more).

Before we dive into this new vision-based retrieval technique, it's worth briefly revisiting the challenges faced by traditional RAG systems when dealing with multimodal data. This context will help us better appreciate the value that a vision-based retrieval model like ColPali offers.

The Challenge with Traditional RAG

Imagine you're trying to find information in a textbook that contains diagrams, charts, equations, and text. The traditional RAG approach would typically require:

Extracting text from the document
Processing images separately
Processing tables separately
Trying to understand tables in isolation
Somehow stitching all this information together

This fragmented approach loses vital context. A diagram often explains concepts that would take paragraphs of text, and the layout of information itself can convey meaning. When working with educational content, research papers, or technical documentation, this limitation becomes particularly problematic.

Traditional RAG systems struggle with this. Extracting text from visuals and then feeding it into an LLM strips away structural nuances. But what if we could process raw visual documents directly and retrieve information based on visual relevance?

That’s exactly what ColPali enables.

ColPali: Efficient Document Retrieval with Vision Language Models

Vision-driven RAG systems tackle this challenge differently. Instead of breaking a document into separate components, they process pages as they appear visually, just like humans do when reading. This approach preserves spatial relationships and visual context, both of which are often essential for deep understanding.

ColPali uses vision-language models (VLMs) to enhance document processing, bypassing traditional text extraction steps and directly analyzing documents as they are.

In this tutorial, we'll explore how to build such a system using:

ColPali - A multimodal document retrieval model that processes documents visually [for retrieval]
Amazon Nova - A powerful vision language model for analyzing retrieved content [for generation]
CrewAI - An agent framework for orchestrating complex AI workflows[for building the Agentic RAG based system]

Let's break down how this system works.

How ColPali Enhances Document Understanding

At its core, ColPali transforms each page of a document into an embedding, similar to how CLIP does for images, but optimized for documents. It views each page holistically through a vision-language model. Here's how it works:

Patch Creation: Documents are divided into manageable image patches, simplifying complex page layouts into smaller, processable units.

Generating Brain Food: Each patch is converted into embeddings, rich numerical representations that capture both visual and contextual data. These embeddings serve as the foundation for understanding and retrieving relevant content.

Detour of Vision Language Model (VLM)

To fully grasp how ColPali handles embedding generation, it’s essential to understand Vision-Language Models (VLMs), models that excel at integrating visual data with textual annotations. For a deep dive into VLMs, refer to tutorial.

At a high level, VLM consist of the following key components:

Image Encoder: Breaks down images into patches and encodes each one into embeddings.
Text Encoder: Simultaneously encodes any accompanying text into its own set of embeddings, preserving language-specific context.

Image Encoder

The Image Encoder component of a VLM breaks down images into smaller patches and processes each patch individually to generate embeddings. These embeddings represent the visual content in a format that the model can interpret and reason over.

Patch Processing: Images are divided into patches, which are then individually fed into the encoder. This modular approach allows the model to focus on detailed aspects of each image segment, facilitating a deeper understanding of the overall visual content.

Adapter Layer Transformation: After encoding, the output from the image encoder passes through an adapter layer. This layer converts the visual embeddings into a numerical format optimized for further processing within the model.

Text Encoder

Parallel to the image encoding, the Text Encoder processes textual data. It converts text into a set of embeddings that encapsulate the semantic and syntactic nuances of the language.

Text Processing: Text is input into the encoder, which then produces embeddings. These embeddings capture the textual context and are crucial for the model to understand and generate language-based responses.

Integration and Output Generation

The final stage in the VLM involves integrating the outputs from both the image and text encoders. This integration occurs within a LLM, where both sets of embeddings interact through the Transformer's attention mechanism.

Contextual Interaction: The image and text token embeddings are combined and processed through the Transformer model. This interaction allows the model to contextualize the information from both modalities, enhancing its ability to generate accurate and relevant responses based on both text and visual inputs.

This comprehensive approach enables VLMs to perform complex tasks that require an understanding of both visual elements and textual information, making them ideal for tasks like multimodal RAG where nuanced document understanding is critical.

So, now that we learnt a bit about vision based language model, lets go back to ColPali and see how it precesses the data and generate the embeddings using the vision based model.

ColPali Embeddings Process

Remember how we started by dividing a document into patches? ColPali treats each page of a document as an image and divides it into patches, typically something like 32×32, resulting in 1024 patches per page.

Think of it like how your eyes scan a document, section by section. These patches are processed individually to capture local visual details, while also maintaining their spatial relationship to the entire page.

Each of these patches is converted into a rich embedding - a numerical representation that captures both visual and textual information. This is achieved through a vision encoder and a language model:

The vision encoder breaks down image patches into initial embeddings
A transformer-based LLM refines these embeddings to capture semantic information

Query Time : Similarity Scoring

When a user query comes in, ColPali calculates similarity scores between the query and document patches of each page through a scoring matrix.

Step 1: Generating and Projecting Tokens

1) Token Generation: Initially, tokens and their embeddings are generated for the query. This involves transforming the text of the query into a format that the system can process and match against document embeddings.

2) Projection: These tokens are then passed through the same transformer model used during the embedding process. This step involves projecting the tokens into the same embedding space as the document patches, ensuring that the subsequent comparisons are meaningful and accurate.

Step 2: Computing the ColBERT Scoring Matrix

At this point, we have two things:

Query embeddings
Embeddings of all pages (at patch level granularity)

The next critical step involves computing the ColBERT scoring matrix. Here's how it works:

1) Embedding Matchup: The scoring matrix is essentially a grid where each row corresponds to a query token and each column to a document patch. The entries in the matrix represent the similarity scores, typically calculated as the dot product between the query token embeddings and the document patch embeddings.

2) Score Maximization: For each query token, the system identifies the maximum similarity score across all document patches. This step is crucial because it ensures that the most relevant patches are considered for generating the response.

3) Summation for Final Score: The maximum scores for each query token are then summed up to produce a final score for each document page. This cumulative score represents the overall relevance of the page to the query.

Step 3: Selecting Top-K Pages

Based on the scores computed:

1) Ranking and Retrieval: The pages are ranked according to their scores, and the top-scoring pages are selected. This selection of top-K pages is crucial as it filters out the pages most likely to contain the information sought by the query.

2)Response Generation: These top pages are then fed, along with the query, into a multimodal language model like Amazon Nova. The model uses both the textual and the visual cues from these pages to generate detailed and contextually accurate responses.

If you want to learn more about ColPali, you can refer to the official documentation and also I would recommend you to read the 9 part blog series on RAG on DailyDoseofDS by Avi Chawla and Akshay Pachaar.

Ok, enough of theory. Let's see it in action :)

Building an Agentic RAG System with CrewAI

Having a powerful retrieval model is just one piece of the puzzle. To create a truly intelligent system, we need to orchestrate the workflow, this is where CrewAI comes in.

With CrewAI, we’ll define specialized AI agents that work collaboratively to execute complex tasks. Each agent is responsible for a specific role, and together, they handle the retrieval, reasoning, and response generation, all under the hood.

Here's how our architecture will look:

Setup

To follow along with this tutorial, I recommend cloning this repo from GitHub and follow along:

$ pip install colpali-engine torch boto3 tqdm pymupdf numpy matplotlib einops seaborn -q
$ pip install boto3==1.34.162 botocore==1.34.162 crewai==0.70.1 crewai_tools==0.12.1 PyPDF2==3.0.1 -q

Let’s import few of the libraries,

import os

import time
import shutil
import boto3
import os
from huggingface_hub import login
from colpali_engine.models import ColPali, ColPaliProcessor
from pdf2image import convert_from_path
from qdrant_client.http import models
from tqdm import tqdm
from matplotlib import pyplot as plt
from PIL import Image
from crewai import Agent, Task, Crew, LLM, Process
from IPython.display import Markdown

Download the dataset

First, let’s create a directory in your current working directory to store the dataset.

pdf_dir = "pdf_data"
os.makedirs(pdf_dir, exist_ok=True)

For this demo, we'll be using the Class X Science book from NCERT, which is publicly available on their official website. Once you download the PDF, save it within the folder pdf_data

Load the ColPali Multimodal Document Retrieval Model

We will now load the ColPali model from HuggingFace. In case you don’t have any account created in HF, this is the time to sign up and create a HUGGING_FACE_TOKEN. You’ll need this token to authenticate and access gated models.

# Loading the token
os.environ['HUGGING_FACE_TOKEN'] = 'YOUR_HF_API_KEY' 

# Login using token from environment variable
login(token=os.getenv('HUGGING_FACE_TOKEN'))

If you're running this code on a machine with a GPU or MPS (for Mac), it's highly recommended to use it for more efficient processing.

import torch 

# Check if CUDA/MPS/CPU is available
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"{device = }")

Now let’s load the ColPali model and its processor from Hugging Face.

model_name = "vidore/colpali-v1.3"

colpali_model = ColPali.from_pretrained(
                pretrained_model_name_or_path=model_name,
                torch_dtype=torch.bfloat16,
                device_map=device, 
                cache_dir="./model_cache"
            )
colpali_processor = ColPaliProcessor.from_pretrained(
                pretrained_model_name_or_path=model_name,
                cache_dir="./model_cache"
            )

Setting up the vector database

Before we generate embeddings using the ColPali model, we need a place to store and query them. That’s where a vector database is needed.

For this tutorial, we’re using Qdrant, an open-source, high-performance vector database designed specifically for handling high-dimensional vector data.

One of the great things about Qdrant is that it’s easy to self-host using Docker, making local development and experimentation super convenient.

Like other vector stores, Qdrant supports fast similarity search across millions (or even billions) of vectors. But it also offers rich filtering capabilities, which makes it a solid choice for powering RAG pipelines where precision and flexibility matter.

Run it on the shell:

docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

Once Qdrant is up and running, you can access the Qdrant Dashboard locally

Now that we have the Qdrant vector database running locally, let’s create a client object and define a collection called class_XII_science_book to store our embeddings.

We’ll use the official Qdrant Python client to interact with the database.

import qdrant_client

# Step 1: Creating a qdrant client object 
client = qdrant_client.QdrantClient(
    host="localhost",
    port=6333
)

# Step 2: Create a collection
COLLECTION_NAME = "class_XII_science_book"
VECTOR_SIZE = 128

client.create_collection(
    collection_name=COLLECTION_NAME,
    on_disk_payload=True,
    vectors_config=models.VectorParams(
        size=VECTOR_SIZE,
        distance=models.Distance.COSINE,
        on_disk=True,
        multivector_config=models.MultiVectorConfig(
            comparator=models.MultiVectorComparator.MAX_SIM
        ),
    ),
)

You can check the the newly created collection in the dashboard as well,

Store embeddings in vector database

Now that our Qdrant vector database is set up and our collection is ready, it's time to generate embeddings using the ColPali model and store them in the collection.

Let’s walk through the steps:

os.environ["TOKENIZERS_PARALLELISM"] = "false" 

# Step 1: Convert PDFs into a dictionary of PIL images 
# which will be used to create embeddings

def convert_pdfs_to_images(pdf_folder, poppler_path="/opt/homebrew/bin"):
    """Convert PDFs into a dictionary of PIL images."""
    pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith(".pdf")]
    all_images = []

    for doc_id, pdf_file in enumerate(pdf_files):
        pdf_path = os.path.join(pdf_folder, pdf_file)
        images = convert_from_path(pdf_path, poppler_path=poppler_path)

        for page_num, image in enumerate(images):
            all_images.append({"doc_id": doc_id, "page_num": page_num, "image": image.convert("RGB")})

    return all_images



# Step 2: Create embeddings for the images
with tqdm(total=len(dataset), desc="Indexing Progress") as pbar:
    for i in range(0, len(dataset), BATCH_SIZE):
        batch = dataset[i : i + BATCH_SIZE]

        # Extract images
        images = [item["image"] for item in batch]

        # Process and encode images
        with torch.no_grad():
            batch_images = colpali_processor.process_images(images).to(colpali_model.device)
            image_embeddings = colpali_model(**batch_images)

        # Prepare points for Qdrant
        points = []
        for j, embedding in enumerate(image_embeddings):
            points.append(
                models.PointStruct(
                    id=i + j,  # Use the batch index as the ID
                    vector=embedding.tolist(),  # Convert to list
                    payload={
                        "doc_id": batch[j]["doc_id"],
                        "page_num": batch[j]["page_num"],
                        "source": "pdf archive",
                    },  
                )
            )

        # Upload points to Qdrant
        try:
            client.upsert(collection_name=COLLECTION_NAME, points=points)
        except Exception as e:
            print(f"Error during upsert: {e}")
            continue

        # Update the progress bar
        pbar.update(BATCH_SIZE)


# Step 3: Generate embeddings and store in Qdrant
PDF_DIR = pdf_dir  # Change this to your actual folder path
dataset = convert_pdfs_to_images(PDF_DIR)

BATCH_SIZE = 4
print("Generating embeddings and storing in Qdrant...")
print("Indexing complete!")

Once this is done, we can check the embeddings in the dashboard

With the embeddings stored in Qdrant, we can now send a query and retrieve the most similar pages using vector similarity.

# Step 1: Our query
query_text = "What are the effects of oxidation reactions in everyday life ?"

# Step 2: Generate embeddings for the query
with torch.no_grad():
    text_embedding = colpali_processor.process_queries([query_text]).to(colpali_model.device)  
    text_embedding = colpali_model(**text_embedding)

token_query = text_embedding[0].cpu().float().numpy().tolist()

start_time = time.time()

# Step 3: Query the vector database
query_result = client.query_points(collection_name=COLLECTION_NAME,
                                   query=token_query,
                                   limit=5,
                                   search_params=models.SearchParams(
                                   quantization=models.QuantizationSearchParams(
                                   ignore=True,
                                   rescore=True,
                                   oversampling=2.0
                                   )
                               )
                           )

print(f"Time taken = {(time.time()-start_time):.3f} s")
print(query_result.points)

This will return the Top-K most similar pages (limit=5) based on the vector similarity between your query and the stored page embeddings.

# output of print(query_result.points)
[ScoredPoint(id=12, version=3, score=21.797455, payload={'doc_id': 0, 'page_num': 12, 'source': 'pdf archive'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=11, version=2, score=19.110117, payload={'doc_id': 0, 'page_num': 11, 'source': 'pdf archive'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=15, version=3, score=19.051605, payload={'doc_id': 0, 'page_num': 15, 'source': 'pdf archive'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=13, version=3, score=18.964575, payload={'doc_id': 0, 'page_num': 13, 'source': 'pdf archive'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=8, version=2, score=16.669119, payload={'doc_id': 0, 'page_num': 8, 'source': 'pdf archive'}, vector=None, shard_key=None, order_value=None)]

Building an Agentic RAG System with CrewAI

Now, let’s level up. Retrieval alone isn’t enough, we need orchestration. To build a system that can reason and respond intelligently, we need to orchestrate the entire workflow, from query to retrieval to generation.

That’s where CrewAI steps in.

In this final section, we’ll build an agentic RAG pipeline using CrewAI, where each part of the process will be handled by a dedicated agent, and they will work together to generate accurate, context-rich answers automatically.

We’ll define two agents:

The Knowledge Retriever Agent : Uses ColPali + Qdrant to fetch the most relevant pages.
The Multimodal Knowledge Expert Agent : Uses Amazon Nova to analyze the images and generate an answer.

Let’s start by creating these agents:

# Define the Knowledge Retriever Agent 
retrieval_agent = Agent(
    role="Knowledge Retriever",
    goal="Retrieve the most relevant textbook pages from the knowledge base based on the student’s question.",
    backstory="An intelligent academic assistant trained with NCERT Class X Science. It specializes in pinpointing the most relevant content based on the student’s question and the subject it pertains to.",
    tools=[retrieve_from_qdrant],
    allow_delegation=False,
    verbose=True,
    llm=llm
)

# Define the Multimodal Knowledge Expert Agent
answering_agent = Agent(
    role="Multimodal Knowledge Expert",
    goal="Accurately interpret the provided images and extract relevant information to answer the question: {query_text}.",
    backstory="An advanced AI specialized in multimodal reasoning, capable of analyzing both text and images to provide the most precise and insightful answers.",
    multimodal=True,
    allow_delegation=False,
    verbose=True,
    llm=llm
)

Now, if you take a closer look at the retrieval_agent definition, you’ll notice we’ve included a tool called retrieve_from_qdrant.

We haven’t defined it yet, but this tool is the most important component in this agent definition. This is the tool which will enable the agent to perform semantic search under the hood. Using this tool, the agent can interact with the Qdrant vector database, run a vector similarity search, and retrieve the most relevant document pages needed to answer the user’s query.

Let’s go ahead and define that next.

from crewai_tools import tool

# Initialize CrewAI LLM (Amazon Nova Pro) 
model_id = "us.amazon.nova-pro-v1:0"  
llm = LLM(model=model_id)

# Create a Retrieval Tool 
@tool
def retrieve_from_qdrant(query: str):
    """
    Retrieve the most relevant documents from Qdrant vector database
    based on the given text query.

    Args:
        query (str): The user query to search in the knowledge base.

    Returns:
        list: List of paths to the matched images.
    """
    global client, COLLECTION_NAME, colpali_processor, colpali_model

    print(f"Retrieving documents for query: {query}")

    with torch.no_grad():
        text_embedding = colpali_processor.process_queries([query]).to(colpali_model.device)  
        text_embedding = colpali_model(**text_embedding)

    token_query = text_embedding[0].cpu().float().numpy().tolist()
    start_time = time.time()

    # Perform search in Qdrant
    query_result = client.query_points(
        collection_name=COLLECTION_NAME,
        query=token_query,
        limit=5,
        search_params=models.SearchParams(
            quantization=models.QuantizationSearchParams(
                ignore=True,
                rescore=True,
                oversampling=2.0
            )
        )
    )

    print(f"Query Time: {(time.time()-start_time):.3f} s")

    matched_images_path = []

    # Define a folder to save matched images
    MATCHED_IMAGES_DIR = "matched_images"

    # Delete all files and the directory itself if it exists
    if os.path.exists(MATCHED_IMAGES_DIR):
        shutil.rmtree(MATCHED_IMAGES_DIR)

    os.makedirs(MATCHED_IMAGES_DIR)

    for result in query_result.points:
        doc_id = result.payload["doc_id"]
        page_num = result.payload["page_num"]

        for item in dataset:
            if item["doc_id"] == doc_id and item["page_num"] == page_num:
                image_filename = os.path.join("matched_images", f"match_doc_{doc_id}_page_{page_num}.png")
                item["image"].save(image_filename, "PNG")
                matched_images_path.append(image_filename)

                print(f"Saved: {image_filename}")
                break  

    print("\nAll matched images are saved in the 'matched_images' folder.")

    return matched_images_path

Now, we can define the tasks for each of these agents and create the Crew.


# Define the task for the Knowledge Retriever Agent
retrieval_task = Task(
    description="Retrieve the most relevant images from the knowledge base based on the given query.",
    agent=retrieval_agent,
    expected_output="A list of image file paths related to the query."
)

# Define the task for the the Multimodal Knowledge Expert Agent
answering_task = Task(
    description="Using the retrieved images at {{matched_images_path}}, generate a precise answer to the query: {{query_text}}.",
    agent=answering_agent,  # Assign answering agent
    expected_output="A clear and well-structured explanation based on the extracted information from the images. No need to include reference to the images in the answer.",
)

# Assemble the Crew 
crew = Crew(
    agents=[retrieval_agent, answering_agent],
    tasks=[retrieval_task, answering_task],  
    process=Process.sequential,
    verbose=False
)

Query Time

Let's see how this system works with a real example. Imagine a student asks about “the proper way to heat a boiling tube containing ferrous sulphate crystals and how to safely smell the odor”,

#  Run the Query 
query_text = "What is the correct way of heating the boiling tube containing crystals of ferrous sulphate and of smelling the odour"
result = crew.kickoff(inputs={"query_text": query_text})

The final output ? detailed explanation pulled from textbook pages that includes both textual and visual cues. The system identifies diagrams, interprets scientific procedures, and provides safe lab instructions, just like a teacher would :)

Conclusion

This approach represents an exciting step toward more human-like document understanding. By combining vision-language models with specialized agents, we can create systems that process information more holistically - considering layout, visual elements, and text as an integrated whole.

As vision language models continue to improve, and frameworks like CrewAI become more sophisticated, we can expect even more powerful multimodal RAG systems that further close the gap between how humans and machines process information.

What Next

If you’re exploring multimodal retrieval and generation, I’d highly recommend checking out my free course on Multimodal RAG and Embeddings in collaboration with Analytics Vidhya. It covers foundational concepts like embeddings, Byte-Pair Encoding, and vision-language reasoning using Amazon Nova and Bedrock.

📖 Course Topics:

✅ Embeddings in NLP & LLMs (Amazon Tian Text Embeddings)
✅ Byte-Pair Encoding (BPE)
✅ Multimodal LLMs & Contrastive Learning (CLIP, BLIP-2, Amazon Nova)
✅ Multimodal RAG & Knowledge Bases with Amazon Bedrock
✅ End-to-end AI application development with Agents and Knowledge Base

Get access here: Mastering Multimodal RAG Course

Are you working on any multimodal use cases?

I’d love to hear about them, connect with me on LinkedIn.

Machine Learning in SQL Style (Part-2)

Suman Debnath — Thu, 04 Mar 2021 14:50:38 +0000

Continuing our learning from where we left in the Part-1 of this tutorial series, where we discussed about Amazon Redshift briefly and dive deep into Amazon Redshift ML. We also learnt about, how a database engineer/administrator could make use of Redshift ML to create, train and deploy a machine learning model using familiar SQL commands.

Now, we are going to see some of the advanced functionalities of Amazon Redshift ML which a Data Analyst or an expert Data Scientist can make use of, which offers more flexibility in terms of defining specific information, like which algorithm to use (such as XGBoost), specifying hyperparameter, preprocessor and so on.

Exercise 2 (Data Analyst's perspective)

Dataset

In this problem, we are going to use the Steel Plates Faults Data Set from UCI Machine Learning Repository. You can download the dataset from this GitHub Repo.

This dataset is related to the quality of steel plates, wherein there are 27 independent variables (input features) which comprises of various attributes of a steel plate and one dependent variable (class label) which can be of 1 of 7 types. So, the problem in hand is a multi-class classification problem, where we need to predict the fault in the steel plate, given 7 different types of the faults that it can have.

So, the objective is to predict what is the fault the steel plate has (Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps or Other_Faults)

As we have seen in Part-1, since our dataset is located in Amazon S3, first we need to load the data in table. We can open DataGrip(or whatever SQL Connector you are using) and create the schema and the table. Once that is done, we can use COPY command to load the training data from Amazon S3 (steel_fault_train.csv) to the Redshift cluster, in the table, steel_plates_fault.

As always, we need to make sure that colum names of the table matches with the feature sets in the CSV training dataset file.

Similarly we can load the dataset for the testing(steel_fault_test.csv) in a separate table, steel_plates_fault_inference

Training (Model Creation)

Now, being a data analyst, you may like to explicitly mention few of the parameters, like PROBLEM_TYPE and OBJECTIVE function. When you provide this information while creating the model, Amazon SageMaker Autopilot chooses the PROBLEM_TYPE and OBJECTIVE specified by you, instead of tying everything.

Like for this problem, we are going to provide the PROBLEM_TYPE as multiclass_classification and OBJECTIVE as accuracy.

Other PROBLEM_TYPE we can specify are :

REGRESSION
BINARY_CLASSIFICATION
MULTICLASS_CLASSIFICATION

Similarly, OBJECTIVE function could be:

MSE
Accuracy
F1
F1Macro
AUC

As we have learnt in the PART-1 of the tutorial, the CREATE MODEL command operates in an asynchronous mode and it returns the response upon the export of training data to Amazon S3. As the remaining steps of model training and compilation can take a longer time, it continues to run in the background.

But we can always check the status of the training using the STV_ML_MODEL_INFO function, and wait till the model_state becomes Model is Ready.

Now, let's look at the details about the model, and see if it has used the same PROBLEM_TYPE and OBJECTIVE function which we mentioned while executed the CREATE MODEL command

Accuracy of the Model and Prediction/Inference

Lastly, let's try to see what's the accuracy of our model using the test data which we have in the steel_plates_fault_inference table.

As we can see the accuracy is around 77%, which is not all that great, but this is because we used a very small dataset to train the model, func_model_steel_fault.

And finally, let's try to do some prediction using this same model function

Lastly, let's take another example and this time from a Data Scientist's perspective, wherein we will make use of some more advanced options while executing the CREATE MODEL command.

Exercise 3 (Data Scientist's perspective)

So the last two problems we worked on, were classification problem (binary and multi-class), and this time we will work on a regression problem and shall use some more advanced parameters while training the model (like mentioning the training algorithm, hyperparameter, etc.).

Dataset

In this problem, we are going to use the Abalone Data Set from UCI Machine Learning Repository. You can download the dataset from this GitHub Repo.

In this problem we need to predict the age of a abalone from its physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.

The dataset is having total 7 input features and 1 target, which is nothing but the age

So, first let's create the schema and the table. Once that is done, we can use COPY command to load the training data from Amazon S3 (xgboost_abalone_train.csv) to the Redshift cluster, in the table, abalone_xgb_train.

Similarly we can load the dataset for the testing(xgboost_abalone_test.csv) from Amazon S3, in a separate table, abalone_xgb_test

Training (Model Creation)

As a data scientist, you may like to have more control over training the model, e.g you may decide to provide more granular options, like MODEL_TYPE , OBJECTIVE, PREPROCESSORS and HYPER PARAMETERS while running the CREATE MODEL command.

As an advanced user, you may already know the model type that you want and hyperparameter to use when training these models. You can use CREATE MODEL command with AUTO OFF to turn off the CREATE MODEL automatic discovery of preprocessors and hyperparameters.

For this problem we are going to specify MODEL_TYPE as xgboost(Xtreme Gradient Boosted tree) which we can use for both regression and classification based problems. XGBoost is currently the only MODEL_TYPE supported when AUTO is set to OFF. We are also going to use the OBJECTIVE function as reg:squarederror. You can specify hyperparameters as well. For more details, you may like to check the Amazon Redshift ML Developer Guide(CREATE MODEL section)

Now, let's look at the details about the model, as we did before:

Accuracy of the Model and Prediction/Inference

Now, let's try to see what's the accuracy of our model, using the test data which we have in the abalone_xgb_test table.

And finally let's try to do some prediction using this same model function, func_model_abalone_xgboost_regression

What next...

So, in this tutorial we learnt about Amazon Redshift ML from an advanced users perspective (like Data Analyst or Data Scientist), and learnt how we can create, train and deploy a ML model using familiar SQL query. You can even go further and explore the training jobs which it internally initiates in the Amazon SageMaker console, if you are interested. Feel free to give it a try and share your feedback.

Resources

Code : GitHub repo
Blog : Amazon Redshift ML - Machine Learning in SQL Style (Part-1)
Documentation:
- Using machine learning in Amazon Redshift
Book:
- Learn SageMaker
Videos:
- AWS on Air 2020: AWS What’s Next ft. Amazon Redshift Machine Learning
- AWS re:Invent 2020: Introducing Amazon Redshift Machine Learning

Machine Learning in SQL Style (Part-1)

Suman Debnath — Thu, 04 Mar 2021 14:49:18 +0000

Machine learning(ML) is everywhere, you look around, you will see some or the other application is either built using ML or powered by ML. And with the advent of technology, specially cloud, every passing day ML is getting more and more reachable to developers, irrespective of their background. We at Amazon Web Services(AWS) are committed to put machine learning in the hands of every developer, data scientist and expert practitioner. Now, what if you can create, train and deploy a machine learning model using simple SQL commands?

During re:Invent 2020 we announced Amazon Redshift ML which makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. Amazon Redshift ML allows you to use your data in Amazon Redshift with Amazon SageMaker(a fully managed ML service), without requiring you to become experts in ML.

Now, before we dive deep into what it is, how it works, etc. here are the things we will try to cover in this first part of the tutorial:

What is Amazon Redshift
Introduction to Redshift ML
How to get started and the prerequisites
I am a Database Administrator - What's in for me ?

And in the Part-2, we will take that learning beyond and cover the following:

I am a Data Analyst - What's about me ?
I am a Data Scientist - How can I make use of this ?

Overall, we will try to solve different problems which will help us to understand Amazon Redshift ML from a perspective of a database administrator, data analyst and an advanced machine learning expert.

Before we get started and set the stage by reviewing what is Amazon Redshift?

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehousing service on the AWS. Its low-cost and highly scalable service, which allows you to get started on your data warehouse use-cases at a minimal cost and scale as the demand for your data grows. It uses a variety of innovations to obtain very high query performance on datasets ranging in size from a hundred gigabytes to a petabyte or more. It uses massively parallel processing(MPP), columnar storage and data compression encoding schemes to reduce the amount of I/O needed to perform queries, which allows it in distributing the SQL operations to take advantage of all available resources underneath.

Let's quickly go over few core components of an Amazon Redshift Cluster:

Client Application

Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data mining, and analytics tools. As Amazon Redshift is based on industry-standard PostgreSQL, most of commonly used SQL client application should work, we are going to use Jetbrains DataGrip to connect to our Redshift cluster(via JDBC connection) later while we jump into the hands-on section. Having said that, you may like to use any other SQL Client tool like SQL Workbench/J, psql tool, etc.

Cluster

The core infrastructure component of an Amazon Redshift data warehouse is a cluster. A cluster is composed of one or more compute nodes. A cluster comprises of nodes, as shown in the above image, Redshift has two major node types: leader node and compute node.

Leader Node

If we create a cluster with two or more no. of compute nodes, then an additional leader node coordinates the compute nodes and handles external communication. We don't have to define a leader node, it will be automatically provisioned with every Redshift cluster. Once the cluster is created, the client application interacts directly only with the leader node. In other words, the leader node behaves as the gateway(the SQL endpoint) of your cluster for all the clients. Few of the major tasks of the leader node is to store the metadata, coordinate with all the compute nodes for parallel SQL processing and and to generate most optimized and efficient query plan.

Compute Nodes

The compute nodes is the main workhorse for the Redshift cluster, and it sits behind the leader node. The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute node(s). After that, the compute node(s) execute the respective compiled code and send intermediate results back to the leader node for final aggregation. Each compute node has its own dedicated CPU, memory, and attached storage, which are determined by the node type.

The node type determines the CPU, RAM, storage capacity, and storage drive type for each node. Amazon Redshift offers different node types to accommodate different types of workloads, so you can select which suits you the best, but its is recommended to use ra3. The new ra3 nodes let you determine how much compute capacity you need to support your workload and then scale the amount of storage based on your needs.

Ok, now that we understood a bit about the Redshift Cluster let's go back to the main topic, Redshift ML :)
And don't worry if things are still dry for you, as soon as we jump into the demo and create a cluster from scratch, things will fall in place.

Introduction to Redshift ML

We have been integrating ML functionality with many other services for long time, for example in re:Invent 2019, we announced Amazon Aurora Machine Learning, which enables you to add ML-based predictions to your applications via the familiar SQL programming language. Integration with ML is very important in today's world we live in. It helps any developer to build, train, and deploy your ML models efficiently and at scale.

Following the ritual, during re:Invent 2020, we announced this new capability called Redshift ML, which enables any SQL user to train, build and deploy ML models using familiar SQL commands, without knowing much about machine learning. Having said that, if you are an intermediate machine learning practitioner or an expert Data Scientist, you still get the flexibility to define specific algorithms such as XGBoost and specify hyperparameter and preprocessor.

The way it works is pretty simple, you provide the data that you want to train the model and metadata associated with data inputs to Amazon Redshift and then Amazon Redshift ML creates the model that capture patterns in the input data. And once the model is trained, you can then use the models to generate predictions for new input data without incurring additional costs.

As of now, Amazon Redshift supports supervised learning, that includes the following problem types:

regression: problem of predicting continuous values, such as the total spending of customers
binary classification: problem of predicting one of two outcomes, such as predicting whether a customer churns or not
multi-class classification: problem of predicting one of many outcomes, such as predicting the item a customer might be interested

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value.

The inputs used for the ML model are often referred to as features or in ML terms, called independent variables, and the outcomes or results are called labels or dependent variables. Your training dataset is a table or a query whose attributes or columns comprise features, and targets are extracted from your data warehouse. The following diagram illustrates this architecture.

Now, we understand that data analysts and database developers are very much familiar with SQL, as they use that day-in day-out. But to build, train and deploy any ML model in Amazon SageMaker, one need to learn some programming language(like Python) and study different types of machine learning algorithms and build an understanding of which algorithm to use for a particular problem. Or else you may rely on some ML expert to do your job on your behalf.

Not just that, even if someone helped you to build, train and deployed your ML model, when you actually need to use the model to make some prediction on your new data, you need to repeatedly move the data back and forth between Amazon Redshift and Amazon Sagemaker through a series of manual and complicated steps:

Export training data to Amazon Simple Storage Service (Amazon S3).
Train the model in Amazon SageMaker.
Export prediction input data to Amazon S3.
Use prediction in Amazon SageMaker.
Import predicted columns back into the database.

All this is daunting, isn't it?

But now, with Amazon Redshift ML, we don't have to do any of these, you can train model with one single SQL CREATE MODEL command. So, you don't have to expertise in machine learning, tools, languages, algorithms, and APIs.

Once you run the SQL command to create the model, Amazon Redshift ML securely exports the specified data from Amazon Redshift to Amazon S3 and calls SageMaker Autopilot to automatically prepare the data, select the appropriate pre-built algorithm, and apply the algorithm for model training. Amazon Redshift ML handles all the interactions between Amazon Redshift, Amazon S3, and SageMaker, abstracting the steps involved in training and compilation.

And once the model is trained, Amazon Redshift ML makes model available as a SQL function in your Amazon Redshift data warehouse.

Ok, let's see all in action now...

Create Redshift Cluster

Let's create a Redshift Cluster now, first we need to login to our AWS Console and search for Redshift and click on Create cluster.

Next, in the Cluster configuration section, we need to provide some cluster identifier, let's say redshift-cluster-1 and select the appropriate node type and number of nodes you would like to have in the cluster. As mentioned before, we recommend to choose RA3 node types, like ra3.xplus, ra3.4xlarge and ra3.16xlarge which offers the best in class performance with scalable managed storage. For our demo we will select ra3.4xlarge node type and we will create the cluster with 2 such nodes.

After than under Database configuration, we need to provide our database name, port number(where the database will accept the inbound connections), master user name and password.

Next, we need to expand the Cluster permissions section and attached an IAM role. Since our cluster would use Amazon S3 and Amazon SageMaker later on, we need to provide adequate permission so that our Redshift cluster can access data saved in Amazon S3, and Redshift ML can access Amazon SageMaker to build and train the model. We have already created an IAM role namely, RedshiftMLRole. We can just select the right IAM role from the dropdown and click on Associate IAM role

If you want to create an IAM role with a more restrictive policy, you can use the policy as following. You can also modify this policy to meet your needs.

Also, if you would like to connect to this cluster from instances/devices outside the VPC via the cluster endpoint, you would need to enabled Public accessible option as bellow, but it is not recommended to enable Public accessible, in our demo we are going to use an Amazon EC2 Instance to connect to the cluster via SSH Tunneling:

Just review all the configurations and click on Create cluster

Connecting to Redshift Cluster

Next, we can use any tool of our choice to connect to our cluster and we are going to use Jetbrains DataGrip.

Now, if you have created the cluster with Public accessible enabled, then you can directly connect with the cluster, but since we created the cluster without public access, we are going to use Amazon EC2 Instance to connect to the cluster via SSH Tunneling as mentioned above. And for that we have already created an Amazon EC2 instance in the same region where we created our Redshift cluster and we are going to use the same instance to access the cluster via SSH Tunning.

But before we connect, we need to fist know the JDBC URL endpoint of our cluster, for that we can click on our cluster and copy the JDBC URL in our clipboard

Now, we can open Datagrip(or any tool of your choice and connect to the cluster) using the JDBC URL, user name and password which we have used while creating the cluster and text the connection.

And then go to the SSN/SSL option to add the tunnel, this is the place where we need to mention the Amazon EC2 Instance details which we had created earlier and once that is done, we can click on Test Connection to test if everything is working fine or not.

Ok, we are now all set to see Redshift ML all in action :)

Dataset

We are going to see 3 demos next showing different aspects and functionalities of Redshift ML, which will hopefully help you to get an understanding of different use cases and learn how you make use of Redshift ML irrespective of your background. You may be a Database Engineer/Administrator or Data Analyst or an advanced Machine Learning practitioner, we will cover different demo from the perspective of all these different personas.

First we need to make sure we upload the dataset on S3(we have uploaded all the dataset in our Amazon S3 bucket, redshift-downloads-2021). All the dataset can be found inside this GitHub repo

Exercise 1 (Database Engineer's perspective)

Dataset

In this problem, we are going to use the Bank Marketing Data Set from UCI Machine Learning Repository. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

The objective is to predict if the client will subscribe (yes/no) a bank term deposit (variable y).

The dataset consists of total 20 features/input variable and one class label/output variable.

Since our dataset is located in Amazon S3, first we need to load the data in table. We can open DataGrip(or whatever SQL Connector you are using) and create the schema and the table. Once that is done, we can use COPY command to load the training data from Amazon S3(bank-additional-full.csv) to the Redshift cluster, in the table, client_details

We need to make sure that colum names of the table matches with the feature sets in the CSV training dataset file.

Similarly we can load the dataset for the testing(bank-additional-inference.csv) in a separate table, client_details_inference

Users and Groups

Before we try to create the model, we need to make sure that the user is having the right permission. Just like how Amazon Redshift manages other database objects, such as tables, views, or functions, Amazon Redshift binds model creation and use to access control mechanisms. There are separate privileges for creating a model running and prediction functions.

Here are going to create 2 user groups, dbdev_group(users who will use the model for prediction) and datascience_group(users who will create the model) and within these groups we will have one user each, dbdev_user and datascience_user respectively.

Next, we can grant the appropriate access/permission to the respective group and authorize the user datascience_user as the current user.

Training (Model Creation)

Finally, now we are all set to create the model using a simple CREATE MODEL command, it will export the training data, train a model, import the model, and prepare an Amazon Redshift prediction function under the hood.

Two things to note here:

The SELECT query above creates the training data(input features), i.e. all columns except column y
The TARGET clause specifies which column should be used as class label that the CREATE MODEL should uses to learn how to predict, i.e. the y column.

Behind the scene, Amazon Redshift will use Amazon SageMaker Autopilot for training. At this point, Amazon Redshift will immediately start to use Amazon SageMaker to train and tune the best model for this binary classification problem(as the output or class label can be either yes or no).

The CREATE MODEL command operates in an asynchronous mode and it returns upon the export of training data to Amazon S3. As the remaining steps of model training and compilation can take a longer time, it continues to run in the background.

But we can always check the status of the training using the STV_ML_MODEL_INFO function.

Once the training is done, we can use SHOW MODEL ALL command to see all the models which we have access to:

We can also see some more details about the model, e.g. model performance(like accuracy, F1 score, MSE, etc. depending on the problem type), model type, problem type, etc.

Accuracy of the Model and Prediction/Inference

Now that we have the new SQL function, func_model_bank_marketing2, we can use the same function for prediction. But before we do so, let's first grant the appropriate access/permission to the dbdev_groupso that the dbdev_user can use the function for prediction. Once that is done we can change the authorization to dbdev_user as we expect that the prediction operation to be executed by the database engineer or the data analyst and not necessarily by only the data scientists in the organization.

First let's try to see what's the accuracy of out model, using the test data which we have in the client_details_inference table.

As we can see the accuracy is around 94%, which is not all that bad considering the small dataset we used for this problem, but we can see how easily we use simple SQL query to create, train and deploy our ML models using Redshift ML.

And finally let's try to do some prediction using this same model function

In the Part 2 of this tutorial series we will try to cover few more advanced functionalities, and those would be from a Data Analyst or any expert Data Scientist viewpoint, wherein you can define many advanced options, like model type, hyperparameters, objective function, pre-processors, and so on.

But before we move on to the Part 2, let's spend some time to underhand the costs consideration using Redshift ML and how you can control it.

Cost and Redshift ML

As Amazon Redshift ML use the existing cluster resources for prediction, there is no additional Amazon Redshift charges. That means, there is no additional Amazon Redshift charge for creating or using a model,
and as prediction happens locally in your Amazon Redshift cluster, you don't have to pay extra.

But, as we learnt that Amazon Redshift ML uses Amazon SageMaker for training our model, which does have an additional associated cost.

The CREATE MODEL statement uses Amazon SageMaker as we have seen before, and that incurs an additional cost. The cost increases with thenumber of cells in your training data. The number of cells is proportional to number of records (in the training query or table) times the number of columns. For example, when a SELECT query of the CREATE MODEL statement creates 100,000 records and 50 columns, then the number of cells it creates is 500,0000.

One way to control the cost is by using two option MAX_CELL and MAX_RUNTIME in the CREATE MODEL command statement. Where MAX_RUNTIME specifies the maximum amount of time the training can take in SageMaker when the AUTO ON/OFF option is used. Although training jobs can complete sooner than MAX_RUNTIME, depending on the size of the dataset. But there are additional works which Amazon Redshift performs after the model is trained, like compiling and installing the model in your cluster. So, the CREATE MODEL command can take a little longer time then then MAX_RUNTIME to complete. This option can be used to limit the cost as it controls the time to be used by Amazon SageMaker to train your model.

Under the hood, when you run CREATE MODEL with AUTO ON, Amazon Redshift ML uses SageMaker Autopilot which automatically explores all the different models(or candidates) to find the best one. MAX_RUNTIME limits the amount of time and computation spent and if MAX_RUNTIME is set too low, there might not be enough time to explore even one single candidate. And you would get an error saying, "Autopilot candidate has no models" and in that case you would need to re-run the CREATE MODEL with a larger MAX_RUNTIME value.

One another way to control cost for training (which may not be a good idea always as it would affect the model accuracy), is by specifying a smaller MAX_CELLS value when you run the CREATE MODEL command. MAX_CELLS limits the number of cells, and thus the number of training examples used to train your model.

By default, MAX_CELLS is set to 1 million cells. Reducing MAX_CELLS reduces the number of rows from the result of the SELECT query in CREATE MODEL that Amazon Redshift exports and sends to SageMaker to train a model. Reducing MAX_CELLS thus reduces the size of the dataset used to train models both with AUTO ON and AUTO OFF. This approach helps reduce the costs and time to train models.

In summary, by increasing MAX_RUNTIME and MAX_CELLS we can often improve the model quality as it allows Amazon SageMaker to explore more candidates and it would have more training data to train better models.

What next...

So, in the tutorial we learnt a bit about, what Amazon Redshift is and how you can create, train and deploy a ML model using familiar SQL query from a Database Engineer's/Administrator's perspective, in the next part we will explore little bit more on how you can make use of Amazon Redshift ML if you are an advanced data analyst or a data scientist and shall explore some advanced options, which it has to offer.

Resources

Code : GitHub repo
Blog : Amazon Redshift ML - Machine Learning in SQL Style (Part-2)
Documentation:
- Using machine learning in Amazon Redshift
Book:
- Learn SageMaker
Videos:
- AWS on Air 2020: AWS What’s Next ft. Amazon Redshift Machine Learning
- AWS re:Invent 2020: Introducing Amazon Redshift Machine Learning

An Introduction to Decision Tree and Ensemble Methods – Part 1

Suman Debnath — Thu, 09 Apr 2020 13:01:38 +0000

In this day and age, there is a lot of buzz around Machine Learning(ML) and Artificial Intelligence(AI). And why not, after all, we all are consumers of ML directly or indirectly; irrespective of our professions. AI/ML is a fascinating field, generates a whole lot of excitement around it, and rightly so. In this tutorial series, we will try to explore and demystify the complicated world of {math, equations, and theory} that functions in tandem to bring out the "magic" which we experience on many application(s)/software(s). The idea is to discuss AI/ML algorithms in detail, implement it from scratch(wherever applicable), and enable the readers to answer "How?" rather than "What?". And while we discuss all of these, we shall demonstrate everything on AWS platform using different services as applicable. Having said that, I am a learner as most of you; trying to get a better understanding of this field with each passing day and I just wish to share my learning as I go along with a broader group . Your inputs/feedback are always welcome, anything that would help me improve this process.

But before we go ahead, I would also like to thank my teacher, Srikanth Varma Sir, for guiding me and 100s of thousands of students like me across the globe. Without him, I wouldn't have dreamt of getting into this wonderful world of Machine Learning. This write up is completely based on the class I have attended from his course at AppliedAI on Decision Tree.

An Introduction to Decision Tree

In this tutorial, we will explore one of the most rampantly used and fundamental Machine Learning model, Decision Tree(DT). DT is a very powerful model which can help us to classify labelled data and make predictions. It also enlightens us with lots of information about the data and most importantly, it's effortlessly easy to interpret.

If you are a software engineer, you would probably know “if-else” conditions, and we all love it, because it’s very simple to understand, imagine and code. A decision tree can be thought of nothing but a “nested if-else classifier”.

Decision tree, is one of the classifiers we have in the world of machine learning, which closely resembles with human reasoning. It is also one of model we have, which comes under the category of supervised machine learning. In supervised machine learning, we are always given with the input data (also referred as features or independent features) and class labels(also referred as target feature or dependent feature).

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value.

Let’s try to understand this with a simple example, suppose we have a data set of 50 fruits, out of which few are lemons and few are apples, and we are given fruit colour and fruit size as the input feature(so these are our 2 independent features).

The decision tree for this problem, might look like this:

But the question is, given a dataset, how can we build a tree like this ? To understand this, we need to look into the "math" behind this, which we will see in the next section. But before that, let's try to learn some key terminologies we must be aware of to work with Decision Tree:

NODE: Each time we ask a question or make a decision stump, we represent the same as a NODE.
ROOT: The topmost node of the tree which we start questioning with.
INTERIOR: Any node, except the ROOT node, where we again ask a question.
LEAF: When we reach a point where we don’t ask a question, but instead make a decision, we call it as LEAF node.

So here is the general DT structure example:

Decision trees are supervised learning algorithms, which means, we need to have labelled dataset and it can be used for both, classification and regression tasks, which means we can have it for categorical data or continuous data.

Training flow of a Decision Tree:

Prepare the labelled data set, with independent feature{1, 2, 3, …, n} and dependent feature(target or class label).
Try to pick the best feature as the root node and thereafter, use intelligent strategies to split the nodes into multiple branches.
Grow the tree until we get a stopping criteria, i.e. the leaf node which would be the actual prediction when we make any query or ask for prediction.
Pass through the prediction data query through the tree until we arrive at some leaf node
Once we get the leaf node, we have the prediction!! :)

Math behind Decision Tree

Now that we know how a Decision Tree looks like, next step to think about is(and it’s the most important task to accomplish), how can we go from training data to a decision tree ? And to understand this, we need to go over a bunch of very important concepts.

Entropy

Entropy is a very interesting concept in the field of Information Theory. It is the notion of the impurity of the data, now what is this new term impurity of the data?

Let’s take an example, we have this node:

We can say that, this node is pure because there is only one class, no ambiguity in the class label(i.e. it’s all APPLE), Now, let’s say this node:

We can say, that it's little less pure w.r.t to the previous node, as there is some small amount of ambiguity that exists in the class label(as few LEMONs are present, along with APPLE). And finally, let’s see this node:

This is very much an impure node, as we have mixed classes(red APPLE, green APPLE, and yellow LEMON).

Now, let’s go back to the definition of Entropy, it’s the notion of impurity of a node, in other words, the more the impurity the more the entropy and vice versa.

Let’s say we have Random Variable, X, where x can be x1, x2, x3, x4…xn
Then, mathematically, Entropy can be defined as(also known as Shannon's entropy) this:

where,
k = ranges from 1 through n
H(x) = entropy of x
P(k) = Probability of random variable X when (X = k)

Now, let’s take an example to understand it little better,

In this dataset(D), Play Ball is the target class, and the random variable, X. And it can take only two values, “Yes” or “No”. So,

P(k=Yes) => 9/14 = 0.64
P(k=No) => 5/14 = 0.36

Therefore, the Entropy of “Play Ball” on the dataset(D), would be:

If we think intuitively, what this essentially means is:

Higher the Entropy, more impure the dataset is.
Lower the Entropy, more pure the dataset is.

Information Gain(IG)

Now that we have an idea of what an Entropy is, the next important concept to look into is Information Gain(IG). Let’s continue with the same example as above, where we have the Entropy, H(“Play Ball”) or in other words, Entropy of the target label, “Play Ball” is 0.94. Now, let’s say we split the dataset with “Outlook” feature set, then our dataset would look like this:

Now, we get 3 small sub-datasets(D1, D2, and D3), based on the different values we have for the feature “Outlook”(i.e. Rainy, Overcast and Sunny). So, if now compute the Entropy of each of this small sub-dataset on the same target class “Play ball”, it would be:

Now, the Weighted Entropy after breaking the dataset to D1, D2 and D3 would be:

So, the Information Gain of the dataset D, when we break it based on feature, Outlook, would be:

Information Gain(Outlook)= Entropy(D) − Weighted Entropy after breaking the dataset

Similarly, we can find the IG based on other features as well(for Temperature, Humidity and Windy)

Now, if we have the information gain of all these 4 features, and it’s very clear that, the information gain of the feature “Outlook” is the largest, which indirectly says that this feature(“Outlook”) gives us the maximum amount of information about the target class(“Play ball”).

Hence, Decision Tree would use this feature as the ROOT node of the tree. And once the data is split, we need to further check each of the small sub tree and perform the same activity and decide the next feature which has the highest IG, so that we can split the dataset further to get to leaf node.

How to build a Tree

With the understanding of what is Entropy and IG, we can build a tree, and here is the algorithmic steps:

First the entropy of the total dataset is calculated for the target label/class.
The dataset is then split on different features.
The entropy for each branch is calculated. Then it is added proportionally, to get total weighted entropy for the split.
The resulting entropy is subtracted from the entropy before the split.
The result is the Information Gain.
The feature that yields the largest IG is chosen for the decision node.
Repeat step #2 to #6, for each subset of the data(for each internal node) until:
- All the dependent features are exhausted.
- The stopping criteria are met.

Few of the stopping criteria used are:

no. of levels of tree from the root node(or in other words, depth of the tree)
Minimum no. of observations in the parent/child node(e.g. 10% of the training data)
Minimum reduction of impurity index

Algorithm behind decision tree

So far, we have discussed about Entropy as one of the ways to find the impurity of a node, but there are other techniques available to split the data, like Gini Impurity, Chi-Square, Variance, etc.. However, we have different algorithms to implement a Decision Tree model, and each uses different techniques to identify the impurity of a node, and hence the split. For example:

ID3(Iterative Dichotomiser 3) algorithm - uses Entropy
CART algorithm - uses Gini Impurity Index
C4.5 - uses Gain Ratio

Thankfully, we do not have to do all this(like calculating Entropy, IG, implement ID3, etc.), we have lots of libraries/packages available in Python which we can use to solve a problem with decision tree.

Problem

Here is the dataset. The data set contains wifi signal strength observed from 7 wifi devices on a smartphone collected in indoor space(4 different rooms)? The task is to predict the location(which room) from wifi signal strength. Form more details check here

Amazon SageMaker Notebook

Before we get into code, we would spin an Amazon SageMaker Notebook, it’s a fully managed ML compute instance running the Jupyter Notebook App. It manages creating the instance and related resources for us, we are going to use Amazon SageMaker Notebook, rather than using local Jupyter-Notebook on our laptop(and we will later see why?)

We will use Jupyter notebooks within our notebook instance to prepare and process data, write code to train models, deploy models to Amazon SageMaker hosting, and test or validate your models. For this tutorial, since we are going to have a small dataset, we will not be deploying the model, but going forward for the upcoming tutorials we are going to solve various complex problem wherein we would deploy the model as well.

Login to AWS Console
Search of SageMaker under Find Services search bar
Click on the side menu, Notebook -> Notebook Instances and then click on Create notebook Instance
Specify the instance type, and click Create notebook instance.
Wait for the instance till the Status changes from Pending to InService
Once its InService, click on Open Jupyter

So, now our Jupyter Notebook is up and running. One of the best part of SageMaker Notebook is it being completely managed and all the different framework comes out of the box, for example, if we click on New and try to create a new notebook, we will see list of different kernels available, we don’t have to worry about installing, maintaining the updates, etc.

OK, we are all set now, lets go back the problem statement we have in hand and start coding.

Loading the necessary modules

Lets load the modules we will need to work with this problem, we will be using scikit-learn machine learning library

Decision Tree Classifier

Let's create a small function which will return a decision tree classifier

Here, the function takes 2 arguments,

X_train: Input or Independent Features and
y_train: Class or Target Label

Then we use DecisionTreeClassifier classifier from scikit-learn library, this function takes many arguments(which are also commonly know as hyperparameters), and here we are using one of them, criterion: The function to measure the quality of a split. Supported criteria are “gini” for the gini impurity and “entropy” for the information gain.

And this function returns clf_tree: the decision tree classifier which we will be using for inference(prediction) later.

Load the dataset

Lets load the dataset (wifi_localization.txt)

Here, after loading the data in a df: Pandas DataFrame, we insert the column names and separates the Input Features(X DataFrame) and Target Label(Y DataFrame)

Splitting the dataset

Now, we will split the whole dataset into training and testing(we will use 20% of the total data points for testing),

Now we can see the first 5 data points:

Decision Tree Classifier

We can see that this classifier can be tuned with many parameters(often times, its called hyper-parameters. For more details on what each hyper parameters mean, we can refer to the documentation

We have used one of the hyper-parameter here criterion, as "entropy", which mean that the classifier will use entropy to calculate the IG, which would be ultimately used to split the data in the background.

Predicting the result on the test data

Finally, we can use our Decision Tree Classifier object clf_tree to make some prediction on my testing data(X_test)

Evaluating the Performance

Now, that we got the prediction(y_pred), we can validate it with the actual labels (y_test) of this test data.

We did a lot in these last few lines of code, and we saw some new terms, accuracy, precision, recall, f1-score and confusion matrix

These all were Model Performance Metrics, we will have one tutorial dedicated to different Model Performance Metrics as we have many and not only these which are mentioned here, but for now, we can think of these as:

accuracy : It simply measures, how accurately the model predicted, e.g. if we have 10 test data points, and out of 10, only 8 data points are predicted correctly by the classifier, then the accuracy would be 80%, which also means accuracy can lie between 0 to 1
precision: The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
recall: The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
f1-score: The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. Mathematically it is defined by: 2 * (precision * recall) / (precision + recall)

Visualize A Decision Tree

Finally, we will try to visualize how our Decision Tree looks like, and we can do so, using a library graphviz

This is finally how the tree from the above classifier would look like. You can also download it from here

Next, we will explore some other algorithm and solve some problem. You may like to get the code/jupyter-notebook from git repo

Reference

You may like to visit the below mentioned course, books and source links which were referred to for the tutorial:

DEV Community: Suman Debnath

Creating Software Teams with AI Agents and Bedrock

What is CrewAI?

Our Mission: Build a Stock Management App

Project Setup: The Skeleton of Our AI Team

Getting Started: Setup in 3 Easy Steps

1. Install the essentials

2. Create your AI crew

3. Configure your team

The Dream Team: Configuring Our AI Engineers

The Engineering Lead (agents.yaml)

Assigning Tasks (tasks.yaml)

Crew Definition (crew.py)

The Starting Gun (main.py)

Showtime! Watching AI Engineers in Action

The Grand Finale: Running Our Application

What Just Happened? Behind the Scenes

Why This Is Revolutionary

The Future of Development?

What's Next?

Conclusion

Standardizing AI Tooling with Model Context Protocol (MCP)

What is Model Context Protocol (MCP) ?

MCP Architecture

How MCP Works ?

Real-World Example: Smart Home Integration

Traditional API Approach:

MCP Approach:

Conclusion

Integrating Vision-Language Models into Agentic RAG Systems with ColPali

The Challenge with Traditional RAG

ColPali: Efficient Document Retrieval with Vision Language Models

How ColPali Enhances Document Understanding

Detour of Vision Language Model (VLM)

Image Encoder

Text Encoder

Integration and Output Generation

ColPali Embeddings Process

Query Time : Similarity Scoring

Step 1: Generating and Projecting Tokens

Step 2: Computing the ColBERT Scoring Matrix

Step 3: Selecting Top-K Pages

Building an Agentic RAG System with CrewAI

Setup

Download the dataset

Load the ColPali Multimodal Document Retrieval Model

Setting up the vector database

Store embeddings in vector database

Building an Agentic RAG System with CrewAI

Query Time

Conclusion

What Next

Machine Learning in SQL Style (Part-2)

Exercise 2 (Data Analyst's perspective)

Dataset

Training (Model Creation)

Accuracy of the Model and Prediction/Inference

Exercise 3 (Data Scientist's perspective)

Dataset

Training (Model Creation)

Accuracy of the Model and Prediction/Inference

What next...

Resources

Machine Learning in SQL Style (Part-1)

Amazon Redshift

Introduction to Redshift ML

Create Redshift Cluster

Connecting to Redshift Cluster

Dataset

Exercise 1 (Database Engineer's perspective)

Dataset

Users and Groups

Training (Model Creation)

Accuracy of the Model and Prediction/Inference

Cost and Redshift ML

What next...

Resources

An Introduction to Decision Tree and Ensemble Methods – Part 1

An Introduction to Decision Tree

Math behind Decision Tree

Crew Definition (`crew.py`)