DEV Community: Mohamad Albaker Kawtharani

Approaching AI Agents 2026

Mohamad Albaker Kawtharani — Thu, 15 Jan 2026 06:37:36 +0000

How to approach AI agents in 2026 (with one example end-to-end)

The best way to build agents is to treat them like a virtual employee: define the job, set boundaries, give safe access, run probation, measure outcomes, then promote slowly.

The Focus is on outcome as agentic solution.

We’ll walk through an example: a Sourcing Agent inside a procurement department.

The continuous example: the Sourcing Agent (Procurement)

What it does (v1): turns a sourcing request into a ready-to-run Sourcing Pack (RFQ draft + supplier shortlist + evaluation matrix + risk checklist).
What it doesn’t do (yet): award suppliers, create POs, commit spend, or change supplier master data.

Step 1 — Pick one closed-loop, low-risk use case

Don’t start with “run procurement end-to-end.” Start with one loop that has clear inputs/outputs.

✅ Example (closed-loop v1): “Sourcing Pack Builder”

Input: intake form (category, specs, location, budget range, timeline, required certifications)
Output: Sourcing Pack (RFQ draft, shortlist, evaluation matrix, risk/compliance checklist)
Done criteria: procurement manager can launch sourcing with minimal edits

Role Card (make it real)

Name: Sourcing Agent
Job: prepare Sourcing Packs
KPIs: time-to-first-pack, pack completeness score, approval rate
Escalation: unclear specs / new supplier / policy exception → human review
Must NOT do: award/approve spend/PO creation

Step 2 — Validate data & integration readiness (GO / NO-GO)

Before the agent “starts work,” set up the desk and systems.

✅ Example: what the Sourcing Agent needs access to (read-only first)

Supplier master + supplier performance notes
Category spend history (e.g. last 6–12 months)
Approved supplier lists + policy thresholds
RFQ templates + standard clauses
Risk/compliance checklist rules

GO / NO-GO checks (simple)

Can it reliably fetch the right suppliers for this category + region?
Are templates and policies accessible and up to date?
Can you run a full test from intake → draft pack in a sandbox?

If any check fails: stop and fix prerequisites.

Step 3 — Run a supervised pilot (read-only / human approval)

Now you let the Sourcing Agent operate—like a new hire on probation.

✅ Example: pilot workflow

Buyer submits intake
Sourcing Agent generates a Sourcing Pack
Manager approves / edits
Feedback is captured (what was wrong, what was missing)

Pilot metrics

Speed: time from intake → first pack
Quality: % packs approved with minor edits
Safety: # policy issues caught before approval

Step 4 — Define the outcome as the agentic solution (measure outputs, not clicks)

This is the key mindset shift: your “agentic solution” is not the model, tools, or prompts—it’s the completed outcome.

✅ Example: outcome units for the Sourcing Agent

1 Sourcing Pack Delivered (RFQ + shortlist + matrix + risk checklist)
1 Supplier Shortlist Approved
1 RFQ Ready to Launch (still requiring approval)

Why this matters

It keeps incentives aligned (“deliver approved packs”)
It makes value obvious (time saved, fewer cycles, higher compliance)
It makes scaling easier (you can forecast “packs per week”)

Step 5 — Harden AgentOps + governance before write access (promotion)

Only after the pilot is stable do you “promote” the Sourcing Agent.

✅ Example: promotion path

Read-only: drafts only
Approval mode: pre-fill SRM/RFQ event after approval
Limited write: allowed for low-risk categories under a threshold
Expanded scope: more categories, more autonomy, still governed

Controls to add before write access

Audit log (inputs used, sources, decisions, approvals)
Guardrails (policy checks, supplier eligibility checks, threshold checks)
Observability (quality, cost, failure modes, overrides)
Kill switch + rollback process

Wrap-up: the 2026 playbook in one sentence

Build agents like virtual employees: role → desk access → probation → outcome-based performance → promotion with controls.

Elevating LLM Evaluation with DeepEval with Native Amazon Bedrock Support

Mohamad Albaker Kawtharani — Thu, 24 Apr 2025 09:51:15 +0000

Elevating LLM Evaluation with DeepEval: Now with Native Amazon Bedrock Support

As Large Language Models (LLMs) move from research labs to production environments, robust evaluation becomes critical. Whether you're building Retrieval-Augmented Generation (RAG) systems, deploying agentic workflows, or integrating LLMs into enterprise products, you need confidence in your model outputs.

What is DeepEval?

DeepEval is an open-source framework designed specifically for comprehensive LLM evaluation across diverse use cases. It provides:

Customizable metrics tailored to specific evaluation needs
Evaluation pipelines for systematic testing
Context-aware validation for retrieval-based, conversational, and workflow applications

This makes DeepEval ideal for developers and data scientists who need to validate model quality beyond basic accuracy metrics.

🔥 What's New: Amazon Bedrock Native Support

We're excited to announce that DeepEval now fully supports Amazon Bedrock models, including Claude, Titan, and the complete Bedrock model lineup. This integration enables enterprise teams to:

Maintain data sovereignty by keeping sensitive evaluation data within your AWS environment
Leverage existing AWS infrastructure with seamless workflow integration
Scale evaluations using Bedrock's managed infrastructure and compliance features

Check out Pull Request #1426 for implementation details on this important integration.

🌍 Open to the Community

LLM evaluation remains an evolving challenge that benefits from diverse perspectives. We're actively seeking contributions in:

Model support expansion beyond current providers and architectures
Enhanced evaluation metrics for factual consistency, coherence, and bias detection
New use case coverage including code generation, summarization, and multilingual tasks

Join data scientists, ML engineers, and researchers in making LLM evaluation more robust and accessible.

Get involved: https://github.com/zeroandoneme/deepeval/

🚀 Getting Started: Using DeepEval with Amazon Bedrock

Here's how to evaluate your Bedrock models using DeepEval:

Install Deepeval

pip install deepeval

Setup AWS Credentials Make sure your AWS credentials are configured (either via environment variables, ~/.aws/credentials, or an IAM role). This gives DeepEval access to Amazon Bedrock.

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=your_region

Run an Evaluation Get Output

from deepeval.models.llms.bedrock_model import BedrockModel

# Initialize the Bedrock model (e.g., Claude)
model = BedrockModel(
    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    region="us-east-1"
)

# Define your input prompt
prompt = "Summarize the following text: Anthropic Claude 3.7 Sonnet is the first Claude model to offer step-by-step reasoning, which Anthropic has termed "extended thinking". With Claude 3.7 Sonnet, use of step-by-step reasoning is optional. You can choose between standard thinking and extended thinking for advanced reasoning. Along with extended thinking, Claude 3.7 Sonnet allows up to 128K output tokens per request (up to 64K output tokens is considered generally available, but outputs between 64K and 128K are in beta). Additionally, Anthropic has enhanced its computer use beta with support for new actions."

# Run the model
output = model.generate(prompt)

Evaluate

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

retrieval_context = [
    "Anthropic Claude 3.7 Sonnet is the first Claude model to introduce optional step-by-step reasoning, called "extended thinking," which users can toggle alongside standard thinking. "
    "It supports up to 128K output tokens per request (with 64K–128K currently in beta) and features an enhanced computer use beta with support for new automated actions."
]

test_case = LLMTestCase(
    input="What new reasoning feature does Claude 3.7 Sonnet introduce?",
    actual_output=output,
    context=retrieval_context
)
metric = HallucinationMetric(model=model)

# To run metric as a standalone
metric.measure(test_case)
print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

Why This Matters

As AI adoption accelerates, evaluation becomes the critical safety net ensuring reliable production systems. DeepEval bridges the gap between cutting-edge research and practical deployment, enabling organizations to ship AI solutions with confidence.

The addition of Amazon Bedrock support particularly benefits enterprise users who require secure, compliant LLM evaluation within their existing cloud infrastructure.

Join Us in Building Better AI

Whether you're evaluating fine-tuned models, creating RAG applications, or deploying conversational agents, DeepEval provides the framework to measure what matters.

We invite you to explore DeepEval, integrate it into your evaluation workflow, and contribute to its development.

Together, we can build more reliable, transparent AI.

A Magic Line That Cuts Your LLM Latency by >40% on Amazon Bedrock

Mohamad Albaker Kawtharani — Tue, 14 Jan 2025 08:07:18 +0000

Cutting LLM Latency by >40% on Amazon Bedrock with One Magic Line

If you’ve worked with large language models (LLMs), you know that latency can make or break the user experience. For real-time applications, every millisecond matters. Enter Amazon Bedrock’s latency-optimized inference—a game-changing feature that can cut latency significantly with just one line of configuration.

In this blog, we’ll explore how to use this feature, measure its impact, and understand why it’s a must-have for high-performance AI applications.

The Magic Line

To enable latency-optimized inference, all you need to do is include the following in your request payload:

"performanceConfig": {
    "latency": "optimized"
}

This setting tells Amazon Bedrock to use its optimized infrastructure, reducing response times without compromising the accuracy of your model.

A Real-Life Test with Claude 3.5 Haiku

We conducted a test using Anthropic’s Claude 3.5 Haiku model. The prompt was simple:

"Describe the purpose of a 'hello world' program in one line."

We measured the latency for both standard and optimized configurations and recorded the results.

Here’s the Python code used to measure latency: (Expand to View)

import time
import boto3
import json

def measure_latency(client, model_id, prompt, optimized=False):
    request = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 512,
        "temperature": 0.5,
        "messages": [
            {"role": "user", "content": [{"type": "text", "text": prompt}]}
        ],
    }
    if optimized:
        request["performanceConfig"] = {"latency": "optimized"}
        request["max_tokens"] = 256
        request["temperature"] = 0.2

    start_time = time.time()
    response = client.invoke_model(modelId=model_id, body=json.dumps(request))
    latency = time.time() - start_time
    response_text = json.loads(response["body"].read())["content"][0]["text"]
    return latency, response_text

def main():
    client = boto3.client('bedrock-runtime', region_name='us-east-1')
    model_id = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
    prompt = "Describe the purpose of a 'hello world' program in one line."

    standard_latency, standard_response = measure_latency(client, model_id, prompt, optimized=False)
    optimized_latency, optimized_response = measure_latency(client, model_id, prompt, optimized=True)

    improvement = ((standard_latency - optimized_latency) / standard_latency) * 100

    print(f"Standard Latency: {standard_latency:.2f} seconds")
    print(f"Optimized Latency: {optimized_latency:.2f} seconds")
    print(f"Latency Improvement: {improvement:.2f}%")

if __name__ == "__main__":
    main()

Results

Here’s what we observed:

Configuration	Latency (Seconds)	Response
Standard	2.14	"A 'hello world' program demonstrates the basic syntax of a programming language by displaying the text 'Hello, World!'."
Optimized	1.27	"A 'hello world' program demonstrates the basic syntax of a programming language by printing the text 'Hello, World!'."

Latency Improvement: 40.41%

Key Insights

Significant Speed Boost: With a simple configuration change, we achieved a 40% reduction in latency.
Similar Output: Both configurations returned equivalent, high-quality responses.
Great for Real-Time Use Cases: This feature is perfect for chatbots or any latency-sensitive application.

How It Works

Amazon Bedrock leverages optimized infrastructure to deliver faster results. However, there are a few things to keep in mind:

Token Limits: For certain models, such as Meta's Llama 3.1 405B, latency-optimized inference supports requests with a combined input and output token count of up to 11,000 tokens. Requests exceeding this limit will default to standard mode.
Slight Cost Increase: Latency-optimized requests may incur slightly higher costs.

Why It Matters

In today’s fast-paced world, users expect instant results. Whether you’re building an AI-powered customer support system or a real-time analytics dashboard, reducing latency can dramatically improve user experience and system efficiency.

Final Thoughts

Amazon Bedrock’s latency-optimized inference is a simple yet powerful tool that can supercharge your AI applications. With just one magic line, you can deliver faster, more efficient services. Try it out, measure the difference, and see the results for yourself! 🚀

Amazon Nova Pro-v1.0 vs OpenAI GPT-4o: A Cost Comparison Through an Example

Mohamad Albaker Kawtharani — Wed, 04 Dec 2024 09:12:25 +0000

Introduction

AI models have revolutionized industries, enabling businesses to automate tasks, generate insights, and optimize workflows. However, as usage grows, so does the need for cost-efficient solutions. Choosing the right AI model isn’t just about performance—it’s also about affordability.

In this blog, we compare Amazon Nova Pro-v1.0 and OpenAI GPT-4o, two powerful models with different pricing structures, by evaluating their cost-effectiveness in a real-world summarization task.

Amazon Nova Pro-v1.0 is a cutting-edge AI model capable of processing and generating text, images, and videos, offering versatile applications across various industries. Recently unveiled at Amazon's re:Invent conference, it provides a cost-efficient solution for businesses seeking advanced GenAI capabilities.

Pricing Models

Before comparing, it’s crucial to understand how these models are priced. Both charge based on the number of tokens used:

Input Tokens: The text provided to the model for processing.
Output Tokens: The text generated by the model in response.

Here’s the on-demand pricing for each model:

Model	Input Tokens ($/1M)	Output Tokens ($/1M)
OpenAI GPT-4o	$2.5	$10.0
Amazon Nova Pro-v1.0	$0.8	$3.2

These costs may seem negligible for small tasks, but they quickly add up when scaling operations. Thus, understanding token usage is key to managing costs effectively.

Summarization Task

To compare costs, we conducted a summarization task where both models were given the same input text—a narrative about an old clockmaker and his apprentice. The goal was to generate a concise summary of the story while analyzing the token usage and associated costs.

Input Text (AI Generated) (Expand to View)
The Clockmaker’s Apprentice
(In the quaint town of Willowshade, nestled between rolling hills and meandering streams, there lived an old clockmaker named Elias. His shop, filled with the soft ticking of countless clocks, was a sanctuary of time. Each clock was a masterpiece, intricately designed with brass gears, golden hands, and polished wood. But Elias was growing old, and his hands, once steady as the earth’s rotation, had begun to tremble.
One frosty morning, a young boy named Finn appeared at the shop’s door. His clothes were tattered, his cheeks flushed with cold, but his eyes sparkled with curiosity.
“Sir, I want to learn to fix time,” Finn declared, clutching a broken pocket watch. Elias chuckled at the boy’s earnestness.
“You can’t fix time, lad,” Elias said. “But you can mend the tools that keep it.”
Finn became Elias’s apprentice. Day by day, he learned the art of clockmaking. He polished gears, measured springs, and practiced the delicate dance of assembling a clock. His favorite was a grand pendulum clock that stood in the corner of the shop, its deep chime echoing through the room every hour.
One evening, as the sun dipped below the horizon, Elias took Finn to the attic. There, covered in a layer of dust, lay a peculiar clock. Its face was blank, and its hands moved erratically, spinning forward, then backward.
“This,” Elias whispered, “is the Clock of Lost Moments. It doesn’t measure hours or minutes. It measures moments forgotten, regrets, or times we wish we could relive.”
Finn stared in awe. “Can it bring back moments?”
Elias shook his head. “No. It’s a reminder, boy. Time moves only forward. We can’t reclaim it, but we can choose to make the most of the moments we’re given.”
Over the years, Finn became a master clockmaker. When Elias passed, Finn inherited the shop. He restored the Clock of Lost Moments and placed it in the shop window, where its mysterious movements fascinated passersby. People began to visit, not just to buy clocks, but to share their stories of moments they cherished or wished to forget.
And so, the little clock shop became more than a place to mend timepieces—it became a place to mend hearts.)

Both models produced summaries of similar quality, but their token usage and costs varied significantly. Let’s examine this in detail.

OpenAI - GPT-4o Response (Expand to View)
Response:

Summary:

In the charming town of Willowshade, an aging clockmaker named Elias, renowned for his intricate clocks, takes in a curious boy named Finn as his apprentice. Finn learns the delicate art of clockmaking and discovers a mysterious “Clock of Lost Moments” in the attic, which measures forgotten or regretful moments but cannot bring them back. Elias teaches Finn the value of cherishing present moments. After Elias’s passing, Finn inherits the shop, restores the peculiar clock, and places it in the window. The shop evolves into a haven where people come not only to repair clocks but also to share and mend their own cherished or regretful memories.

Amazon - Nova Pro v1.0 Response (Expand to View)
Response:

In the town of Willowshade, an old clockmaker named Elias runs a shop filled with intricate clocks. As he ages and his hands tremble, a young boy named Finn arrives, eager to learn the craft. Elias becomes his mentor, teaching him the art of clockmaking. Among the clocks, there is a peculiar one called the Clock of Lost Moments, which measures forgotten moments and regrets rather than time. Elias explains that while time can't be reclaimed, one can make the most of the present. Finn eventually becomes a master clockmaker, inherits the shop after Elias's passing, and restores the Clock of Lost Moments, turning the shop into a place where people share their cherished and regretted moments, thus mending hearts as well as clocks.

Cost Calculation

OpenAI GPT-4o

Input Tokens: 489
Output Tokens: 130
Pricing Calculation:

\text{Cost} = \left( \frac{489 \times 2.5}{1M} \right) + \left( \frac{130 \times 10.0}{1M} \right)

\text{Cost} = 0.0012225 \, \text{USD}

Amazon Nova Pro-v1.0

Input Tokens: 494
Output Tokens: 152
Pricing Calculation:

\text{Cost} = \left( \frac{494 \times 0.8}{1M} \right) + \left( \frac{152 \times 3.2}{1M} \right)

\text{Cost} = 0.0006848 \, \text{USD}

Conclusion

Amazon Nova Pro-v1.0’s cost is noticeably lower, nearly half that of GPT-4o.

Comparison and Insights

Model	Input Tokens	Output Tokens	Total Cost (USD)
OpenAI GPT-4o	489	130	$0.0012225
Amazon Nova Pro-v1.0	494	152	$0.0006848

Cost Efficiency:

Amazon Nova Pro-v1.0 is approximately 44% more cost-efficient than OpenAI GPT-4o for this summarization task.
Performance Comparison:
Both models generated high-quality summaries. Amazon Nova Pro-v1.0 provided a concise and accurate response, making it an excellent choice for cost-sensitive scenarios.

Token Optimization Strategies

Even with cost-efficient models like Amazon Nova Pro-v1.0, there are ways to further optimize token usage and reduce costs:

Limit Token Length:
Adjust input and output token limits in API settings to avoid excessive usage.
Pre-process Input Texts:
Simplify input texts by removing redundant or unnecessary information.
Optimize Task Complexity:
For simpler tasks, choose models optimized for cost efficiency over high computational power.

By implementing these strategies, businesses can maximize their return on investment when leveraging AI solutions.

Conclusion

In this comparative analysis, Amazon Nova Pro-v1.0 emerged as a significantly more cost-efficient model for summarization tasks. While both models excel in quality, the ability to deliver comparable results at a lower cost makes Amazon Nova Pro-v1.0 an attractive option for budget-conscious users.

When choosing an AI model, it’s essential to evaluate not only performance but also long-term scalability and cost efficiency. Amazon Nova Pro-v1.0 proves that affordability doesn’t have to come at the expense of quality.

Deploying whisperX on AWS SageMaker as Asynchronous Endpoint

Mohamad Albaker Kawtharani — Sun, 31 Mar 2024 12:57:45 +0000

Whisper is an automatic speech recognition system developed by OpenAI, designed to transcribe and translate audio into text across multiple languages, focusing on general-purpose transcription tasks.
WhisperX, on the other hand, extends Whisper's capabilities with enhancements like faster processing times and additional features such as voice activity detection (VAD), making it more suitable for specific applications that require these advanced functionalities.

First, we install the Hugging Face Hub library, enabling interaction with the Hugging Face Model Hub from our environment.

!pip install huggingface_hub

We import necessary modules and define two functions. download_hf_model downloads a specified model from Hugging Face to a local directory. fetch_models downloads the WhisperX and Voice Activity Detection (VAD) models, storing them locally for use.

import huggingface_hub
import os
import urllib.request

def download_hf_model(model_name: str, hf_token: str, local_model_dir: str) -> str:
    """
    Fetches the provided model from HuggingFace and returns the subdirectory it is downloaded to.
    """
    model_subdir = model_name.split('@')[0]
    huggingface_hub.snapshot_download(model_subdir, token=hf_token, local_dir=f"{local_model_dir}/{model_subdir}", local_dir_use_symlinks=False)
    return model_subdir

def fetch_models(hf_token: str, local_model_dir="./models"):
    """
    Fetches models required for WhisperX transcription without diarization.
    """
    WHISPERX_MODEL = "guillaumekln/faster-whisper-large-v2"
    VAD_MODEL_URL = "https://whisperx.s3.eu-west-2.amazonaws.com/model_weights/segmentation/0b5b3216d60a2d32fc086b47ea8c67589aaeb26b7e07fcbe620d6d0b83e209ea/pytorch_model.bin"

    # Fetch WhisperX model
    download_hf_model(model_name=WHISPERX_MODEL, hf_token=hf_token, local_model_dir=local_model_dir)

    # Fetch VAD Segmentation model
    vad_model_dir = "whisperx/vad"
    if not os.path.exists(f"{local_model_dir}/{vad_model_dir}"):
        os.makedirs(f"{local_model_dir}/{vad_model_dir}")
    urllib.request.urlretrieve(VAD_MODEL_URL, f"{local_model_dir}/{vad_model_dir}/pytorch_model.bin")

Then we call fetch_models function to download the WhisperX and VAD models into our local directory named using the specified Hugging Face token.

fetch_models(
    hf_token="", # enter your hugging face token
    local_model_dir="./models-v1"
)

Here, we set up necessary files for our model. We create inference.py and requirements.txt inside the /code in our directory.

inference.py contains the inference script for loading the WhisperX model, processing input audio files from S3, performing transcription, and formatting the output.
requirements.txt specifies the Python packages needed for the inference environment.

import os

# Directory and file paths
dir_path = './models-v1'
inference_file_path = os.path.join(dir_path, 'code/inference.py')
requirements_file_path = os.path.join(dir_path, 'code/requirements.txt')

# Create the directory structure
os.makedirs(os.path.dirname(inference_file_path), exist_ok=True)

# Inference.py content
inference_content = '''# inference.py
# inference.py
import io
import json
import logging
import os
import tempfile
import time
import boto3
import torch
import whisperx

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
s3 = boto3.client('s3')

def model_fn(model_dir, context=None):
    """
    Load and return the WhisperX model necessary for audio transcription.
    """
    print("Entering model_fn")

    logging.info("Loading WhisperX model")
    model = whisperx.load_model(whisper_arch=f"{model_dir}/guillaumekln/faster-whisper-large-v2",
                                device=DEVICE,
                                language="en",
                                compute_type="float16",
                                vad_options={'model_fp': f"{model_dir}/whisperx/vad/pytorch_model.bin"})
    print("Loaded WhisperX model")

    print("Exiting model_fn with model loaded")
    return {
        'model': model
    }

def input_fn(request_body, request_content_type):
    """
    Process and load audio from S3, given the request body containing S3 bucket and key.
    """
    print("Entering input_fn")
    if request_content_type != 'application/json':
        raise ValueError("Invalid content type. Must be application/json")

    request = json.loads(request_body)
    s3_bucket = request['s3bucket']
    s3_key = request['s3key']

    # Download the file from S3
    temp_file = tempfile.NamedTemporaryFile(delete=False)
    s3.download_file(Bucket=s3_bucket, Key=s3_key, Filename=temp_file.name)
    print(f"Downloaded audio from S3: {s3_bucket}/{s3_key}")

    print("Exiting input_fn")
    return temp_file.name

def predict_fn(input_data, model, context=None):
    """
    Perform transcription on the provided audio file and delete the file afterwards.
    """
    print("Entering predict_fn")
    start_time = time.time()

    whisperx_model = model['model']

    logging.info("Loading audio")
    audio = whisperx.load_audio(input_data)

    logging.info("Transcribing audio")
    transcription_result = whisperx_model.transcribe(audio, batch_size=16)

    try:
        os.remove(input_data)  # input_data contains the path to the temp file
        print(f"Temporary file {input_data} deleted.")
    except OSError as e:
        print(f"Error: {input_data} : {e.strerror}")

    end_time = time.time()
    elapsed_time = end_time - start_time
    logging.info(f"Transcription took {int(elapsed_time)} seconds")

    print(f"Exiting predict_fn, processing took {int(elapsed_time)} seconds")
    return transcription_result

def output_fn(prediction, accept, context=None):
    """
    Prepare the prediction result for the response.
    """
    print("Entering output_fn")
    if accept != "application/json":
        raise ValueError("Accept header must be application/json")
    response_body = json.dumps(prediction)
    print("Exiting output_fn with response prepared")
    return response_body, accept

'''

# Write the inference.py file
with open(inference_file_path, 'w') as file:
    file.write(inference_content)

# Requirements.txt content
requirements_content = '''speechbrain==0.5.16
faster-whisper==0.7.1
git+https://github.com/m-bain/whisperx.git@1b092de19a1878a8f138f665b1467ca21b076e7e
ffmpeg-python
'''

# Write the requirements.txt file
with open(requirements_file_path, 'w') as file:
    file.write(requirements_content)

Then we compresses our model directory into a gzip-compressed tar. The archive is created in the current working directory, encapsulating the prepared model and code necessary for deploying WhisperX on SageMaker.

import shutil
shutil.make_archive('./modelv1', 'gztar', './models-v1')

After that we upload the compressed model archive to an S3 bucket using. It also saves the S3 path of the uploaded model.

import sagemaker
import boto3
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
bucket = '' # Enter your s3 bucket name
prefix = 'whisperx/code'

# Upload the model to S3
s3_path = sagemaker_session.upload_data(
    'modelv1.tar.gz',
    bucket=bucket,
    key_prefix=prefix
)

print(f"Model uploaded to {s3_path}")

We are ready to create and deploy the sagemaker model on SageMaker as asynchronous endpoint using a image URI and the model data uploaded to S3.
For different deployment regions and requirements, find suitable container images at AWS Deep Learning Container.

import sagemaker
from sagemaker.model import Model
from sagemaker.async_inference import AsyncInferenceConfig

# Initialize a sagemaker session
sagemaker_session = sagemaker.Session()

# Create a SageMaker model
model = Model(
    image_uri=image_uri,
    role=role,
    model_data=s3_path,
)

# Specify the output location
async_config = AsyncInferenceConfig(
    output_path='s3://{your_s3_bucket_name}/whisperx/output'
) # add your s3 bucket name to have the output there

# Deploy the model to an asynchronous endpoint
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge', # you can change
    async_inference_config=async_config
)

Now as we are done on the deployment we can invoke the endpoint.
This function, invoke_async_model, is developed for asynchronous inference. It takes parameters for an S3 bucket and key, saves a JSON payload containing these details to S3, and invokes a SageMaker asynchronous endpoint with the location of this payload.
The function aims to facilitate the processing of inference jobs without requiring the client to wait for the job to complete, ideal for large-scale or batch processing tasks.
Upon successful invocation, it prints confirmation and the unique InferenceId provided by SageMaker, which can be used to track the status and result of the inference job.

import boto3
import json

def invoke_async_model(s3_bucket, s3_key):
    """
    Saves a JSON payload to S3 and invokes a SageMaker asynchronous endpoint with the payload.

    Parameters:
    - s3_bucket: The S3 bucket name.
    - s3_key: The S3 key for the input file.
    Returns:
    The response from the endpoint invocation, including the InferenceId.
    """
    s3 = boto3.client('s3')
    sagemaker_runtime = boto3.client('sagemaker-runtime')

    # Create the payload
    payload = {
        "s3bucket": s3_bucket,
        "s3key": s3_key
    }

    # Define the S3 key for the input JSON
    s3_key_for_input = s3_key.rsplit('/', 1)[0] + '/asynch_input_file.json'

    # Save the payload to S3 as a JSON file
    try:
        s3.put_object(
            Body=json.dumps(payload),
            Bucket=s3_bucket,
            Key=s3_key_for_input
        )
        print("Payload saved to S3.")
    except Exception as e:
        print(f"Error saving JSON to S3: {e}")
        return

    # The S3 location of the input data for the inference request
    input_location = f"s3://{s3_bucket}/{s3_key_for_input}"

    endpoint_name = '' ## add you endpoint name
    # Invoke the SageMaker asynchronous endpoint
    try:
        response = sagemaker_runtime.invoke_endpoint_async(
            EndpointName=endpoint_name,
            InputLocation=input_location,
            ContentType='application/json'
        )
        print(f"Endpoint invoked. InferenceId: {response['InferenceId']}")
    except Exception as e:
        print(f"Error invoking endpoint: {e}")
        return

    return response

# Example usage
request_body = {
    's3bucket': '', # add the s3 bucket 
    's3key':'', # the audio / video file prefix
}

# Invoke the function with parameters from request_body
invoke_async_model(
    s3_bucket=request_body['s3bucket'],
    s3_key=request_body['s3key'],
)

I hope this was helpful.
Please let me know if you have any questions!

Deploying OpenAI's Whisper Large V3 Model on SageMaker Using Hugging Face Libraries

Mohamad Albaker Kawtharani — Tue, 23 Jan 2024 12:02:56 +0000

In a recent project, I was utilizing OpenAI's Whisper model for transcription. The sprint goal was to deploy it on SageMaker, leveraging the smoothness of Hugging Face libraries. However, I encountered a block: a ModelError that puzzled me for a couple of hours.
The error in more details:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{ "code": 400, "type": "InternalServerException", "message": "Wrong index found for \u003c|0.02|\u003e: should be None but found 50366."

After conducting research, I discovered a solution discussed in Issue #58 on the Hugging Face forum, within the OpenAI Whisper Large V3 repository. The solution indicates that the issue is caused by variations in the transformers libraries, and to resolve it, we need to enforce the use of a more recent version. It's important to note that the required libraries are not currently supported by the Hugging Face library (as of now).

In this blog post, I will present a straightforward method to implement this solution, whether you are utilizing a SageMaker domain or a SageMaker notebook for deploying Whisper Large models.

1. Setting Up Directory and Files

In this phase, we create the necessary directory structure and files for our Whisper model deployment.
This includes creating the whisper-model directory, the inference.py script, and the requirements.txt file.
The script inference.py sets up the model and processor configurations for the Whisper model.

import os

# Directory and file paths
dir_path = './whisper-model'
inference_file_path = os.path.join(dir_path, 'code/inference.py')
requirements_file_path = os.path.join(dir_path, 'code/requirements.txt')

# Create the directory structure
os.makedirs(os.path.dirname(inference_file_path), exist_ok=True)

# Inference.py content
inference_content = '''
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Model and task specifications
model_id = "openai/whisper-large-v3"
task = "automatic-speech-recognition"

# Device configuration
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

def model_fn(model_dir):
    try:
        print(f"Loading model: {model_id}")
        # Load the model
        model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
        )
        model.to(device)
        print(f"Model loaded on device: {device}")

        # Load the processor
        processor = AutoProcessor.from_pretrained(model_id)
        print("Processor loaded")

        # Create and return a pipeline for ASR
        asr_pipeline = pipeline(
            task,
            model=model,
            tokenizer=processor.tokenizer,
            feature_extractor=processor.feature_extractor,
            return_timestamps=True,
            torch_dtype=torch_dtype,
            device=device,
        )
        print("Pipeline created")

        return asr_pipeline
    except Exception as e:
        print(f"An error occurred: {e}")
        raise
'''

# Write the inference.py file
with open(inference_file_path, 'w') as file:
    file.write(inference_content)

# Requirements.txt content
requirements_content = '''
transformers==4.38.0
accelerate==0.26.1
'''

# Write the requirements.txt file
with open(requirements_file_path, 'w') as file:
    file.write(requirements_content)

2. Archiving the Directory

In this phase, we archive the entire whisper-model directory into a compressed file using the make_archive function from shutil.
This compressed file is prepared for deployment to SageMaker.

import shutil
shutil.make_archive('./whisper-model', 'gztar', './whisper-model')

3. Uploading the Model to S3

This phase involves uploading the Whisper model, which is now in a compressed format, to Amazon S3 bucket.
We utilize SageMaker's capabilities to interact with S3 for efficient storage and retrieval.

import sagemaker
import boto3

# Get the SageMaker session and default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket() # Change if you want to store in a different bucket
prefix = 'whisper/code'

# Upload the model to S3
s3_path = sagemaker_session.upload_data(
    'whisper-model.tar.gz', 
    bucket=bucket,
    key_prefix=prefix
)

print(f"Model uploaded to {s3_path}")

4. Deploying the Model on SageMaker

Here, we deploy the Whisper model on SageMaker using the Hugging Face Model Class.
We specify the model's version, PyTorch version, instance type, and other parameters to ensure smooth deployment as an inference endpoint.

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.26.0',
    pytorch_version='1.13.1',
    py_version='py39',
    model_data=s3_path,
    role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge"
)

5. Making Predictions with the Deployed Model

In this final phase, we configure the deployed model to handle audio input data.
We specify the data serializer for audio and demonstrate how to use the deployed model for making predictions, such as transcribing speech from audio files.

from sagemaker.serializers import DataSerializer

predictor.serializer = DataSerializer(content_type='audio/x-audio')

# Make sure the input file "sample1.flac" exists
with open("sample.wav", "rb") as f:
    data = f.read()
predictor.predict(data)

Hope it was helpful!!

Embedding AWS Bedrock Into Your Workloads

Mohamad Albaker Kawtharani — Tue, 03 Oct 2023 07:29:46 +0000

Introduction

Generative AI is on the rise, and with services like Amazon Bedrock, it's easier than ever to integrate powerful foundation models (FMs) into your workloads. If you're wondering how to take full advantage of this technology, this blog post is for you. I delve into the world of Amazon Bedrock and guide on embedding it seamlessly into running applications.

What is Amazon Bedrock?

Amazon Bedrock is Amazon's fully managed service designed for building and scaling generative AI applications. In simple terms, it's a platform that provides access to a range of top-performing FMs from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon itself. With Bedrock, developers can easily integrate and experiment with various FMs, customize them with their own data, and even extend their capabilities without writing a single line of code.

Benefits of Using Amazon Bedrock

Diverse Range of Foundation Models: Experiment and deploy with a plethora of FMs from industry leaders, all accessible via a single API.
Minimal Code Changes: Due to its consistent API structure, transitioning between different models or updating to newer versions requires minimal code alterations.
Code-free Customization: Enhance FMs with your own data through a visual interface. Link with datasets on Amazon S3 and tweak hyperparameters to achieve optimal performance.
Fully Managed Agents: Beyond just generative capabilities, Bedrock offers agents that can execute intricate business tasks, such as managing inventory or processing insurance claims. These agents dynamically interact with company systems and APIs, taking generative AI applications to a new frontier.
Knowledge Bases Enhancement: Securely connect FMs to your data sources within Bedrock. This feature augments the model's capabilities, making it more attuned to your domain or organization's specifics.

Embedding Amazon Bedrock in Your Workloads

Start with the Playground: Before diving deep, experiment with different FMs in Bedrock's playground. This sandbox environment allows for quick testing and understanding of a model's capabilities.
Integrate Using Bedrock's API: Regardless of the FM you choose, Bedrock offers a unified API. This ensures easy integration with your applications and consistency in invoking different models.
Customize for Your Needs: Once integrated, use Bedrock's visual interface to enhance the model's performance. Link it with datasets stored in Amazon S3 and adjust hyperparameters to get the desired results.
Incorporate Managed Agents: Want to automate complex business tasks? Bedrock's agents can be dynamically called to interact with your systems. Whether it's orchestrating ad campaigns or managing inventories, these agents can significantly optimize processes.
Enrich with Knowledge Bases: Extend the model's capabilities by connecting it to your data sources within Bedrock. This allows the FM to be more knowledgeable and attuned to your organization's specific nuances.

Practical Integration: Embedding Amazon Bedrock Models into Your Applications using Python

Prerequisites

Python Environment Python 3.8> installed
AWS Account
boto3 installed

pip install boto3

Amazon Bedrock Access

Invoking the AI21's J2-Ultra-V1 Model

Here's a Python script that uses the Boto3 library to invoke AI21's J2-Ultra-V1 model via Amazon Bedrock:

import boto3
import json

# Constants
SERVICE_NAME = "bedrock-runtime"
REGION_NAME = "us-east-1"
MODEL_ID = "ai21.j2-ultra-v1"
CONTENT_TYPE = "application/json"
ACCEPT = "*/*"


def create_bedrock_client():
    """Create and return a Bedrock client."""
    return boto3.client(
        service_name=SERVICE_NAME,
        region_name=REGION_NAME
    )


def generate_request_body(prompt: str, max_tokens: int = 200, temperature: float = 0.7) -> dict:
    """Generate and return the body for the request."""
    return {
        "prompt": prompt,
        "maxTokens": max_tokens,
        "temperature": temperature,
        "topP": 1,
        "stopSequences": [],
        "countPenalty": {"scale": 0},
        "presencePenalty": {"scale": 0},
        "frequencyPenalty": {"scale": 0}
    }


def invoke_model(client, prompt: str) -> str:
    """Invoke the model and return the response."""
    body = json.dumps(generate_request_body(prompt))
    kwargs = {
        "modelId": MODEL_ID,
        "contentType": CONTENT_TYPE,
        "accept": ACCEPT,
        "body": body
    }
    response = client.invoke_model(**kwargs)
    content = json.loads(response["body"].read())
    return content.get('completions')[0].get('data').get('text')


def main():
    prompt_text = "Hello"
    try:
        client = create_bedrock_client()
        response = invoke_model(client, prompt_text)
        print(response)
    except Exception as e:
        print(f"Error occurred: {e}")


if __name__ == '__main__':
    main()

Invoking Anthropic's Claude-V2 Model with Streamed Response

To get a streamed response from the Anthropic's Claude-V2 model, use the following code:

import boto3
import json

# Constants
SERVICE_NAME = "bedrock-runtime"
REGION_NAME = "us-east-1"
MODEL_ID = "anthropic.claude-v2"
CONTENT_TYPE = "application/json"
ACCEPT = "*/*"
ANTHROPIC_VERSION = "bedrock-2023-05-31"


def create_bedrock_client():
    """Create and return a Bedrock client."""
    return boto3.client(
        service_name=SERVICE_NAME,
        region_name=REGION_NAME
    )


def generate_request_body(prompt: str) -> dict:
    """Generate and return the body for the request."""
    return {
        "prompt": f"Human: {prompt}\nAssistant:",
        "max_tokens_to_sample": 300,
        "temperature": 1,
        "top_k": 250,
        "top_p": 0.999,
        "stop_sequences": ["\n\nHuman:"],
        "anthropic_version": ANTHROPIC_VERSION
    }


def invoke_model(client, prompt: str) -> dict:
    """Invoke the model and return the response."""
    body = json.dumps(generate_request_body(prompt))
    kwargs = {
        "modelId": MODEL_ID,
        "contentType": CONTENT_TYPE,
        "accept": ACCEPT,
        "body": body
    }
    return client.invoke_model_with_response_stream(**kwargs)


def extract_and_print_response(response: dict):
    """Extract response and print it."""
    stream = response.get('body')
    if stream:
        for event in stream:
            chunk = event.get('chunk')
            if chunk:
                print(json.loads(chunk.get('bytes')).get('completion'), end="")


def main():
    prompt = "write an article about the fictional planet Foobar"
    try:
        client = create_bedrock_client()
        response = invoke_model(client, prompt)
        extract_and_print_response(response)
    except Exception as e:
        print(f"Error occurred: {e}")


if __name__ == "__main__":
    main()

Why Go for Amazon Bedrock?

In the era of AI and machine learning, having a robust platform is essential for harnessing the full potential of generative models. Amazon Bedrock stands out due to its comprehensive capabilities, serverless nature, and seamless integration options. With Bedrock, not only do you get access to industry-leading FMs but also the tools to customize and extend their capabilities. And the best part? All of this without the hassle of managing infrastructure or delving deep into code.

In conclusion, Amazon Bedrock is an indispensable tool for businesses and developers looking to leap into the future of generative AI. By embedding it into your workloads, you open doors to unprecedented automation, efficiency, and innovation. Dive into the world of Amazon Bedrock today and reshape how you approach AI in your applications.

References:

The Transformative Power of Generative AI Chatbots

Mohamad Albaker Kawtharani — Tue, 03 Oct 2023 07:00:41 +0000

Introduction

The realm of business is constantly evolving, with Artificial Intelligence (AI) playing a key role in this transformation. Among the various AI-powered tools, "Generative AI Chatbots" have emerged as a game-changer [6].

According to recent reports from McKinsey on generative artificial intelligence (GenAI), it's evident that despite the technology being in its early stages, its usage is already extensive. In fact, leading companies are spearheading this advancement and are significantly ahead in their use of GenAI [1].

This blog provides an accessible guide to these innovative chatbots, explaining their unique qualities, their rising significance for businesses of all sizes, and showcasing our solution built on AWS. We'll also explore what the future holds for this technology within the business landscape.

So, whether you're a business owner looking to stay ahead, or simply a tech enthusiast intrigued by the latest developments, this blog is for you.

Understanding Generative AI Chatbots

What are Generative AI Chatbots?

Think of Generative AI Chatbots as advanced virtual assistants, like ChatGPT from OpenAI [5]. They're not a typical chatbot that provides predefined responses. Instead, these chatbots are capable of crafting their own answers to your queries in real-time, just like a human would. This means they can engage in conversations that feel a lot more natural and less robotic.

How do Generative AI Chatbots work?

The magic behind Generative AI Chatbots lies in their ability to learn and adapt. They are built using sophisticated AI techniques. Picture it this way: just as a human learns from experience, these chatbots learn from vast amounts of data. They are fed with different conversations and situations, which they use to formulate responses. The more data they consume, the better they become at communicating.

How are Generative AI Chatbots different from traditional rule-based chatbots?

Traditional rule-based chatbots are like vending machines. You press a specific button (ask a particular question), and you get a specific product (a predefined answer). There's no room for creativity. On the other hand, Generative AI Chatbots are more like skilled chefs. Give them a set of ingredients (a question or a problem), and they can whip up a unique dish (a solution or a response) based on what they've learned from previous cooking experiences. This ability makes them more versatile, responsive, and engaging compared to their rule-based counterparts.

The Importance of Generative AI Chatbots for Businesses

PwC's study reveals that Generative AI can drastically boost business operations, notably by improving customer engagement, automating high-volume tasks, and simplifying the understanding of unstructured data. [2]

Generative AI chatbots represent a powerful tool for businesses [3, 4]. They offer personalized, efficient customer service and can provide valuable insights while improving operational efficiency. As a business owner, implementing this technology could help take your operations to the next level.

Customer Service

One of the most significant benefits of generative AI chatbots is their ability to revolutionize customer service. They can interact with customers 24/7, resolving queries and addressing concerns swiftly.

For instance, if you own a retail store, a generative AI chatbot can assist online shoppers, answer their queries about products, guide them through purchases, and even help with returns or exchanges. The best part? They can handle multiple customers at once, something a human representative might struggle with during peak hours.

Operational Efficiency

Generative AI chatbots can also streamline internal operations. They can automate routine tasks, helping to save valuable human resources for more complex tasks.

Imagine you run a recruitment agency. The chatbot can handle initial candidate screening, suggest interviews questions, and even answer frequent candidate queries. This way, your HR team can focus more on assessing candidate suitability and less on administrative tasks.

Scalable Business Operations

Unlike human employees, generative AI chatbots can easily scale to handle an increase in workload during high demand periods. They can manage multiple interactions simultaneously without any drop in service quality.

For example, if you run an airline business, during holiday seasons, the demand for ticket bookings, changes, and cancellations increases. Here, chatbots can handle these requests, allowing your staff to focus on urgent and more complex issues.

Data-Driven Insights

Generative AI chatbots can also gather valuable insights from their interactions with customers. They can analyze patterns, preferences, and behaviors, enabling businesses to develop more personalized marketing strategies and make informed decisions.
In particular, as a restaurant owner, based on customers' frequently asked questions and ordering patterns collected by the chatbot, you might decide to launch a new menu item or start a special promotion.

Implementation of Generative AI Chatbots on AWS

Use Case

Objective

Improve student engagement, comprehension, and access to course material through a generative AI chatbot.

Overview

A university develops an AI-powered chatbot called "CourseBot." Professors can upload course content to an admin portal, and students can interact with CourseBot to get answers, explanations, and clarifications related to the course material.

Process

Admin Portal for Professors:

Upload Functionality: Professors can upload textbooks, research papers, lecture notes, slides, and other educational materials.
Ingesting Phase: The AI chatbot, after consuming the content, undergoes a brief fine-tuning phase, ensuring it understands the specifics of the uploaded materials.

Student Interaction:

Queries: Students can ask the chatbot questions like, "Can you explain the Pythagorean theorem?" or "What did Professor Smith mention about quantum mechanics in last week’s lecture?"
Study Assistance: The chatbot can provide summaries, highlight key concepts, and guide students to relevant sections in the course materials.
Engagement Tools: The chatbot can conduct quizzes, flashcard reviews, and other interactive tools based on the course content to help students reinforce their understanding.

Benefits

Increased Engagement: By offering immediate answers, students are more likely to engage with the course material outside of classroom hours.
Personalized Learning: Students can learn at their own pace, revisit topics they find challenging, and get targeted assistance.
Accessibility: The chatbot can assist students who may be hesitant to approach professors with "simple" questions, ensuring no student is left behind.
Cost Efficiency: While the initial setup may require resources, over time, the chatbot can reduce the need for additional tutors or assistants, leading to long-term savings.

Business Results

Improved Student Retention: With better access to course material and understanding, students are less likely to drop out or fail.
Higher Satisfaction Scores: Both student and faculty satisfaction rates can increase due to the streamlined approach to learning and teaching.

Screenshots

Admin Portal – Professor Adding the Documents

Admin Portal – The Solution Ingesting the Documents

Student Portal – Student Raising a Question

Solution Summary

Our open source innovative solution is designed to transform business communication and customer interaction. This solution can be easily deployed on Amazon Web Services (AWS) using AWS Cloud Development Kit (CDK), thereby reducing the hassle of setting up and manual deployment. The estimated time for deployment is 30-45 minutes.

Our generative AI chatbot solution comprises two integral processes: The Ingest Process and the Answer Process.

Ingest Process: The first process is built to handle the ingestion of the relevant content. This means that the AI chatbot is tuned on your specific business data. It could include data from your website, customer service logs, product catalogs, and any other text-based information that you want the chatbot to learn from.
Answer Process: Once the ingestion is complete, the second process steps in. This is the Answer Process, where the chatbot responds to the questions based on the ingested queries. Using advanced Natural Language Processing (NLP) algorithms and the latest Large Language Models (LLMs), the chatbot deciphers the user's query, matches it to the ingested data, and generates an appropriate and helpful response. Our solution's beauty lies in its simplicity of deployment and the ease with which it can be integrated into existing workflows, making it a valuable addition to any business striving to optimize their operations and improve customer interaction.

Solution Architecture

Solution Workflow

The workflow outlines the AWS solution for deploying a chatbot with two distinct routes: data ingestion and data querying. The process follows a systematic flow as described below:

The data ingestion and querying endpoints are deployed on Amazon SageMaker, from a specified container registry. Ingest Route:
The user initiates the data ingestion process by making a request to the Ingest Lambda function through its designated function URL.
The Ingest Lambda function, upon receiving the user's request, invokes the SageMaker endpoint, targeting the '/ingest' route.
The SageMaker model retrieves the required data, already processed, and stored in Amazon S3, and proceeds to create embeddings.
The generated embeddings are stored in an OpenSearch vector database for efficient access and retrieval.
Upon completion, the SageMaker model sends a response back to the Lambda function.
The user is promptly notified of the ingestion process status through the response received from the Lambda function.

Query Route:

To perform a data query, the user sends a request to the Query Lambda function using the designated function URL.
The Query Lambda function, upon receiving the user's query request, invokes the SageMaker endpoint, targeting the '/query' route.
The SageMaker model processes the user's query, converting it into embeddings, and initiates a similarity search using OpenSearch. The search retrieves data related to the query.
The retrieved data, along with the original query, is used by the model to generate a prompt. The prompt is then passed to the Language Model (LLM) to generate a response.
The Lambda function sends back the generated response to the user, completing the query process.

Architecture Components

AWS Lambda: This component takes care of all the computing needs for our solution. Lambda runs the code without requiring to provision or manage servers, providing a straightforward way to run the chatbot operations, scale automatically and handle high availability.
AWS OpenSearch Service: Acting as a vector database, OpenSearch Service saves the embeddings of our solution. It is instrumental in the process of querying and retrieving information, ensuring rapid and precise retrieval of data, which enhances the efficiency and accuracy of the chatbot's responses.
AWS SageMaker: This is where our machine learning models are deployed. SageMaker provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly, acting as the brain behind the chatbot solution.
AWS S3 (Simple Storage Service): This is used to store the documents. The S3 provides scalable object storage for data backup, archival, and analytics - a safe and secure environment for all data that our solution uses and generates.
AWS ECR (Elastic Container Registry): Our solution's container images are hosted here. ECR makes it easy for developers to store, manage, and deploy Docker container images, allowing the seamless and reliable deployment of our solution.
AWS IAM (Identity and Access Management): It's used to control the access permissions. IAM ensures that the right entities have the correct access to resources, maintaining the security and integrity of our solution while adhering to the principle of least privilege.

Future of Generative AI Chatbots in Business and Conclusion

The future of generative AI is shaping up to redefine the business landscape. As highlighted in Accenture's Technology Vision 2023, titled "When Atoms Meet Bits: The Foundations of Our New Reality," there is a profound convergence taking place between the physical and digital realms. This merger is driven by mega technology trends and recently Gen AI.

Generative AI, especially Generative AI chatbots platforms, stands out as a transformative force. Its ability to augment human capabilities is gaining recognition worldwide. According to Accenture's findings, a staggering 40% of all work hours are projected to be supported or enhanced by language-based AI in the foreseeable future. This sentiment is echoed by business leaders; a whopping 98% believe that AI foundational models will be pivotal in shaping their organization's strategies within the next three to five years [7].

However, the promise of generative AI doesn't come without its set of challenges. As Paul Daugherty, Group Chief Executive of Accenture Technology, rightly points out, diving into its full potential demands substantial investments, specifically in data, people, and the customization of foundation models tailored to fit unique organizational needs. Businesses looking to harness this promising technology need to start now, preparing for a future where the lines between our digital and physical worlds blur, offering unprecedented opportunities for innovation and growth.

References

ChatGPT-4.0 vs Falcon-40B Life Hacks

Mohamad Albaker Kawtharani — Fri, 23 Jun 2023 07:34:56 +0000

Overview

This documentation outlines the solution architecture for two conversational bots opening and closing a conversation on life hacks topic with each other. It utilizes AWS resources which include SageMaker, Lambda functions and a DynamoDB and two LLM models, GPT 3.5-Turbo and Falcon-40B. The LLM models are being invoked using a lambda functions which process the input and store outputs to the DynamoDB table.

Solution Architecture

Components

Lambda Function 1:

Calls OpenAI’s GPT 3.5-Turbo.
Generates questions surrounding life hacks.
Takes reply from falcon bot through DynamoDB and generates a new question based on the response.
New question generated by the GPT model is stored back in DynamoDB.

Lambda Function 2:

Calls SageMaker Endpoint where Falcon-40B is deployed.
Takes the question generated by GPT 3.5 as input from the DynamoDB and passes it to the Endpoint.
Response from the Falcon is stored back in DynamoDB.

DynamoDB:

Serves as the storage for the conversation History between the two Chatbots
Allows easy access for both the lambda functions to store and retrieve data
Each record contains ResponseID, BotType, Prompt, and Timestamp.
Prompt is the generated response from the bots, and the BotType specifies from which bot it was generated by.

GPT-3.5 Turbo:

GPT-3 is a large language model developed by OpenAI. It is capable of understanding and generation natural language. GPT-3.5 is an improved version that is more powerful and capable model than GPT-3.

Falcon 40B:

Is an open-source LLM with 40 billion parameters. It is deployed on Sagemaker for an easy access through an endpoint.

Conclusion

The solution architecture described in this documentation enables the seamless integration of Falcon40b and GPT-3.5 models with AWS Lambda functions and DynamoDB to create a conversation. This demonstrates the power of LLM models and the human-like interactions.

Chat Snippets

OpenAI Whisper Deployment on AWS as Asynchronous Endpoint

Mohamad Albaker Kawtharani — Wed, 31 May 2023 14:41:44 +0000

This project facilitates the deployment of the OpenAI Whisper ASR model on SageMaker, and handling of inference results using SageMaker, S3, SNS, DynamoDB, and Lambda.

Description

The cdk script sets up an infrastructure in AWS to automatically deploy a Whisper ASR model using Amazon SageMaker asynchronous endpoint. It also sets up resources to store, manage and process model inference results.

Resources

S3 Bucket: An S3 bucket is created to store the model and inference results. The bucket is automatically deleted when the stack is deleted.
IAM Role: An IAM Role is set up with appropriate permissions for SageMaker to access the S3 bucket, CloudWatch Logs, and to pull images from the ECR repository.
SageMaker Model: The script defines a SageMaker model, including a container definition property and sets up an endpoint configuration for asynchronous inference.
SNS Topics: Two SNS topics are created for success and error notifications.
DynamoDB Table: A DynamoDB table is created to store job results.
Lambda Function: A Lambda function is created to process job results and store them in the DynamoDB table. The function uses the SNS topics as event sources.

Output

At the end, the script provides output which includes the names of the created S3 bucket and the deployed SageMaker endpoint.

Functionality

When a job result is produced by the SageMaker model, it is sent to the appropriate SNS topic depending on whether the job was successful or not.
The SNS topic triggers the Lambda function, which processes the result and stores it in the DynamoDB table.
The following diagram shows the architecture and workflow of the solution.

Asynchronous inference

Amazon SageMaker Asynchronous Inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously.
This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements.
Asynchronous Inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

How it works

Creating an asynchronous inference endpoint is similar to creating real-time inference endpoints. You can use your existing SageMaker models and only need to specify the AsyncInferenceConfig object while creating your endpoint configuration with the EndpointConfig field in the CreateEndpointConfig API. The following diagram shows the architecture and workflow of Asynchronous Inference.

To invoke the endpoint, you need to place the request payload in Amazon S3 and provide a pointer to this payload as a part of the InvokeEndpointAsync request.
Upon invocation, SageMaker queues the request for processing and returns an identifier and output location as a response.
Upon processing, SageMaker places the result in the Amazon S3 location. We receive success or error notifications with Amazon SNS.

Additional Material

The ECR Image for the Whisper Model

It is Flask-based Python server that uses the OpenAI Whisper ASR system to transcribe audio files. This server listens for HTTP POST requests containing information about an audio file, downloads the file from an Amazon S3 bucket, transcribes it, and then uploads the transcription back to the S3 bucket.
For more info please check app>asr_server.py and app>Dockerfile

Script to invoke the endpoint

The script invokes an Amazon SageMaker endpoint for asynchronous inference. The asynchronous inference is performed on the input audio file that resides in an Amazon S3 bucket, and the transcription is returned.
Before running the script, you need to have AWS CLI set up on your local machine or server and your AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_SESSION_TOKEN) properly configured.
for more info please check whisper-invoke>whisper.py

Github Repo

Speech-to-Image

Mohamad Albaker Kawtharani — Sat, 19 Nov 2022 10:15:49 +0000

Abstract

In this blog, we demonstrate multi-deep-learning models that can complement together to form a speech-to-image solution. We have built over Stable Diffusion Model and utilized Amazon Transcribe and Amazon Translate to shape our solution. Once an mp3 recording is uploaded into S3, a lambda function will call Amazon Transcribe to transcribe the recording (speech-to-text), then it will call Amazon Translate to translate the transcribed text (text-to-text), and finally, it will invoke the HuggingFace model deployed on SageMaker to generate an image from the translated text (text-to-image) that is also uploaded to the specific sub-bucket.

Some Results

Overview

Text to image in artificial intelligence is the generation of images based on a given prompt. It requires a prominent level of natural language processing and image generation. Text deciphering to generate images is a major field in AI and still a work in progress as real image collection and processing is expensive.

Once an mp3 recording is uploaded into the S3 a lambda function will call Amazon Transcribe to transcribe the recording (speech-to-text), then it will call Amazon Translate to translate the transcribed text (text-to-text), and finally, it will invoke the HuggingFace model deployed on SageMaker to generate an image from the translated text (text-to-image) that is also uploaded to the specific sub-bucket.

Stable Diffusion

The algorithm used is Stable Diffusion by hugging face, a data science platform that provides tools that enable users to build, train, and deploy ML models based on open-source code and technologies. Stable Diffusion is a latent diffusion model, a variety of generative neural network that was developed by Stability AI. It was trained on pairs of images and captions taken from LAION-5B (a dataset of 5.85 billion CLIP-filtered image-text pairs). Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, cultivates autonomous freedom to produce incredible imagery, empowers billions of people to create stunning art within seconds.

The Lambda Function

The Lambda function is developed to call the deep learning models and complement their responses. There are three main models executed in this lambda. Each model execution code will be shown in detail after this section.

The Lambda function code is as follows:

import json
import boto3
from transcribe import transcribe_mp3
from translate import translate_text
from hf_model import generate_images

from upload2s3 import upload_file

client = boto3.client ( 's3' )

def lambda_handler(event, context):
    #parse out the bucket & file name from the event handler
    for record in event['Records']:
        file_bucket = record['s3']['bucket']['name']
        file_name = record['s3']['object']['key']
        object_url = 'https://s3.amazonaws.com/{0}/{1}'.format(file_bucket, file_name)

        transcribed_text = transcribe_mp3(file_name, object_url)
        translated_text = translate_text(transcribed_text)
        generated_images = generate_images(translated_text, 2)

        for i, img in enumerate(generated_images):
            img_name = f'{translated_text.replace(" ", "_").replace(".","")}-{i}.jpeg'
            img_path = "/tmp/" + img_name
            img.save(img_path)
            upload_file(
                file_name=img_path,
                bucket='', # enter the s3 bucket name
                object_name='result/' + img_name
                )

            return "lambda handled Successfully!"

Transcribe Model

For this purpose, Amazon Transcribe was used. It is an automatic speech recognition service that uses deep learning models to convert audio to text.

The code includes boto3 library which allows us to access AWS services, it is as follows:

import json, boto3
from urllib.request import urlopen
import time


transcribe = boto3.client('transcribe')

def transcribe_mp3(file_name, object_url):
    response = transcribe.start_transcription_job(
        TranscriptionJobName=file_name.replace('/','')[:10],
        LanguageCode='ar-AE',
        MediaFormat='mp3',
        Media={
            'MediaFileUri': object_url
        })
    while True :
        status = transcribe.get_transcription_job(TranscriptionJobName='audio-rawV')
        if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED','FAILED']:
            break
        print('In Progress')
        time.sleep(5)
    load_url = urlopen(status['TranscriptionJob']['Transcript']['TranscriptFileUri'])
    load_json = json.dumps(json.load(load_url))
    text = str(json.loads(load_json)['results']['transcripts'][0]['transcript'])

    return text

Translate Model

In order to make the application more diverse for users around the world, Amazon Translate was used. It’s a neural machine translation service that uses deep learning models to deliver fast and high-quality language translation. After integration, the user can enter the prompt in any language, and AWS Translate will transform it to English language as the algorithm requires. The implementation uses Boto3 library, that connects the code written on SageMaker to AWS Translate service. Then, an input is required and translated to English, and finally saved as the prompt variable.
The code used is as following:

import boto3, json

translate = boto3.client ('translate')

def translate_text(text):
    result = translate.translate_text(
        Text = text,
        SourceLanguageCode = "auto",
        TargetLanguageCode = "en")

    prompt = result["TranslatedText"]

    return prompt

Hugging Face Model

To deploy the HuggingFace Model, you can easily just follow the steps provided by Phil Schmid in his blog entitled “Stable Diffusion on Amazon SageMaker”. Once the model is deployed, you can note the SageMaker Endpoint name that we are going to invoke through our lambda code.

The code to invoke the SageMaker Endpoint is as follows:

import os, io, boto3, json, csv
from io import BytesIO
import base64
from PIL import Image


ENDPOINT_NAME = '' # enter your ENDPOINT_NAME 
runtime= boto3.client('sagemaker-runtime')

# helper decoder
def decode_base64_image(image_string):
  base64_image = base64.b64decode(image_string)
  buffer = BytesIO(base64_image)
  return Image.open(buffer)

def generate_images(prompt, num_images_per_prompt):
    data = {
        "inputs": prompt,
        "num_images_per_prompt" : num_images_per_prompt
    }

    payload = json.dumps(data, indent=2).encode('utf-8')

    response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='application/json',
                                       Body=payload)

    response_decoded = json.loads(response['Body'].read().decode())

    decoded_images = [decode_base64_image(image) for image in response_decoded["generated_images"]]

    return decoded_images

Hope you enjoyed !!!!!!

Github Repo

Fraud Detection

Mohamad Albaker Kawtharani — Sat, 27 Aug 2022 10:10:02 +0000

Recent surveys show a notable increase in online fraud in the Middle East. Visa, Dubai Police, and Dubai Economy (DED) revealed cardinal findings in 2021, shedding light on a significant percentage of UAE consumers who experienced online fraud. Besides, the big four consultancies shared several surveys with a remarkable rise in fraud and financial crime in the Middle East. At the end of PwC’s recent Global Economic Crime and Fraud Survey, it states a gap between the good intentions of Middle East organizations to prevent fraud and their ability to improve their performance in this area. A couple of weeks ago, Dr. Scott Nowson -AI lead @pwc ME- dived into harnessing the top notch technology to reduce the false positives for anti-money laundering, payment fraud, or financial crime.

At Zero&One, we have raised the flag to compact online fraud. The ML team is sharing the end-to-end technical demo with different approaches to identify cases that represent financial and regulatory risks and show the power of machine learning models on AWS.

The dataset used to demonstrate the fraud detection solution is the dataset collected and analyzed during a research collaboration between Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big-data mining and fraud detection.

Github: https://github.com/MohAlbakerKaw/Fraud-Detector.git

Technology advances and new advantages come to light, but it is not without any problems. New undiscovered issues arrive with everything new. One of the issues that have advanced with the advancement of technology is fraud. Fraud existed since the beginning of humankind, however after online transactions and payments became a thing (add more here on where fraud can occur) it proved a gift for online hackers, exploits and fraudsters as the main types of fraud experienced by consumers are phishing, credit card fraud and receiving counterfeit goods (Research more on types of fraud for consumers and businesses).

The survey conducted by Deloitte in 2021 entitled Middle East Fraud Survey, found that 48% witnessed an increase in fraudulent incidents compared to earlier years, with the leading cause for fraud over the last two years in the MENA region being Cyber-crime and technology frauds which stands at 24%.

According to PwC Middle East Economic Crime and Fraud Survey, in the region, traditional fraud types continue to feature prominently, compared with the global survey average. Procurement fraud, which may include the practice of favoring known associates with vendor and supplier contracts, remains a significant and growing problem. In 2018, 22% of Middle East respondents said their organization had suffered procurement fraud. In 2020, the proportion has risen to 42%, more than double the global survey average of 19%. Customer fraud is also a growing problem for Middle East organizations, with 47% of respondents reporting an incident during the past two years, up from 36% in 2018.

In addition, a 2020 UAE cybercrime survey by KPMG revealed that 73% of respondents anticipate their business to invest in changes to their cybercrime prevention initiatives. Compared to the rest of the world, the middle east is expected to have a high increase and commitment in fraud combat.

The increasing fraud due to technology has resulted in the development of counter measurements to reduce the impact and losses. Those counter measurements include an increase in implementation of anti-fraud policies and organizations increasing the spending on combatting fraud/economic crime. Organizations are learning more and ready to dedicate resources to fighting cybercrime. One of the most advanced systems used to win the fight is machine learning. It helps in recognizing and analyzing the patterns, which in turn helps in understanding and preventing threats with same or similar patterns. In addition, machine learning helps cybersecurity teams be more proactive in preventing threats and responding to active attacks in real time.

Moreover, after fraud became a grave issue, major companies started creating services devoted to countering the problem and reducing the impact. Amazon was one of those companies, their web services provide various machine learning services that aid in forming the most efficient applications. The most notable of these services for fraud detection are Amazon SageMaker and Amazon Fraud Detector.

Amazon SageMaker is a Platform as a Service (PaaS) that is used to build, train, and deploy machine learning models allowing users to focus on the development without having to worry about the infrastructure. It is the perfect service for organizations that prefer building their own models. It also provides built-in algorithms and pre-trained models through the AWS Marketplace to ease and speed up building fraud detection models. One of the major advantages it provides is the ability to scale up quickly and train models faster.

Fraud Detector is a fully managed machine learning service that enables customers to identify potentially fraudulent activities and catch more online fraud faster and in real time. This model has been developed after learning patterns from AWS for over 20 years while attempting to defraud Amazon.com, through evaluating the fraud data to generate model scores and model performance data. A decision logic can be configured to interpret the score and assign outcomes for each fraud evaluation. Amazon Fraud Detector is made specially for organizations with no machine learning experience as it can be set up and added to the solution application in a short amount of time. It has proven to be of a great addition to organizations that made use of it, such as Omnyex who has reduced fraudulent transactions by 6% and Icony has decreased the time dealing with fake accounts by 77%.

Our analysis has been put together on Amazon Fraud Detector and the usage of SageMaker platform to create machine learning models for fraud detection. Different approaches were used to demonstrate the usage of AWS for the use case of Fraud Detection. Besides the sample –provided by AWS- that demonstrate how to operationalize Amazon Fraud Detector, we are sharing three approaches that enable the deployment of fraud detector machine learning models on SageMaker.