DEV Community: Milad Rezaeighale

Beyond the System Prompt: Building Modular AI Agents with Strands Skills

Milad Rezaeighale — Thu, 25 Jun 2026 09:08:31 +0000

Anyone who has shipped a multi-capability agent knows the pattern. You start clean. Then the product needs more. You append instructions. Then edge cases. Then domain-specific rules for each capability. Six months later your system prompt is 3,000 tokens of competing guidance that the model has to reconcile on every single call — whether it needs that context or not.

The problem isn't prompt engineering skill. It's architecture. You're treating instruction delivery like a static config file when it should be dynamic.

This is the same problem software engineering solved decades ago with modular design. You don't load every library into memory at startup. You import what you need, when you need it.

Skills bring that principle to agent instruction design.

What Skills Are

Skills are self-contained instruction packages that an agent loads on demand. The agent's context stays lean — only skill names and descriptions are present at startup. When the agent determines it needs a specific capability, it fetches the full instructions at that moment and executes within them.

Three properties make this meaningful at scale:

Isolation — each skill's instructions are scoped. They can't conflict with each other because they're never in context at the same time unless explicitly needed.

Token efficiency — you pay only for what's active. An agent with ten skills doesn't carry ten sets of instructions into every call.

Maintainability — skills are versioned and updated independently. Changing how your agent handles one domain doesn't touch anything else.

This is progressive disclosure applied to LLM context management.

Strands and the AgentSkills Plugin

Strands is AWS's open-source agent SDK for Python and TypeScript. It takes a model-driven approach — instead of hardcoding orchestration logic, the LLM itself decides when to call tools, which order to execute steps, and when it has enough information to respond. This makes agents significantly more flexible without requiring complex orchestration code.

Strands ships with built-in tool support, multi-agent orchestration, and a plugin system for extending agent behavior. One of those plugins is AgentSkills — a production implementation of the progressive disclosure pattern.

Setting up an agent with Strands takes less than ten lines:

from strands import Agent

agent = Agent(system_prompt="You are a helpful assistant.")
response = agent("What is the capital of France?")

Adding skills is one extra step:

from strands import Agent, AgentSkills

plugin = AgentSkills(skills="./skills/")
agent = Agent(plugins=[plugin])

From that point, the agent manages skill discovery and activation automatically — you don't wire any routing logic.

How AgentSkills Works in Detail

The plugin operates in three phases:

Discovery At initialization, AgentSkills scans your skills directory and injects only the skill names and descriptions into the system prompt:

<available_skills>
  <skill>
    <name>email-drafter</name>
    <description>Drafts professional emails from a plain-English brief.</description>
  </skill>
  <skill>
    <name>bug-investigator</name>
    <description>Analyzes errors and returns a structured diagnosis.</description>
  </skill>
  <skill>
    <name>git-commit-writer</name>
    <description>Writes conventional commit messages from a change description.</description>
  </skill>
</available_skills>

That's all the agent sees upfront — names and descriptions. No instructions, no domain logic, no token cost beyond the metadata.

2. Activation
When the agent receives a message it determines requires a specific skill, it calls the built-in skills tool with the skill name as the argument. This is a standard tool call — the same mechanism the agent uses for any other tool. No special routing, no conditional logic on your side.

3. Execution
The tool returns the full contents of the SKILL.md — instructions, rules, output format, everything. The agent now operates within those instructions for that response. Activated skills persist in agent state for the remainder of the session, so they don't need to be re-fetched on follow-up messages in the same domain.

Let's See It in Action

To make skill activation visible, I built a simple Streamlit UI — three skills loaded into one agent, each triggered by a different type of message.

I sent this prompt:

I'm getting this error in my React app, can you help me debug it? TypeError: Cannot read properties of undefined (reading 'map') at App.js:42

The agent identified it as a bug report, activated the bug-investigator skill on demand, and returned a structured diagnosis — no routing logic, no conditionals, no hardcoded rules.

Same agent, one prompt, the right skill loaded automatically.

Defining a Skill

A skill is a directory with a single SKILL.md file. The file has two parts: a YAML frontmatter header that the plugin reads, and a markdown body that becomes the agent's instructions.

skills/
├── bug-investigator/
│ └── SKILL.md
├── email-drafter/
│ └── SKILL.md
└── git-commit-writer/
└── SKILL.md

---
name: bug-investigator
description: "Analyzes an error message or stack trace and returns a structured diagnosis with root cause and fix."
---

# Bug Investigator Skill

You are a senior software debugger. When given an error message or stack trace, respond in this exact format:

🔍 Root Cause:
<one clear sentence explaining why this error occurs>

🛠 Fix:
<step-by-step instructions to resolve it>

✅ Example:
<a minimal corrected code snippet>

Rules:
- Be precise — if the error is ambiguous, ask one clarifying question.
- Always explain the why, not just the what.
- Keep the example under 10 lines.

The name field must be lowercase alphanumeric with hyphens, 1–64 characters. The description is what the agent reads to decide whether to activate the skill — write it as a clear, specific one-liner. Vague descriptions lead to wrong activations.

An optional allowed-tools field restricts which tools the skill can use:

---
name: pdf-processor
description: Extracts text and tables from PDF files using shell scripts.
allowed-tools: file_read shell
---

Two Ways to Define Skills

Filesystem-based is the standard approach — each skill in its own directory, versioned alongside your code, easy to review and update independently.

Programmatic is useful when instructions need to be generated at runtime — pulled from a database, built from environment config, or constructed dynamically per tenant:

from strands import Skill, AgentSkills, Agent

skill = Skill(
    name="summarizer",
    description="Condenses any text into a bullet-point summary preserving all key facts.",
    instructions=(
        "Extract the 3-5 most important points as bullet points. "
        "Add a one-sentence TL;DR at the top. "
        "Do not add information not present in the source text."
    )
)

plugin = AgentSkills(skills=[skill])
agent = Agent(plugins=[plugin])

Both approaches compose cleanly:

plugin = AgentSkills(skills=["./skills/", dynamic_skill])

This is the practical setup for most production agents — static skills for stable capabilities, programmatic skills for anything that varies by environment or user context.

When to Reach for Skills

Skills aren't the right tool for every agent. If your agent has one job, a well-crafted system prompt is simpler and sufficient.

Skills pay off when:

Your agent handles genuinely different domains where instruction sets would conflict
You're optimizing for token cost at scale across high-volume calls
You need independent versioning of capabilities across a team
You're building toward a multi-skill agent that will grow over time They're a step below full multi-agent orchestration — more structure than a monolithic prompt, less overhead than spawning separate agents per capability.

Try It
Full project with Streamlit UI on GitHub:

👉 https://github.com/miladrezaei-ai/strands-agent-skills

git clone https://github.com/miladrezaei-ai/strands-agent-skills
cd strands-agent-skills
uv sync
aws configure   # or AWS SSO
uv run streamlit run app.py

Where does your current agent prompt need this kind of separation? Would love to hear what you're building.

MCPfying Tools Securely at Scale with Bedrock AgentCore Gateway

Milad Rezaeighale — Tue, 03 Feb 2026 09:44:54 +0000

As organizations move from single-agent experiments to production-grade agentic systems, the bottleneck is rarely the model. It’s the tool layer: how teams expose capabilities, how agents discover the right tools, how invocation is standardized across heterogeneous backends, and how governance is enforced consistently as usage scales.

In this article, I describe an enterprise pattern for “MCP-fying” internal tools using Amazon Bedrock AgentCore Gateway—treating it as a centralized MCP front door for tool discovery and invocation. The goal is not to wrap one function, but to establish a repeatable approach that reduces duplicated integrations, supports multi-team ownership, and creates a foundation for secure, scalable tool access across the organization.

AgentCore Gateway as an enterprise tool layer

Amazon Bedrock AgentCore Gateway AgentCore Gateway is a managed tool front door that exposes organizational capabilities as discoverable, invokable tools through an MCP-compatible interface. Instead of every agent framework integrating separately with every backend service, you register backends behind the gateway as targets, define tool schemas (contracts) once, and let clients interact through one consistent surface.

Clients typically use three MCP-style operations:

Tool discovery (what tools exist).
Tool search/filtering (find the right tool at scale).
Tool invocation (run a tool with inputs).

Key capabilities of AgentCore Gateway

AgentCore Gateway provides a set of capabilities designed to standardize and simplify tool integration across teams and agent frameworks:

Unified MCP endpoint – A stable entry point that exposes tools through a consistent contract for discovery, search, and invocation.
Protocol translation & request routing – Converts MCP tool calls into the appropriate backend action and routes requests to the correct target/tool implementation.
Composition (many tools, one front door) – Aggregates tools from multiple backends so agents integrate once with the gateway instead of many services directly.
Targets for enterprise backends – Connect common enterprise surfaces as tool targets, such as:
AWS Lambda
OpenAPI-defined APIs
Smithy models

In this article we focus on MCPfying tools with AWS Lambda.

Managed operations – Centralizes telemetry and operational visibility (for example via Amazon CloudWatch) for troubleshooting and governance.
Scalable discovery – Supports narrowing the toolset at runtime (semantic search / filtered discovery) to reduce tool overload and improve tool selection in large catalogs.

OK, enough theory. Let’s build!

What we’ll build

Configure identity (inbound) and Create the Gateway
Add a Lambda target + IAM execution role (outbound)
Connect a Strands agent to the Gateway (MCP client)
Invoke tools through the agent

The full notebook (end-to-end) is available in my repo:

1. Configure identity (inbound) and Create the Gateway

For this implementation I use bedrock_agentcore_starter_toolkit which is AWS’s CLI-based starter toolkit for deploying Python agents to Amazon Bedrock AgentCore Runtime with “zero infrastructure” to manage—so you can go from local code to a running agent quickly.

import logging
from bedrock_agentcore_starter_toolkit.operations.gateway.client import GatewayClient

client = GatewayClient(region_name=os.environ["AWS_DEFAULT_REGION"])

cognito_authorizer = client.create_oauth_authorizer_with_cognito("agentcore-gateway-test")

# Create Gateway (MCP) and capture identifiers
gateway = client.create_mcp_gateway(authorizer_config=cognito_authorizer["authorizer_config"])
gateway_id = gateway["gatewayId"]
gateway_url = gateway["gatewayUrl"]

We need gateway_id and gateway_url later.

2. Add a Lambda target + IAM execution role (outbound)

When you use bedrock_agentcore_starter_toolkit with create_mcp_gateway_target, the SDK automatically provisions an AWS Lambda function that includes two example tools: a helper-created AWS Lambda target exposing get_weather and get_time. You can see that lambda in your console.

lambda_target_1 = client.create_mcp_gateway_target(
    gateway=gateway,
    target_type="lambda"  # helper creates/uses a default lambda + tool schema
)

In this tutorial, we’ll go a step further by creating a custom Lambda-backed target with a simple tool that returns a random number (Option B). In order to do that we need to :

Create an AWS Lambda function and copy its function ARN.
Create a Gateway target for that Lambda using create_mcp_gateway_target.
Define the tool schema (contract) so the Gateway knows the tool name, inputs, and what output to expect.

explicit target payload (more realistic)
This is the part that matters most for enterprise usage: you’re defining the tool contract (schema) explicitly.

lambda_target_configuration = {
    "lambdaArn": "arn:aws:lambda:REGION:ACCOUNT_ID:function:agentCoreGatewayCustomLambda",
    "toolSchema": {
        "inlinePayload": [
            {
                "name": "get_random_number",
                "description": "Return a random number",
                "inputSchema": {"type": "object", "properties": {}, "required": []},
                "outputSchema": {"type": "integer"},
            }
        ]
    },
}

lambda_target_2 = client.create_mcp_gateway_target(
    gateway=gateway,
    target_type="lambda",
    target_payload=lambda_target_configuration,
)

3) Connect a Strands agent to the Gateway (MCP client)
Cool! Now let’s create an agent with Strands and connect it to the AgentCore Gateway as an MCP client. Two important details from the working flow: refresh the access token before connecting (Cognito tokens expire), and keep the MCP client running while the agent is operating (start it once—don’t recreate it per call).
After connecting, your first sanity check is to list the tools exposed by the Gateway—if ListTools doesn’t return what you expect, the issue is usually the Authorization header, the /mcp suffix, or the Gateway target/tool configuration, not the agent itself.

from strands import Agent
from strands.models import BedrockModel
from strands.tools.mcp.mcp_client import MCPClient
from mcp.client.streamable_http import streamablehttp_client

# refresh token
access_token = client.get_access_token_for_cognito(cognito_authorizer["client_info"])

# ensure /mcp suffix
mcp_url = gateway_url if gateway_url.endswith("/mcp") else f"{gateway_url}/mcp"

mcp_client = MCPClient(
    lambda: streamablehttp_client(
        url=mcp_url,
        headers={"Authorization": f"Bearer {access_token}"},
    )
)

mcp_client.start()
tools = mcp_client.list_tools_sync()

# Bedrock model for the agent
model = BedrockModel(model_id="eu.amazon.nova-pro-v1:0")  # choose your model
agent = Agent(model=model, tools=tools)

print("Loaded tools:", agent.tool_names)

You’ll see tools like:

\<TargetName\>___get_weather
\<TargetName\>___get_time
\<TargetName\>___get_random_number

4) Invoke tools through the agent

response = agent("Get the time for ECT")
print(response)

You should be able to see this in the output:

The user has requested the time for ECT, which stands for Eastern Caribbean Time. I need to use the TestGatewayTargetc7c8080f___get_time tool to get the current time for this timezone. Tool #1: TestGatewayTargetc7c8080f___get_time
The current time in Eastern Caribbean Time (ECT) is 2:30 PM.Response: The current time in Eastern Caribbean Time (ECT) is 2:30 PM.

This validates the full chain:
Agent **→ **MCP client → AgentCore Gateway → Lambda target → tool response

Cleanup (important)

Here we delete the Gateway and its targets, and remove the Cognito user pool/domain created for the tutorial to avoid leaving unused resources behind (and any potential costs).

ognito_idp = boto3.client("cognito-idp")

# Stop the MCP client if it's running
try:
    if streamable_http_mcp_client.is_running():
        streamable_http_mcp_client.stop()
        print("✓ MCP client stopped")
except:
    pass

# Deletinbg User Pool
try:
    cognito_idp = boto3.client("cognito-idp")
    cognito_idp.delete_user_pool_domain(
        UserPoolId=COGNITO_USER_POOL_ID,
        Domain=DOMAIN_PREFIX 
    )
    print(f"✓ Deleted Cognito domain: {DOMAIN_PREFIX}")

    cognito_idp.delete_user_pool(UserPoolId=COGNITO_USER_POOL_ID)
    print(f"✅ Deleted Cognito user pool: {COGNITO_USER_POOL_ID}")

except Exception as e:
    msg = str(e).lower()
    if "notfound" in msg or "not found" in msg:
        print("ℹ️ Cognito user pool already deleted (nothing to clean up).")
    else:
        print(f"❌ Failed to delete Cognito user pool: {e}")
        raise

# Deleting the gateway and its targets
try:
    client.cleanup_gateway(gatewayID)
    print("✅ Cleanup complete! (gateway + targets deleted)")
except Exception as e:
    msg = str(e).lower()
    if "notfound" in msg or "not found" in msg:
        print("ℹ️ Gateway already deleted (nothing to clean up).")
    else:
        print(f"❌ Cleanup failed: {e}")
        raise

Practical notes from this implementation

Tool schema is the real product surface. Treat it like an API contract (names, descriptions, input schema quality).
Target naming matters. The final tool name includes the target name prefix; keep it stable and readable.
Keep the MCP client session alive while the agent is running, otherwise tool calls will fail.
You get enterprise-friendly operational behavior because the tool access surface is centralized (and can be governed consistently later).

Summary

This article shows how Amazon Bedrock AgentCore Gateway acts as an enterprise “tool front door”: agents don’t integrate with every backend directly—instead they connect once over Model Context Protocol to discover, search, and invoke tools through a single stable endpoint.

You then walk through a practical build: creating a gateway, wiring identity (inbound auth via AgentCore Identity and an IdP like Amazon Cognito), connecting an AWS Lambda target with an IAM execution role (outbound auth), defining tool schemas (contracts), and validating everything end-to-end by connecting a Strands Agents MCP client to list tools and run invocations.

Resources

My GitHub (full notebook + code)

Amazon Bedrock AgentCore Setup Confusion: Which IAM Role Do I Need?

Milad Rezaeighale — Wed, 07 Jan 2026 13:14:32 +0000

If you’re trying to deploy an agent into Amazon Bedrock AgentCore Runtime and you see a CLI flag like:

agentcore configure --entrypoint my_agent.py -er <YOUR_IAM_ROLE_ARN>

…it’s easy to get stuck.

Because is not your IAM user, and it’s not your SSO role. It’s a separate Execution Role that AgentCore Runtime assumes to run your agent.

Even after publishing my earlier article on building an agent with AgentCore, I noticed there’s still a common point of confusion for many people. So I decided to write this article and explain what role you need to create!

Once you create that role correctly, deployment becomes straightforward.

This guide is based on the official AWS documentation for AgentCore Runtime permissions: IAM Permissions for AgentCore Runtime

What you actually need (2 identities)

1) Your “caller identity”
This is the identity you use to run the CLI:

IAM User, or
SSO Role (IAM Identity Center)

This identity needs permission to deploy/configure and often PassRole.

2) The “AgentCore Runtime execution role” (the important one)
This is the role AgentCore uses at runtime to:

pull images from ECR (if applicable),
write logs to CloudWatch,
send traces to X-Ray,
publish metrics,
call Bedrock models,
get workload access tokens.

This is the ARN you pass via -er.

Step-by-step: Create the AgentCore Runtime Execution Role in AWS Console

Step 1 — Create the Role

Go to AWS Console → IAM
Click Roles → Create role
Choose Custom trust policy
Paste this trust policy (replace region/account, 123456789012 and us-east-1):

{
  "Version":"2012-10-17",
  "Statement": [
    {
      "Sid": "AssumeRolePolicy",
      "Effect": "Allow",
      "Principal": {
        "Service": "bedrock-agentcore.amazonaws.com"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "123456789012"
        },
        "ArnLike": {
          "aws:SourceArn": "arn:aws:bedrock-agentcore:us-east-1:123456789012:*"
        }
      }
    }
  ]
}

Name the role something clear, for example:

AgentCoreRuntimeExecutionRole- Create the role.

Step 2 — Attach the correct permissions policy

This is where most people get confused.

You want the policy titled “AgentCore Runtime execution role” (NOT the “direct deploy execution role”, and NOT the “starter toolkit” caller policy).

Open the role you just created
Go to Permissions tab
Click Add permissions → Create inline policy
Choose JSON
Paste the following policy JSON :

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ECRImageAccess",
            "Effect": "Allow",
            "Action": [
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer"
            ],
            "Resource": [
                "arn:aws:ecr:us-east-1:123456789012:repository/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogStreams",
                "logs:CreateLogGroup"
            ],
            "Resource": [
                "arn:aws:logs:us-east-1:123456789012:log-group:/aws/bedrock-agentcore/runtimes/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogGroups"
            ],
            "Resource": [
                "arn:aws:logs:us-east-1:123456789012:log-group:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:us-east-1:123456789012:log-group:/aws/bedrock-agentcore/runtimes/*:log-stream:*"
            ]
        },
        {
            "Sid": "ECRTokenAccess",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "xray:PutTraceSegments",
                "xray:PutTelemetryRecords",
                "xray:GetSamplingRules",
                "xray:GetSamplingTargets"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": "cloudwatch:PutMetricData",
            "Condition": {
                "StringEquals": {
                    "cloudwatch:namespace": "bedrock-agentcore"
                }
            }
        },
        {
            "Sid": "GetAgentAccessToken",
            "Effect": "Allow",
            "Action": [
                "bedrock-agentcore:GetWorkloadAccessToken",
                "bedrock-agentcore:GetWorkloadAccessTokenForJWT",
                "bedrock-agentcore:GetWorkloadAccessTokenForUserId"
            ],
            "Resource": [
                "arn:aws:bedrock-agentcore:us-east-1:123456789012:workload-identity-directory/default",
                "arn:aws:bedrock-agentcore:us-east-1:123456789012:workload-identity-directory/default/workload-identity/agentName-*"
            ]
        },
        {
            "Sid": "BedrockModelInvocation",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:*::foundation-model/*",
                "arn:aws:bedrock:us-east-1:123456789012:*"
            ]
        }
    ]
}

(replace region/account, 123456789012 and us-east-1).
Click on Next and save policy name.

Step 3 — Copy the Role ARN (this is what -er needs)
In IAM → Roles → open your role → copy ARN.

Deploy using the role ARN in your CLI

agentcore configure --entrypoint my_agent.py -er YOUR-ROLE_ARN

Please note that my_agent.py has to be replaced by your entry file where you define your agentCore setup

Summary
The key unlock is understanding:

✅ -er expects the AgentCore Runtime execution role ARN
❌ It is NOT your user/SSO identity ARN

Once that role exists (trust + runtime policy), deployment works.

From Demos to Business Value: Taking Agents to Production with Amazon Bedrock AgentCore

Milad Rezaeighale — Mon, 11 Aug 2025 11:05:02 +0000

We’re living in the era of agents and agentic workflows. Frameworks like LangChain, LlamaIndex, CrewAI, and others make it easier than ever to design complex single- or multi-agent systems that can plan, reason, and act. It’s exciting to see these frameworks powering demos that wow technical teams and spark imagination.

But here’s the catch: no matter how clever the prompt chaining is, or how impressive the reasoning looks on screen, it doesn’t create real business value until it’s deployed into production and embedded into the company’s workflows. For executives, a polished demo is nice — but a production-ready agent that’s delivering measurable outcomes is what really matters.

This is where Amazon Bedrock AgentCore comes in. It enables you to deploy and operate highly effective agents securely, at scale, using any framework or model — including open-source options like LangChain or LlamaIndex. With AgentCore, you can accelerate AI agents into production with the scale, reliability, and security essential for real-world use. It offers tools to enhance agent capabilities, purpose-built infrastructure to scale securely, and controls to ensure trustworthiness. Best of all, its services are composable and framework-agnostic, so you don’t have to choose between open-source flexibility and enterprise-grade robustness.

From Theory to Practice

We’ve talked about why production deployment matters and how Amazon Bedrock AgentCore is designed to make it easier, faster, and more secure. Now, without any further explanation, let’s get straight to the point — in the rest of this article, I’ll show you exactly how you can deploy your own agent into production with AgentCore.

We’ve talked about why production deployment matters and how Amazon Bedrock AgentCore is designed to make it easier, faster, and more secure. Now, without any further explanation, let’s get straight to the point — in this article, we’ll keep things simple by using the AgentCore Starter Toolkit, which gives you the perfect opportunity for quick prototyping and testing. In the following steps, I’ll walk you through how to use it to deploy your own agent into production with AgentCore.

Before starting, ensure your AWS CLI is configured and authenticated. You can either:

Use AWS SSO via aws configure sso, or
Use access keys via aws configure

This configuration must be done in the same environment where you will run your Python script so that boto3 can authenticate and invoke your Bedrock AgentCore runtime successfully.

Step 1 – Configuration

First, install the Bedrock AgentCore Starter Toolkit. This toolkit gives you a ready-made environment to quickly prototype and test agents before taking them to production.

pip install bedrock-agentcore-starter-toolkit

Once installed, you’ll have access to CLI commands and project templates that speed up setup so you can focus on building and deploying your agent.

Step 2 – Create Your Project Folder

Next, set up a simple project structure for your agent. This will keep your code, dependencies, and package definition organized for deployment.

Project Folder Structure

your_project_directory/
├── my_agent.py     
├── requirements.txt     
└── __init__.py

File Contents
my_agent.py

from strands import Agent
from bedrock_agentcore.runtime import BedrockAgentCoreApp
from strands.models import BedrockModel
import json

model_id = "eu.anthropic.claude-3-7-sonnet-20250219-v1:0"
model = BedrockModel(
    model_id=model_id,
)

agent = Agent(
    model=model
)

app = BedrockAgentCoreApp()

@app.entrypoint
def invoke(payload):
    """
    Invoke the agent with a payload
    """
    user_input = payload.get("prompt")
    print("User input:", user_input)
    response = agent(user_input)
    return response.message['content'][0]['text']

if __name__ == "__main__":
    app.run()

requirements.txt

strands-agents
bedrock-agentcore

This minimal setup defines:

my_agent.py — where your agent’s logic lives and integrates with AgentCore.
requirements.txt — listing dependencies so they can be installed in the runtime environment.
init.py — ensures the folder is treated as a Python package.

Step 3 – Configure Your Agent

Before deploying, you need to tell the Starter Toolkit which IAM role your agent should use when running in production. This role must have the necessary AgentCore Runtime permissions (see Permissions for AgentCore Runtime).
Run the following command, replacing _*YOUR_IAM_ROLE_ARN *_with the ARN of your IAM role:

Step 4 – Deploy Your Agent

Now that your agent is configured, it’s time to deploy it into production using AgentCore.

Deployment Steps

Step 1 – Configure Your Agent for Deployment
Run the following command, replacing with your IAM role ARN:

agentcore configure --entrypoint my_agent.py -er <YOUR_IAM_ROLE_ARN>

This command will:

Generate a Dockerfile **_and _.dockerignore** for containerizing your agent

Create a .bedrock_agentcore.yaml configuration file with your agent’s runtime settings

While configuring your agent, you’ll be prompted to provide the URI of the Amazon ECR repository where the Docker image will be uploaded. You can either create this repository yourself in the AWS Console and enter its URI, or simply press Enter to have AgentCore create one for you automatically.

You will also be prompted to confirm your dependencies — press Enter to let AgentCore use requirements.txt. For authorization, you can choose the default no to keep IAM.

After completing the prompts, you’ll see a configuration summary showing your agent name, AWS region, account ID, execution role, ECR repository, and authorization method. The configuration is then saved locally in a .bedrock_agentcore.yaml file for use during deployment.

Now you’re ready to launch your agent in production.

Step 5 – Launch Your Agent

With your configuration complete, you can now deploy your agent to AWS with a single command:

agentcore launch

This command will:

Build a Docker image containing your agent code
Push the image to Amazon ECR
Create a Bedrock AgentCore runtime in your AWS account
Deploy your agent to the cloud so it’s ready for production use

Once complete, you’ll have a production-ready agent running on Amazon Bedrock AgentCore, fully integrated with your chosen framework and secured by AWS IAM.

Step 6 – Invoke the Agent

To test our deployed agent, we’ll create a new file named test.py in the same folder as our project and run the invocation from there.
This script sends a natural-language prompt to the agent and processes the streamed response.

import boto3
import json

# Initialize the Bedrock AgentCore client in the same region as your agent
agentcore_client = boto3.client('bedrock-agentcore', region_name='eu-central-1')

# Your Agent Runtime ARN (from the deployment step)
# You can find this in the Bedrock console under your agent’s runtime details,
# or in the deployment confirmation message.
AGENT_RUNTIME = "AOUTR_AGENT_RUNTIME_ARN"

# Prompt to send to the agent
PROMPT = "Please explain how can I become a professional football player?"

# Invoke the agent
boto3_response = agentcore_client.invoke_agent_runtime(
    agentRuntimeArn=AGENT_RUNTIME,
    qualifier="DEFAULT",
    payload=json.dumps({"prompt": PROMPT})
)

# The response is streamed in chunks; read them all into memory
response_body = boto3_response['response']
all_chunks = [chunk for chunk in response_body]

# Combine chunks into one string
complete_response = b''.join(all_chunks).decode('utf-8')

# Attempt to parse JSON output
try:
    response_json = json.loads(complete_response)
    print(response_json)
except json.JSONDecodeError:
    print("Raw response:")
    print(complete_response)

How it works:

boto3.client('bedrock-agentcore') – Creates a client to communicate with the AgentCore Runtime service.
invoke_agent_runtime() – Sends the prompt to the agent and streams back the response.
StreamingBody reading – The output is returned in small chunks, which we merge before decoding.
JSON parsing – If the response is in JSON format, we parse it; otherwise, we display the raw text.

Save the file as test.py in your project folder, then run it from your terminal:

python test.py

You should see the agent’s JSON response (or raw output) printed in the terminal.

This approach ensures you receive the complete, assembled agent output, whether it’s plain text or structured JSON.

Wrapping Up

Amazon Bedrock AgentCore bridges the gap between impressive agent demos and real-world business impact. By following the steps in this guide, you can go from idea to production-ready agent quickly, while leveraging AWS’s scalability, reliability, and security. The sooner your agent moves into production, the sooner it can start delivering measurable value to your business.

Whether you’re experimenting with a single-agent workflow or orchestrating multi-agent systems, AgentCore gives you the tools to operationalize your ideas with confidence. Now it’s your turn—deploy your agent, test it, and see how it performs in the real world.

Understanding Amazon Bedrock Pricing: From On-Demand to Fine-Tuning

Milad Rezaeighale — Mon, 05 May 2025 10:05:48 +0000

As generative AI continues to revolutionize industries, Amazon Bedrock emerges as a pivotal platform, providing seamless access to a plethora of foundation models (FMs) from leading AI providers such as Anthropic, Meta, Mistral AI, and Amazon itself. Its serverless architecture and unified API simplify the deployment of AI applications. However, understanding its pricing nuances is crucial for optimizing both performance and cost.

Model Inference

When utilizing foundation models (FMs) in Amazon Bedrock for inference, there are two primary approaches: On-Demand and Provisioned Throughput.

On-Demand

In the On-Demand model, Amazon Bedrock operates on a pay-as-you-go basis, making it ideal for scenarios where usage patterns are unpredictable. For instance, if you're launching a new LLM application without a clear forecast of user engagement, this model offers flexibility without long-term commitments. Each foundation model (FM) available through Bedrock has its own pricing structure based on token usage. When the model is invoked, Bedrock calculates the number of input and output tokens processed and multiplies these by the respective per-token rates defined for that model.

Prompt caching

In addition to standard token pricing, Amazon Bedrock also offers a prompt caching feature. This allows repeated prompts within a short window to be served from cache, reducing both latency and cost—especially useful when parts of your input remain the same across multiple requests.

Let’s take a look at the current pricing for Amazon Nova Micro on the Bedrock pricing page. (Note: Pricing is subject to change, so it’s always a good idea to refer to the official AWS Bedrock pricing page for the latest rates.)

For example, Amazon Nova Micro—a lightweight text generation model—charges $0.000035 per 1,000 input tokens and $0.00014 per 1,000 output tokens when used in on-demand mode. If a portion of your prompt is cached, the cached input tokens are charged at a reduced rate of $0.00000875 per 1,000, offering substantial savings for repeated instructions or context. When running batch inference, input and output costs drop even further to $0.0000175 and $0.00007 per 1,000 tokens, respectively—making it a cost-efficient choice for large-scale jobs. While these prices seem small, they can quickly add up when you’re processing thousands of requests per day.

In addition to text-based models, Amazon Bedrock includes support for image and video generation, with pricing based on output type and quality. For example, generating images through Amazon Nova Canvas or Stability AI models ranges from a few cents depending on resolution and quality level—higher resolutions or premium outputs cost more.

Batch processing - potential to reduce inference costs

If you plan to handle a high volume of prompts or images in one scheduled run, using batch inference can help reduce the cost per token or per image. Let’s say you have 1,000 customer support transcripts that you want to summarize. Instead of sending each document individually—which can be both time-consuming and more expensive—you can use batch inference to process them all at once. Each document is treated as a separate prompt within a single batch job. This approach helps reduce the per-token cost compared to on-demand inference and is well-suited for scheduled or background tasks that don’t require real-time responses. The main advantage is reduced per-token cost compared to on-demand inference, making it ideal for large-scale jobs that don't require real-time output.

Provisioned Throughput

For applications that require consistent, high-performance inference—especially in production environments—Provisioned Throughput is a valuable option. Unlike the on-demand model where you pay per token, Provisioned Throughput reserves dedicated capacity for your chosen foundation model, ensuring low-latency and predictable response times. You are billed hourly/daily/monthly for the provisioned units, regardless of how much you use them, which makes this approach ideal for steady, high-volume workloads. Bedrock also offers discounts based on commitment: the longer the reservation (e.g., 1-month or 6-month plans), the lower the hourly rate.

Which Pricing Model Should You Choose?

If you're just starting out or expect fluctuating usage, On-Demand gives you the flexibility to pay only for what you use—perfect for development, experimentation, or unpredictable traffic. If you’re processing large volumes of requests in scheduled jobs, Batch Inference offers the same flexibility with better cost-efficiency. For steady, production-level workloads that demand consistent performance and low latency, Provisioned Throughput is the most reliable choice, especially when combined with long-term commitments for additional savings.

Fine-Tuning: Customizing Models

When you want to fine-tune a model in Amazon Bedrock, the cost structure differs from standard inference and comes with a few additional components:

Training Cost: For text models, you’re charged per 1,000 tokens processed during training. For image or multimodal models, pricing is typically based on the number of images used.

Storage Fee: After fine-tuning, the custom model is stored in your account, and a monthly storage fee applies.

Inference Cost: You can’t run fine-tuned models in on-demand mode. Instead, you must use Provisioned Throughput, which is billed hourly—even if the model isn’t actively being used.

For example, let’s consider fine-tuning Amazon Nova Micro using a small dataset.

Pricing for model customization (fine-tuning)
Let’s say you're fine-tuning a model with 100,000 tokens (about 75,000 words or 150+ pages of content). That’s still on the small side for deep fine-tuning, but it’s a more realistic starting point.

Training Cost (One-time) You’re charged based on the number of tokens processed during training. → Example: 100,000 tokens × $0.001 per 1,000 tokens = $0.10 (one-time)
Model Storage (Monthly) Once the model is fine-tuned, storing it incurs a fixed monthly cost. → Example: $1.95 per month
Provisioned Throughput for Inference (Hourly) Fine-tuned models must use provisioned throughput—you pay even if no requests are made. → Example: $108.15 per hour

Special Case Pricing: What You Should Know

When exploring Amazon Bedrock's pricing structure, it's essential to be aware of certain exceptional costs that can significantly impact your overall expenditure. Beyond the standard charges for on-demand usage and provisioned throughput, there are additional fees associated with model customization. For instance, fine-tuning a model on your proprietary data incurs costs based on the number of tokens processed during training. Moreover, once a model is fine-tuned, storing it attracts a monthly storage fee. These costs are separate from the inference charges and can accumulate over time, especially if multiple custom models are maintained.

Another area to consider is the inference of fine-tuned models. Unlike base models that can be used on-demand, fine-tuned models require provisioned throughput, meaning you need to reserve dedicated capacity, which is billed hourly regardless of usage. This can lead to higher costs, particularly if the reserved capacity isn't fully utilized. Additionally, importing models trained outside of Bedrock may involve compatibility evaluations and associated fees. It's crucial to factor in these exceptional costs when planning your AI infrastructure to avoid unexpected charges.

🧠 Final Thoughts
Amazon Bedrock offers a flexible and modular pricing structure that adapts to various use cases—from quick experiments to production-grade AI applications. Whether you're using foundation models as-is or customizing them through fine-tuning, understanding the cost breakdown is crucial to optimizing both performance and spend. With the right usage pattern, you can scale your AI applications efficiently without surprises in your billing dashboard.

Unifying or Separating Endpoints in Generative AI Applications on AWS

Milad Rezaeighale — Wed, 27 Nov 2024 21:19:31 +0000

When building generative AI applications on AWS, one critical decision is how to manage multiple components. For example, you might have a retrieval-augmented generation (RAG) pipeline for context and a fine-tuned model for specific tasks. Should these components share a single endpoint, or should you give each one its own? Both approaches have their pros and cons, and the right choice depends on your use case.

In this article, I’ll break down the unified endpoint vs. separated endpoint designs, so you can make an informed decision for your architecture.

The Unified Endpoint Approach

With a unified endpoint, you deploy a single API Gateway and route requests to the appropriate model based on paths, methods, or query parameters.

Here’s how it works:

Use a single API Gateway, like https://api.example.com.
Backend logic (usually a Lambda function) handles routing. For instance:
- POST /rag routes traffic to the RAG pipeline.
- POST /fine-tuned invokes the fine-tuned model.

Why Choose Unified?

Cost-Effective: Operating one gateway is cheaper than managing multiple.
Simplified Integration: Clients use one URL for all requests, reducing complexity.
Flexible: Adding new routes for additional models or services is straightforward.

Potential Drawbacks

Routing Overhead: You need backend logic to manage and direct requests.
Shared Bottlenecks: High traffic to one pipeline might impact the other unless autoscaling is configured carefully.
Unified endpoints are great for early-stage projects or MVPs where simplicity and cost savings matter most.

The Separated Endpoint Approach

In a separated design, each model gets its own API Gateway. For example:

https://rag.example.com for the RAG pipeline.
https://fine-tuned.example.com for the fine-tuned model.

Why Choose Separated?

Scalability: Each gateway can scale independently, ensuring reliable performance.
Reliability: Issues in one model don’t affect the other.
No Routing Logic: Each gateway directly connects to its respective model, simplifying backend code.

Trade-Offs

Higher Costs: Operating multiple gateways adds to your AWS bill.
More Complex Integration: Clients need to manage multiple URLs, which can complicate development.

Separated endpoints are ideal for production systems with high traffic or strict performance requirements.

Which Approach Is Right for You?

It depends on your application’s stage and requirements:

Use Unified Endpoints If:

You’re in the early stages or building an MVP.
Traffic for both models is predictable and not too high.
Cost savings and simplicity are top priorities.

Use Separated Endpoints If:

Your application handles high traffic or requires independent scaling.
Reliability and modularity are critical.
You’re running a production-grade system with strict SLAs.

A Hybrid Approach?

In many cases, starting with a unified endpoint and transitioning to separated endpoints as your app scales can be the best option. This approach lets you balance simplicity and cost in the beginning with scalability and performance later on.

Final Thoughts

Architecting generative AI applications on AWS involves trade-offs, and there’s no one-size-fits-all solution. Unified endpoints keep things simple and cost-effective for small or early-stage projects, while separated endpoints shine in production systems with demanding workloads.

If you’re just starting out, consider trying a unified endpoint and evolving your architecture as needed. AWS services like API Gateway and Lambda give you the flexibility to adapt and scale your design over time.

What’s your preference—unified or separated endpoints? Let’s discuss in the comments below!

Fine-Tuning and Deploying Custom AI Models on Amazon Bedrock: A Practical Guide

Milad Rezaeighale — Mon, 25 Nov 2024 12:32:14 +0000

In the rapidly evolving field of Generative AI, the ability to fine-tune and deploy custom models is a crucial skill that enables businesses to tailor solutions to their unique needs. Amazon Bedrock, a powerful service within the Amazon Web Services (AWS) ecosystem, simplifies this process by offering a robust platform for building, fine-tuning, and deploying large language models (LLMs). Whether you’re looking to enhance a model's performance for a specific task or deploy it at scale, Amazon Bedrock provides the tools and infrastructure to do so efficiently.

Amazon Bedrock provides a seamless environment for fine-tuning and deploying these models, simplifying what can often be a complex process. If you're new to the concept of fine-tuning or want to delve deeper into its mechanics, I highly recommend A Deep Dive into Fine-Tuning which offers an excellent explanation.

In this article, I will guide you through the process of fine-tuning a language model using Amazon Bedrock. We'll focus on the most critical sections of the code, providing a clear understanding of the key components and steps involved in the fine-tuning process. The goal is to highlight the essential elements so you can grasp how the general workflow is implemented, without diving into every line of code.

For those who want to dive directly into the code or explore it further, the complete implementation is available in my GitHub repository.

Use Case: Summarizing Doctor-Patient Dialogues

For this example, we'll focus on a dataset containing doctor-patient dialogues sourced from the ACI-Bench dataset. Our task is to train the model to summarize these dialogues into structured clinical notes. The foundation model selected for this fine-tuning is Cohere's command-light-text-v14, which excels at generating concise and coherent text summaries.

Objective: we will:

Set up the necessary AWS resources.
Prepare and upload finefune dataset to S3.
Create and submit a fine-tuning job.
Purchase provisioned throughput.
Test our fine-tuned model.
Clean up

Step 1: Set up the necessary AWS resources

Before we begin, we need to ensure we have the necessary AWS SDK installed and configured. We'll use boto3, the AWS SDK for Python, to interact with various AWS services:

import boto3
import json
import os

Step 2: Prepare and upload finefune dataset to S3

In this step, we prepare the dataset by formatting it into the JSON Lines (JSONL) structure required for fine-tuning on Amazon Bedrock. Each line in the JSONL file must include a Prompt and a Completion field.

# Define output path for JSONL
output_file_name = 'clinical_notes_fine_tune.jsonl'
output_file_path = os.path.join('dataset', output_file_name)
output_dir = os.path.dirname(output_file_path)

# Prepare and save the dataset in the fine-tuning JSONL format
with open(output_file_path, 'w') as outfile:
    for _, row in train_dataset.iterrows():
        formatted_entry = {
            "completion": row['note'],  # Replace 'note' with the correct column name
            "prompt": f"Summarize the following conversation.\n\n{row['dialogue']}"  # Replace 'dialogue' as needed
        }
        json.dump(formatted_entry, outfile)
        outfile.write('\n')
    print(f"Dataset has been reformatted and saved to {output_file_path}.")

The following is the format of the data converted into JSONL:

{
    "completion": "<Summarized clinical note>",
    "prompt": "Summarize the following conversation:\n\n<Doctor-patient dialogue>"
}

To make the dataset accessible for fine-tuning, it needs to be uploaded to an Amazon S3 bucket. The code ensures that the S3 bucket exists, creating it if necessary. Once the bucket is verified, the fine-tuning dataset, saved in JSON Lines format, is uploaded to the specified bucket. This step is essential, as Amazon Bedrock accesses the dataset from S3 during the fine-tuning process.

# Define the file path and S3 details
bucket_name = 'bedrock-finetuning-bucket25112024'
s3_key = abstracts_file

# Specify the region
region = 'us-east-1'  # Change this if needed

# Initialize S3 client with the specified region
s3_client = boto3.client('s3', region_name=region)

# Check if the bucket exists
try:
    existing_buckets = s3_client.list_buckets()
    bucket_exists = any(bucket['Name'] == bucket_name for bucket in existing_buckets['Buckets'])

    if not bucket_exists:
        # Create the bucket based on the region
        try:
            if bucket_region == 'us-east-1':
                # For us-east-1, do not specify LocationConstraint
                s3_client.create_bucket(Bucket=bucket_name)
                print(f"Bucket {bucket_name} created successfully in us-east-1.")
            else:
                # For other regions, specify the LocationConstraint
                s3_client.create_bucket(
                    Bucket=bucket_name,
                    CreateBucketConfiguration={'LocationConstraint': bucket_region}
                )
                print(f"Bucket {bucket_name} created successfully in {bucket_region}.")
        except Exception as e:
            print(f"Error creating bucket: {e}")
            raise e
    else:
        print(f"Bucket {bucket_name} already exists.")

    # Upload the file to S3
    try:
        s3_client.upload_file(output_file_path, bucket_name, s3_key)
        print(f"File uploaded to s3://{bucket_name}/{s3_key}")
    except Exception as e:
        print(f"Error uploading to S3: {e}")

except Exception as e:
    print(f"Error: {e}")

Step 3: Create and submit a fine-tuning job

With the dataset uploaded to Amazon S3 and the necessary resources in place, the next step is to create and submit the fine-tuning job. This involves specifying the pre-trained foundation model, the job details, and the fine-tuning parameters.

In this example, we fine-tune the Cohere command-light-text-v14 model to summarize medical conversations. Below is the configuration used to submit the job:

# Define the job parameters
base_model_id = "cohere.command-light-text-v14:7:4k"
job_name = "cohere-Summarizer-medical-finetuning-job-v1"
model_name = "cohere-Summarizer-medical-Tuned-v1"

# Submit the fine-tuning job
bedrock.create_model_customization_job(
    customizationType="FINE_TUNING",
    jobName=job_name,
    customModelName=model_name,
    roleArn=role_arn,
    baseModelIdentifier=base_model_id,
    hyperParameters={
        "epochCount": "3",  # Number of passes over the dataset
        "batchSize": "16",  # Number of samples per training step
        "learningRate": "0.00005",  # Learning rate for weight updates
    },
    trainingDataConfig={"s3Uri": f"s3://{bucket_name}/{s3_key}"},
    outputDataConfig={"s3Uri": f"s3://{bucket_name}/finetuned/"}
)

Key Parameters:

Base Model: The pre-trained model (cohere.command-light-text-v14) serves as the foundation for customization.
Job Name and Model Name: These identifiers help track the fine-tuning job and the resulting fine-tuned model for future deployments.

Hyperparameters:

epochCount: Specifies the number of training cycles. For demonstration, three epoch is used, but more epochs may yield better results for larger datasets.
batchSize: Determines how many samples are processed in each training step. A value of 16 balances memory usage and training efficiency.
learningRate: Sets the pace at which the model learns. Lower values ensure stable training but may require more time to converge.

Training and Output Configuration:The trainingDataConfig points to the S3 location of the dataset.The outputDataConfig specifies where the fine-tuned model will be stored.

Considerations:
The parameters, especially the hyperparameters, can be adjusted to optimize the fine-tuning process:

Smaller datasets may benefit from lower batchSize values.
Complex tasks may require more epochs to achieve convergence.
Learning rates should be fine-tuned to balance training stability and speed.

This step officially kicks off the fine-tuning process, allowing Amazon Bedrock to handle the heavy lifting of training your model with the provided data and configuration.

The status of the fine-tuning job can be also seen:

status = bedrock.get_model_customization_job(jobIdentifier="cohere-Summarizer-medical-finetuning-job-v1")["status"]
print(f"Job status: {status}")

The status of the fine-tuning job can be also seen in the Bedrock console:

Step 4: Purchase provisioned throughput

To use the model for inference, you need to purchase "Provisioned Throughput." On Amazon Bedrock sidebar in your AWS console, go to "Custom Models" and then choose the "Models" tab, select the model you have trained, and then click on "Purchase Provisioned Throughput."

Give the provisioned throughput a name, select a commitment term (you can choose "No Commitment" for testing), and then click "Purchase Provisioned Throughput." You will be able to see the estimated price as well. Once this is set up, you'll be able to use the model for inference.

To access your deployed model's endpoint, you'll need its ARN. Go to the "Provisioned Throughput" section under Inference in the sidebar. Select the name of your fine-tuned model, and on the new page, copy the ARN for use in the next step. Keep in mind that provisioning throughput may take a few minutes to complete.

Step 5: Test our fine-tuned model

In the next step, we will make a request to the model for inference. Be sure to replace YOUR_MODEL_ARN with the ARN you copied earlier.

# Initialize Bedrock runtime client
bedrock_runtime = boto3.client(service_name="bedrock-runtime", region_name=bedrock_region)

# Define a prompt for model inference
prompt = """
[doctor] Good morning, Mr. Smith. How have you been feeling since your last visit?  
[patient] Good morning, doctor. I've been okay overall, but I’ve been struggling with persistent fatigue and some dizziness.  
[doctor] I see. Is the dizziness occurring frequently or only under specific circumstances?  
[patient] It’s mostly when I stand up quickly or after I've been walking for a while.  
[doctor] Have you noticed any changes in your heart rate or shortness of breath during these episodes?  
[patient] No shortness of breath, but I do feel my heart racing sometimes.  

[doctor] How about your medications? Are you taking them as prescribed?  
[patient] Yes, but I missed a few doses of my beta-blocker last week due to travel.  
[doctor] That could explain some of the symptoms. I’ll need to check your blood pressure and do an EKG to assess your heart rhythm.  
[patient] Okay, doctor.  

[doctor] How has your diet been? Are you still following the low-sodium plan we discussed?  
[patient] I’ve been trying, but I’ve slipped up a bit during holidays with family meals.  
[doctor] I understand. We’ll reinforce that, as it’s critical for managing your hypertension.  
[patient] Yes, I’ll make sure to get back on track.  

[doctor] Let’s discuss the results from your last bloodwork. Your cholesterol levels were slightly elevated, and your hemoglobin A1c suggests borderline diabetes.  
[patient] I see. What does that mean for me?  
[doctor] It means we need to focus on dietary changes and consider starting a low-dose statin. I’ll also refer you to a nutritionist for better meal planning.  
[patient] That makes sense. Thank you, doctor.  

[doctor] Lastly, you mentioned experiencing more frequent leg swelling recently. Is that still a concern?  
[patient] Yes, especially after long days at work.  
[doctor] That could be a sign of fluid retention. I’ll adjust your diuretic dose and monitor your progress over the next two weeks.  
[patient] Thank you, doctor.  

[doctor] All right, let’s get those tests done and review everything at our next appointment. Do you have any other concerns?  
[patient] No, I think that’s all for now.  
[doctor] Great. See you in two weeks. 
"""

# Define the inference request body
body = {
    "prompt": prompt,
    "temperature": 0.5,
    "p": 0.9,
    "max_tokens": 80,
}

# Specify the ARN of the custom model
custom_model_arn = "YOUR_MODEL_ARN" #Put your model ARN here

# Invoke the custom model for inference
try:
    response = bedrock_runtime.invoke_model(
        modelId=custom_model_arn,
        body=json.dumps(body)
    )

    # Read and parse the response
    response_body = response['body'].read().decode('utf-8')
    result = json.loads(response_body)

    # Extract the summary from the response
    summary_text = result['generations'][0]['text']
    print("Extracted Summary:", summary_text)
except Exception as e:
    print(f"Error invoking model: {e}")

I tested it with the following conversation to evaluate its ability to generate concise and meaningful summaries for medical dialogues. The input conversation is designed to reflect a real-world doctor-patient interaction, emphasizing symptoms, medication adherence, and a follow-up plan:

[doctor] Good morning, Mr. Smith. How have you been feeling since your last visit?

[patient] Good morning, doctor. I've been okay overall, but I’ve been struggling with persistent fatigue and some dizziness.

[doctor] I see. Is the dizziness occurring frequently or only under specific circumstances?

[patient] It’s mostly when I stand up quickly or after I've been walking for a while.

[doctor] Have you noticed any changes in your heart rate or shortness of breath during these episodes?

[patient] No shortness of breath, but I do feel my heart racing sometimes.

[doctor] How about your medications? Are you taking them as prescribed?

[patient] Yes, but I missed a few doses of my beta-blocker last week due to travel.

[doctor] That could explain some of the symptoms. I’ll need to check your blood pressure and do an EKG to assess your heart rhythm.

[patient] Okay, doctor.

[doctor] How has your diet been? Are you still following the low-sodium plan we discussed?

[patient] I’ve been trying, but I’ve slipped up a bit during holidays with family meals.

[doctor] I understand. We’ll reinforce that, as it’s critical for managing your hypertension.

[patient] Yes, I’ll make sure to get back on track.

[doctor] Let’s discuss the results from your last bloodwork. Your cholesterol levels were slightly elevated, and your hemoglobin A1c suggests borderline diabetes.

[patient] I see. What does that mean for me?

[doctor] It means we need to focus on dietary changes and consider starting a low-dose statin. I’ll also refer you to a nutritionist for better meal planning.

[patient] That makes sense. Thank you, doctor.

[doctor] Lastly, you mentioned experiencing more frequent leg swelling recently. Is that still a concern?

[patient] Yes, especially after long days at work.

[doctor] That could be a sign of fluid retention. I’ll adjust your diuretic dose and monitor your progress over the next two weeks.

[patient] Thank you, doctor.

[doctor] All right, let’s get those tests done and review everything at our next appointment. Do you have any other concerns?

[patient] No, I think that’s all for now.

[doctor] Great. See you in two weeks.

You can also test the inference directly from the Playground in the Amazon Bedrock console. To do this, navigate to Chat/Text under the Playground section, select your fine-tuned model, and enter your desired prompt.

Input to the model:

[doctor] Good morning, Mr. Smith. How have you been feeling since your last visit?

[patient] Good morning, doctor. I've been okay overall, but I’ve been struggling with persistent fatigue and some dizziness.

[doctor] I see. Is the dizziness occurring frequently or only under specific circumstances?

[patient] It’s mostly when I stand up quickly or after I've been walking for a while.

[doctor] Have you noticed any changes in your heart rate or shortness of breath during these episodes?

[patient] No shortness of breath, but I do feel my heart racing sometimes.

[doctor] How about your medications? Are you taking them as prescribed?

[patient] Yes, but I missed a few doses of my beta-blocker last week due to travel.

[doctor] That could explain some of the symptoms. I’ll need to check your blood pressure and do an EKG to assess your heart rhythm.

[patient] Okay, doctor.

[doctor] How has your diet been? Are you still following the low-sodium plan we discussed?

[patient] I’ve been trying, but I’ve slipped up a bit during holidays with family meals.

[doctor] I understand. We’ll reinforce that, as it’s critical for managing your hypertension.

[patient] Yes, I’ll make sure to get back on track.

[doctor] Let’s discuss the results from your last bloodwork. Your cholesterol levels were slightly elevated, and your hemoglobin A1c suggests borderline diabetes.

[patient] I see. What does that mean for me?

[doctor] It means we need to focus on dietary changes and consider starting a low-dose statin. I’ll also refer you to a nutritionist for better meal planning.

[patient] That makes sense. Thank you, doctor.

[doctor] Lastly, you mentioned experiencing more frequent leg swelling recently. Is that still a concern?

[patient] Yes, especially after long days at work.

[doctor] That could be a sign of fluid retention. I’ll adjust your diuretic dose and monitor your progress over the next two weeks.

[patient] Thank you, doctor.

[doctor] All right, let’s get those tests done and review everything at our next appointment. Do you have any other concerns?

[patient] No, I think that’s all for now.

[doctor] Great. See you in two weeks.

Model's Response:

Step 6. Cleanup

To avoid incurring additional costs, please ensure that you remove any provisioned throughput. You can remove provisioned throughput by navigating to the Provisioned Throughput section from the sidebar in the Amazon Bedrock console. Select the active provisioned throughput and delete it.

Conclusion

Fine-tuning and deploying custom AI models on Amazon Bedrock unlocks the potential to create tailored solutions for specific use cases, such as summarizing medical dialogues. This guide has walked you through every step of the process, from preparing your dataset and configuring fine-tuning parameters to testing your model and deploying it for real-world inference. By leveraging the robust infrastructure and tools provided by Amazon Bedrock, you can streamline the fine-tuning process and focus on delivering impactful AI-driven solutions.

The steps outlined in this article illustrate how even a relatively small, structured dataset can yield meaningful results with careful preparation and parameter tuning. Whether you're exploring summarization, classification, or other NLP tasks, Amazon Bedrock makes advanced model customization accessible and efficient.

As you begin your fine-tuning journey, remember to experiment with hyperparameters and test your model rigorously to ensure optimal performance. Lastly, always clean up unused resources to avoid unnecessary costs. For further exploration, check out the complete implementation on [https://github.com/miladrezaei-ai/bedrock-custom-model-finetuning].

With Amazon Bedrock, the possibilities for building intelligent, custom AI models are endless—empowering businesses to innovate and thrive in the evolving AI landscape.