Justin Coker

Posted on Jul 10

Image Summarization using AWS Bedrock

#aws #bedrock #genai #cloud

In previous posts, I've explored various applications of using Amazon Rekognition for analyzing images and videos. Today, I wanted to take it a step further by integrating Rekognition’s powerful computer vision capabilities with the advanced summarization features of Amazon Bedrock’s large language models. Let's get started!

Use Cases

Search: By generating captions that describe the visual details and semantic information of images, image summarization allows for a more nuanced and accurate search experience. It bridges the gap between visual data and language, enabling users to find images based on textual descriptions that reflect the content of the images.
Accessibility: Image summarization can enhance accessibility by providing concise textual descriptions of visual content, which is crucial for individuals with visual impairments. It allows them to access information that would otherwise be inaccessible, fostering inclusivity and equal access to digital content.
Tagging: This solution could allow automatic tag generation based on content for metadata storage and refinement.

Services

The services we'll be using are pretty well known, but, in case they're new to you, here's a brief overview of each:

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
Amazon Rekognition offers pre-trained and customizable computer vision (CV) capabilities to extract information and insights from your images and videos.
AWS Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.

Architecture

Prerequisites

This application will be deployed using Pulumi and JavaScript/TypeScript so you should have some familiarity with both in order to understand what's being deployed. You'll also need to make sure the following dependencies are installed:

Getting Started

After installing the prerequisites we can now start building our app. I'm going to outline each step, but if you prefer to simply see the finished product feel free to skip ahead and go directly to the Github repo.

Creating the Project

First, let's login to a Pulumi backend. By default, pulumi new attempts to use Pulumi Cloud for its backend, but for this simple example we'll use our local filesystem.

pulumi login --local

Next, we'll need to create a directory for our project and bootstrap Pulumi.

mkdir image-summarization && cd image-summarization
pulumi new aws-typescript

You will be prompted to enter the project name and project description

This command will walk you through creating a new Pulumi project.

Enter a value or leave blank to accept the (default), and press <ENTER>.
Press ^C at any time to quit.

project name: (image-summarization)
project description: (Image summary generation using AWS)
Created project 'image-summarization'

Next, you will be asked for a stack name. Hit ENTER to accept the default value of dev.

Please enter your desired stack name.
stack name: (dev)
Created stack 'dev'

Finally, you will be prompted for the region. For this example, I'll be using us-east-1.

aws:region: The AWS region to deploy into: (us-east-1)
Saved config

Authentication

Before we can deploy any services to AWS we have to set the credentials for Pulumi to use. I won't cover it here, but you can reference the Pulumi Docs which outlines your authentication options.

AWS Components

We could use some level of abstraction to make things more manageable, but, for this example, I'm simply going to put all components in the index.ts file at our project root.

Buckets

Like a lot of applications built in AWS, the first thing we need is some S3 buckets.

const inputBucket = new aws.s3.Bucket("input-bucket", {
  forceDestroy: true,
});

const outputBucket = new aws.s3.Bucket("output-bucket", {
  forceDestroy: true,
});

Now, let's go ahead and enable EventBridge notifications on the Input Bucket we just defined.

const inputBucketNotification = new aws.s3.BucketNotification(
  "input-bucket-notification",
  { eventbridge: true, bucket: inputBucket.id }
);

Lambdas

Detecting image labels using Rekognition produces a very verbose output—most of which is inconsequential to our LLM—so we'll create one lambda to filter the labels and another one simply to tidy up our final output.

Lambdas need an execution role, so let's go ahead and create that first. Our lambdas won't be calling any services, so they don't really need any permissions. Creating a role and a trust policy will suffice.

const lambdaTrustPolicy = aws.iam.getPolicyDocument({
  statements: [
    {
      effect: "Allow",
      principals: [
        {
          type: "Service",
          identifiers: ["lambda.amazonaws.com"],
        },
      ],
      actions: ["sts:AssumeRole"],
    },
  ],
});

const lambdaRole = new aws.iam.Role("ImageSummarizationLambdaRole", {
  name: "ImageSummarizationLambdaRole",
  assumeRolePolicy: lambdaTrustPolicy.then((policy) => policy.json),
});

Filter Labels

Next, let's create the lambda we'll use to filter the Rekognition labels. Let's put in a file called filterLabels.mjs under the lambdas/src directory in our project root. This function will filter out any labels below our set confidence level, count each object type, and format them into a comma-separated string for our LLM to consume.

export const handler = async (event) => {
  const confidenceLevel = parseInt(process.env.CONFIDENCE_LEVEL) || 90;

  const labels = event.Rekognition.Labels;

  const filteredLabels = labels
    .filter((label) => label.Confidence > confidenceLevel)
    .map((label) =>
      label.Instances.length > 0
        ? `${label.Instances.length} ${label.Name}`
        : label.Name
    )
    .join(", ");

  const response = {
    labels: filteredLabels,
  };
  return response;
};

We'll be deploying this as a zip package, so we'll go ahead and tell Pulumi to archive the file for us.

const filterLabelsArchive = archive.getFile({
  type: "zip",
  sourceFile: "lambdas/src/filterLabels.mjs",
  outputPath: "lambdas/dist/filterLabels.zip",
});

As you can see from the block above, Pulumi will place the output zip file in the lambdas/dist directory. Now we'll tell Pulumi to create the lambda using the zip file it just created.

const filterLabelsFunction = new aws.lambda.Function(
  "ImageSummarizationFilterLabels",
  {
    code: new pulumi.asset.FileArchive("lambdas/dist/filterLabels.zip"),
    name: "ImageSummarizationFilterLabels",
    role: lambdaRole.arn,
    sourceCodeHash: filterLabelsArchive.then(
      (archive) => archive.outputBase64sha256
    ),
    runtime: aws.lambda.Runtime.NodeJS20dX,
    handler: "filterLabels.handler",
    environment: {
      variables: {
        CONFIDENCE_LEVEL: "90",
      },
    },
  }
);

Build Output

Now, we're going to create a simple function to build the output we want from the results. The process will be the same as the filter labels function we just created so I'll only include the snippets.

export const handler = async (event) => {
  const response = {
    source: {
      bucket: event.detail.bucket.name,
      file: event.detail.object.key
    },
    labels: event.Rekognition.Labels,
    summary: event.Bedrock.Body.results[0].outputText
  }

  return response;
};

const buildOutputArchive = archive.getFile({
  type: "zip",
  sourceFile: "lambdas/src/buildOutput.mjs",
  outputPath: "lambdas/dist/buildOutput.zip",
});

const buildOutputFunction = new aws.lambda.Function(
  "ImageSummarizationBuildOutput",
  {
    code: new pulumi.asset.FileArchive("lambdas/dist/buildOutput.zip"),
    name: "ImageSummarizationBuildOutput",
    role: lambdaRole.arn,
    sourceCodeHash: buildOutputArchive.then(
      (archive) => archive.outputBase64sha256
    ),
    runtime: aws.lambda.Runtime.NodeJS20dX,
    handler: "buildOutput.handler",
  }
);

Step Function

Just like our lambdas, our step function needs an execution role, but the step function will actually need real permissions so we'll create those using references to components we've already defined.

const stateMachineTrustPolicy = aws.iam.getPolicyDocument({
  statements: [
    {
      effect: "Allow",
      principals: [
        {
          type: "Service",
          identifiers: ["states.amazonaws.com"],
        },
      ],
      actions: ["sts:AssumeRole"],
    },
  ],
});

const stateMachinePolicy = new aws.iam.Policy("ImageSummarizationSfn-Policy", {
  name: "ImageSummarizationSfn-Policy",
  path: "/",
  description: "Permission policy for Image Summarization state machine",
  policy: pulumi.jsonStringify({
    Version: "2012-10-17",
    Statement: [
      {
        Effect: "Allow",
        Action: ["lambda:InvokeFunction"],
        Resource: [
          filterLabelsFunction.arn,
          buildOutputFunction.arn,
        ],
      },
      {
        Action: ["s3:GetObject", "s3:DeleteObject", "s3:PutObject"],
        Effect: "Allow",
        Resource: [
          pulumi.interpolate`${inputBucket}/*`,
          pulumi.interpolate`${outputBucket}/*`,
        ],
      },
      {
        Action: "rekognition:DetectLabels",
        Effect: "Allow",
        Resource: "*",
      },
      {
        Action: ["bedrock:InvokeModel"],
        Effect: "Allow",
        Resource: "*",
      },
    ],
  }),
});

const stateMachineRole = new aws.iam.Role("ImageSummarizationSfn-Role", {
  name: "ImageSummarizationSfn-Role",
  assumeRolePolicy: stateMachineTrustPolicy.then((policy) => policy.json),
  managedPolicyArns: [stateMachinePolicy.arn],
});

Now that we have a role to use, we can create our state machine. We're going to define five step in our state machine Detect Labels, Filter Labels, Invoke Model, Build Output, and Save Output. Step function definitions are pretty verbose so I've stripped out everything but the most critical parameters.

const stateMachine = new aws.sfn.StateMachine(
  "ImageSummarizationStateMachine",
  {
    name: "ImageSummarizationStateMachine",
    roleArn: stateMachineRole.arn,
    definition: pulumi.jsonStringify({
      StartAt: "Detect Labels",
      States: {
        "Detect Labels": {
          Type: "Task",
          Parameters: {
            Image: {
              S3Object: {
                "Bucket.$": "$.detail.bucket.name",
                "Name.$": "$.detail.object.key",
              },
            },
          },
          Resource: "arn:aws:states:::aws-sdk:rekognition:detectLabels",
          Next: "Filter Labels",
          ResultPath: "$.Rekognition",
          ResultSelector: {
            "Labels.$": "$.Labels",
          },
        },
        "Filter Labels": {
          Type: "Task",
          Resource: "arn:aws:states:::lambda:invoke",
          Parameters: {
            "Payload.$": "$",
            FunctionName: filterLabelsFunction.arn,
          },
          ResultPath: "$.Lambda",
          ResultSelector: {
            "FilteredLabels.$": "$.Payload.labels",
          },
          Next: "Bedrock InvokeModel",
        },
        "Bedrock InvokeModel": {
          Type: "Task",
          Resource: "arn:aws:states:::bedrock:invokeModel",
          Parameters: {
            ModelId:
              "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-text-premier-v1:0",
            Body: {
              "inputText.$":
                "States.Format('Human: Here is a comma seperated list of labels/objects seen in an image\n<labels>{}</labels>\n\n" +
                "Please provide a human readible and understandable summary based on these labels\n\nAssistant:', $.Lambda.FilteredLabels)",
              textGenerationConfig: {
                temperature: 0.7,
                topP: 0.9,
                maxTokenCount: 512,
              },
            },
          },
          ResultPath: "$.Bedrock",
          Next: "Build Output",
        },
        "Build Output": {
          Type: "Task",
          Resource: "arn:aws:states:::lambda:invoke",
          OutputPath: "$.Payload",
          Parameters: {
            "Payload.$": "$",
            FunctionName: buildOutputFunction.arn,
          },
          Next: "Save Output",
        },
        "Save Output": {
          Type: "Task",
          End: true,
          Parameters: {
            "Body.$": "$",
            Bucket: outputBucket.id,
            "Key.$": "States.Format('{}.json', $.source.file)",
          },
          Resource: "arn:aws:states:::aws-sdk:s3:putObject",
        },
      },
    }),
  }

Events

At this point we have buckets, lambdas, and a fully working step function capable of detecting labels and summarizing the results. The one thing missing is the event rule which routes and enables uploads to the Input Bucket to start the state machine.

First, let's create the rule for objects created in the Input Bucket.

const inputRule = new aws.cloudwatch.EventRule("input-bucket-rule", {
  name: "input-bucket-rule",
  eventPattern: pulumi.jsonStringify({
    source: ["aws.s3"],
    "detail-type": ["Object Created"],
    detail: {
      bucket: {
        name: [inputBucket.id],
      },
    },
  }),
  forceDestroy: true,
});

Now, we'll need a role capable of starting our state machine.

const inputRuleTrustPolicy = aws.iam.getPolicyDocument({
  statements: [
    {
      effect: "Allow",
      principals: [
        {
          type: "Service",
          identifiers: ["events.amazonaws.com"],
        },
      ],
      actions: ["sts:AssumeRole"],
    },
  ],
});

const inputRulePolicy = new aws.iam.Policy("ImageSummarizationRule-Policy", {
  name: "ImageSummarizationRule-Policy",
  policy: pulumi.jsonStringify({
    Version: "2012-10-17",
    Statement: [
      {
        Effect: "Allow",
        Action: ["states:StartExecution"],
        Resource: [stateMachine.arn],
      },
    ],
  }),
});

const inputRuleRole = new aws.iam.Role("ImageSummarizationRule-Role", {
  name: "ImageSummarizationRule-Role",
  assumeRolePolicy: inputRuleTrustPolicy.then((policy) => policy.json),
  managedPolicyArns: [inputRulePolicy.arn],
});

Finally, we'll tie the rule, role, and state machine together by defining an event target.

const inputRuleTarget = new aws.cloudwatch.EventTarget("input-rule-target", {
  targetId: "input-rule-target",
  rule: inputRule.name,
  arn: stateMachine.arn,
  roleArn: inputRuleRole.arn,
});

Time to play with our new application!

Wrapping Up

There are a couple of copyright free images contained in the assets folder of the repo I provided, but you're free to upload any images you like and test the results. For this I'm going to upload skateboard.jpeg from the repo and see what I get.

Let's see what the output looks like and compare it to the image. Here is what's contained in summary key of the output.

The image shows a city with a road and street in the neighborhood. There are 13 cars and 21 wheels. There is 2 buildings and 2 persons in the metropolis. The architecture is urban.

Not perfect, but really not too bad. The application clearly performs what we expected, so what's next?

Tinker with the confidence level: I have the confidence level set to 90 and changing this value can drastically alter the labels passed to our model.
Try different models: For this example, I used the Titan Text Large model, but Bedrock has many models to choose from that may produce better results.

And with that, we're done. Feel free to provide any comments or corrections, and I sincerely hope you enjoyed this post and thank you for reading!

DEV Community