DEV Community

Zied Ben Tahar for AWS Community Builders

Posted on • Edited on • Originally published at levelup.gitconnected.com

AI powered video summarizer with Amazon Bedrock and Anthropic’s Claude

Photo by [Andy Benham](https://unsplash.com/@benham3160?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)

At times, I find myself wanting to quickly get a summary of a video or capture the key points of a tech talk. Thanks to the capabilities of generative AI, achieving this is entirely possible with minimal effort.

In this article, I’ll walk you through the process of creating a service that summarizes YouTube videos based their transcripts and generates audio from these summaries.

AI powered youtube video summarizer

We’ll leverage Anthropic’s Claude 2.1 foundation model through Amazon Bedrock for summary generation, and Amazon Polly to synthesize speech from these summaries.

Solution overview

I will use a step functions to orchestrate the different steps involved in the summary and audio generation :

AI powered youtube video summarizer architecture

🔍 Let’s break this down:

  • The Get Video Transcript function retrieves the transcript from a specified YouTube video URL. Upon successful retrieval, the transcript is stored in an S3 bucket, ready for processing in the next step.

  • Generate Model Parameters function retrieves the transcript from the bucket and generates the prompt and inference parameters specific to Anthropic’s Claude v2 model. These parameters are then stored in the bucket for use by the Bedrock API in the subsequent step.

  • Invoking the Bedrock API is achieved through the step functions’ AWS SDK integration, enabling the execution of the model inferences with inputs stored in the bucket. This step generates a structured JSON containing the summary.

  • Generate audio form summary relies on Amazon Polly to perform speech synthesis from the summary produced in the previous step. This step returns the final output containing the video summary in text format, as well as a presigned URL for the generated audio file.

  • The bucket serves as a state storage used across all the steps of the state machine. In fact, we don’t know the size of generated video transcript upfront; it might reach the Step Functions’ payload size limit of 256 KB in some lengthy videos.

On using Anthoropic’s Claude 2.1

At the time of writing, Claude 2.1 model supports 200K tokens, an estimated word count of 150K. It provides also a good accuracy over long documents, making it well-suited for summarizing lengthy video transcripts.

TL;DR

You will find the complete source code here 👇
GitHub - ziedbentahar/yt-video-summarizer-with-bedrock

I will use NodeJs, typescript and CDK for IaC.

Solution details

1- Enabling Anthropic’s Claude v2 in your account

Amazon Bedrock offers a range of foundational models, including Amazon Titan, Anthropic’s Claude, Meta Llama2, etc., which are accessible through Bedrock APIs. By default, these foundational models are not enabled; they must be enabled through the console before use.

We’ll request access to Anthropic’s Claude models. But first we’ll need to submit a use case details:

Request Anthropic’s Claude access

2- Getting transcripts from Youtube Videos

I will rely on this lib for the video transcript extraction (It feels like a cheat code 😉) ; in fact, this library makes use of an unofficial YouTube API without relying on a headless Chrome solution. For now, it yields good results on several YouTube videos, but I might explore a more robust solutions in the future :

import { storeTranscript } from "adapters/transcript-repository";
import { YoutubeTranscript } from "youtube-transcript";
export const handler = async (event: {
youtubeVideoUrl: string;
requestId: string;
}) => {
const { youtubeVideoUrl, requestId } = event;
const transcript = await YoutubeTranscript.fetchTranscript(youtubeVideoUrl);
const sentences = Array.from(getSentencesFromYoutubeTranscript(transcript));
await storeTranscript(requestId, sentences.join("\n"));
};
function* getSentencesFromYoutubeTranscript(transcript: { text: string }[]) {
let currentSentence: string[] = [];
let i = 0;
do {
const { text } = transcript[i];
currentSentence.push(text);
if (text.endsWith(".")) {
yield currentSentence.join(" ").replaceAll("\n", " ");
currentSentence = [];
}
i++;
} while (i < transcript.length);
yield currentSentence.join(" ").replaceAll("\n", " ");
}

The extracted transcript is then stored on the s3 bucket using ${requestId}/transcript as a key.

You can find the code for this lambda function here

3- Finding the adequate prompt and generating model inference parameters

At the time of writing, Bedrock currently only supports Claude’s Text Completions API. Prompts must be wrapped in \n\nHuman: and \n\nAssistant: markers to let Claude understand the conversation context.

Here is the prompt; I find that it produces good results for our use case:

    You are a video transcript summarizer.
    Summarize this transcript in a third person point of view in 10 sentences.
    Identify the speakers and the main topics of the transcript and add them in the output as well.
    Do not add or invent speaker names if you not able to identify them.
    Please output the summary JSON format conforming to this JSON schema:
    {
      "type": "object",
      "properties": {
        "speakers": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "topics": {
          "type": "string"
        },
        "summary": {
          "type": "array",
          "items": {
            "type": "string"
          }
        }
      }
    }

    <transcript>{{transcript}}</transcript>
Enter fullscreen mode Exit fullscreen mode

🤖 Helping Claude producing good results:

  • To clearly mark to the transcript to summarize, we use XML tags. Claude will specifically focus on the structure encapsulated by these XML tags. I will be substituting {{transcript}} string with the actual video transcript.

  • To assist Claude in generating a reliable JSON output format, I include in the prompt the JSON schema that needs to be adhered to.

  • Finally, I also need to inform Claude that I want to generate only a concise JSON response without unnecessary chattiness, meaning without including a preamble and postscript while returning the JSON payload:

\n\nHuman:{{prompt}}\n\nAssistant:{
Enter fullscreen mode Exit fullscreen mode

Note that the full prompt ends with a trailing {

As mentioned on the section above, we will store this generated prompt as well as the model parameters in the bucket so that It can be used as an input of Bedrock API:

      const modelParameters = {
        prompt,
        max_tokens_to_sample: MAX_TOKENS_TO_SAMPLE,
        top_k: 250,
        top_p: 1,
        temperature: 0.2,
        stop_sequences: ["Human:"],
        anthropic_version: "bedrock-2023-05-31",
      };
Enter fullscreen mode Exit fullscreen mode

You can follow this link for the full code of the generate-model-parameters lambda function.

4- Invoking Claude Model

In this step, we’ll avoid writing custom lambda function to invoke Bedrock API. Instead, we’ll use Step functions direct SDK integration. This state loads from the bucket the model inference parameters that were generated in the previous step:

new CustomState(this, "bedrock-invoke-model", {
stateJson: {
Type: "Task",
Resource: "arn:aws:states:::bedrock:invokeModel",
Parameters: {
ModelId: "anthropic.claude-v2:1",
Input: {
"S3Uri.$": `$.Payload.modelParameters`,
},
ContentType: "application/json",
},
ResultSelector: {
"id.$": "$$.Execution.Name",
"summaryTaskResult.$":
"States.StringToJson(States.Format('\\{{}', $.Body.completion))",
},
})

☝️ Note: As we instructed Claude to generate the response in JSON format, the completion API response misses a leading { as Claude outputs the rest of the requested JSON schema.

We use intrinsic functions on the state’s ResultSelector to add the missing opening curly brace and to format the state output in a well formed JSON payload :

    ResultSelector: {
      "id.$": "$$.Execution.Name",
      "summaryTaskResult.$":
        "States.StringToJson(States.Format('\\{{}', $.Body.completion))",
    }
Enter fullscreen mode Exit fullscreen mode

I have to admit, it is not ideal but this helps get by without writing a custom Lambda function.

5- Generating audio from video summary

This step is heavily inspired by this previous blog post. Amazon Polly generates the audio from the video summary:

import {
getPubliclyAvailableUrl,
storeAudio,
} from "adapters/audio-summary-repository";
import { synthesize } from "adapters/speech-synthesis";
export const handler = async (event: SummaryTaskOutput) => {
const audio = await synthesize(event.summaryTaskResult);
await storeAudio(event.id, audio);
return {
videoSummary: {
...event.summaryTaskResult,
audioUrl: await getPubliclyAvailableUrl(event.id),
},
};
};

Here are the details of synthesize function:

import { PollyClient, SynthesizeSpeechCommand } from "@aws-sdk/client-polly";
const polly = new PollyClient({});
const synthesize = async (data: { topics: string; summary: string[] }) => {
const audioBuffers = [];
for (const sentence of data.summary) {
const sentenceWithBreak = `${sentence} <break strength="x-strong" />`;
const paragraphBuffers = await Promise.all(
chunkString(sentenceWithBreak, 1500).map((chunk) => {
return polly
.send(
new SynthesizeSpeechCommand({
OutputFormat: "mp3",
TextType: "ssml",
Text: `<speak>${chunk}</speak>`,
Engine: "neural",
VoiceId: "Joanna",
LanguageCode: "en-US",
})
)
.then((data) => data.AudioStream.transformToByteArray())
.then((byteArray) => Buffer.from(byteArray));
})
);
audioBuffers.push(...paragraphBuffers);
}
const mergedBuffers = audioBuffers.reduce(
(total: Buffer, buffer: any) =>
Buffer.concat([total, buffer], total.length + buffer.length),
Buffer.alloc(1)
);
return mergedBuffers;
};
view raw synthesize.ts hosted with ❤ by GitHub

Once the audio generated, we store it on the S3 bucket and we generate a presigned Url so it can be downloaded afterwards.

☝️ On language detection : In this example, I am not performing language detection; by default, I am assuming that the video is in English. You can find in my previous article how to perform such a process in speech synthesis. Alternatively, We can also leverage Claude model capabilities to detect the language of the transcript.

6- Defining the state machine

Alright, let’s put it all together and let’s take a look at the CDK definition of the state machine:

const failState = new Fail(this, "fail");
const successState = new Succeed(this, "success");
const chainDefinition = new LambdaInvoke(this, "get-video-transcript", {
lambdaFunction: getVideoTranscriptLambda,
payload: TaskInput.fromObject({
"requestId.$": "$$.Execution.Name",
"youtubeVideoUrl.$": "$.youtubeVideoUrl",
}),
})
.addCatch(failState)
.next(
new LambdaInvoke(this, "generate-model-parameters", {
lambdaFunction: generateModelParameters,
payload: TaskInput.fromObject({
"requestId.$": "$$.Execution.Name",
}),
}).addCatch(failState)
)
.next(
new CustomState(this, "bedrock-invoke-model", {
stateJson: {
Type: "Task",
Resource: "arn:aws:states:::bedrock:invokeModel",
Parameters: {
ModelId: "anthropic.claude-v2:1",
Input: {
"S3Uri.$": `$.Payload.modelParameters`,
},
ContentType: "application/json",
},
ResultSelector: {
"requestId.$": "$$.Execution.Name",
"summaryTaskResult.$":
"States.StringToJson(States.Format('\\{{}', $.Body.completion))",
},
},
})
.addCatch(failState)
.next(
new LambdaInvoke(this, "generate-audio-from-summary", {
lambdaFunction: generateAudioFromSummary,
}).addCatch(failState)
)
.next(successState)
);
const stateMachine = new StateMachine(this, "StateMachine", {
definitionBody: DefinitionBody.fromChainable(chainDefinition),
stateMachineType: StateMachineType.EXPRESS,
logs: {
destination: new LogGroup(this, "ExpressLogs", {
retention: RetentionDays.ONE_DAY,
removalPolicy: cdk.RemovalPolicy.DESTROY,
}),
level: LogLevel.ALL,
includeExecutionData: true,
},
});

In order to be able to invoke Bedrock API, we’ll need to add this policy to the workflow’s role (And it’s important to remember granting the S3 bucket read & write permissions to the state machine):

stateMachine.addToRolePolicy(
new PolicyStatement({
actions: ["bedrock:InvokeModel"],
resources: [
`arn:aws:bedrock:${Stack.of(this).region}::foundation-model/anthropic.claude-v2:1`,
],
})
);
stateMachine.addToRolePolicy(
new PolicyStatement({
actions: ["s3:GetObject", "s3:PutObject"],
resources: [`${bucket.bucketArn}/*`],
})
);

Wrapping up

I find creating generative AI based applications to be a fun exercise, I am always impressed by how quickly we can develop such applications by combining Serverless and Gen AI.

Certainly, there is room for improvement to make this solution production-grade. This workflow can be integrated into a larger process, allowing the video summary to be sent asynchronously to a client, and let’s not forget robust error handling.

Follow this link to get the source code for this article.

Thanks for reading and hope you enjoyed it !

Further readings

Put words in Claude's mouth
Anthropic Claude models
What is Amazon Bedrock?

Top comments (2)

Collapse
 
carmelachapman profile image
CarmelaChapman

This article is right up my alley! The idea of creating a service that utilizes generative AI to summarize YouTube videos and generate audio from those summaries is brilliant. cbtf speed 247

Collapse
 
brakobby profile image
Samuel Nyonator

Is someone working on that