🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - [NEW LAUNCH] Amazon Nova 2 Omni: A new frontier in multimodal AI (AIM3324)
In this video, AWS introduces Amazon Nova 2 Omni, a unified multimodal AI model that can understand text, images, video, and audio while also generating high-quality images. The session demonstrates Omni's superior performance in document understanding, OCR, audio transcription with three-speaker diarization, and cross-modal reasoning. Key capabilities include hybrid reasoning control, support for 200+ languages, and one million token context windows. Benchmarks show competitive performance against Gemini 2.5 Flash and GPT-4 models. Dentsu Digital's Chief AI Officer Satoru Yamamoto presents real-world applications including their Mugen AI solution, demonstrating video creative prediction with 0.88 correlation accuracy, automated storyboard generation with accurate Japanese character rendering, and AI-powered workflow automation using Nova Act. Notably, Dentsu developed seven applications in just seven days, reducing development time from three months to one day per solution, showcasing the model's practical efficiency for enterprise deployment.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: Amazon Nova 2 Omni and Session Overview
Hello, everyone. Good afternoon and thank you for coming to our session on Amazon Nova 2 Omni, which is a new frontier in multimodal AI. I'll be joined with my two co-presenters, Ashwin Swaminathan, who's the Director of Applied Science working on Amazon Nova models, as well as Satoru Yamamoto, who's the Chief AI Officer at Dentsu Digital.
For the agenda for today, we're going to look at the Amazon Nova family of models overall. Then we're going to touch upon what was launched by Matt Garman yesterday in his keynote, which is about the Amazon Nova 2 family. Then we're going to deep dive into Amazon Nova, especially the Omni model, which is optimized for multimodal workflows in terms of multimodal understanding and multimodal generation.
We're going to also touch upon some examples and demos that will give you an idea of what you can create for your business. We'll also touch upon the performance of these models as compared to the best models out there. Then we will hear from Satoru Yamamoto on how using Nova Omni, Dentsu Digital is transforming their business and helping their customers.
The Amazon Nova Family: From Foundation Models to Nova 2 Launch
We launched the Amazon Nova family of foundation models last year at re:Invent, where I was here in this venue actually. We launched a suite of models. Our primary Nova models were launched in two categories. One we call Nova understanding models, which take in text, images, and videos, and we're able to derive metadata from it, summarize that content, do Q&A, and generate text. That's the understanding family of models available in Micro, Light, and Pro, and since then we have launched Premier, which is the largest model with the best performance in that category.
We also launched an image generation model called Nova Canvas and a video generation model called Nova Real. Since then, we have launched a speech-to-speech model, which is for real-time conversational AI applications such as customer support calls, and that is called Amazon Nova Sonic. Very recently, last month, we launched the industry's first natively multimodal embedding model, which is great for semantic search and agentic RAG applications, especially if you have a lot of unstructured data around documents, images, videos, audio, and text.
We are the pioneers in that space in terms of providing the best performance at much lower latency and cost characteristics and being the first model in the industry to do everything together. Nova is already used by tens of thousands of enterprises and startup customers, and some of those names are listed here. Now, to touch upon what was launched yesterday, we are introducing four Amazon Nova 2 family of models. Starting with our Nova 2 Lite, which is our fast, cost-effective reasoning model for everyday workloads. It's our first reasoning model, and it's a hybrid reasoning model, meaning that developers control whether they want reasoning to be enabled and also to what level of reasoning they want for a particular task.
So if your task involves document understanding with some reasoning on whether the extracted values are incorrect or not, then you might want to use some level of thinking or reasoning with it. For everyday tasks, you can just disable that reasoning and the model will provide you results as is, using less tokens and producing faster results. Then we have also launched in preview our Nova 2 Pro, which is similar to the Nova 1 family. It's a higher tier that provides higher performance at different latency and cost characteristics, and it's our most intelligent reasoning model for highly complex tasks. So think of coding tasks, complex agents, or multi-agent scenarios where you would want to use Nova 2 Pro.
Then the third model that we also launched in preview is our Nova 2 Omni, that we're going to talk about today, and it's our unified model for multimodal reasoning and for image generation. Now, we believe that models in the future will be fully multimodal.
And that's how humans interact as well—with spoken words, visuals, and written text. All providers will be moving towards that. This is our first model that can understand any modality, including audio and speech, which is the first one in any of the models available on Bedrock, while also generating high quality images in the same model. It's also the industry's first reasoning model that can reason across any modality and generate images in a single model.
We also have a second generation of our Amazon Nova Sonic, which is the Nova 2 Sonic. It improves the performance compared to the first generation, supports more languages, and the conversations feel more natural compared to the first generation. We're going to deep dive more on the Omni model, but first, here's a quick overview of the capabilities of all four models.
Nova 2 Omni Capabilities and Performance Benchmarks
All four models support a large context window of up to one million input tokens and support more than 200 languages for input text. Pro and Omni also understand audio, so they can understand up to 10 languages for audio and speech as well. With Sonic, it supports more voices now, naturally. For multiple languages, you can choose between different male or female voices for your workflows.
Omni can take any modality as input to generate text and images. It feels more natural in the way that humans interact. The models understanding all the content in the same space to then generate text or generate images provides higher quality outputs than a combination of multiple bespoke models that are task specific. It also means that you don't have to worry about creating complex pipelines with multiple models and multiple systems in place. You can unify that with a single model, reducing build costs, maintenance costs, and faster time to market.
These are the main properties of what Amazon Nova 2 Omni provides. This is a hybrid reasoning model where developers can control the level of reasoning or whether they want reasoning enabled at all. This model is great at instruction following, tool calling, and even for simple text-based natural language tasks such as NLP tasks, sentiment analysis, and classification.
It is state of the art in multimodal perception, highly optimized for any multimodal tasks. This means input document understanding, input image understanding, video understanding, and audio understanding, as well as the new feature of cross-modal reasoning. This model really exceeds compared to the best models out there in the market right now, and we're going to show some of those numbers which we have already published in our technical report as well.
Nova 2, Pro, and Omni are the first models on Bedrock that support audio understanding, which means you can do speech transcription and multi-speaker diarization to understand which speaker is speaking at what point. It supports multiple languages and different types of tasks that we're going to dive deeper into later. This model supports much higher quality image generation and natural language-based image editing. Backed by a text LLM, it can generate much higher quality text which is rendered within the images, especially for longer text in the images, which has always been a challenge. We faced that challenge even with Canvas. With this model, we have tried to fix all the remaining gaps that we had in the previous generation of image generation model. This model should be very compelling for image generation.
Now, looking at the performance, this chart is based on artificial analysis by a third party that shows how top-of-the-line models perform across their overall artificial analysis index, which measures performance across 10+ benchmarks. A score of 58 is a consolidated number across all the different benchmarks that are meaningful in different categories such as instruction following, tool calling, and agentic coding. As you can see, in its tier, which competes with Gemini 2.5 Flash and GPT 4 Mini type of models, this model is very competitive and stands very high on the leaderboard. The models on the left are the higher tier, which are the Gemini 2.5 Pro, similar to the Nova 2 Pro category, so we're not comparing there.
These are some of the numbers, which are already public, just to give you an idea in terms of language understanding, knowledge, reasoning, instruction following, and tool calling that are critical for testing how good a model is and also for tool calling for agentic applications. These are the things that matter. As you can see here, Nova 2 Omni is very competitive on all of these numbers as well. Now let's look at some of the use cases, especially the multimodal use cases where this model improves significantly upon the Nova 1 family of foundation models as well as being competitive compared to other models out there.
Document and Audio Understanding: Real-World Applications
The main use case we have heard from our customers is document understanding. Pretty much every company has documents, and some of them are very complex in nature in terms of layout. There are handwritten documents, a mix of handwritten and typed text in different languages. Document understanding is a very complex and challenging problem. This model provides much higher accuracy compared to the previous generation of our models, and we will show some numbers in terms of measurements on public benchmarks where we perform.
Some of the key things we have tried to fix with this are OCR, character recognition, and key information extraction. If you have PDF reports or receipts, the model can accurately extract all that information. In some cases, it can also do verification with built-in tools and expose any inconsistencies in the content that is listed there. This is an example of OCR where the image on the left is a combination of elements that represent a very non-traditional type of document. There are some text, some images in different areas, and this is one of the basic use cases that every company has: how accurately can you extract the information and do the OCR to create text. You can also tell the model to produce output in a structured manner like JSON or XML, which you can then use further down the line for calling different tools. As you can see on the right-hand side, it extracted all the information in a structured way.
This is again a pretty complex layout of a document on the left-hand side. The model is able to accurately extract the information out of this. As I mentioned before, you can prompt the model to look for any inconsistencies in specific areas and do calculations. The model, using the built-in tools that it has such as the code interpreter which can do simple math calculations, can tell you whether the information presented in the document is correct and where the inconsistencies are.
Now let's look at audio understanding, which is a new feature of this model. We've had AWS Transcribe as one of the solutions for automatic speech recognition, or ASR, for speech recognition use cases. Using this model, you can transcribe speech, summarize what was said in the audio file, perform question-and-answer tasks, and call tools with it. As I mentioned before, it supports up to three-speaker diarization, which means it can segregate which speaker is speaking what part of the audio.
It is a very performant model. On the MMAU leaderboard, which is the Massive Multitask Audio Understanding and Reasoning leaderboard, we rank overall at number two and are much ahead of the models provided by Google and OpenAI in this space. The leaderboard measures accuracy over speech, soundscapes which is non-speech audio, as well as music. So the model can understand different types of audio inputs.
Let's see how it works. In this case, we have a short snippet from our Q3 earnings call that was done by our CEO Andy Jassy. We tested it out to see how the model performs. We provide this input and test the different types of tasks that we ask the model to perform. This is an internal tool where we upload the file and then start very simply with just a plain transcription. In this case, we look for whether the model missed any words or misquoted any words that were not spoken in the input audio. We did the analysis and testing ourselves, and the results were pretty accurate.
Then you can ask follow-up questions like what are the key takeaways, so it can provide a bulleted list of main items that were part of this audio file. And then you can ask again with follow-up questions like outline the key accomplishments that were mentioned by Andy. So you can basically analyze any type of audio file to perform these kinds of tasks at a much higher accuracy. This is also just a summarization as well. We're going to share the results of this later when we talk about the summary of the multimodal perception.
Image and Video Understanding: Perception, Detection, and Temporal Reasoning
With that, I would like to hand it over to my co-presenter Ashwin to talk about image and video understanding. Like Rohit mentioned, I'll talk a little bit about some of the work that we did as part of the Nova 2 Omni model, focused on image and video understanding use cases, and also extend to how Omni can solve some of the image generation aspects of things. If you look at the broad categories of image and video understanding use cases, we can broadly look at it in three specific areas. One is perception and object detection. You have a scene, you want to understand what objects are there in the scene, and you want to create bounding boxes. The second aspect is visual question-answering. You want to ask questions about the image and figure out what is happening in it. The third part is temporal understanding. When you have videos, going from images to videos, you want to capture the temporal aspect of the scene and be able to identify and ask questions about a particular video, like when did a particular event happen in the video or when did something show up in the video, which you would generally have for any kind of video understanding tasks.
Let's walk through some examples. This is an example of a scene. It's a complex scene with a lot of components in it. If I ask a simple question like detect the number of plants, cushions, tables, and TVs in the scene, as a human looking through the image, there are things that pop up immediately to you. There is this huge potted plant on the right, right next to the right sofa. And then if you look around, you can find multiple small plants throughout the different areas of the scene. Let's look at what the model produced. The model not only detected the big potted plant that's behind the scene, but also all the small plants that are all over the place.
So the plants that are in the cupboard, the plants that you see in front of the center table, the one next to the TV—it also detected all the individual cushions, extracted bounding boxes for these cushions, as well as the TV which was provided in the prompt. The second thing to note is the accuracy of the bounding boxes in terms of the locations and how tight the bounding boxes are. These are things that we optimized as we built the Nova 2 Omni.
We want to make sure that we are producing very high quality outputs and at the same time producing very accurate bounding boxes that can help you unlock different kinds of applications, whether it be automation or real-time decision making, robotics, or different kinds of use cases. So we wanted to optimize the model to make sure it works really well for these use cases.
Let's look at perception and visual question-answering. If you look at the scene, you want to be able to identify and ask a bunch of questions on the scene. How many teams are on the screen? What is this event? What are they playing here? Has the runner transferred the baton? If you pass this kind of image to the Nova 2 Omni model, it determines that this is a relay race. It has three teams: Great Britain, along with Swiss teams and Belgium. It also can detect the location of the batons and identify who has transferred the baton and who has not, and who has come first in the race.
This entire understanding of the scene becomes extremely powerful for supporting a wide range of different kinds of applications. This is something that we've been hearing from customers like yourself on how we can support these kind of complex understanding use cases, which will enable and reduce the human automation part of it and increase the overall power of AI for your workloads.
A third example is on temporal and multimodal attention. If you want to reason across frames, you want to be able to support long event detection and identify particular events in the scene. Let's look through an example. The prompt here is to be able to locate all the segments in the video for this particular prompt. The prompt is the man standing on a boat, and it's asking the model to produce the results in a very structured output fashion. As we walk through this, I'm just going to pause so that you see the exact time stamps.
As you walk through this, you see different kinds of landscapes and different kinds of scenes. Around the 12 to 14 second mark, you see the man standing next to the boat. As you roll the video, you see the man is now walking, but is no longer standing next to the boat. You still see the person, but not standing next to the boat, so it's not counted as part of the model outputs. The model can understand and reason exactly the kind of prompts the customer is asking for and provide outputs in a way that it can solve complex tasks.
In this case, a man standing next to a boat—it's not about just a person walking or a person standing, but also about the relationship between objects and how these objects can come together towards answering the question that the customer is asking for. This is the kind of complex use cases that the Nova 2 Omni model can support.
In terms of performance, we have evaluated the model against a wide range of benchmarks. We have published a very detailed technical report on the Amazon Science website. I'd encourage all of you to go in and look at all the benchmarks in that technical report. Here, I'm just pointing out a few of the benchmarks that we looked at. On the video understanding side, we have benchmarked extensively across a wide range of use cases, including the video MME benchmark.
The video MME benchmark is interesting in the sense that it has general video understanding use cases, but it also has cross-modal understanding. So you can look at how the video performs when you input video along with speech, and what the performance improvement is because of that. We see that with the combination of video and speech and cross-modal understanding, our performance is much better than other models that are out in the similar category. On document understanding, we have OCR bench, which is an OCR-based benchmark. Again, the Nova 2 Omni model exceeds state of the art and provides very good results in a wide range of use cases for this particular benchmark.
For speech understanding, on popular benchmarks like MMAU, we are number two on the leaderboard.
Across the board, you see very good performance in a wide range of tasks. As we built the Omni model, we learned from all the feedback you provided from our Nova 1 models. When you were trying video understanding use cases, you provided a lot of feedback about cases where the models worked and cases where they did not work. In collaboration with CMU, we created a new benchmark called Maverick, which has also been published. We evaluated our model based on real customer use cases and all the feedback you provided to us over the last year. In this Maverick benchmark, we are among the top performers, and it helps us evaluate how video along with speech or video along with audio can together improve and provide very high quality results for real-world customer use cases. I would encourage all of you to check it out as part of the work we have done on benchmarking.
Image Generation and Advanced Editing Operations with Nova 2 Omni
Now let's talk about image generation. One unique aspect of Omni is that it not only understands all the different kinds of content like text, images, video, audio, and speech, but it also can generate images. This can help you power a lot of complex use cases where you probably used multiple models in the past. The Nova Omni model can generate high quality realistic images such as people, text rendering, and also spatial understanding. Because we have a good understanding backbone that understands the content in images, we can now use that backbone to also generate better quality images and help us do better quality editing tasks.
To start with, these are some examples of images that we created with the Nova 2 Omni model. This is the text-to-image part of it. Given a text prompt, it generates different kinds of images, and you see some of these images in this slide. As Rohit was mentioning earlier, we also optimized a lot on visual text rendering. We want to make sure that the models do well in producing right text outputs, which is a really hard problem for most of the image generation models out there today.
In terms of evaluating our models, we use winning rate. The way we calculated winning rate is based on all the feedback we have gotten from you as customers over the last year on our Nova Canvas model. We curated a dataset of prompts and did blind A/B testing, comparing our model against other competition models. The annotators did not know which image corresponded to Nova Omni and which image corresponded to the competitor. They just looked at each of the images and provided a win, tie, or loss rating. We aggregated all the results across all the annotations. The exact process and procedure for human evaluation is outlined in our technical report. We then calculated the winning rate, which is essentially win rate plus tie divided by 2. This gives you a quantification where anything above 50 means that our model is equivalent or better than the other competition models.
Looking at the figure, Nova Omni performs significantly better than Nova Canvas, representing a step function improvement in performance compared to our last generation model that we launched last year. It also significantly outperforms models like Flux-1 Kontext Pro or Flux-1 Kontext Max, and it is comparable to GPT-image 1 and Gemini models. These are more detailed results looking into specific categories. The specific areas we focused on improving are rendering of people and rendering of visual text and scenes. This is an area we improved as we went from Nova Canvas to Nova Omni as part of our image generation tasks.
Here is one complex example. The prompt is an image of Paris with a crowd of pedestrians all looking at the camera. You are trying to generate not only a group of people, which makes it really hard for image generation models because you are generating a lot of small spaces throughout the entire scene, but the prompt also says that they want the people to be looking at the camera. Here, the Nova Omni model is able to understand the user intent and produce images that are representative of exactly what the user is asking for in the prompt. If you look at the scene with a lot of people, you will see most of them, around 90 percent, are actually looking at the camera, which most of the other competition models also cannot do today.
The advantage of having a joint understanding and generation model is that the understanding components of the model can learn from the generation side, and the generation components can learn from the understanding side. This helps us in tasks like editing. The Nova 2 Omni model can support nine different kinds of editing operations including adding new objects, altering objects, extracting information about a specific object, replacing, removing, doing background changes, style transfer, and more.
I'll go through some examples so that you get a sense of the kind of operations we support. All of this is just with a text prompt. You can provide the input image along with the text prompt, and the model understands the user intent and can help you with the different kinds of editing tasks. Here's an example of an add operation where you want to add a wooden bench fence facing on the right side of the image. The model generated the bench and also understood the style and layout of the scene. As it created the bench, it generated it with the exact style of the image, so it's representative and flows well with the scene and layout.
Here's another example where you want to add a door number to the image on the left. The model understands the intent and creates the image with the numbers in the right location of the door. Here's an example of alter, where the customer is asking for changes to the color of the bicycle to red. The model understands the location of the bicycle, is able to segment out those areas, and then correspondingly converts those locations only to red color. Here's another example of changing the frosting color to a different color. Again, the model is able to understand your intent and change things.
This is an example of altering the scene, changing a scene from a snowy background to a more sandy kind of background. Here's an example of extracting information. Think of Amazon.com where you have images of people wearing dresses. You want to be able to extract the dress and show it and represent this information very accurately so that the user can make purchase decisions. That's an area where the model does well as well. Here's an example of replacing objects. The interesting part is that the user's prompt is specifically focused on keeping the hairstyle untouched. As you're changing the hat, you also want to make sure that you're not changing the hairstyle so it doesn't affect the overall representation of the scene. The model is able to understand that intent and generate images where it doesn't touch the hair but only touches the hat.
Here's an example of remove, where we're removing a polar bear from the scene. Here's another example of removing a cell phone from the user's hand. This is also an example of complex editing operations where it has multiple prompts, not only removing the cell phone from the user's hand but also changing the color of the sweater from pink to blue. The model understands both aspects and generates an image that adheres to the prompt that's been described. Here are examples for background change, style transfer, and motion change. This is more about how to represent the scene, like how the same person looks in a different kind of expression and create those expressions so that it looks realistic while not changing the appearance of the person or changing the attributes of the person or the dress of the person. That becomes an important aspect as well.
These are hybrid changes where you're not only removing the hat but also changing the color of the dress that the person is wearing. These are the kind of complex scenes that we've been evaluating and testing as we were building the Nova 2 Omni model. Having the capability of understanding tasks helps us also improve the overall quality of the generation aspect of things. We evaluated our model against the image edit benchmark, and this graph shows the performance of the model. As I was describing earlier, on image editing tasks, we do better than Flux models and are at par with the GPT-4 and Gemini models as well.
Practical Use Cases and Customer Adoption
Let's walk through an example here. We have different kinds of editing operations that you can do with the Nova models. So we want to transform the space, the bright Scandinavian style to white colors, adding a muted minimalistic art.
So you can use it for rearranging and creating your own environments. If you're a property manager, you can extend it to similar scenarios and use cases for your workloads. And if you want to create ads, you have the object, you can create captions, you can place the object in different kinds of environments, and create different kinds of ad marketing assets for your workloads as well.
We're seeing a lot of customers use the models right now as part of our beta program. Monks is one such company. One of the advantages of the Omni model is that given the power and capabilities of understanding and generation, you don't need to have multiple models to solve the same task. The model itself can understand and represent the task and also help generate content, which becomes more powerful.
Companies are able to see that they now don't have to have a very powerful, complicated workflow. They can simplify a lot of the workflows by the power of the Omni model. With that, I'll hand over to Satoru Yamamoto, who will talk about how Dentsu is using some of the models for their use cases and workloads.
Dentsu Digital's Creative Solutions: Video Prediction and Generation with Nova Omni
Thank you so much for the introduction and thank you for the great opportunity. Let me introduce myself. I'm Satoru Yamamoto, Chief AI Officer at Dentsu Digital, and I will also be the Deputy Chief AI Officer of Dentsu Japan. I want to say thank you. I really appreciate all the support from AWS. I have had a lot of opportunities to present at many AWS events like re:Invent and Summit, and also Cannes Lions. I'm quite a big fan of AWS. Today, I really want to tell you how AWS AI is powerful, especially Amazon Nova Omni.
But before I go to the main topics, let me briefly introduce our company, Dentsu. Dentsu is Asia's largest advertising agency, and we are operating our business globally in more than 100 countries with over 7,000 employees. I have to say one thing. I need to say thank you because I know yesterday's keynote session, the CEO mentioned Dentsu as a marketing giant. No, we are not so tall as a giant. We are just number one in Asia. Thank you so much for that.
By the way, we have a solution named Mugen AI, and this is a kind of digital marketing solution. Here we're utilizing AWS AI services a lot. Today I want to introduce how we're utilizing AWS AI, especially Amazon Nova Omni, in this solution.
Let's go to our main topic and agenda. We have three topics: creative, planning operation, and next generation experience. This agenda follows the evolution of generative AI like multimodal AI, agent AI, and physical AI. Let's go to the first topic: creative.
We have a solution named Me AI, and it has three functions: creative generation, performance prediction, and improvement suggestion. Thankfully, it's already implemented in more than 200 companies, and we succeeded in achieving an average improvement of more than 150 percent. What I want to emphasize here is that we are utilizing Amazon Nova to make this solution conversational in order to make our employees use it as much as possible. But still we have a challenge. It is video creative. For example, we couldn't get accurate video prediction, but we can say that we can change the situation with Amazon Nova Omni.
Let me explain. The traditional approach we are doing is changing the original video into more simple, fragmented information like keyframes. We also need to convert keyframes to textual summaries, and we need to get information from sound. Everything is separated, but it's really different from how humans watch a video.
As a result, accuracy is not very high. However, Amazon Nova 2 Omni is a genuine multimodal AI that can understand video itself just like we do. As a result, we can achieve quite high accuracy in prediction. Additionally, in traditional approaches, we need an enormous amount of data. However, because Nova 2 Omni has already learned vast knowledge in the background, we don't need such a large amount of data. We just need to prepare a small set of high-quality data.
Let me show you the demo here. What we should do is upload the video like this, and then we should select the advertisement campaign and prepare the training data for the prediction model. That model will tell us what kind of video creative has higher performance. Finally, we can get the predicted value of the uploaded video. What I want to emphasize here is that the correlation value between the predicted value and the actual distribution value is quite high, like 0.88. So we can really accurately conduct the prediction.
I forgot to mention that this example is from ANA, which is Japan's largest airline company. I'm using that airplane example, so thank you very much for providing it. Not only can we use Nova 2 Omni for prediction, but we can also use it for generation. Because Nova 2 Omni is a multimodal AI, we can prepare a model that understands the tendency of which videos have high performance, and then use this model to generate a storyboard. Since it's multimodal, it can also draw pictures. Let's check the demo.
We simply ask the same AI to make a storyboard. First, we can get a textual storyboard. We can convert this storyboard to a visual storyboard, and based on that storyboard, we can quite easily generate a moving video like this. Here I want to emphasize two things related to the image generation capability of Nova 2 Omni. One is that Omni's listening power is really high, so it's quite easy to handle and edit images with natural language. The second thing is quality itself. It's competitive with leading models, and I'm really quite surprised because we can generate Japanese characters really accurately and beautifully. It's almost the first time for us. It's really high quality.
Not only can we generate storyboards and videos from scratch, but we can also reuse existing videos because Nova 2 Omni understands which parts are important. Let's check the demo again. Instead of me, please listen to this. Today we demonstrate how AI accelerates content creation. First, we upload a raw video file. Nova 2 Omni instantly analyzes the footage, generating detailed metadata of every scene. Next, it acts as a creative director, drafting a viral strategy with specific hooks and scripts. Finally, leveraging the AI's deep understanding of the interview and visuals, we efficiently craft the final video edit. From raw footage to viral ready, streamlined by intelligent analysis.
Planning, Operations, and Next-Generation Experiences: Dentsu's Multi-Agent AI Integration
Like this, using Nova 2 Omni, we can quite easily handle complex data like video. Let's go to the next topic: planning and operation. We have a solution named the Mugen AI Agent Canvas. It's a kind of citizen development solution for non-engineers to build agents quite easily. We've already succeeded in developing solutions for marketers like strategic planning, journey planning, and media planning. But we can do more when we utilize Nova 2 Omni for agents. I think this part is interesting.
What we did is we tried to make Omni mimic human behavior. The video on the left side shows a human actually operating the UI of Amazon Ads automation.
We input this video to Nova Omni, and then Nova Omni understood the workflow and writes the prompt for Nova Act. Nova Act is the AI which can handle browser operations. As you can see, we can fully automate this process without any kind of human touch. These two pages are not directly related to Omni, but it's a quite nice example, so let me introduce that.
So far, I introduced the AI solution to Japan, like creative prediction and generation as planning and operation. What we're going to do is try to combine every whole solution together in one AI agent, utilizing the Bedrock Agent multi-agent collaboration functionality. Let's check the demo. What we're going to do is, sorry, it is written in Japanese, but I speak in English. So let's make a new product, the sparkling water. How about the shining sparkling water? Then let's talk to the AI persona, and we ask the persona, how do you think about that? If the taste itself is not so important, if the visual is nice, it's okay.
So let's summarize the interview, and based on the interview, we can consider the name and the key messages and also the key visuals for that. And then I'll summarize this one as the value proposition. And based on that proposition, we can prepare the customer journey and for acquisition part, we need the digital banner generation. So what I want to say is now we have already applied Amazon Nova AI leads for our solution, and we can go more with the Amazon Nova Omni and we combine them together with Bedrock Agent.
So far, I think what I explained is also a kind of advanced way of utilizing AI for marketing agent. But let's talk about the final topic, the next generation experience. And it is more like conversational, interactive, real, and physical. The first example is using a chat. With this solution, we can quite easily prepare a client chatbot utilizing client-specific data. Now, of course, it's based on AWS and utilizing Amazon Nova. This one is an example of golf course reservation. But the problem is it's just text-to-text conversation, which is quite ordinary.
But utilizing Nova Omni, we can provide a much richer experience like this. He's my friend and he's talking to AI with voice. I've analyzed your video. Tap the mic to talk. First, she's speaking in English and then he inputs information with Japanese. It's very interesting. Only speaking in English and then he's speaking in Japanese, but the conversation is still not broken. And here we can also upload the video. The AI can understand this video and can give us advice on how to improve the golf swing like this. It's a kind of new generation experience which is related to the real world.
And there's more examples. This one is a kind of in-store experience. What we're doing on the left side video is the video from the customer's view in the store. This customer is watching perfumes and they're waiting for a second, then they talk to the shop clerk. What we're doing is we are extracting the persona of this customer. For example, maybe we can guess from this video that this customer might be a lady and also sees a kind of gorgeous and style-conscious office worker like this. We can extract the persona.
And also we can use this persona in the digital world again. I think it's a kind of interesting example. What we did is we use this persona with Amazon Nova. So based on the in-store behavior, we can guess how this customer will behave in the online shop like this. And I think this kind of technology, which connects the real store and the online store, is very essential for the era of VR and AR glasses widely.
We are almost at the end of the presentation. Today, I have presented several applications based on Amazon Nova 2 Omni to you. However, I know this is not a marketing session but a technology session. I want to share some shocking news with you. I was informed that I have a presentation today, last Tuesday, and we had just seven days to prepare for it.
Despite this tight timeline, I introduced seven applications. That means one application per day, which is quite rapid development. The lady sitting here, Suzuki-san, developed this solution entirely by herself. Compared to before, the development conditions have changed significantly. Previously, it would take at least three months to develop one solution, but now it takes just one day.
Now we can focus on our own business rather than worrying about development. I believe this is the right way of working together with AI. This is my final comment. Thanks to AWS AI, we are quite free from concerns about development. As a marketer, we can now focus on inspiring people's hearts, creating new values, and changing the world. Thank you very much.
; This article is entirely auto-generated using Amazon Bedrock.


































































































Top comments (0)