DEV Community

Cover image for AWS re:Invent 2025 - [NEW LAUNCH] Amazon Nova 2 Omni: A new frontier in multimodal AI (AIM3324)
Kazuya
Kazuya

Posted on • Edited on

AWS re:Invent 2025 - [NEW LAUNCH] Amazon Nova 2 Omni: A new frontier in multimodal AI (AIM3324)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - [NEW LAUNCH] Amazon Nova 2 Omni: A new frontier in multimodal AI (AIM3324)

In this video, AWS introduces Amazon Nova 2 Omni, a unified multimodal AI model that can understand text, images, video, and audio while generating both text and images. The presenters demonstrate its superior performance in document understanding with complex OCR, audio transcription with multi-speaker diarization, video temporal reasoning, and high-quality image generation with accurate text rendering. Nova 2 Omni ranks second on the MMAU leaderboard and outperforms competitors in winning rate comparisons. Dentsu Digital's Chief AI Officer Yamamoto showcases real-world applications including video creative performance prediction achieving 0.88 correlation, automated storyboard generation with Japanese character rendering, and AI agent workflows using Nova Act for browser automation. He emphasizes the dramatic reduction in development time from three months to one day per application, enabled by Nova 2 Omni's true multimodal capabilities that eliminate the need for complex multi-model pipelines.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction to Amazon Nova 2 Omni and the Nova Family of Models

Hello, everyone. Good afternoon, and thanks for coming to our session on Amazon Nova 2 Omni, which is a new frontier in multimodal AI. I'll be joined by my two co-presenters, Ashwin Swaminathan, who's the Director of Applied Science working on Amazon Nova models, as well as Yamamoto-san, who's the Chief AI Officer at Dentsu Digital.

Thumbnail 40

For the agenda today, we're going to look at the Amazon Nova family of models overall. Then we're going to touch upon what was launched by Matt Garman yesterday in his keynote, which is about the Amazon Nova 2 family. We're going to deep dive into Amazon Nova, especially the Omni model, which is optimized for multimodal workflows in terms of multimodal understanding and multimodal generation. We're also going to touch upon some examples and demos that will give you some idea of what are the things that you can create for your business, as well as touch upon the performance of these models as compared to the best models out there. And then we will hear from Yamamoto-san on how, using Nova Omni, Dentsu Digital is transforming their business and helping their customers.

Thumbnail 90

We launched the Amazon Nova family of foundation models last year at re:Invent, where I was here in this venue actually. We launched a suite of models. Our primary Nova models were launched in two categories. One we call Nova understanding models, which take in text, images, and videos, and we're able to derive metadata out of it, summarize that content, and do question and answer to generate text. So that's the understanding family of models available in Micro, Lite, and Pro. Since then, we have launched Premier, which is the largest model with the best performance in that category. We also launched an image generation model called Nova Canvas and a video generation model called Nova Reel. Since then, we have launched a speech-to-speech model, which is for real-time conversational AI applications such as customer support calls, and that is called Amazon Nova Sonic. Very recently, last month, we launched the industry's first natively multimodal embedding model, which is great for semantic search and agentic RAG applications, especially if you have a lot of unstructured data around documents, images, videos, audio, and text. We are the pioneers in that space in terms of providing the best performance at much lower latency and cost characteristics and being the first model in the industry to do everything together.

Thumbnail 190

Thumbnail 200

Thumbnail 210

Unveiling the Amazon Nova 2 Family: Four New Models with Advanced Capabilities

Nova is already used by tens of thousands of enterprises and startup customers, and some of those names are listed here. Now, to touch upon what was launched yesterday, we are introducing four Amazon Nova 2 family of models, starting with our Nova 2 Lite, which is our fast, cost-effective reasoning model for everyday workloads. It's our first reasoning model, and it's a hybrid reasoning model, meaning that developers control whether they want reasoning to be enabled and also to what level of reasoning that they want for a particular task. So if your task involves document understanding with some reasoning on whether the extracted values are incorrect or not, then you might want to use some level of thinking or reasoning with it. For everyday tasks, you can just disable that reasoning, and the model will provide you results as is, again using fewer tokens and producing faster results.

Thumbnail 260

Then we have also launched in preview our Nova 2 Pro, which is similar to the Nova 1 family. It's a higher tier that provides higher performance at different latency and cost characteristics, and it's our most intelligent reasoning model for highly complex tasks. So think of coding tasks, complex agents, or multi-agent scenarios. You would want to use Nova 2 Pro there.

Thumbnail 290

Then the third model that we also launched in preview is our Nova 2 Omni that we're going to talk about today, and it's our unified model for multimodal reasoning and for image generation.

We believe that models in the future will be fully multimodal, and that's how humans interact as well, right? Through spoken words, visual information, as well as written text. All the providers will be moving towards that direction. This is our first model that can understand any modality, including audio and speech, which is the first one in any of the models available on Bedrock, while also generating high quality images in the same model. It's also the industry's first reasoning model that can reason across any modality and generate images in a single model.

Thumbnail 350

Thumbnail 370

We have a second generation of our Amazon Nova Sonic, which is the Nova 2 Sonic. It improves the performance as compared to the first generation, supports more languages, and the conversations feel more natural as compared to the first generation. In this presentation, we're going to deep dive more on the Omni model. This is just a quick overview of what are the capabilities of all of these four models.

Thumbnail 380

All these four models support a large context window of up to 1 million input tokens. They support more than 200 languages for the input text. Pro and Omni also understand audio, so they can understand up to 10 languages for audio or speech as well. The other thing about Sonic is it supports more voices now naturally, so for multiple languages you can choose between different male or female voices for your workflows.

Thumbnail 420

Thumbnail 430

Amazon Nova 2 Omni: A Unified Multimodal Model with Hybrid Reasoning

Omni, as I said before, just to recap, can take any modality as input to generate text and images. It feels more natural in the way that, as I mentioned, humans interact. The model understanding all the content into the same space to then generate text or generate images provides higher quality outputs than a combination of multiple bespoke models that are task specific. It also means that you don't have to worry about creating complex pipelines that have multiple models and multiple systems in place, and you can unify that with a single model, reducing the build costs, maintenance costs, and enabling faster time to market.

Thumbnail 470

These are the main properties of what Amazon Nova 2 Omni provides. As I mentioned before, this is a hybrid reasoning model where the developers can control the level of reasoning or whether they want reasoning even enabled or not. This model is great at instruction following, tool calling, and even for simple text-based natural language tasks such as NLP tasks, sentiment analysis, classification, and those kinds of things. It is state of the art in multimodal perception, so this model is highly optimized for any multimodal tasks, which means input document understanding, input image understanding, video understanding, and audio understanding.

The new feature of cross-modal reasoning is all that this model really excels at as compared to the best of the models out there in the market right now, and we're going to show some of those numbers which we have already published in our technical report as well. Audio understanding, as I mentioned before, Nova 2 Pro and Omni are the first models on Bedrock that support audio understanding, which means that you can do speech transcription. You can do multi-speaker diarization, which is to understand which speaker is speaking at what point. It supports multiple languages and supports different types of tasks that we're going to dive deeper into later.

This model supports much higher quality image generation and natural language-based image editing. Now backed by a text LLM, it can generate much higher quality of text which is rendered within the images, especially for longer text in the images that has always been a challenge, and we faced that challenge even with Canvas. With this model we have tried to fix all the remaining gaps that we had in the previous generation of image generation model, and this model should be very compelling for image generation.

Thumbnail 610

Thumbnail 680

Now to look at the performance, this chart is based on Artificial Analysis, so it's not done by us but by a third party that shows in terms of their overall Artificial Analysis Index, which measures performance across 10+ benchmarks on how top of the line models perform. So 58 is a consolidated number out of all the multitude of different benchmarks that are meaningful in different categories such as instruction following, tool calling, agentic coding, and so on. As you can see in its tier, which competes with Gemini 2.5 Flash and GPT-5 Mini type of models, this model is very competitive and stands very high on the leaderboard. The ones on the left are the higher tier. Those are the Gemini 2.5 Pro, which is similar to the Nova 2 Pro category, so that's why we're not comparing there. These are some of the numbers just to give an idea in terms of the language understanding, so knowledge, reasoning, instruction following, and tool calling that are critical for testing how good a model is, and also for tool calling for agentic applications. These are the things that matter, and as you can see here, Nova 2 Omni is very competitive on all of these numbers as well.

Thumbnail 710

Document and Audio Understanding: Transforming Complex Data into Actionable Insights

Now, let's look at some of the use cases, especially the multimodal use cases where this model improves a lot upon the Nova 1 family of foundation models as well as being pretty competitive compared to the other models out there. The main use case that we have heard from our customers is document understanding. Pretty much every company has documents, and some of the documents are very complex in nature as well in terms of the layout. There are some handwritten documents, a mix of handwritten plus typed languages in different languages, so document understanding is a very complex and challenging problem. What this model does is it provides much higher accuracy compared to the previous generation of our models, and we're going to show some numbers also in terms of the measurements on the public benchmarks on where we perform as well.

Thumbnail 800

Some of the key things that we have tried to fix with this is the OCR, the character recognition, as well as the key information extraction. So if you have some PDF reports, if you have some receipts, then how accurately can the model extract all that information out. In some of the cases, it can also do verification with the built-in tools and expose any inconsistencies in terms of the content that is listed there as well. This is an example of OCR where the image on the left is a combination of a very non-traditional type of document you can think of. There are some text, there are some images in different areas, and this is one of the basic use cases that every company has, like how accurately can you extract the information out here, do the OCR, and create this text. You can also tell the model to produce output in a structured manner like JSON, XML, and so on, which you can then use further down the lane in terms of calling different tools. As you can see on the right-hand side, it extracted all the information in a kind of structured way. It's not displayed as the JSON here, but that's what the model can accurately do.

Thumbnail 860

Thumbnail 900

This is again a pretty complex layout of a document on the left-hand side, and the model is able to pretty accurately extract the information out of this. As I was mentioning before, you can prompt the model to look for any inconsistencies in specific areas and do the calculation. So the model using the built-in tools that it has, say the code interpreter, which can do simple math calculations and so on, can tell you whether the information presented in the document is correct or not and where are the inconsistencies.

Thumbnail 910

Now let's look at the audio understanding, which is a new feature of this model. We've had AWS Transcribe as one of the solutions for ASR or speech recognition type of use cases. Using this model, you can transcribe speech, you can summarize what was said in the audio file, you can do question and answer, and then you can also do everything and call tools with it. As I mentioned before, it supports up to three speaker diarization, which means it can segregate which speaker is speaking what part of the audio. It is a very performant model.

On the MMAU leaderboard, which is the Massive Multitask Audio Understanding and Reasoning leaderboard, we are overall number two and much ahead of the models provided by Google and OpenAI in this space. It measures accuracy over speech, the soundscapes, which is non-speech audio, as well as music. So the model can understand different types of audio inputs.

Thumbnail 990

Thumbnail 1000

Thumbnail 1020

Let's see how it works. In this case, we have a short snippet from our Q3 earnings call that was done by our CEO Andy Jassy. We just tested it out to see how the model performs when we provide this input or the different types of tasks that we ask the model. So in this case, this is just an internal tool where we just upload this file and then we start very simply with just a plain transcription. In which case we look for, did the model miss any words? Did the model misquote any words which were not spoken in the input audio? It's hard to read here, but we did the analysis, we did the testing ourselves, and the results were pretty accurate.

Thumbnail 1040

Thumbnail 1060

Then you can ask follow-up questions like what are the key takeaways, so it can provide a bulleted list of main items that were part of this audio file as well. And then you can ask again follow-up questions like outline the key accomplishments that were mentioned by Andy. So you can basically analyze any type of audio file to do these kinds of tasks at a much higher accuracy. And this is just a summarization as well. So we're going to share the results of this later when we talk about the summary of the multimodal perception.

Image and Video Understanding: Perception, Detection, and Temporal Reasoning

With that, I would like to hand it over to my co-presenter Ashwin to talk about the image and video understanding. Thanks, Rohit. Like Rohit mentioned, I'll talk a little bit about some of the work that we did as part of the Nova 2 Omni model, focused on image and video understanding use cases, and also extend about the work on how Omni can solve some of the image generation aspects of things.

Thumbnail 1120

So if you look at the broad categories of image and video understanding use cases, we can broadly look at it in three specific areas. One is perception and object detection. So you have a scene, you want to understand what objects are there in the scene, and you want to create bounding boxes. The second aspect is question and answer. You want to ask questions to the image and figure out what is happening in the image. And the third part of it is temporal understanding. When you have videos going from images to videos, you want to capture the temporal aspect of the scene and be able to identify and ask questions about a particular video, like when did a particular event happen in the video, or when did something show up in the video, things like that, which you would generally have for any kind of video understanding tasks.

Thumbnail 1170

So let's walk through some examples. This is an example of a scene. It's a complex scene. It has a lot of components in it. If I ask a simple question like detect the number of plants, cushions, tables, and TVs in the scene, as a human, as we're looking through the image, there are things that pop up immediately to you. Like there is this huge potted plant on the right, right next to the right sofa. And then if you look around, you can find multiple small plants all through the different scenes.

Thumbnail 1200

Let's look at what the model produced. So the model not only detected the big potted plant that's behind the scene, but also all the small plants throughout the space.

This includes the plants in the cupboard, the plants on the center table, and the one next to the TV. It also detected all the individual cushions and extracted bounding boxes for these cushions, as well as for the TV that was provided in the prompt.

The second thing to note is the accuracy of the bounding boxes in terms of their locations and how tightly they fit around the objects. These are aspects that we optimized as we built Amazon Nova 2 Omni. We wanted to ensure that we are producing very high-quality outputs while simultaneously producing very accurate bounding boxes that can help you unlock different kinds of applications, whether it be automation, real-time decision making, robotics, or various other use cases. We wanted to optimize the model to make sure it works really well for these scenarios.

Thumbnail 1260

Let's look at perception and question answering. If you look at the scene, you can ask a bunch of questions about it. How many teams are on the screen? What is this event? What are they playing here? Has the runner transferred the baton? When you pass this kind of image to the Nova Omni model, it determines that this is a relay race with three teams: Great Britain, Switzerland, and Belgium. It can also detect the location of the batons and identify who has transferred the baton and who has not, as well as who has come first in the race.

Thumbnail 1320

This entire understanding of the scene becomes extremely powerful for supporting a wide range of different kinds of applications. This is something that we've been hearing from customers like yourself about how we can support these kinds of complex understanding use cases, which will enable and reduce the human automation part of it and increase the overall power of AI for your workloads.

Thumbnail 1340

A third example is on temporal and multimodal attention. If you want to reason across frames and support long event detection to identify particular events in the scene, let's look through an example. The prompt here is to be able to locate all the segments in the video for this particular prompt. The prompt is "the man standing on a boat," and it's asking the model to produce the results in a very structured output fashion. As we walk through this, I'm just going to pause so that you see the exact timestamps.

Thumbnail 1360

Thumbnail 1370

Thumbnail 1380

Thumbnail 1390

As you walk through this, you see different kinds of landscapes and different kinds of scenes, and around the 12 to 14 second mark, you see the man standing on the boat. As you roll the video, you see the man is now walking but is no longer standing next to the boat, so it's not counted as part of the model outputs even though you still see the person. The model can understand and reason exactly the kind of prompts the customer is asking for and provide outputs in a way that it can solve complex tasks.

Thumbnail 1400

Thumbnail 1410

In this case, a man standing next to a boat where it's not just about a person walking or a person standing, but also about the relationship between objects and how these objects can come together toward answering the question that the customer is asking for. This is the kind of complex use cases that the Amazon Nova 2 Omni model can support.

Thumbnail 1420

In terms of performance, as Rohit mentioned, we have evaluated the model against a wide range of benchmarks. We have published a very detailed technical report on the Amazon Science website, and I'd encourage all of you to go in and look at all the benchmarks in that technical report. Here, I'm just pointing out a few of the benchmarks that we looked at.

On the video understanding side, we have benchmarked extensively across a wide range of use cases, including the Video MME benchmark. The Video MME benchmark is interesting in the sense that it has general video understanding use cases, but it also has cross-modal understanding. You can look at how the video performs when you input video along with speech and what the performance improvement is because of that. We see that with the combination of video and speech and cross-modal understanding, our performance is much better than other models that are out in the similar category.

On document understanding, we have OCR Bench, which is an OCR-based benchmark. Again, the Nova Omni model exceeds state of the art and provides very good results in the wide range of use cases for this particular benchmark. For speech understanding, as Rohit mentioned, on popular benchmarks like MMAU, we are number two on the leaderboard.

So across the board, you see very good performance in a wide range of tasks. One other thing that we did as we built the Omni model is we learned from all the feedback that you had from our Nova 1 models. As you were trying video understanding use cases, you provided a lot of feedback in terms of what are the cases where the models worked and the cases where the models did not work.

Thumbnail 1570

In collaboration with CMU, we created a new benchmark called Mavericks, and this benchmark has also been published. We also evaluated our model based on real customer use cases, based on all the feedback that you have provided to us in the last year. In this Mavericks benchmark, again, we are among the top in terms of performance, and it also helps us evaluate how video along with speech or video along with audio can together improve and provide very high quality results for real-world customer use cases. That's something that I would also encourage all of you to check out as part of the work that we've done on benchmarking.

Thumbnail 1620

High-Quality Image Generation and Advanced Editing Capabilities with Nova 2 Omni

So now, let's talk a little bit about image generation. One unique aspect of Omni is it not only understands all the different kinds of content like text, images, video, audio, and speech, but it also can generate images. This can help you power a lot of complex use cases where you probably used multiple models in the past. The Nova Omni model can generate high quality realistic images such as people, text rendering, and also spatial understanding. Because we have a good understanding backbone which understands the content and images, we can now use that backbone to also generate better quality images and help us to do better quality editing tasks, and we'll see that in some of these examples.

To start with, these are some examples of images that we created with the Nova 2 Omni model. This is the text to image part of it. So given a text prompt, it generates different kinds of images, and you see some of these images in this slide here. As Rohit was mentioning earlier, we also optimized a lot on visual text rendering. We want to make sure that the models do well in producing right text outputs, which is a really hard problem for most of the image generation models that are out there today.

Thumbnail 1660

In terms of how we evaluate our models, we look at winning rate. The way we calculated winning rate is, again, based on all the feedback that we've gotten from you as customers over the last year on our Nova Canvas model, we curated a dataset of prompts, and we did blind AB testing comparing our model against other competition models that are out there. This is a blind AB testing. The annotators did not know which image corresponded to Nova Omni and which image corresponded to the competitor, and they just looked at each of the images and provided a win, tie, or a loss rate. Then we aggregated all the results across all the annotations. The exact process and the procedure for human evaluation is outlined in our technical report as well.

Then we calculated the winning rate. Winning rate essentially is a win rate plus tie divided by two. So we are looking at how well our model performs, and it essentially gives you a quantification that anything above 50 means that our model is equivalent or better than the other competition models. Looking at the figure, as you can see, Nova Omni performs significantly better than Nova Canvas. So it's a step function improvement in terms of performance compared to our last generation model that we launched last year. It also significantly outperforms models like Flux Kontext Max or Flux Kontext Pro, and it's comparable to GPT-image 1 and Gemini kind of models.

Thumbnail 1740

Thumbnail 1760

These are more detailed results looking into specific categories. Specific areas that we focused on are improving rendering of people and rendering of visual text and scenes. So this is an area that we also improved as we went from Nova Canvas to Nova Omni as part of our image generation tasks. Here's one complex example. The prompt here is an image of Paris with a crowd of pedestrians all looking at the camera. Here you're trying to generate not only a group of people, which makes it really hard for image generation models because now you're generating a lot of small faces all through the entire scene, but further the prompt also says that they want the people to be looking at the camera.

Thumbnail 1800

So here again, the Nova Omni model is able to understand the user intent and produce images which are representative of exactly what the user is asking for in the prompt. If you look at the scene with a lot of people, you'll see most of them, around 90% of them, are actually looking at the camera, which most of the other competition models also cannot do today.

As I was mentioning, the advantage of having a joint understanding and generation model is that the understanding components of the model can learn from the generation side, and the generation components of the model can learn from the understanding side. This helps us in tasks like editing. The Nova 2 Omni model can support nine different kinds of editing operations, including adding new objects, altering objects, extracting information about a specific object, replacing, removing, doing background changes, style transfer, and so on. Now I'll go through some examples so that you get a sense of the kind of operations we support.

Thumbnail 1840

All of this is just with a text prompt. You can simply provide the input image along with the text prompt, and the model understands the user intent and can help you with the different kinds of editing tasks. Here's an example of an add operation where you want to add a wooden bench facing on the right side of the image. The model generated the bench, but it also understood the style and the layout of the scene. As it's creating the bench, it created the bench with the exact style of the image so that it's representative and flows well with the scene and the scene layout.

Thumbnail 1880

Thumbnail 1900

Here's another example where you want to add a door number to the image on the left. The model understands the intent and also creates the image with the numbers in the right location of the door. Here's an example of alter. In this case, the prompt asks to change the color of the bicycle to red. The model understands the location of the bicycle, is able to segment out those areas, and then correspondingly converts those locations only to red color.

Thumbnail 1920

Thumbnail 1930

Here's another example of changing the frosting color to a different color. Again, the model is able to understand your intent and make the changes. This is an example of altering the scenes, changing a scene from a snowy background to a more sandy kind of background. For extracting information, think of Amazon.com where you have images of people wearing dresses. You want to be able to extract the dress and show the dress and represent this information very accurately so that the user can make purchase decisions.

Thumbnail 1950

That's an area that the model does well as well. For replacing objects, the interesting part here is that the user's prompt is specifically focused on keeping the hairstyle untouched. As you're changing the hat, you also want to make sure that you're not changing the hairstyle so it doesn't affect the overall representation of the scene. The model is able to understand that intent and also generate images where it doesn't touch the hair but only touches the hat, so it generates those images and just updates the hat.

Thumbnail 1990

Here's an example of remove where we're removing a polar bear from the scene and removing the cell phone from the user's hand. This is also an example of complex editing operations where it has multiple prompts, not only removing the cell phone from the user's hand but also changing the color of the sweater from pink to blue. It understands both aspects and generates an image that adheres to the prompt that's been described.

Thumbnail 2020

Here are examples for background change, style transfer, and motion change. This is more about how to represent the scene, such as how the same person looks with a different kind of expression, and creating those expressions so that they look realistic while not changing the appearance of the person or changing the attributes of the person or the dress of the person. That becomes an important aspect as well.

Thumbnail 2040

These are hybrid changes where you're not only removing the hat but also changing the color of the dress that the person is wearing. These are the kinds of complex scenes that we've been evaluating and testing as we were building the Nova 2 Omni model. Having the capability of understanding tasks helps us also improve the overall quality of the generation aspect of things.

Thumbnail 2070

We evaluated our model against the image edit benchmark, and this graph shows the performance of the model. As I was describing earlier, on image editing tasks, we do better than Flux models and are on par with the GPT 5, GPT-4.1, and Gemini models as well. Let's walk through an example here. We have different kinds of editing operations that you can do with the Nova models.

Thumbnail 2100

We want to transform the bright Scandinavian style to white colors. Adding a muted minimalistic art.

Thumbnail 2110

Thumbnail 2120

Thumbnail 2130

Thumbnail 2140

Thumbnail 2150

Thumbnail 2160

So you can use it for rearranging and creating your own environments. If you're a property manager, you can extend it to similar kinds of scenarios and similar kinds of use cases for your workloads. And if you want to create ads, you have the object, you can create captions, you can place the object in different kinds of environments, and create different kinds of ad marketing assets for your workloads as well.

So we're seeing a lot of customers use the models right now in different beta programs as part of our beta program. Monks is one such company. One of the advantages of the Omni model is that given the power and given the capabilities of understanding and generation, you don't need to have multiple models to be able to solve the same task. The model itself can understand and represent the task and also help generate content, which becomes more powerful. So companies are able to see that they now don't have to have a very powerful, complicated workflow, but they can simplify a lot of the workflows by the power of the Omni model.

Thumbnail 2220

Dentsu Digital's AI Journey: Leveraging AWS for Marketing Innovation

With that, I'll hand over to Yamamoto-san, who will talk a little also about how Dentsu is using some of the models for their use cases and workloads. Hey, thank you. Hello, hello, thank you so much. Thank you for the introduction and thank you for the great opportunity. Let me introduce myself. I'm Satoru Yamamoto, Chief AI Officer at Dentsu Digital, and I will be also Deputy Chief AI Officer of Dentsu Japan.

Thumbnail 2230

Thumbnail 2260

I want to say thank you. I really appreciate all the support from AWS. I have a lot of opportunities to present in a lot of events related to AWS like re:Invent and Summit and also Cannes Lions. So I'm quite a big fan of AWS. So today, I really want to tell you how AWS AI is powerful, especially Amazon Nova Omni, right. And let me, but before I go to the main topics, let me quite briefly introduce our company, Dentsu.

Dentsu is Asia's largest advertising agency, and we are operating our business globally in more than 100 countries now with over 70,000 employees. Oh, I have to say one thing. I need to say thank you. It was good news for me. Thank you so much because I know in yesterday's keynote session, Matt Garman, CEO, mentioned Dentsu also as a marketing giant. No, no, no, no, we are not so tall as a giant. We are just number one in Asia. Thank you so much, Matt Garman. Like this.

Thumbnail 2300

Thumbnail 2320

By the way, by the way, so we have the solution named Mugen AI, and this is a kind of digital marketing solution. Out here we're utilizing AWS AI service a lot. So today I want to introduce how we're utilizing Amazon AWS AI, especially Nova Omni, in this solution. OK, let's go to our main topic, agenda. We have three topics: creative, planning operation, and next generation experience. This agenda is following the evolution of the generative AI like multimodal AI and agent AI and physical AI. Let's go to first topics, creation.

Thumbnail 2340

Mugen AI Solutions: From Creative Prediction to Multi-Agent Collaboration

At creation, we have a solution named Me AI A, and it has three functions like creative generation, performance prediction, and improvement suggestion. And thankfully, it's already implemented in more than 200 companies, and we succeeded to achieve an average improvement of more than 150%. I don't, I want, why don't I emphasize here is we are utilizing Amazon Nova to make this solution conversational in order to make our employees use it as much as possible. But still we have a challenge. It is video creative, video creative.

For example, we couldn't get the accurate video prediction, but we can say that we can change the situation with the Amazon Nova Omni. OK, let me explain. In the traditional approach, what we are doing is we are changing the original video to like more simple, fragmented information like keyframe, and also we need to convert keyframe to the textual summary, and also we need to get the information from sound like this. Everything is separated, but it's really different from how humans watch a video.

As a result, accuracy is not so high. On the other hand, Amazon Nova 2 Omni is a true genuine multimodal AI, so it can understand the video itself like us. As a result, we can get quite high accuracy in prediction. Also, in a traditional way, what we need is quite an enormous amount of data. But because Nova Omni is already learning the vast knowledge in the background, we don't need such an amount of data. We just prepare a small set of high quality data.

Thumbnail 2450

Thumbnail 2460

Thumbnail 2470

Right, let me show you the demo here. What we should do is just upload the video like this, and then also we should select the advertisement campaign and then prepare the training data for the prediction model. That model will tell us what kind of video creative has higher performance, like this. And then finally, we can get the predicted value of the uploaded video. What I want to emphasize here is the correlation value between the predicted value and the actual distribution value is quite high, like 0.88. So we can really accurately conduct the prediction. I forgot to say, by the way, this example is from ANA. It's a Japanese largest airline company. I came here using that airplane, so thank you so much for providing the example.

Thumbnail 2510

Thumbnail 2550

By the way, not just utilizing Nova Omni for prediction, but we can also use it for generation, because Nova Omni is truly multimodal. So we prepare the brain which can understand the tendency of which video has high performance, and then use this brain to generate a storyboard. Because it's multimodal, it can also draw a picture. So let's check the demo like this. We just ask the same AI, okay, now let's make a storyboard. First, we can get the textual storyboard. We can convert this storyboard to the visual storyboard, and based on that storyboard, we can quite easily generate the moving video like this.

Thumbnail 2560

And here I want to emphasize two things. It's related to the capability of image generation of Omni. One is that Omni's listening power is really high, so it's quite easy to handle and edit the images with natural language. The second one is quality itself. It's more, it's kind of competitive to the leading model. And also I'm really quite surprised because we can generate Japanese characters really accurately and beautifully. It's almost the first time for us. It's really high quality.

Thumbnail 2600

Thumbnail 2610

Thumbnail 2620

Thumbnail 2630

Okay, not just generating the storyboard and video from scratch, also we can reuse the existing video because Omni is understanding which part is important. So let's check the demo again, and it's a bit tired. Instead of me, please listen to this. Today we demonstrate how AI accelerates content creation. First, we upload a raw video file. Nova Omni instantly analyzes the footage, generating detailed metadata of every scene. Next, it acts as a creative director, drafting a viral strategy with specific hooks and scripts. Finally, leveraging the AI's deep understanding of the interview and visuals, we efficiently craft the final video edit. From raw footage to viral ready, streamlined by intelligent analysis.

Thumbnail 2650

Thumbnail 2660

So like this, using Nova Omni, we can quite easily handle the complex data like video. Okay, let's go to the next topic, it's planning and operation. We have a solution named Mugen AI Agent Canvas. It's a kind of citizen development solution for non-engineers to build up their agent quite easily. And we already succeeded to develop the solution by marketers like strategic planning and journey planning and media planning, but we can go more when we utilize Nova Omni for agent. I think this part is interesting.

Okay, what we did is we tried to make Omni mimic human. So the video on the left side is the human actually operating the UI of Amazon automation UI.

Thumbnail 2710

Thumbnail 2720

Amazon Ads Operation Automation using Xnurta, an automated UI, and then we input this video to Nova Omni. Nova Omni understood the workflow and writes the prompt for Nova Act. Nova Act is the AI which can handle the browser's operation. So as you can see, we can fully automate this process without any kind of human touch, right.

Thumbnail 2730

These two pages are not directly related to Omni, but it's a quite nice example, so let me introduce that. So far, I introduced the AI solution and then to Japan, like creative prediction and generation as a planning and operation. What we're going to do is we try to combine every whole solution together in one AI agent, utilizing the Bedrock Agent multi-agent collaboration functionality. Okay, let's check the demo.

Thumbnail 2760

Thumbnail 2770

Thumbnail 2780

What we're going to do is, sorry, it is written in Japanese, but I speak in English. So okay, let's make the new product, the sparkling water. How about the shining sparkling water? Then let's talk to the AI persona, and they'll ask persona how do you think about that. And if taste itself is not so important, if the visual is nice, it's okay. Okay, so let's summarize the interview, and based on interview, we can consider the name and the key messages and also the key visuals for that.

Thumbnail 2790

Thumbnail 2800

Thumbnail 2810

Thumbnail 2820

And then I'll summarize this one as the value proposition. And based on that proportion, we can prepare the customer journey and for acquisition part, we need the digital banner generation and so on. So what I want to say is now we have already applied Amazon Nova AI leads for our solution, and we can go more with the Amazon Nova Omni and we combine them together with Bedrock Agent. Okay, so far I explained is also the kind of advanced way of utilizing AI for marketing agent.

Thumbnail 2830

Thumbnail 2850

Thumbnail 2860

Next Generation Experiences: Bridging Physical and Digital Worlds with Nova Omni

But let's, the final topic, let's talk about the next generation experience. Okay, and it is more like conversational, interactive and real and physical. The first one, first example is using a chat. With this solution, we can quite easily prepare client chatbot utilizing client specific data. Now, of course, it's based on AWS and utilizing Amazon Nova series. Like this, this one is example of golf course reservation. But the problem is it's just text to text conversation. It's quite ordinary, but utilizing Nova Omni, we can provide more richer experience like this.

Thumbnail 2890

Thumbnail 2900

He's my friend and he's talking to AI with voice. I've analyzed your video. Tap the mic to talk. Yeah, first, she's speaking in English and then he inputs information with Japanese like, it's very interesting. Only speaking in English and then he's speaking in Japanese, but the conversation is still not broken and here we can also upload the video. AI so we can understand this our video and I can give us advice how to improve the golf swing like this. It's a kind of the new generation experience which is related to real world.

Thumbnail 2920

And there's more example. This one is kind of the in-store experience. What we're doing on the left side video is the video from the customer view in the store, and this customer is watching the perfumes and they're waiting for a second, then maybe she, this customer is talking to the shop clerk like this and then what we're doing is we are extracting persona of this customers. For example, maybe we can guess from this video, this customer might be a lady. And also sees a kind of the gorgeous and also the style conscious office worker like this, we can extract the persona.

Thumbnail 2980

And also we can use this persona to digital world again. I think it's kind of an interesting example. What we did is we use this persona to Amazon Nova Act. So based on the in-store behavior we can guess how this customer will behave in the e-commerce shop like this. And I think this kind of technology, which connects the real store and the e-commerce store, it is very essential for the era when VR and AR glasses become widely spread.

Thumbnail 3010

Thumbnail 3030

Okay, so this is almost the end of the presentation. Today, I have presented several applications based on Amazon Nova 2 Omni, but I know this is not a marketing session. This is a technology session. So I'll inform you of some very shocking news. Actually, I was informed that I have a presentation today last Tuesday of last month, and we had just seven days to prepare for the presentation. But I introduced seven applications, so one application per day. It's quite rapid development. Actually, the lady sitting here, Suzuki-san, develops these solutions by herself alone.

So compared to before, the development conditions are quite changing. Previously, it took at least three months for one solution, but now it's just one day. So now we can quite focus on just our own business, and I think this is the right way of working together with AI. This is almost my final comment. Thanks to AWS AI, we are quite free from concerns about development. And then as a marketer, we can focus on inspiring people's hearts and also creating new values and changing the world. Thank you very much. Finished.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)