🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.
Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!
Overview
đź“– AWS re:Invent 2025 - From prompt to production: On-brand marketing images with Amazon Nova (AIM373)
In this video, Henry Wang and Parmida from AWS Generative AI Innovation Center present an automated marketing image generation pipeline addressing brand safety challenges in AI-generated content. They demonstrate a multi-stage solution using Amazon Nova Pro and Nova Canvas that includes prompt decomposition, asset retrieval via vector search, automated caption generation following specific topology rules, composition planning with grid search for optimal product placement, and intelligent quality control using vision language models. The pipeline achieved 70% human preference over out-of-the-box model outputs, with improvements measured through MSSSM and Dino V2 metrics for object similarity. Key innovations include using prompt masks for outpainting to preserve product integrity while changing backgrounds, and implementing binary scoring against brand guidelines to filter thousands of candidates down to three top-ranked images for final human review.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: The Challenge of Brand-Safe Image Generation in Marketing
All right. Good morning, everyone. Feel free to put on your headphones right now. Give me a thumbs up if you can hear. Fantastic, fantastic. Well, thanks so much for everybody to come in. This is your fourth day, right? Thursday, the fourth day of averaging 25,000 steps. Yeah. Anybody want to shout your personal record at re:Invent this week? 30,000? 30,000. Wow, that's impressive. Higher bidder? No? Okay, cool.
Well, we're in the home run, so thanks so much for coming to this session with me and Parmida. We hope this session is really fun for you guys. So if you haven't woken up yet, wake up a little bit, because we're going to be talking about image generation. My name is Henry Wang, and I'm a Senior Applied Scientist with AWS Generative AI Innovation Center, on the stage with my colleague Parmida. Yeah, and my name is Parmida, Senior Data Scientist at AWS Generative AI Innovation Center. If you'd like to be like me and Henry doing everything everywhere at once, you're at the right place. We're going to teach you how to design a pipeline that can do stuff like that for marketing image generation. Do you want to kick it off?
All right, thanks, Parmida. So a quick show of the agenda today. We're going to talk a little bit about our observations on generative AI applications in marketing, and then we'll give an overview of the architecture, followed by a deep dive into each component of the pipeline and the architecture. We'll be wrapping up with some of the action items for you guys to take if you're interested in this topic.
Now, very first, very quick, anybody has used generative AI to generate some images? Expecting a lot of hands. Anybody still using your generative AI avatar for your social media profile? Not so many. Okay, cool. And who works in the marketing space? Curious. All right, a few hands. Thank you very much. Thank you.
So what we found working in marketing, really before generative AI image came to place, it was a fairly manual process, right? There's a lot of great tools on the market, but that still requires you to do a lot of finessing, fine tuning, making sure the color adjusts, matches the brand, et cetera. And everything just changed with generative AI image coming up. There are so many promising signals that come up that you can see. Literally, you can create these visually stunning images in a matter of minutes or seconds, right? And they look beautiful. Purse, sneakers, sofa. And we were really excited about the promise this holds. But then one thing we found is, is the existing technology really brand safe yet?
This was a journey we took on with one of the customers, and we're like, okay, let's try this thought experiment for a little bit. Say, imagine you have a sofa like that as your marketing product you want to be building an image for, and you just prompt a model, right, an image generation model here. And then, okay, you get an output like that to your right. Everything looks great at first glance, right? You have this beautiful floor-to-ceiling windows, you have this nice lighting in the background. But if you notice, the texture and the color of the sofa changes, right? And there's actually not an easy way for you to modify it.
Now for companies, this can be an issue, right? First of all, it changes how your product actually feels and touches, so you want to make corrections for it, and you still need to go back to some of the manual tools that you rely on. And secondly, it's just like, could it actually take more time for you to correct those images? And if you want to do this through image generation like through prompting, it's actually still really hard because due to the non-deterministic nature of these image generators, you still can't get the similar image or the similar feel unless you do like hundreds or thousands of iterations, right?
What's worse is that sometimes these can slip through. For example, you're thinking about, okay, doing a boots show on the street during sunlight of sunset, right? And you're able to see some of these just like images, and these are real images coming out of the image generation models. These are just not really fitting what a normal expectation, the world of physics, right? Just you don't see them in real life. So that got us to think, like, how can we do this? How do we pick the right image out of so many images that's coming out of image generators? For instance, we want to generate this boot
picture for you. How do we know to find this one, which in our opinion is the best in a scalable and automatic fashion? So that's what we want to be talking about to you today.
Solution Overview: Prompt Decomposition and Asset Retrieval
Let's quickly go through the solution pipeline, and then we're going to dive deep into each component. Say a user passes a query, "a shoe on an urban street at sunset," a very simple query, right? What this pipeline does is that it first decomposes the prompt, and we extract different entities from this query, such as product, background, and themes. Afterwards, we use a vector search to find relevant assets if your company has any. Say you have millions of products or tens of thousands of backgrounds, then you're able to immediately retrieve the right product and the right background or the right theme from your media library.
Then the pipeline will automatically generate an optimized prompt based on your initial prompt, leveraging some of the best practices we have incorporated into the model. And then there's this stage which is called composition of images. So we're thinking about putting the product at different places, different orientations, rotations, and so on, on the canvas and give these as seeds to an image generator. Now you have thousands of candidates right now, right? Different variations and so on. You want to see, okay, which one should I choose?
So we're building this multi-step quality control pipeline as well using vision language models. You can inspect the images, provide feedback, and also we incorporate these traditional or classic image quality understanding algorithms so that you can pick the highest ranked, aesthetic image that looks most stunning, and then this becomes one of the top ranked images as the output of this pipeline. Everything looked good so far? Great.
Now let's talk about the solution breakdown a little bit. The first thing is the prompt decomposition and asset retrieval. So we have this simple prompt, right? "Rest the shoe on the ground on an urban street in a sunny day." You have a couple of things that are very important for image generators. You have product, you have the background, and then you have the theme. And this is a simple example, but in reality, you can have multiple of these, right? You can have multiple products you want to put them together, you have different backgrounds and so on, and theme. But usually it takes a lot of time for pretty expert level prompters to say, okay, how can I prompt the image, how can I prompt the model to generate the best image I want. Most people will just tend to say, let's get this done, right?
Now, with this prompt decomposition, how we extract these entities, just using a lightweight large language model that you can prompt the model, say, hey, you are an assistant that processes this. These are the entities I'm interested in extracting. And the modern models, large language models, even the lightweight ones, are pretty powerful that they can do this job very robustly and reliably for you.
The next thing after you extract these entities and assets is that you want to pass through your media library, right? As we mentioned earlier, you can do this vector search. So you can embed your sentences or your theme or the entity extracted from the query into a vector embedding space, and then you can match those with your product image embedding spaces, and you try to find the most similar ones, for example, and then you get five candidates that you can just choose from, same for the backgrounds as well. Now, we have all the things we need. We have the raw assets we need.
Caption Generation: Automating Prompts with Nova Pro
Now, I'd like to welcome Parmida back onto the stage to talk about the next component. Awesome, thank you Henry. Cool. So once we have our assets retrieved and our background retrieved, what's next? I want to tell you that if you have worked with image generation models, 90% of the work is getting that caption right. I don't know if you've tried them, but just changing the words, a little bit of the placement of the words changes the output a lot.
Changing the words or even slightly adjusting the placement of words can significantly alter the output, so figuring out how to put everything together is crucial. You'll eventually get an image that is almost what you want. That's why we have a specific module for caption generation.
Now, before I explain how this module works, let's take a look at caption topology. Usually in the caption, of course, you have the subject, the point of reference, and the environment—all the elements in the environment that you want to have in your image. But then there are optional elements that, if you add them to your caption, will enrich it and help you get that really amazing, natural-looking image. These include things like the position and pose of the subject, lighting description, camera position, and the style.
Now with that knowledge, how do we automate caption generation? Well, we use Nova Pro. Nova Pro can be prompted to generate captions, but if I want to prompt Nova Pro to generate captions, I have to think about how I write captions and instruct Nova Pro to follow the same steps. The three main rules that I follow are, first of all, to generate caption-like prompts, not instruct-like prompts. I don't know if you've tried caption generation before—probably some of you have tested having an LLM generate some captions for you. Sometimes what you see is "generate an image that does blah blah blah." That's not the type of caption you want. You want the caption to be something like "an image of a handbag on a dining table in a dining area." That is the caption we're looking for, so that's the first instruction.
Second, image generation models are different from LLMs. Usually they are trained with shorter prompts, and it's very important to not have a super lengthy prompt because the model cannot really process that. So instructing Nova Pro to generate captions that are not longer than 1,024 characters is very important. And third, we just talked about caption topology, so why not make Nova Pro understand how to generate my caption? I want the camera position to be there, the subject to be there, the description and the lighting. When you put all that into Nova Pro, you get captions like this. I'm not going to read it—it's a little bit long—but you can see the frame position is there, the subject is there, and then the environmental descriptions and lighting, and then you can get an image that looks like the one below.
Outpainting and Composition Planning: Positioning Products with Nova Canvas
Now, I have my caption generation module complete. What's next? Before I go to the next module, I want to take a step back and talk about different generative tasks that you can do with image generation. This is probably very preliminary for some of you who already know them, but let's do a fast review. First, text-to-image. This is the basic thing. Everybody knows text-to-image generation. You provide a prompt, you get an image. But for marketing image generation, is that what I'm looking for? Usually in marketing, you have your product with specific details, and it's very important down to the texture level—even very detailed—to have the same exact product in the image. There is no way for me to describe all of that in text for the model. Even if I do my best, it's not going to be the same exact product, and some brands have very strict guidelines about that. So this is not going to be the task that I'm going to use.
Next, inpainting. What does inpainting do? Well, you have a reference image, you have a caption, and the model can insert features into your image. This way of generation helps to keep the environment intact. The environment is not touched, but then you add elements. So how does that look for my task? Well, I want the reverse of that, because I want the product to be kept intact, but I want to put it in different places and I want to create different marketing campaigns depending on the mood and the time of the year.
So that comes down to outpainting. As you can see here, the Amazon van is kept intact, but the surrounding has changed. How does outpainting work? Well, you have a reference image and you have a prompt, similar to inpainting. Usually both inpainting and outpainting work with a mask image. That mask in the image indicates to the image generation model where the generation will happen and where the generation will not happen,
and that gives you control over where the generation will happen. Now what makes Nova Canvas very interesting is that I don't need to have an image mask. I can use a prompt mask, and that makes it super simple because I don't need to think about what other module I have to add to my pipeline to figure out where the position of my object is, so I mask it and so on. The main thing is I just provide a prompt mask, and that would be my object, and that object will be kept intact.
Now, with that, let's go to the next module, composition planning. So up to now we have retrieved our initial reference image. If there is a background aligned with our query, we have retrieved it. If there was no background aligned with our query, well, the retrieval doesn't return anything. That's fine, but we still have the caption that gives us all the features that we need for the background.
Now another important part is the three elements that I need to think about to plan the final image. Composition planning is about the relative size of the object in the scene, the position, and the rotation. With these three elements together, if they are picked properly, you can have very natural looking images. You can even make the image generative model generate shadows for you. But if the three combinations are not right, then you have synthetic looking images. So that's why it's very important to think about that.
Now the relative size is a subjective task. It's not very easy, and that's why we are using Nova Pro for that, a multimodal model. We are using the multimodal capability of Nova Pro. Given the image caption, if you have a background reference as well, if you don't, we still feed Nova Pro with an empty canvas. Why? Because the dimensions of the canvas are going to be important as a point of reference for Nova Pro to understand how to scale our object. And then we basically ask Nova Pro, hey, you have this caption, you have this environment, you have this object, this is my empty canvas or reference background. How should I scale my object? This is the way that we get the scale.
Now we scale our object, but then we have to figure out the position and rotation. I don't know if any of you are expert content creators or if you've seen how content creators work, but most of the time based on their experience they can figure out that, oh, maybe the image that I got is a little bit off looking. Maybe I'll just put the object a little bit to the right. Maybe I tilt it a little bit, and with these tiny adjustments they just get a wow image. And that's why we thought, hey, this can be automated through a simple grid search.
Here we have divided our canvas into three different sections. At every moment we place the product in each of the sections, and then we provide three different rotations, the original orientation of the object, 15 degrees to the right, 15 degrees to the left. Now what's the size of the grid? Well, that depends on the task that you have in hand. This is a trade-off between performance and cost, and you can figure that out based on the task in hand. In our solution we went with nine grids based on the cost that we had in mind, and it ended up working very well for our solution and the images that we had.
Intelligent Filtering and Quality Control: Multi-Step Image Evaluation
Cool. So now I have my reference images and the caption. So I'm going to try to do outpainting using the mask prompt with Nova Canvas, and these are the images that I get. What now? Well, I can send all this to my experts and ask them to, hey, take a look which one looks good for our marketing campaign, but is that the best use of time of that expert? Probably not. I can do some filtering. Some of this can just be out of the pool that I send to my experts.
Also think about it. I'm thinking about scale and automation. So I'm thinking that this pipeline can run hundreds of marketing campaigns at the same time. So it doesn't make any sense for each of those marketing campaigns to have an expert person sit and look at all these images.
And that's why we have the next component, which is basically our intelligent filtering component. For this component, the first part is going to be again our champion, Nova Pro. Because it's a multimodal model, we again use the multimodal capability of Nova Pro. We feed it the image that is generated, the caption, and the product.
Now we have certain guidelines that we want Nova Pro to evaluate the image based on. There are some baseline guidelines, and these are the failure modes that you can see in your image generator. Each image generator can have their own failure modes, and that's something that you can figure out when you work with that image generator. Based on that, you can add these guidelines for Nova Pro to figure out, oh, this is a failure mode, I'm not going to pass this image. But then also marketing images and brands have their own guidelines. Maybe you have a specific color palette that you're looking for. Those can be additional guidelines that you can add to Nova Pro to check for every image and right off the bat get rid of any image that doesn't fit any of these guidelines.
We ask Nova Pro to provide us a binary score. Now how do I work with these scores? That's up to you. I can be very strict and say any image that doesn't pass any of the guidelines is off. But then if I didn't have that many images to start with, maybe I can be a little bit more flexible. Some of these guidelines, for example caption alignment, is something that I can be flexible with. Why? Because an LLM, when it tries to figure out whether the image is aligned with the caption, it goes for every single detail. But me as a person, if I have an autumn scene, if leaves are not generated on the ground, maybe I'm okay with that. Maybe that's fine.
So there are some guidelines that I can be flexible with, and if I'm starting with a limited number of images and I want to make sure to have some images out of this quality check, then I can say, hey, if none of my images pass all the guidelines, then I'm going to flex those guidelines that can be flexible. You can also go for a weighted approach. That's another way. For these tasks, we actually went very, very strict because Nova Pro can generate really good images following this pipeline, and the majority of the time we have images that pass all the quality checks, so we went strict.
And having these images, we can get rid of the top left image because there are duplications. We can get rid of the second image and the first in the second row because a leg is generated and that was not part of the caption. And we can get rid of, interestingly, the third image on row one because no shadows are generated and Nova Pro was able to capture that. Now I'm left with five images. Again, is this a good time to send it to a human for review? That again depends on the scale of the work you're running, the labeling budget, the time value of your experts.
In our scenario, our labeling budget was three images. We didn't want to waste any time for a human. We wanted three images maximum. They look and they pick one. So we're like, okay, we're going to add another step that is final ranking, and that can be super simple. There are already really good models out there to do aesthetic scoring for you. There is Aesthetic Model, there is Image Reward, HPSV2 that you can use out of the box and generate scores for your image.
These models, the three of them, they basically are different ways of creating a measure for you of the aesthetic quality of your images in a score format. You can also construct your task-specific metric. Now Image Reward and HPSV2 are models, lightweight models that are trained based on human feedback, so they're more aligned with what we perceive as aesthetic quality. Aesthetic Model, if you haven't seen it, Google search, it comes up. It's a very lightweight model that is trained with a dataset in a supervised manner, so it's less accurate compared to Image Reward and HPSV2. But for our use case, actually that was a great pick because we wanted to save costs. It runs on CPU and it could do the job for us.
So we picked aesthetic score. When we calculated the scores for this problem that we're reviewing together, these are the scores, so I can easily get the top three, and that would be the top three images that I send to my human evaluator to pick one for the marketing campaign that I'm running.
So now we have been efficiently putting a pipeline together to reduce the amount of human effort, push the human review to the last step, and even minimize that effort. So now I can actually scale. I can run tons of marketing image generation, and I just have some expert at the end of the line to pick that one best image that goes in my catalog.
Now I need to talk about performance. I just showed an example to you, but what's the performance really? Did we measure that? When we measure the performance, we try to do quantitative metrics analysis and also qualitative. Quantitative metrics, unfortunately, the problem with them is that at every time they only measure one factor.
We chose two metrics. Both of them actually measure the same factor, that's object similarity. MSSSM is a traditional edge-based comparison. It basically gets the edge image of the original image and the final image and compares them. Dino V2 is an embedding way of comparing the similarity. If you know about embeddings, they are basically vector representations of the images. And it's a little bit more accurate, but we tried both.
So the whole goal of this metric was because I wanted to make sure that my object stays the same, so we got the object and we got the object in the originally generated image and we compared them together. In each of these performance evaluations, we run Nova Canvas out of the box and then we run Nova Canvas inserted in our pipeline, and that's how much you can see improvement in the metrics.
But as I said, that's just one factor. There are so many factors you want to check, aesthetic quality and the natural feel of the images, and that's why we run human preference. So we had 70% preference over our pipeline compared to just having the model running out of the box. And here are some of the other examples of the generated images.
Future Improvements and Closing Remarks
Now this is great and rosy and everything, but is that all? Of course not, as you already have heard in the keynote, Nova 2 is out. There are new models every day, and I think the best way to use generative AI is to be adaptive and change your pipeline all the time and make sure that you're aligned with the latest models.
So in the pipeline that I just discussed, out of the box there are several things that I can think of that we haven't covered and we can improve. Think about how we reiterate. Think about a scenario that none of my images passes the quality check. Then what? Do I just rerun the pipeline without any feedback? That doesn't make any sense.
What if my quality check LLM is an agent and it provides feedback, specific feedback to how to change the reference image or how to change the caption? Well, that's more constructive. Second, think about my human experts sitting checking those three images, and it's like, I don't like any of these. I actually want this other thing to be in the scene. How do I incorporate that human feedback again?
Another agent can take that feedback and figure out which modules need to implement that feedback. Is it the reference image that is going to be changed? Is it the background reference? Is it the caption? All of them. So that's another part. And third, let's not put fine tuning out of the table because fine tuning can be valuable when it's needed.
If the guidelines for brands are very specific, maybe my model cannot out of the box figure out every guideline, and maybe I need a little bit of fine tuning to get that part right, specifically because brands have very strict guidelines. And I'm pretty much sure you can think about many other ways to improve this pipeline or even make your own pipeline. So with that, I really encourage you to think about which parts you're going to use for your own problem. Maybe you just use a couple of modules.
Maybe you add more, be creative with it. And Henry, would you like to tell our audience how they can learn more?
Absolutely, thank you, Parmida. Yeah, and as Parmida mentioned, this is really our sharing with you on what we figured out. This pipeline works as food for thought during your day-to-day workflows. How can you build on top of that pipeline or maybe swap out certain components of the pipeline that fits your purpose better? Now, if you are interested, we actually have tutorials online that you can scan this barcode and see all those great learnings and experiences people are sharing about using Amazon Nova. And as Parmida mentioned, new models come out every day, so we encourage you to keep an eye out for it.
As we all in this AI native stage, the best thing we can do is just to challenge ourselves, forget about what we learned a year ago because something new will be coming out again and just be willing to embrace that kind of change. And if you're interested in a relevant session, feel free to attend this one. This is the AIM389 that's going to be happening from 4 to 5 p.m. today, the seamless and scalable creative and product visual generation.
All right. With that being said, this is our LinkedIn barcode if you wanted to connect and hopefully this is a sweet and short session. We have some time for Q&A. And with that, yeah, we're going to open the floor for questions.
; This article is entirely auto-generated using Amazon Bedrock.







































Top comments (0)