Kazuya

Posted on Dec 6

AWS re:Invent 2025 - Accelerating AI innovation with NVIDIA GPUs on AWS (AIM251)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Accelerating AI innovation with NVIDIA GPUs on AWS (AIM251)

In this video, AWS and Adobe discuss GPU infrastructure evolution for generative AI. AWS presents 15 years of NVIDIA partnership, introducing P6e-GB300 Ultra servers with 72 GPUs, 20TB GPU memory, and liquid cooling technology. They explain compute requirements for reasoning models, disaggregated inference (prefill/decode phases), and agentic AI workloads. Adobe's Ersin Yumer demonstrates Firefly Foundry, which trains custom video models on customer IP using AWS infrastructure with EKS and auto-scaling. Examples include the Regenerates franchise and Cosmos Laundromat, showing how foundation models trained on specific IP enable studios to generate production-quality content while maintaining commercial safety.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: 15 Years of AWS-NVIDIA Partnership and the Evolution of GPU Computing

Hi everyone and welcome to AIM 251. Today we'll be talking about the GPU platforms we're building at AWS and how these are powering recent breakthroughs in generative and agentic AI. My name is Dvij Bajpai and I'm a principal product manager at AWS. I'm joined today by my colleague Sreekar Reddy, senior product manager at AWS, and we're really excited to be co-presenting with Ersin Yumer, senior director of engineering at Adobe.

Our agenda today is in three parts. First, we'll talk about customer use cases in generative AI, diving deep into the compute requirements for emerging use cases like reasoning, disaggregated inference, and agentic AI. Next, we'll talk about the major investments we're making in our AI infrastructure to deliver the next level of GPU performance for these use cases. Finally, we'll hand over to Ersin to learn about how Adobe is leveraging GPU infrastructure at AWS to develop new capabilities in generative AI for creators all around the world.

But before we jump in, we'd like to commemorate the fact that this year, 2025, marks the 15th year in our partnership with NVIDIA. Our partnership started way back in 2010 with the launch of our CG1 instances featuring NVIDIA Tesla GPUs. You can see the diagram from our launch blog back in 2010, showing the main architecture differences between CPUs and GPUs. CG1 was mainly focused on use cases in graphics and high performance computing.

But in the early 2010s, there were also major breakthroughs showing the benefits of GPU-accelerated compute for machine learning. In 2012, a paper called AlexNet showed that neural networks trained on NVIDIA GPUs could significantly outperform the state of the art in machine learning use cases like image recognition. We launched our first P series and G series instances in the early 2010s to support these emerging use cases in machine learning.

The next turning point in our portfolio came with the launch of P3 instances, which featured NVLink, a high bandwidth, low latency interconnect connecting multiple GPUs within an instance. This meant that customers could train bigger models and more complex use cases. Around this time, the transformer model architecture was developed and quickly became the industry standard. Researchers discovered what are called the scaling laws, which showed that model performance in terms of model accuracy improved consistently and predictably the more data and the more compute that you dedicated towards training.

This was a major inflection point in the industry because it showed that you could get consistent returns to your model performance by scaling your compute. We quickly saw that customer workloads scaled from tens or hundreds of GPUs out to thousands or even tens of thousands of GPUs for training. In 2020, we launched our EC2 Ultra Clusters to support this massive scale-out on P4 and P5 instances.

But of course the industry was just getting started. In the past couple of years we've seen the emergence of trillion-parameter models, context lengths out to a million tokens or more in production, and compute-intensive use cases like reasoning. To support these use cases, we launched our first EC2 Ultra Servers based on NVIDIA GPUs earlier this year.

The Paradigm Shift: From Gen AI 1.0 to Compute-Intensive Inference Workloads

To provide a bit more context on this present moment, let's dive deeper into the specifics of how customer workloads have evolved over the past couple of years. Looking at the industry two or three years ago, what we'll call Gen AI 1.0, our top customers spent the vast majority of their compute on large-scale pre-training. Researchers used parallelism techniques like data parallel and model parallel to scale workloads out to thousands of instances, leveraging the scaling laws to improve model performance.

The next step in the process is fine-tuning and then model distillation, where the size of the model is reduced before it's deployed to production. Smaller models are easier to deploy, and they cost less in production because they require less compute for inference. In inference, customers focused on single-node or even single-GPU inference to optimize the economics of inference at scale. Simplistically, customers spent the majority of their compute in training and fine-tuning to optimize the compute required for inference.

Of course, this general focus continues in the industry today as well. But we're starting to see this paradigm shift in certain important ways. One thing we've observed over the past year is that the compute requirements for inference have increased significantly. This is partly driven by the fact that demand for inferences is increasing, but it's also driven by the fact that inference use cases themselves are becoming more compute-intensive.

There are three use cases in particular that are driving the need for increased compute in inference. The first is reasoning, where a model takes a problem and breaks it down into intermediate steps before formulating its final response. The model therefore needs to generate a lot more intermediate tokens

to get to its final answer, which requires more compute. The second is multimodality, where a model doesn't just respond with text but can generate a broader set of outputs like images, video, or audio. Urson will talk a lot more about recent developments at Adobe in this area. And the third use case is Agentic AI where a user no longer simply interacts with the model through prompt and response, but the model can actually take action on the user's behalf, calling APIs or tools, or even interacting with other models in multi-agent configurations to solve complex problems in production.

So all of this is driving the need for more compute in production. But we're also seeing customers dedicate a lot more compute towards reinforcement learning or RL. Customers scale RL pipelines to ensure that models continue to stay aligned with user intent as they operate in a broader set of environments and generate a broader set of outputs. So Shrika, let's dive deeper into the compute requirements for each of these use cases.

Reasoning Models and Disaggregated Inference: New Compute Requirements for Multimodal AI

Yeah, so the first major trend that we want to focus on today is the growth in reasoning use cases. Several of our customers are increasingly implementing chain-of-thought reasoning during inference where a model breaks down a complex problem into simpler steps in order to improve the accuracy of the model across a range of benchmarks. Reasoning models generate a set of intermediate responses that are iteratively fed back to the model for it to generate a final output response to the user. As such, the number of intermediate tokens generated during reasoning is significantly higher than that of a traditional LLM.

In the last year or so, we have also seen that chain-of-thought reasoning has been evolving into tree-of-thought reasoning, which closely resembles how humans think. In tree-of-thought reasoning, a model explores multiple different paths in a structured manner in order to figure out a solution to the problem. Reasoning models also generate an interesting trade-off between training time compute and test time compute. As we're all aware, the accuracy of a model can be significantly improved by having it trained extensively. But the same can also be achieved by using significant compute power during inference, enabling the model to take time to think, validate its response, and refine its answer.

An interesting trend that we've seen from reasoning is that customers' compute requirements and GPU memory bandwidth requirements are continuing to grow to optimize for these reasoning use cases. The second major trend that we want to focus on today is the growth in multimodal LLMs. A couple of years ago, customers used dedicated models for vision, text, audio, and video. However, in the last couple of years, we have increasingly seen customers train and deploy multimodal models that are capable of crossing input and generating output across a variety of modalities.

Depending on your use case, you can have a range of input and output sequence lengths. For example, if you are trying to summarize a 200 page document into one paragraph, you're likely going to have a very large input sequence length. On the other hand, if you are trying to generate a video from a simple text prompt, you're likely going to have a very large output sequence length. Scaling input sequence length, where the input tokens are processed one at a time, is usually very compute intensive. And scaling output sequence length is usually more GPU memory bandwidth intensive.

In order to optimize performance across a range of input and output sequence lengths, we're starting to see that customers are increasingly disaggregating the two major phases of inference by running them on dedicated GPUs. In the first phase, called prefill, which is more compute intensive, the focus is on processing tokens all at once and then generating the first output token along with the initial KV cache. And in the second phase called decode, which is more GPU memory bandwidth intensive, the focus is on generating the subsequent tokens using the initial KV cache generated during the prefill phase.

Agentic AI and Reinforcement Learning: Heterogeneous Compute for Dynamic Environments

A key bottleneck that typically surfaces during disaggregated inference is the transfer of the KV cache between the prefill and the decode nodes. As such, a key compute requirement that is arising from these use cases is the need to have the best inter-GPU connectivity to optimize performance for multimodal use cases. The next trend that we want to focus on is Agentic AI. Several of our customers are increasingly focusing on Agentic AI as the next frontier of AI across enterprise and consumer.

In agentic AI, a model not just interacts with the user, but it also interacts with the environment through either querying a database or through APIs and tools, and also interacts with other models before performing an action on behalf of the user. Some of the actions performed by the agent are CPU intensive, such as querying a database or data preprocessing. Whereas some of these actions performed by the agent are GPU intensive, such as generating the output tokens. As such, a key compute requirement that is arising from these agentic AI workloads is the need to have heterogeneous compute of CPU and GPU cores co-located in order to optimize performance for these use cases.

As customers want agents to take actions on behalf of the users, they also want models to better align with user intent. And this is where reinforcement learning comes into the picture. Reinforcement learning includes inference in the loop and works by having a model learn by trial and error. In reinforcement learning, a model observes the environment, takes action, and gets feedback in the form of either a reward or a penalty. The model then uses this feedback to update its model weights and model policy so that it can make better decisions in the future with the goal to optimize or maximize the cumulative reward over a period of time.

For example, let's say you're trying to train a robot to navigate through a maze. When the robot starts off, it makes random movements. It may hit a wall and then get penalized. It then knows that in that state, it is not supposed to take that action. And when it makes progress towards a goal, it gets rewarded over time. The model figures out, or the robot in this case, figures out what actions under what states lead to positive outcomes and what actions lead to negative outcomes. Through multiple trials and errors, the robot optimizes its strategy to make sure that it can navigate through the maze in the least amount of time through the least number of errors.

RL algorithms also work efficiently in complex environments that have multiple rules and dependencies. RL algorithms can also quickly adapt to ever-changing environments by constantly updating their strategies to optimize results. An interesting trend that we've noticed from the recent past is that customers are increasingly using the same cluster of nodes for pre-training, for inferencing, and for reinforcement learning workloads by dynamically scaling up and down as needed.

Liquid Cooling Innovation: Enabling Higher GPU Density and Performance

Now that we've talked about some of the trends in the GenAI industry and the compute requirements these are driving, let's talk about some of the investments that EC2 has been making in the past year or so to meet these evolving customer requirements. Emerging use cases like reasoning, disaggregated inference, and agentic AI require more compute and memory in production, but they also require better interconnectivity between GPUs, both scale up and scale out. Let's talk about some of the major investments we're making in our AI infrastructure to deliver the next level of GPU performance for these use cases.

A major area of focus for us over the past year has been liquid cooling. As GPUs have gotten more powerful in recent years, they have also gotten a lot more power hungry, generating a lot more heat and requiring more efficient methods of cooling. Liquid as a conductor is a lot more efficient than air, meaning it can more efficiently dissipate heat away from GPUs. We launched our first at scale liquid cooling platform this year through our P6e Ultra servers.

So let's dive a bit deeper into how we actually implemented this technology at Amazon. It all starts at the chip level, and as you can see from the picture there, we have a GPU cold plate that fits right on top of the GPU which carries cold liquid through the central tube, dissipates heat away from the GPUs, and then carries the warm liquid out through the tube on the side. We actually co-designed a custom cold plate with NVIDIA with a focus of serviceability and reliability in production. These liquid cooling plates have quick disconnects so that our data center technicians can quickly replace compute trays that need maintenance in production while maximizing uptime and availability for customers. We also carefully selected every material in the liquid cooling loop to reduce the probability of corrosion causing liquid cooling leaks in production.

The liquid then flows out to what's called a coolant distribution unit, or CDU, which is constantly measuring metrics like temperature and pressure, as well as other metrics like turbidity and conductance. This means that we can continuously measure the health of the system in production. The liquid then goes to a heat exchanger to get cooled down. At Amazon, we actually invented a new solution called the IRHX heat exchanger, where we can co-deploy our cooling racks alongside our compute racks in our data centers. This means that we can leverage existing data center capacity much more dynamically and flexibly to scale our GPU compute in response to customer demand.

What are the main benefits to customers from liquid cooling? The first main benefit is higher compute density. With liquid cooling, we can fit many more GPUs much closer together, meaning that we can pack more GPUs in standalone compute products like EC2 Ultra servers. We'll talk more about these in the next slide. The second benefit for customers is that our focus on serviceability and reliability means that customers can enjoy higher uptime and availability through our design. Finally, solutions like IRHX mean that we can scale quickly in response to customer demand without having to constrain ourselves only to those options that can support facilities-level liquid cooling. Liquid cooling is a foundational technology that provides better GPU performance for customers.

EC2 Ultra Servers: A New Tenancy Model for Multi-Trillion Parameter Models

Let's now talk about how customers actually access this compute through EC2 products like Ultra servers. An EC2 Ultra server is a collection of EC2 instances that share a high bandwidth, low latency GPU interconnect, meaning that a mesh of GPUs across instances can operate as a single powerful standalone unit of compute. This means that we can deliver an order of magnitude higher compute to customers. Our P6e Ultra servers provide more than 20 times the compute under NVLink compared to our prior generation P5en instances.

Ultra servers represent a new tenancy model and a new customer experience in EC2, and we launched a dedicated set of APIs to help customers manage every aspect of this customer experience, from capacity provisioning to monitoring and observability to other aspects like topology and auto scaling. We also implemented a custom Nitro-based design for managing the NVSwitch trays in the Ultra servers so that we can securely and dynamically partition the NVLink domain, providing greater choice for customers across Ultra server sizes. Finally, we had an emphasis on ensuring that every aspect of this Ultra server customer experience integrated seamlessly across the GenAI stack at Amazon from EC2 services like topology and auto scaling to orchestration services like Elastic Kubernetes Service and Parallel Lustre, all the way to fully managed services like SageMaker Training and Hyperpods. Our primary focus was to ensure that this powerful compute product was as easy for customers to use right out of the box.

What are the main benefits for customers? The first is that customers can train and deploy bigger models at higher context using Ultra servers. Our P6e Ultra servers have over 10 terabytes of GPU memory under NVLink, meaning there's plenty of GPU headroom for customers to scale into the multi-trillion parameter range. Customers can also use the higher compute within the Ultra server to optimize performance for use cases like reasoning, as we spoke about earlier. Finally, customers can leverage the higher aggregate memory bandwidth under NVLink to optimize performance for decode-heavy applications like video generation.

EC2 Ultra Clusters: Petabyte-Scale Networks with Resilience at Scale

Ultra servers provide up to 72 GPUs connected over NVLink, but as we know, training workloads can scale to thousands of GPUs, and that's where our scale-out network, or EC2 Ultra clusters, come in. An EC2 Ultra cluster is a petabyte-scale network that connects tens of thousands of GPUs in a data center. Ultra clusters are non-blocking, meaning that each GPU can drive its maximum bandwidth without causing contention within the network. We completely redesigned our Ultra clusters this year with 51 terabit switches and 400 gig links, up from 12.8 terabit switches. What this means is that we can connect more GPUs with fewer networking components, meaning bigger GPU clusters at lower latency.

We also implemented a rail-aligned topology with our Ultra clusters, which provides optimal performance for collective communications that are used in machine learning. One of our main design principles for Ultra clusters is resilience at scale, and so we implemented design features like ToR overprovisioning and backplane redundancy, as you can see in the diagram there, to ensure that customers continue to see consistent performance, even if there are events like links or switches down in the fabric. At the instance level, customers provision instance networking using Elastic Fabric Adapter, or EFA. We launched our fourth generation EFA this year with our P6 instances, offering higher throughput for customers at 20 percent lower latency.

EFA leverages our scalable, reliable Datagram or SRD protocol, which is purpose-built to optimize scale-out in a cloud-scale environment. SRD uses techniques like intelligent multipathing and out-of-order execution so that we can deliver a networking protocol that dynamically adjusts to events like congestion or links down in the network.

So what does all of this mean for customers? Firstly, with our new ultra clusters, customers get bigger GPUs at lower latency from the new networking switches. The new hardware also means that we can provide higher bandwidth at lower cost. The rail align topology offers optimal performance for common collectives in machine learning workloads.

Finally, design features like tore overprovisioning, backplane redundancy, and SRD mean that we can deliver an end-to-end solution that is resilient to congestion and hardware failures, providing consistent performance for customers as they scale out to tens of thousands of GPUs. These are the major investments we've made in our scale-out network, in our EC2 control plane, and in foundational technologies like liquid cooling.

NVIDIA Blackwell Ultra on AWS: P6e-GB300 Ultra Servers and P6-B300 Instances

Let's now learn more about the NVIDIA platforms that we announced at re:Invent this year that leverage these capabilities to provide the highest GPU performance for customers. We made a couple of really exciting announcements in the last month or two, including one exciting announcement just yesterday with the P6e-GB300 Ultra servers.

The P6e-GB300 Ultra servers are powered by the NVIDIA Blackwell Ultra GPUs and offer 72 GPUs within one NVLink domain. They offer the highest GPU compute and the highest GPU memory with an EC2 tool for customers to train multi-trillion parameter models. They are ideal for use cases like reasoning in production, as well as for training and inferencing multimodal large language models.

The P6e-GB300 Ultra servers offer 20 terabytes of GPU memory within one NVLink domain. Just to put that in context, this is 30 times the GPU memory offered on P5 instances, which only launched 2 years ago. We're also starting to see customers use FP4 for inference, and the P6e-GB300 Ultra servers offer 50% higher FP4 FLOPs compared to the P6e-GB200 Ultra servers that we launched just a few months ago.

I also want to highlight a few design aspects of the P6e-GB300 Ultra servers. The Ultra servers feature the NVIDIA superchip architecture in which the GPU and the CPU are co-located within one compute domain. For customers running heterogeneous compute workloads, this provides the ability to take advantage of CPU and GPU cores in parallel. The various compute traces within the GB300 Ultra servers are connected in a tightly coupled mesh using NVIDIA NVLink.

We leverage our Nitro-powered head node to manage the NVLink switches, which means the NVLink network is completely managed within the trusted Nitro domain. We also developed a set of dedicated APIs for the Ultra servers for customers to provision capacity using capacity blocks, as well as to monitor the health of their nodes and to monitor the health of their NVLink fabric, as well as for customers to determine the instance topology within the Ultra servers.

We remain extremely focused on making sure that the new APIs and the experiences that we developed for Ultra servers are seamlessly integrated across the generative AI stack. Just last month, we also announced the Amazon EC2 P6-B300 instances. These instances are also powered by the latest NVIDIA Blackwell Ultra GPUs and offer 8 GPUs within one NVLink domain. These instances offer the highest performance within an NVL8 configuration at EC2 and are ideal for training and inferencing mid and large-scale generative AI workloads.

There are three important customer benefits of the P6-B300 instances that I'd like to highlight. These instances offer 50% more GPU memory compared to the P6-B200 instances that we launched back in May, giving customers the ability to deploy bigger models within a single GPU or within a single NVLink domain. This reduces communication overhead and model sharding.

The P6-B300 instances also offer twice the networking bandwidth compared to the P6-B200 instances, supporting up to 6400 gigabits per second of EFA throughput, making them ideal for large-scale distributed training workloads. Similar to the P6e-GB300 Ultra servers, the P6-B300 instances also offer up to 50% more FP4 FLOPs compared to the P6-B200 instances.

These instances also offer up to 50% more FP4 FLOPs compared to the P6-B2100 instances. They also offer a dedicated Nitro card for north-south traffic supporting up to 300 gigabits per second of ENA throughput to access remote storage services like S3 Express One. They also offer twice the CPU to GPU bandwidth compared to the P6-B200 instances, further improving latency for large scale training and inference workloads.

At EC2, we prioritize giving customers the broadest set of NVIDIA GPU instances, but we do acknowledge that it can sometimes get quite overwhelming to look at a bunch of specs and figure out what the right instance is for your workload. One way that we think about segmenting our portfolio at EC2 is on the left side, we have the G family instances, which mainly comprise the G6 and the G6E instances. These instances are ideal for small and mid-scale training workloads, for agentic AI workloads, graphics intensive, and spatial computing workloads.

The G family instances are also ideal for models that can fit within a single node or even within a single GPU, so customers don't really need high inter-GPU connectivity or high GPU to other component connectivity. This enables us to optimize the infrastructure or simplify the design and remove the PCIe switching layer, giving customers a cost optimized option and enabling them to strike the right balance between performance and cost.

In the middle, we have the NVL8-based P family instances. These instances are ideal for training and inferencing mid and large scale generative AI workloads. They offer up to 8 GPUs within one NUMA domain and come with the most powerful NVIDIA GPUs. They also have the ML chip interconnect at the bottom, which provides high GPU interconnectivity between any two GPUs in a single server.

With P6-B300 instances, for example, any two GPUs can talk to each other at 1.8 terabytes per second. So if you have expensive communications in your algorithm, these instances become incredibly useful. We also have the switching layer in the middle. In our design, the GPUs and the EFA devices share the same switches, which means the GPUs can talk to the EFA devices and over the network without having to go through a CPU, which further improves the overall network performance.

Overall, if your workload needs the latest and greatest GPUs, needs the highest inter-GPU connectivity, and you want the ability to scale out, the NVL8-based P family instances become an ideal option. On the far right, we have the P family EC2 Ultra servers. In the Ultra servers, we have several compute trays that are interconnected in a tightly coupled mesh through NVIDIA in-wheeling switches. These Ultra servers provide the highest performance in generative AI and are ideal for training and deploying multi-trillion parameter models.

To summarize, we looked at some of the recent trends in the generative AI industry and the compute requirements that these are driving. We also talked about some of the investments that we've been making at EC2 to help meet customers' evolving requirements. I'll now hand it off to Ersin to talk about how Adobe developed Firefly on AWS.

Adobe's Firefly Base Model Training: Commercially Safe Video Generation at Scale

The first part was quite technical, and these guys have been preparing for a while, so I'm just going to take it down from here, setting your expectations. The way compute evolves is already fascinating, and we have been hearing that throughout this conference here at re:Invent. I'm going to talk about the application layer and particularly one thing that we care about at Adobe, which is video models. It's not only the AI content that you see on the web today, but more so how do we actually productize and make it useful for our customers, especially at the high-end media and entertainment companies and studios.

We brought this product, Firefly Foundry, very recently, about six months ago. The way Firefly Foundry works is we take our base models, which are commercially safe, and we really run them through training on the customer's IP. I'm going to go into all of that and I'm going to talk about it a bit.

Back to the talk in the last half hour, how the infrastructure impacts what we do and what we can do. We're going to start from the training of our base models just to set a level set with everybody. The way we do base model training is what we call commercially safe, and that means we not only have the training rights to the data that we train on, but we also moderate and eliminate any trademarks and brands from the content that we use. Of course, we curate the data based on safety, fairness, and robustness, which then results in commercially safe models. When you think about that type of data scrutiny, you end up with a model that's not necessarily extremely powerful, but you know that it's commercially safe. When our customers want to generate something that's in their IP space, they have to customize, so we will touch on that in a bit.

Let's walk through how we train and what type of architecture we use for our base models. When we do curriculum training, we generally walk through three steps in our pre-training pipeline. The first step is image-based model training on images where computer efficiency is high and we have better utilization of available data. We have a lot of commercially safe stock data on the imaging side, and basically it stabilizes the model, especially the diffusion layers and the IT layers that allow the model to learn spatial patterns early on. Then we move to the dynamics. We start adding low resolution video to the training, which helps the model capture large-scale temporal dynamics. Not only that, but it's especially efficient on the compute side. You've just seen the GB 300 PCIe, and you don't have to go to that architecture for this step because you're using a lot less compute and you're very efficient.

But then as soon as you start going into medium resolution videos and start adding them, on the model side you're starting to learn facial mimics and fluid dynamics. On the infrastructure layer, when you look at how you need to do parallelization, HSTP and FSTP come into play, and really wrangling all of that is quite time consuming and resource consuming. I'm really looking forward to getting on the larger setups of the future. Then finally, when we add high resolution video, this is where we do post-training, and that's where we do texture quality, prompt alignment, and reinforcement learning. We also do specialized forks of the post-training because from the same base models that might have been trained until mid-training, we fork off and create image to video applications, video extension applications, and text to video applications as well.

We do all of this training with image, audio, and other media modalities primarily on our AWS infrastructure. I'll walk you through what we use. We start from bare bones EC2 at the very low level, but we actually add the managed Kubernetes layer, which is EKS. We are on EKS with AWS for our training instances completely on top of that tooling. We actually build our own training job management system, and that particular job management system allows us to do three things very efficiently. One is what we call dev box sessions, which are developer sessions that our developers or researchers use on the same machines right before starting and kicking off a batch training. We can do batch offline jobs that run when the compute is empty, and of course the priority jobs are the larger training jobs.

To double click on that a little bit, we are in a unique state where we have more than maybe thousands of folks that touch the cluster of hundreds of thousands of GPUs. These are researchers, engineers, service engineers, and product teams. What happens is you really need an efficient job management system. You need projects and quotas, and you need to manage how you create priority jobs so that the top priority work gets done.

The multi-tenant training scheduler is something we built in-house, and we can talk to you about that later if you're interested. Basically, what that allows us to do is manage quotas and create priority jobs so that very important large-scale training jobs don't die. We have a heterogeneous set of compute, and the jobs are distributed into that compute based on those priorities and quotas.

On the scaling side, I really wanted to bring something into perspective. If you look at the video models that we have today, the Firefly video models, when you go to the highest resolution of the video, you're talking about 10% of the iterations, maybe not. That's really the last set of the post-training side, but from a FLOPs teraflops perspective, it's actually 90% or more of the training. That's where GPU failures, hardware failures, node failures, and networking failures all come into play.

You really need to consider auto-recovery on the software level because you're essentially building a rocket from hundreds of nodes, which is consumer-grade hardware. Failures are inevitable, and quite often actually. So how do you make that rocket keep flying while hardware is failing every now and then? We use auto-recovery, which is also something we built in-house. We do pre-flight checks on the hardware, and then while the hardware is running and a job is running on that hardware, we consistently drop nodes and bring nodes back up into the job if and when necessary. When the hardware fails, we do post-flight checks to keep the cluster in a good sanitary state.

On the parallelism side, when you go up to thousands of GPUs and more, you start thinking about HSTP, which is something you wouldn't necessarily think about at earlier stages with a smaller set of GPUs. FSTP works a lot better and is generally what we do. On the data side, in general, FP8 has always been the goal, and this is where you see generative media forking from where the LLMs and the state of the art in LLMs are. Large-scale LLMs can go even lower or to bit level, but for us on the media side, the bit representation is really important for high resolution.

When you think about movies that you see on TV or in cinemas, what you're really seeing is over 4K with 14 to 16 bit depth. That depth resolution needs to be represented, and today we use mixed precision with FP8 and BF16 to get there. When you're training video models, there are four interesting bugs that I'm going to show you that you see most of the time. There's almost always this confetti bug that happens with fast motion, whether that's like raining or the opening of wings.

Dancing is really tough, and when you look at people in front of the camera for a while, they actually should be blinking, but they don't blink initially. Those are the bugs that you need to catch. You can see how hard it is to catch this bug from a metrics perspective or automation perspective. It's really not at a quantizable level; it's at a quality level. Expressions generally end up being exaggerated at these early levels of training, so that's what we fix at the post-training side of things with really high-quality data that focuses on this type of data.

Optimizing Inference Pipelines: Adobe's Architecture for Billion-Scale Generations

On inference, I'm going to talk about the anatomy of a general inference pipeline. What you really have is not just the model that represents the particular video generative model, but you add a lot of additional models and business logic around it. Text and images go in if this is an image-to-video model, and the video comes out. But along the way, we actually split not only the business logic itself but also the model itself to be able to optimize how we use the machines in the cluster in the cloud.

What I mean with that is if you think about the DIT blocks, which is the heaviest part of a video generative model generally, and then if you think about the decoder and what you do from a business logic perspective on the safety side, every single box here that you see on the screen requires a vast amount of different compute. So the DIT blocks will be the ones that you would put on, say, P6s or even more. And then you can of course for some of the encoders and decoders get away with even G10s, which are G5s. So the way we structure our in-house built inference pipeline is that each and every single module that we break out has its own set of queues and machines dedicated to them in the cluster.

Those machines that are queued are picking up jobs only for that particular box. So the DIT block would be going into the say P6 queues, and then P6s would be crunching on the DIT block. Then we do auto balancing across the board as well, so an entire inference pipeline could be split at any given time on 20 different machines. On the DIT block specifically, I already talked about the fact that this is the most compute intensive side of it. When you look at the unoptimized model that's coming from PyTorch on training generally, where we work, you're looking at one step taking about maybe even 3 or more seconds, and 40 steps would take maybe even 2 minutes or more with a medium sized model.

If you don't optimize this really, you're burning a lot of compute, and the customer experience is not great. So we do heavy optimization, including distillation, TensorRT runtime optimizations, kernel specific in-house built kernels, and then we switch to mixed precision as well at this point to even further push the gains. From an architecture perspective, the way we have built our inference stack is similar to what we do on the training side. We build on top of EKS, but we have what's called Ethos, which builds on top of that, which brings us to compliance and across the customer facing side of things. It also brings a lot of additional safety and security.

On top of that EKS Ethos distribution, we build our ML compute frameworks essentially, and then finally we do ML inference on top of those frameworks. So depending on the model type and how it's optimized, the particular framework will be different. I already mentioned that the set of compute that you need for particular parts of the pipeline, even in a single model, is different, and that's how you get to really high utilization of your compute. Communication ends up being a bottleneck if you don't design that particularly well. So what we do is we really create a global queue for each of those different broken out pieces of the model, and then we pull them.

The machines themselves pull them from the global queue when the machines are empty and ready for the next job. So we have seen that work a lot better rather than the push-based method where a global master pushes into each node thinking that node should take the job. The custom auto scaler that we have, that particular set of broken out network that we distribute on the inference clusters, is only one of the networks at any given time we serve. We have about 100 of these services at any given time with many different models, many different applications, so it becomes very quickly not really easy for the SRE team to go in and optimize the number of machines that should be assigned with each service and then assigned to each of the worker groups.

That's why we built an auto scaler that basically handles that entire distribution of the resources towards the services and then down to the worker groups itself automatically, and then the only job for the SREs ends up being pulling in more nodes if needed from the cloud.

If you're tapping into too much on-demand capacity, we reduce that by bringing in more reserved instances.

That's the base model training and base model services. Remember at Adobe we have more than a billion generations every month. So the scale of that model and the reason that we optimize it, especially on the inference side thinking about bottoms up from the hardware, is because every single optimization we can squeeze in matters a lot due to the volume of generations that we serve.

Firefly Foundry: Foundation Model Customization for Studio-Quality IP Content

Now on the Firefly Foundry side, I'm going to talk to you about what Foundry is and show it in the Adobe stack. What you see at the bottom is our foundational AI models. The ones you see in the dashed lines are basically the ones that we have brought in from third parties, but think about the ones that are Adobe-specific. For the video model that I showed you, they are serving a specific need as base models and are commercially safe.

However, if you are willing to put in your IP and you really want to stay commercially safe but use your IP as well, you didn't have anything other than what we call custom models before. That was essentially how you would bring in maybe twenty to thirty images and self fine-tune with LoRA on top of our base model, and you would be able to represent a small new character or maybe a background. What Firefly Foundry does is take that customization to the level of full foundation model customization.

We walk back in the training timeline of the model, start from mid-training or maybe post-training, and then add a lot of IP which could be coming from a studio. It could be over petabytes of data, so it's really very rich, high-resolution, high-quality data that the studio potentially owns the IP for and is able to use. We are then able to create video models for them that represent the high-quality demand that they have for production and makes the model not only production-ready from a quality perspective but also production-ready from a legal perspective as well. It's the commercially safe data that Adobe has and as the studio customer, the IP that you have.

I'm going to show you one example here. What you see in this video is IP that's generated using both hand-drawn figures and then replicated across the board for 3D representations. Further down, a full series of motions, motion characters, and voices and sounds are generated for all those characters. The video quickly summarizes a typical IP generation, character design, character development, and then further on generating content for the environment and really content for the episodes of this IP.

When you think about all of that IP and bringing all of that into many episodes of this Regenerates franchise into the model, and then also the artist's drawings, the artist's textual descriptions of what and how the environment should feel like, what is the look and feel of characters and their motion, it's really rich data. It's not only just scraped data off the video parts of it. I'm going to show you if you take this data and mold it into our models with the commercially safe data, what the results are. The first few seconds of this video you will see is going to be the Regenerates franchise fully generated from a Firefly Foundry model, and then it will walk through even making that IP into different styles as well, which is the power that's coming from the base model. Taking selfies in front of the hideout, what could possibly go wrong? Who is this guy? We got to get out of here, Fate. Quick, take another selfie. I'm not enjoying this.

This is the original style of the franchise. You can see the power of training a foundation model on your IP this way because it completely enables the creative team to create shoulder content, revitalize their content, and post it onto social channels. It even goes to the level of user-generated content, or UGC, to bring to users as interactive content.

We'll go into one more example. This is Cosmos Laundromat, which is something you can actually find online. It's an open-source Blender project that was done 10 years ago. It's an animation short on YouTube, I believe around 12 minutes long. The team took that data and put it into clips. We created about 2 hours of training data from the 12-minute original by clipping it in different ways and cutting the video in different ways as well. Then we trained the foundry model with that, and these are the results you're seeing that are typical generations from that foundry model.

I'll show you one with a particular prompt. Victor is the red-headed man and Franck is the black sheep, and they are in this mystical world of floating islands. What you're seeing here, if you really read the prompt and look at that motion, you're controlling the camera and the actual characters in the franchise. It's extremely powerful to be able to create clips like this. These video models are good enough from a base model perspective, but even if you look at the strongest open-source models, you wouldn't be able to create these results exactly for these characters. That's because when you train on an average basis and not necessarily the specific franchise, you're not thinking about Victor and Franck. You're thinking about the entire world of knowledge, which is good for the base model but not necessarily for high-quality IP content generation.

Here's another example where Victor and Franck are prompted by just their names. It's much less mental load for the creative working with the system as well. I'm going to stop there. Thank you for showing up to the last session of the last day. We can take either questions or meet with you offline after the talk. Please do fill out the survey that should pop up on your mobile phones to give us feedback. Thanks everyone. Maybe we can take questions.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community