Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Tapping into the Power of Agentic AI: Driving Mission Success with NVIDIA & AWS

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Tapping into the Power of Agentic AI: Driving Mission Success with NVIDIA & AWS

In this video, NVIDIA and AWS partnership for generative AI agentic design is explored, covering the evolution from LLMs to RAG, fine-tuning, and agents. The session explains NVIDIA's terminology including NEMO (neural modules), NIMs (NVIDIA Inference Microservices), and Blueprints for Kubernetes deployment. Key challenges in scaling agents to production are addressed, emphasizing exponential token growth, security, and data governance. The NVIDIA Nemo agent toolkit offers modularized solutions with framework-agnostic support, delivering 57% fewer lines of code, 16x faster data processing with Nemo Curator, and 2x faster response times. Demonstrations include parameter efficient fine-tuning (LoRA) and deployment options across AWS services like EKS and SageMaker, with resources available at build.nvidia.com and the AWS Marketplace.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Evolution of Generative AI: From LLMs to Agentic Design with NVIDIA and AWS

Hello, hello. Thanks everybody for attending. I hope you can hear me just fine and you're ready to learn about agent design for mission success with NVIDIA and AWS. It's a strong partnership that we've been working on for a long time, and it keeps increasing.

If you think about NVIDIA and AWS, we go hand in hand. Today, specifically, we're talking about generative AI agentic design. It's very important, and it seems like it's been a very long time, but as we see the dates, the adoption started in January 2023 with generative AI. People became comfortable with LLMs and comfortable with using human language to interact with very highly intelligent scientific models.

As we go forward, scientists and human users started interacting with RAG and being able to increase the effectiveness of these general purpose models, which is very important. You want to create something that is for your mission success, so that's where RAG came in. Parameter efficient fine tuning came in next, and then came agents. It's just the natural progression of what we're doing.

Agents are really important because they put together many hyper-focused, intelligently designed subsets of models to be orchestrated by another model. As you go forward, it's very important to think about agents as a huge ecosystem rather than just one single monolithic being.

Before I jump right into what that is, I want to throw some terms out there so you're not scratching your head as we go on. This is NVIDIA's terminology for how we put together our generative AI processes, AIML pipelines, and ultimately, AI agentic design. NEMO stands for neural modules. Over a decade ago, we identified the modularization of the machine learning pipeline process, which is very important because it's not a single scientific process. Each thing along the way needs to be optimized, or else you reduce in performance.

NEMO Agent Toolkit specifically uses NEMO modules for agentic design. NVIDIA Inference Microservices, or NIMs, are containers. Most importantly, they're optimized containers of frontier models or Hugging Face models with NVIDIA driver packages, with RAPIDS for Spark processing, large data jobs, and so on. It depends on what it is. Each one is specific, but it allows you to run inference very quickly by changing just one line of code.

Blueprints take a bunch of containers, and you need Kubernetes to implement them. So blueprints are our term for Helm charts and implementation of multiple containers, such as NIMs. This gives you the effectiveness design of that machine learning pipeline, so you can just go ahead and lift and shift and put that into your process. The ultimate goal is to use the word scale a lot, accelerated computing a lot, and we want to get you into production. It's very easy to start with something that's really good and end in a fizzle because as you start to scale out, a lot of things get amplified and can be run incorrectly.

Understanding Agent Complexity and the Production Scaling Chasm

So what is an agent? What does it look like? How does it work? We hear a lot about it. Here we have an avatar in the middle, which is kind of our orchestrator agent, just telling what to do. Multiple elements are involved in agentic design. You have tool use, computer use. Memory is very important. When you use an agent, that memory gets maxed out rapidly. Other agents start to talk with them and interact with them. It's an ecosystem again. And that human is the input prompt. Then we create tokens, then we do more processing.

A lot of people may not be aware, maybe you are a developer conference, but agents actually have an exponential increase of tokens. That reasoning is just creating more and more tokens. As a human, you don't see it unless you ask the model to show you its thinking process. So that leads to the complexity of the whole situation.

The agents you're going into production with have a lot of legacy code, a lot of existing pipelines, and heterogeneous data coming from all types of sources. To get that to perform reliably is quite difficult, and you want to think ahead. It's kind of like defensive programming, defensive data science. You're expecting the pitfalls before you get there because what happens is you enter a chasm.

The more you scale, the more that chasm grows. It deepens and expands, and so many pitfalls happen. The worst thing is you have an amazing POC. You get great stakeholder buy-in, you get investments, you get people supporting you. And then as you go forward, you start to see performance suffer because the governance of the data gets difficult, the profiling gets difficult. Security is very, very difficult for agents because humans need to be authenticated, and so do computers need to be authenticated rapidly at scale.

All of these considerations need to be thought about and addressed as you're going into production level. So let's avoid the chasm and show you how. This is a simplified chatbot. It has quite a few arrows, and it takes considerable learning to understand this process. However, understanding the machine learning pipeline process is quite important, and knowing where everything goes is essential. If you want to scale this out from your proof of concept, every single white box probably represents ten more things added to it, with more and more processes stacked on top.

You start to hit a lot of issues with that because you're multiplying everything by ten. All your problems are amplified and everybody can see them. The last thing you want is for users to use your product and have it immediately fail. We want to provide the best developer experience. NVIDIA creates things for people who code and for companies who code. We want to get our services, software frameworks, and SDKs into the hands of those developers so that you can implement them into your production pipelines.

NVIDIA Nemo: Modular Tools and Blueprints for Accelerated AI Development

The best thing about NVIDIA Nemo blueprints and similar tools is that they're mostly free. You can go on GitHub and use them. You can go with a developer license, which is just a login to our NGC container registry, and you can pull any of these and use them. In the diagram, we have NVIDIA Nemo, which represents each different process. Coming from the bottom up, people start with GPUs and accelerated computing, which is very important. But in order to get the best performance out of what you've purchased or what you're using on AWS, you should use some accelerated libraries for that.

We put these out for free so that you don't have to learn a whole new set of skills just to make sure everything's running correctly. Understanding and reasoning, AI safety, and all of these issues are addressed with the line going across with the NVIDIA Nemo agent toolkit. This is a collection of itemized modules to accelerate the pipelines for agents. At the top are different types of blueprints with Kubernetes Helm chart design. You can work bottom up and put all these together, and we have all the machine learning pipelines figured out for you. It's modularized, so you can plug and play. You can use the NVIDIA Nemo packages and then work with whatever other frameworks you happen to be using. We're framework agnostic.

I like to say science a lot when I'm talking about machine learning and AI because I think it's very important to remember that even though we're interacting with human language, we're still interacting with a computer. That computer needs to be taught to do things, and it's on the human to be creative, to fix it, and to identify where to accelerate. The computer doesn't know where it's messing up; it'll just tell you a random error most of the time, if you even get an error. So we start with that, with the data stacked right there, and just start to rotate around this flywheel.

Everything here is very important as part of the scientific process for agentic AI design. You want to curate your data and customize it based on your proprietary holdings. You want to evaluate it and make sure it's good. It's probably a good idea to implement guard rails so you don't want it to go off the rails. You don't want it to hallucinate, you don't want it to experience data drift, and so on. These are big problems. Lastly, you figure that out, create the NIM, and create the container. Now you have a snapshot in time of reliable processing that you can use and plug and play. It's very easy to work with containers and update them.

A data flywheel is self-sustaining. Next, I want to call out a few things that were included in the agentic design. I apologize if it's difficult to read, but the framework is agnostic. If you're using LangChain, that's great. If you're using something else, that's great too. We're here to support you. Semantic Kernel is cool too. YAMLs can be difficult if you've never interacted with them, but once you get comfortable, they're quite easy to change. You can also interact with a YAML via Python. There are lots of ways to work with this.

Safety and security are really important. Evaluation is important as well. The agentic ecosystem connectors, like MCP and custom plug-ins, are things we want to help with. We want to alleviate the stress of you having to write those, so we give them to you so you can start to connect all of these tools together. This leads to the best accelerated production from your pipelines, happier stakeholders, and stakeholder buy-in. It's kind of a flywheel as well as a go-to-market strategy.

Here are some metrics for you. As a coder, I like to think about fifty-seven percent fewer lines of code. That's weeks of development time and expertise that you've now reduced so you can actually get to implementing your services and your production. Higher throughput is very important. That's your tokenization. When you're accelerating the amount of tokens, that means it's faster, it's less latency, and you can possibly use less compute as well, depending on how you work with it. And then there are faster response times.

This is specifically for the healthcare virtual assistant, which is a blueprint for healthcare. If I'm a patient, I really enjoy almost 2x speed up of the response time. A couple of other metrics to call out:

Nemo Curator previously relied on CPU-based data munging and data processing, which is great, but it can only go so fast. If you start to process that data on GPUs with accelerated computing, you're going to see a 16x speed up in this example. That's quite fast and it allows you to get to building your models quicker. Nothing's worse than cleaning up data and sitting there watching it process. I like to get to where we're implementing the data, not just watching it and getting it ready. Data pipelines, deduplication, and getting large data sets to be processed all at once is quite important instead of waiting and chunking. We want to get to the cool stuff and then develop anywhere with the Python API, which is a very good language and Python's almost everywhere. API calls are ubiquitous, so we keep it as simple as possible.

Nemo customization is quite important because you want to customize your models on your own data. There's no bigger pitfall than trying to use a generic model on something hyper-specific. You're not going to get performance on that, and you're not going to see the results you want as a human being. It's going to give you generic ideas, and agents are designed to reduce that generic and make everything hyper-specific.

The Nemo customization architecture is another way to look at that. I showed you some metrics, but if we look at the customization architecture, there's parameter efficient fine tuning or low rank adaptation. If you've heard of those before, that's awesome because that's what ties into agents. You're creating subsets of a generic model that's hyper-specified on your proprietary data. It's the best way to do it. If you look down here where it says training type, you just type LoRA and now you can fine tune. The rest is done for you.

Another way to look at this is it's kind of linear, but then you get to the end and there's a bunch of flywheel rotating and thinking through it. This is all an iterative process. Gone are the days of something that you just hit run or send off and it's good to go, and you sit back and watch it happen like a DBA. Now even DBAs have to be interactive with everything. It's keeping us on our toes, but it's really fun because the amount of really awesome software and products being produced is astounding. It's very useful. Just look around the expo and you can find a lot of this implementation.

With Nemo, it's pretty interesting. People may not know, but NVIDIA software such as frameworks and SDKs already exist in a lot of services you use, such as much of the AWS stack. If you open search, for instance, we just got implemented with that. So the list goes on of what you can use us for. Just keep that flywheel in mind. Ultimately, we've worked through the Nemo processing to create optimized agents with the agent toolkit.

Now everything's involved in this agent toolkit. It's modularized, you can use it off the shelf, and it's ready to go. You won't have to worry about where to go to fix a problem or how to write a new process. Well, it's done for you. You just implement the process, which is way more fun. Creating a good developer experience is key to creating good AI agents.

Deployment Options and Getting Started with NVIDIA Services on AWS

So where can we find all this and how can we use it? Any framework you want to work with, I have many listed here. If you work with any of these frameworks, that's great. If there's some you don't, reach out to us. One thing about NVIDIA is we want to find out your hardest problems. Whatever you're struggling with, if you can find an NVIDIA representative, we always answer the phone. We'll work with you to fix that problem. Where can you deploy a lot of what we've talked about?

Well, pretty much everywhere in the AWS compute stack. I recommend EKS, especially as you go to scale. EKS really does make it the easy button for deploying such as your blueprint design, and agents really are blueprints at Kubernetes. It's a lot of work happening behind the scenes to implement an agent. You want to get the best performance out of that. You also want to get telemetry. You want to see how things are running and the profiling. AWS makes that easy for you, and our partnership with AWS helps alleviate a lot of the issues. I just want to also call out SageMaker. It's a great place to develop and design what you're working on.

We are on the marketplace, and I heard earlier people talking about top secret implementations. We actually are implemented in ICMP and have an offering there in the Marketplace, as well as commercial options. You can find us wherever you may be via a private offer, or if you just want to use the services, you can do that and use runtime.

I do want to say that we have a big booth over there. If you haven't seen it, it's booth 10:22, and we've covered quite a few of the topics I just went over. I think reconstructing large scale environments is something really cool. The best way to interact with generative AI and other models is to see it in action. When it's text-based, it's hard to really wrap your head around it. If you can see what it's like with computer vision, and now we do VLMs and things like that, it's a good way to demonstrate what you're getting. We also have edge to cloud vision and AI agents, which is quite important.

What's really neat is that the people staffing our booth are the actual engineers who worked on these solutions. Feel free to deep dive and ask some really hard questions, or if you're interested, we'll talk to you for as long as you'd like. I want to leave this up for a minute if anyone's interested in where you can interact with our services, software, and SDKs. On the far left, we have the marketplace where you can see what we have listed there. We have AWS and NVIDIA tech blogs on how to implement Nemo fine-tuning, for instance, on SageMaker. We have the whole walkthrough design for you, showing how to take that and make your AI agent and run that in EKS. All you need to bring is your data, or if you just want to run through the tutorials, that's fine too. That's always a good starting point.

Lastly, I highly recommend everyone check out build.video.com. You can lose hours playing with this. It's an AI generative AI playground of all the NEMs and blueprints we keep creating. We have hundreds on there now, and we keep creating more. It allows you to not only see what's listed and read about it, but you also get the technical documents for how to work with what's there. You can play with it, and with just one click, you can create an API key, copy that code out, and put it into production. It's quite simple. We took all the thinking out of it, so you just work with it.

Since I have a couple of minutes, I'll expand a bit on what API keys are and how we're interacting with the NIEM, the NVIDIA Inference Microservice containers. When you're calling a generative AI model that way, we use the OpenAI protocol. It's quite simple because you just change your direction, like what model you want to use. That's one line of code you just change. The second line is your API key, which is a copy and paste. After that, you're good to go. If you wanted to do fine-tuning or training type LoRA, that's three lines of code. You don't have to spend weeks figuring out how to do this and doing it from scratch.

I think we still love to read scientific papers and then write the code ourselves, but why start from scratch when you can use an existing framework? With that, I'll say thanks for having me. I appreciate your time, and I hope you have a good expo.

; This article is entirely auto-generated using Amazon Bedrock.