Kazuya

Posted on Dec 5

AWS re:Invent 2025 - Accelerate gen AI and ML workloads with AWS storage (STG201)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Accelerate gen AI and ML workloads with AWS storage (STG201)

In this video, Monica Vyavahare and Jordan Dolman from AWS demonstrate how customers build and scale AI use cases using AWS storage services. They cover a progression from prompt engineering to RAG (retrieval augmented generation) with vector search, introducing Amazon S3 Vectors which offers up to 90% cost reduction and supports up to 20 trillion vectors. The session explains metadata filtering for optimized searches, MCP (Model Context Protocol) for agent-tool integration, and advanced techniques like supervised fine tuning, distillation, and continued pre-training. They showcase real examples including a biotech firm searching 30 million scientific papers and Meta's 140 terabit per second training infrastructure, while detailing storage solutions like Amazon FSX for Lustre, S3 Express One Zone, and integration tools like Mountpoint for S3.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Making AI Work with Your Data

Hey everyone, thanks for joining. Good crowd. My name is Monica Vyavahare, and I'm a senior product manager with Amazon S3, and I'm joined by Jordan Dolman, principal product manager with AWS storage. Today we're going to cover how customers are using AWS storage as they build and scale new AI use cases. It's no secret that everyone wants to build with generative AI, but the real challenge is how to make it work for your data, for your business at a cost that makes sense. Today we're going to show you how customers are achieving this all built with AWS storage.

The key to making AI work for you is your data. Generic AI gives you generic answers, but your data—like your customer feedback, your usage patterns—that's what makes AI valuable for your business. What we're seeing sets customers apart today is the ability to access relevant data to fuel multi-step AI agentic workflows. When your AI agents can get access to the right data at the right time, they can better understand your business and your customers, and that can dramatically help you improve your productivity.

So let's start with the core challenge. I have a given task that I want to improve productivity for, and I want to use an existing large language model or an LLM. I want to use a foundational model out of the box, but I need it to respond based on my data. But here's the problem: LLMs are trained on static data. They don't know your latest information or your business context or your use cases, so they give generic answers and sometimes hallucinate.

So we're going to discuss several approaches on how you can build AI workflows for your business using existing foundational models. This is where most people start today because it's pretty easy to go from an idea and the data you have to higher productivity. As we cover different approaches today, they're going to increase in cost, time and complexity, but they're also going to dramatically improve output quality so you can pick where on this scale which approach fits right for your business.

Prompt Engineering: The Simplest Approach to Improving AI Productivity

Let's start with the simplest approach: prompt engineering. If you have a single task in mind and data, odds are you're starting with prompt engineering. You give the LLM examples, context and constraints to guide the response. I can give you a personal example. At Amazon, before we build or launch anything, we start with the PR FAQ document. This consists of a press release as well as a series of internal and external FAQs to anticipate what customers are going to ask and what hard questions they're going to challenge us with after we launch. This helps us work backwards from the problem to know exactly how to define the right product shape and build the right thing.

To help me make this PR FAQ document, I have a saved prompt in a markdown file that I use with a CLI. In it I've included examples of good PR FAQs that have gotten approved in the past, some big launches like you probably heard yesterday. I've also included some context like notes from my meetings with customers so I can specifically define and hone down on what the real problem statement is, as well as constraints like legal guidelines from our legal department on messaging do's and don'ts to include. The result has really helped me with my personal productivity. A process that used to take weeks to get to a good first draft, to get to a review point, now I get something in minutes and I get a reviewable draft within a few days that I can review with stakeholders and my team.

So this has really helped my productivity, but how do I extend this capability to my team so that everyone can benefit from this? Does it scale if I need to include hundreds of documents or multiple different data sources? I can't manually add in all of that context because, as you must have experienced, the context window overflows and once that happens, quality tends to degrade. What we really need is to give the model access to massive amounts of data, but give it the ability to find and use only the most relevant bits of information so the context window doesn't get overwhelmed.

RAG: Scaling AI with Retrieval Augmented Generation and Vector Search

RAG, or retrieval augmented generation, is a scalable solution to this problem. You retrieve the relevant information, augment your original prompt with that relevant context, and use it to generate a better response. Remember, we're trying to give these foundational models access to your evolving data without manually having to load everything into the prompt.

To bolster your prompt, we're going to use RAG, which uses semantic search to find and return relevant data from any size data lake. Here's how it works. We start by converting your data into a vector. A vector is basically a numeric representation of your data that captures its meaning. Once all of your existing data is converted into vectors, when you make a prompt to your AI model, that query is also converted into a vector.

We use spatial similarity to search that query vector against all of your other vectors to find similar content. We then use those close matches to augment your original prompt, and that augmented prompt is fed into the model to get a better response. This focused, relevant context instead of generic knowledge that the model was trained upon helps you get more accurate answers and is also less likely for the model to hallucinate.

Before you can start using your data for semantic search through RAG, you need your data in some type of format to access and vectorize it in the first place. Most of our customers use S3 to store their data. That's because S3 is low cost and scalable to any workload size. There are two main approaches to get your data into S3. One is batch ingestion, and the second is real-time ingestion.

Batch is a useful technique when your data doesn't change frequently, such as documentation, historical records, and product catalogs that aren't evolving all the time. Another good reason to use batch is if you need to preprocess your data. If you need to preprocess like chunking or generate embeddings, this isn't something you can do scalably in real time. The second technique is real-time ingestion, which is a good tool when your data is changing frequently, such as live social media feeds or live transcripts from customer support calls. You don't want to have a dated pulse on what your customers are saying, so you can't use batch operations from a week ago or every month that will have stale data.

You can use Amazon Kinesis or SQS as a simple way to ingest real-time streaming data. It collects, processes, and loads the data into S3. There are also several ways to do batch ingestion. For RAG workflows, usually batch is what customers choose because it's simpler and more cost effective.

Now we have our data in S3 and we want to vectorize it so we can use it for semantic search with RAG. What are vectors? Vectors are basically the mechanism that makes RAG more powerful through semantic search. You can vectorize any type of data. For this example, let's use text documents. You have your documents and they are converted into chunks, which is basically a finite set of characters of text in this case. Those chunks then go through an embedding model to generate vectors, and some embedding models also attach metadata to help refine your search to the vector.

These embeddings and metadata are stored in a vector database. Remember, these vectors now capture your evolving data so that we can use it later to augment your query. There are several choices you need to make in this pipeline. One, you need to choose your data source, which could be an S3 bucket or a prefix if you want more granularity. You also need to choose your chunking strategy. Imagine you're a film studio and instead of text documents like we're doing here, your data is movies. You may choose to chunk by act or by scene or some more granular time frame. Then you need to choose your embeddings model and finally your vector store.

You can manage this pipeline on your own, but we also have Knowledge Bases for Amazon Bedrock, which can manage it for you. It has a simple way to configure this entire pipeline, and the nice thing is when you're uploading new documents, those documents are already ingested as vectors and they land in your vector database.

Amazon S3 Vectors: Transforming the Economics of Vector Storage

Vectors are essential for RAG, but managing large volumes of vectors can be challenging and expensive. Over the last years we've heard three main problems from customers. First is cost. Many traditional vector databases bundle storage, memory, and queries together as a single unit, making it cost prohibitive to deploy large vector sets.

We've also heard that scalability is a challenge, so it's difficult to scale from small proof of concepts to those large production data sets. And then finally granularity. We've heard that many customers need millions of separate indexes for multi-tenant applications, and once you start to get to that scale, costs tend to spiral out of control. The real theme that we've heard across all of these was cost. Customers needed a more effective way to store and manage vectors. That's why this week we launched general availability for Amazon S3 Vectors.

We're really proud of this launch. It's the first cloud object store with native support for vector store and query. And it has completely transformed the economics of AI. S3 Vectors offers up to 90% lower costs for uploading, storing, and querying vectors. We offer 100 millisecond warm query latency because we're able to cache queries that are frequently made and subsecond cold query latency. You can store up to 2 billion vectors per index and up to 10,000 indexes per bucket. That's over 20 trillion vectors in a vector bucket, and you can have 10,000 of those. We also offer fast ingestion so you can get to work quickly. You only pay for what you use, and the best thing is this is built on top of S3, so customers get access to the attributes that they know and love about S3, like availability, durability, security and compliance.

Vector pricing is fundamentally different from traditional vector databases. Most of them bundle compute, memory and storage, and you pay for it all together. You provision capacity upfront and then pay around the clock. The way S3 Vectors has changed that completely is we've introduced three pay-per-use pricing components that actually align with your usage. First you pay for ingestion, you pay for the vectors that you're putting into the vector database. The second is storage. You can leverage S3's industry-leading economics to store vectors at a fraction of the cost. And finally, queries, you pay for the queries that you're actually making. There's no capacity planning necessary and no infrastructure overhead, and this is really useful when your application has varying workloads.

You can imagine during the workday maybe your team is making lots of queries, but at night there's hardly any out of business hours. This way you're only paying for what you're using and when you're using it.

Here's a customer example. We launched public preview of Vectors this summer, and this customer has been with us since then. We've gotten to really work with them and see how they've scaled. This is a biotech firm, and they've been using S3 Vectors for semantic search on scientific literature. Their team consists of PhD scientists and entrepreneurs, and really their goal is to discover the next breakthrough in drug development. To do so, they need to know and absorb the large corpus of scientific literature and knowledge that already exists, and this isn't just reading a few papers every morning with breakfast. The scope of this problem is 30 million scientific papers.

So before they integrated with S3 Vectors, this research phase would take weeks, and still you can imagine it's not possible to absorb all of this knowledge. Since integrating with S3 Vectors, this is what their pipeline looks like. They've ingested the entire corpus of scientific literature that they had access to. Those were then generated into millions of vector embeddings which now have landed in S3 Vectors. Now when they're exploring a hypothesis, it's as simple as performing semantic search on these vectors to understand what is nearby and what is relevant. This has dramatically reduced their research timeline.

Metadata Filtering: Making RAG Smarter and More Targeted

OK, so let's recap what we've done so far. We have all of our vectorized data in S3, and we're performing RAG workflows on it to get more relevant context. But here's the thing: when you perform a RAG workflow on this data, it's searching against all of your data. We can make RAG even smarter with metadata filtering. This metadata filtering helps to narrow the search space so your queries are faster and more accurate and more targeted to what you're actually trying to achieve. So for example, instead of those 30 million documents, this firm could choose to select only documents about gene synthesis.

You can use Glue to catalog structured data, and you can add custom metadata for unstructured data. You can use metadata filtering directly in line in your RAG workflow. It's like adding WHERE clauses to your search to make it faster and more effective. As you continue to process your data, more metadata gets generated. This rich metadata becomes the nervous system of your entire AI operation. Metadata gives you context like what is the data I'm looking at—this is Q4 sales data. It also gives you lineage like where the data actually came from and how it's been transformed, and this has been really useful for root causing issues. Finally, it can give you classification like auto-tagging sensitive data such as personally identifiable information so that it's managed correctly.

Metadata helps AI understand not only what your data is, but what it actually means for your business. So far we've built a complete RAG system. We've ingested data into S3. We've vectorized it for semantic search. We've empowered it with metadata filtering to make it more effective, and we've introduced cost-effective vector storage. Many customers choose to stop here because this is already a very powerful AI engine and it delivers results at scale.

AI Agents and MCP: Enabling Complex Multi-Step Workflows

But what if you want more? What if instead of a single question like "what are similar scientific documents to this problem I'm exploring," you want to break down a complex task, give it power with tools and access to other data, and then have it reason through a complex problem? Earlier I introduced prompt engineering with an example on how I've used it to improve my productivity with PRFAQ writing. We are now entering roadmap season for next year, and it's been great to have many conversations with customers this week at re:Invent as well as over the last year to help us define what to build for next year.

Can I make a roadmap agent using all of this information and using several different tools? What if I want customer feedback from my notes this week as well as product reviews we have online? I want to check our S3 tables for customer usage data to find top customers, and I want to combine those to find the top pain points that customers are discussing. I want to write a PRFAQ for each of those as a pitch for a new feature that we should launch for each of those pain points. That requires the LLM to reason through a complex task, use multiple data sources and tools, and perform those actions. RAG isn't going to cut it anymore. Now I need agents.

With agents, you give your LLM a set of tools up front. Things like Bedrock Knowledge Bases for access to documents, S3 tables for access to usage statistics, and APIs to pull customer feedback from social media. Here's the big shift. Before this with prompt engineering, we had to guide the model step by step. Now agents figure out the steps, figure out what data is needed, and determine what are the right tools to get it. If we look at this diagram, we were already prompting the model with goals, instructions, and context. Tools are the missing piece here. This is all fed to the agent who's using this foundational model to perform actions.

Managing all of these tool integrations can start to get very complex. This is where MCP comes in. MCP, or Model Context Protocol, is becoming the standard for how tools connect with agents. It's really similar to how HTTP standardized how applications can talk to backends. There are two parts to this: MCP clients and MCP servers. MCP clients define the how—how do you query a database, how do you search through a knowledge base. MCP servers actually execute on that. They check those constraints and those rules and then actually execute that query and tailor the results based on how you've configured it.

AWS is building several MCP servers for our services as well as external services to help your agents find the right data through the right tools. For example, we have an MCP server for Bedrock Knowledge Bases. Now you can query knowledge bases with natural language, no API calls necessary. With this, you can filter to target specific sections of the knowledge base. It can configure the result size and re-rank outputs to improve relevance to what you're searching for.

You can also do this conversationally. So now my roadmap agent can ask, "What are the key limitations that customers are reporting in our product reviews?" and it's all configurable. In addition to RAG for semantic search, we also need traditional ways to filter and find data. That's where metadata comes in. You can generate and search metadata with S3 Metadata, and this works for both structured and unstructured data.

Let me walk you through how this works. Let's say you have lots of types of data like PDFs, CSVs, audio files, video files—you name it—and they're all landing in your S3 bucket. First, you want to configure your source bucket and configure S3 Metadata on your source bucket. Second, you want to create an S3 Table bucket. This is where a queryable metadata table will live. You can do both of these through a single API call or just a few clicks in the S3 console.

Finally, S3 will generate a metadata table in that table bucket, and it automatically updates every few minutes as your data is evolving. It auto-updates for new objects, it auto-tracks system metadata, and you can configure your own custom metadata. RAG workflows can leverage this data for metadata filtering for more optimized and targeted searches, and you can also query it on your own for your own use cases.

You can use Athena, QuickSight, Spark, or any SQL-based process to gain valuable insights from your metadata. This is very useful for RAG and for analytics, but it's also useful for agents. Now you have this rich cataloged metadata and your agents can also access this. This year we also launched MCP Server for Amazon S3 Tables, so now you can use natural language to interact with S3 Tables and S3 Metadata. No SQL required anymore.

The Modern AI Stack: From Prompts to Agents

Coming back to my roadmap agent, it can check the table of customer usage and see who are the top ten customers for a given problem I'm solving and what their monthly usage looks like. This is all available in the AWS MCP open source repository and it's really easy to set up. Let's bring this all together. We started with prompt engineering, giving your LLM examples and constraints and context to get better responses.

Then we added RAG, giving your LLM access to more relevant data to fill that context and augment your original prompt. We vectorized our data to support this. We discussed metadata filtering to target the RAG search, and we also talked about S3 Vectors as a cost-effective way to manage your vector storage. Then we leveled up to agents. We gave our LLM the ability to connect with multiple tools to perform complex tasks.

And finally, we talked about MCP, the standard that makes it easy for your agents to connect to tools. This here is the modern AI stack, and it's where many customers stop because it's really powerful and it's scalable. Some customers want to go further. To discuss more advanced workloads, I'm going to invite Jordan to the stage.

Fine-Tuning Models: Supervised Learning, Distillation, and Alignment

Thank you. All right, so Monica talked a lot about how you can use prompt engineering to provide better structure and RAG to provide more data to improve the outputs of your model. This works in many cases, however, a lot of that data and structure we just talked about lives in the context window of your application. That's kind of like the short-term memory of the model, and because that's fixed, sometimes you have to go further and actually embed some of that content in the model itself rather than using something off the shelf.

So let's think about how models work today and how they're built. Every model that we have access to that's already pre-existing has been trained on a large dataset, and the knowledge that was used for training is now deeply embedded within the weights of the model itself. That data is accessible when you introduce a prompt to the model and get a response, but if the knowledge you're looking for isn't available, if the structure doesn't match your application, then we actually need to try to tweak those weights, update the model itself to get the right kind of output.

Of course, that takes data, and depending on how much data you have, there are different techniques you'll be able to use to train and improve your model's outputs. If you have a small amount of data, rather than trying to update the weights across the whole model, you really want to focus on just updating portions of the model itself. To get the most value from your data, you want to provide even more guidance by labeling the data so it's very clear you have an input and a desired output that you're looking for.

You're trying to guide the model to produce certain outputs based on that small labeled dataset, and you're going to focus it on a small portion or subset of the model. If you have more data, then you can still use that labeling to provide guidance and expand to actually update more of the model weights. That's kind of like rather than teaching the model a pointed piece of information, it's like helping it think in your domain that you're working with, which may not already be embedded in the model itself.

If you have a huge volume of data, you can actually skip that labeling step and go straight to something more akin to continued pre-training. It's kind of like picking up where the original model developers left off and then embedding that new content in those model weights. Now the services we have available to help with this effort are Amazon Bedrock and Amazon SageMaker AI. If you find yourself mostly on the inference side, the application builder side of the equation, Bedrock is really where you'll probably want to operate.

If you think of yourself more as a model builder or model developer, then SageMaker AI is the place to go. In both of these cases you have capabilities to do what we're going to talk about now, which is updating the model itself, but each of these services has really been refined or tuned for specific types of user experiences. So let's go into some of the different techniques we have to actually update the model itself in this case with that labeled data. We'll start with three techniques: supervised fine tuning, distillation, and alignment.

I know that there are a lot of these words that get thrown around in the ML space. So one of the goals I have today is to walk through and make sure that everyone has a clear understanding of how each of these differs and what it means for data and what it means for storage. So let's talk about supervised fine tuning. Again, I mentioned before that labeled data is really helpful when you're trying to tell the model here's an input, here's the output I'd like to see, so this is kind of in that same vein.

What is the input going to be? What's the user saying? What other context do we have available as part of this exchange and then what's the single best response or output? A simple model may be used for something like translation or summarization, which is text to text, text input, text output. You might have labeled data that looks like this: just a very simple prompt and the desired output you'd like to see. This is a non-conversational model.

If you have some more conversational models, which is more of a chatbot type of experience, you'll notice that the labeled data is actually going to need to have tags for the user and the assistant, kind of like who's actually introducing the prompt and who's responding. Some of that labeled data can also have multiple turns, going back and forth between the agent and the user as part of that labeled dataset. Now I should say the reason why I'm putting all this up here on the slide is because when you're thinking about the data that you're generating and collecting across your organizations, thinking ahead to how this data might be used can be really helpful.

This is especially true if you aren't expecting to be generating hundreds of thousands of data points to use for future training and you're going to be relying on labeled datasets. Thinking in advance about how you might want to label data that's coming out of maybe a call center log or any other kind of interaction can be really helpful. Here's another example: a different type of model, this is image to text, but again we have an image reference and then an explanation or caption that basically says when you see this kind of image I'd like you to respond with a cartoon of an orange cat with white spots.

This supervised fine tuning example that I've shared today is available in Amazon Bedrock. So again,

Continued Pre-Training: Storage and Performance Requirements for Model Training

even though it's intended in general for app builders, you still have this kind of capability available to you when you're working with Bedrock to customize those models. The next capability where you might be tweaking the underlying model itself is distillation.

This is something that might be useful if you're working with a large model and you really like the outputs that you're getting. But the large model isn't quite fast enough for your user behavior or your desired application. Maybe it costs more, but again, you like the output, and when you try to generate the same kind of outputs from a smaller, more nimble, lower cost model, you're not getting the results you like. Distillation is kind of like the cousin of supervised fine tuning.

So again we have the prompts coming in, but in this case you don't actually provide the output responses. The outputs get generated by the larger teacher model and then get fused back with the input and then used as basically labeled data to train the small student model. So again this gets a bit messier here. I'm trying not to overwhelm with a lot of text, but the key here is you see in this third row, there's this role of a user. And then there's some text, some context and a question, but there's no assistant response and that's because again the assistant response is going to come from the larger teacher model. So this is pretty useful if you don't actually have all the answers but you know that a large model would produce content that you like. You can use something like distillation.

And then the third technique with labeled data that can be really helpful is alignment. This is less about training for knowledge and more about training for tone or maybe compliance, adding guardrails into the response. Sometimes this can be something that's really helpful for the branding of an organization. Similar to fine tuning, we want to provide what's the prompt, what's the context, but in this case we want to provide preferred and non-preferred responses.

So this is where I think the example is actually quite helpful. This labeled data that you're going to be providing also includes a score or guidance of different responses and which one it prefers. Here you're really teaching the model not just what output you want to see, but how it compares to another output so that it can really start to understand and shape towards the preferred outputs over time. This is something that if you were interested in doing this kind of direct preference optimization, this kind of alignment technique, you would need to be shifting over to SageMaker AI. This one isn't available in Bedrock today.

So we have three different techniques using labeled data. It's also worth noting that the examples I shared today are real examples you could use that kind of structure and syntax for supervised fine tuning, distillation, and alignment, but they don't work with every model. So you'll want to look at the specific inputs, specific label data requirements for the models that you're planning on working with in advance so that you're actually structuring your data properly or you can obviously modify it after the fact.

OK, so we've gone from the left here through the labeled data, now we're all the way on the right side, continued pre-training and training a new model from scratch. We'll come back with one of those slides at the end as well. So here again, we have a lot of data, we're not happy yet with the output of the model, and we want to take it forward. The best place to go here would be continued pre-training.

This is where you don't have to have that labeled data set, but you can deeply embed new knowledge, new tone, new relationships between your data into the model itself. But the fact is there are also cases where the model that you're working with was actually trained on data that is just structurally so different from the model you're trying to create, maybe it's a different language. Or maybe it's not a large language model at all. It's something like a weather forecasting model or a foundation model for drug discovery. There you're really going to be starting from scratch with building a model and that's fine as long as you have all that data and that's what we're going to be talking about here.

One of the big differences if you're in this kind of continued pre-training or training a model from scratch, the big difference from some of the previous phases is this is a much more resource intensive effort

This is a much more resource-intensive effort, and the integration between your data, compute, and storage becomes really important. The reason is that getting all of the vast amount of data you need to train a model from scratch or do continued pre-training needs to be loaded from storage efficiently into your GPU or accelerator instances. Periodically, for a number of different reasons, you'll also want to write intermediate checkpoints back to storage. Sometimes you'll write checkpoints to ensure there's a safe point to restore to in case of infrastructure issues, and in other cases it might be a point for you to evaluate the model and potentially go back to if you want to try training with different datasets over time.

In all of those cases, there's a lot of data that you need to get into the GPUs, and there's checkpoint data that you want to get off the GPUs very quickly. To give you a sense of the rates that these GPU and accelerator instances can process data, if you're working with text-based models, typically we see around 128 megabytes per second for every GPU. If you imagine distributed training on a number of different GPUs, that number can get quite high—tens of gigabytes per second. If you're working with richer media, multimodal models, or video, the throughput required to drive data into these accelerator and GPU instances can be quite material.

The other thing that can impact performance is the size of the IO, the size of the data that you're trying to retrieve at a given point in time for a given request. If you think about reading data from storage, the time it takes to get that data back is a combination of overhead for the request and then moving that data itself, the payload. Getting the request overhead involves authenticating that you have access to the data, the network latency to actually retrieve the data between the GPU or accelerator instance and the storage, and the metadata lookup to figure out where the data lives within the storage system.

If you're working with small files or small IO, small objects, then the amount of time you spend on that overhead actually dominates the read. This is where having low-latency storage, storage that is able to respond and do that authentication, shorten that network path, and do that metadata lookup really quickly can have quite an impact on job completion time. Training these models, doing continued pre-training or training a model from scratch can be quite expensive, so making sure these instances are kept busy and that the training is efficient is helpful on the infrastructure side and also helpful to get responses to the people who are developing these models in the first place and iterate quickly, because this is really a collaborative and iterative workload.

Amazon FSX for Research and Development: File Systems for AI Workloads

Within the storage portfolio that we have at AWS, on the file side we have Amazon EFS, our elastic file system. We have Amazon FSX, which is commercial and open source file systems that we manage on behalf of our customers in the cloud. This is kind of akin to RDS. We have a variety of different file system offerings within the FSX family. We have object storage with Amazon S3, and then we have block storage with EBS. For model training in particular, the shared storage offerings that our customers use fall within the FSX and Amazon S3 families.

To go one step further, how customers think about which storage services to use typically depends on the use case. If you're working on the research and development side, file systems are the preferred approach for shared storage. If you're looking for something more optimized for production use cases with more fixed data pipelines, that's where Amazon S3 and object storage becomes more optimized. Let's dive into the file side. My goal here is to make sure that you understand all these different terms that are typically thrown around with AI and machine learning, and that you know which services to use for each given use case.

In the case of research and development, we have different file systems that have been built and architected for specific use cases. Scale-up file systems, file systems that rely on a single server that can be larger or smaller, are one part of the portfolio and one type of architecture. Then we have these scale-out file systems that rely on multiple servers stitched together to deliver higher levels of performance.

Amazon FSX for Open ZFS is an open source file system that we fully manage for our customers. It's a scale up file system that uses NFS for communication. This is the most standard way for file systems to communicate with EC2 instances or any other client instances. It's ideal for ultra low latency workloads like home directories, storing your development environments or Git repositories.

This is where your developers are going to want to keep their data so that if you have a Kubernetes-based developer environment stack where researchers are having their instances spun up and down, but you want to give them shared storage to work with in between, something like an Open ZFS shared file system is incredibly valuable. On the flip side, stitching a bunch of servers together is how you get to something like Amazon FSX for Lustre. Again, this is a file system we manage on behalf of our customers so they don't have to think about what Lustre is or how it works.

For our customers, this just becomes a mount point that you mount on your instances and you do file APIs. You open, read, write, and close files, which is very simple in terms of the interface. However, the power you get is the ability to drive very high levels of throughput and I/Os or transactions to your storage. This is the solution that we recommend for customers to store their training data and to receive and restore their checkpoint data because it offers really scalable performance. One example of the type of thing we've done with these offerings is allow our customers to actually use some enhanced networking capabilities like Elastic Fabric Adapter or GPU Direct Storage with our FSX file systems.

This networking stack allows customers to read data directly from our FSX for Lustre file systems into the CPU memory or the GPU memory on those GPU and accelerator instances. This is optimized for performance, which is why we recommend these file systems for these particular workloads. You don't have to take these steps to use these networking stacks, but they're available to you. Once you set them up, you don't have to do anything special to achieve the high levels of throughput that are actually enabled by these networking stacks. There is one challenge that has historically been associated with file systems. Many file systems are based on SSD-based disks, solid state disks, because a lot of file-based applications expect very low latency responses and high transaction rates.

One of the challenges with SSD-based file systems is they can be expensive. If you have large amounts of data and some of it's hot and some of it's cold, storing all of that on an SSD-based file system isn't ideal. If your data volume is increasing and decreasing quickly, having the right amount of storage on SSD disks to support that data can also be challenging. For those of you who are taking on that challenge and trying to move data between hotter and colder storage offerings, that's just more operational overhead. One of the things we've done over the last year with our FSX offering is we've added FSX Intelligent-Tiering. This is very similar to the concepts that we see on S3 Intelligent-Tiering.

We have virtually unlimited elastic storage on our file systems. We have that half a penny price point that you might be familiar with from something like Glacier Instant Retrieval. We're automatically moving data between hotter tiers, including an SSD-based tier, all the way down to that colder archival instant access tier. For quite a while, customers who were interested in working with file systems but had a lot of data would split their data between Amazon S3 and their file system really just for cost reasons, even though everything was going to be accessed through a file interface. This really allows our customers to just work with one single file system and be able to take advantage of all the different tiers and get the performance of SSDs for their hottest data. They get the cost of a frequent access tier for their hotter data that's being used over the last few days, and then as data gets colder and colder, we archive it down and you get that automatic cost savings down to that 0.5 penny price point.

It's super easy and simple to work with if you have that mix of hot and cold data. This is accessible on FSX for Lustre or FSX for Open ZFS file system. If you are a customer who has data stored in S3 in an S3 data lake, we also have other capabilities to integrate FSX with your S3 data lakes. One option you can do is connect your file system to your S3 bucket and load the metadata, which are pointers to your data, onto the file system. They just show up as files, so you open a directory and you can see your S3 data on your instance just as a regular directory.

When you access that data, it gets pulled and lazy loaded behind the scenes onto your file system and communicated or shared back to your EC2 instance. If you add new data to S3, it can automatically show up on your file system, and if you write new data to your file system, it can be pushed automatically back to S3. This is a super powerful architecture. If you watched Matt Garman's keynote, you may have seen this architecture a couple of times. Customers like this a lot because the storage and IT administrators can think about working with their data in an S3 data lake, and the researchers can have the intuitive collaborative interface that file systems provide. One of the examples that was in the keynote was Adobe. They work with this specific architecture to store a lot of their data on S3 and access it through a file system. They use this for a lot of their research to figure out what models are actually going to be moved into production, what models are helpful and add value, and eventually they use S3 to do some of their production training.

If you do choose to store your data in a file system, it is probably also worth noting that we have added some new capabilities this year, in fact this week, to make that data accessible to a whole wide range of Amazon Analytics services. In the last example I mentioned, you might have a data lake with S3 data, and you can access it through your FSX for Lustre file system. In this case, if you have your data in an FSX file system but you want to access it with Amazon S3, now we have this concept of an access point that you can actually attach to your FSX file systems. You can access data that might be stored in your Open ZFS file system or now in a NetApp ONTAP file system. This is super helpful if you are a customer who has migrated data from on-premises with a NetApp file system into the cloud and you want to make that data accessible to Amazon Bedrock for some of that fine tuning I mentioned before or for that distillation use case. All of that data is now accessible to all of these AWS services as well using these S3 access points.

Amazon S3 Express One Zone and ML Optimization: Production-Scale Training Infrastructure

Now we have talked about using Amazon FSX to accelerate and deliver the performance we need for these research and development applications. Let us talk a little bit about what we have done on S3 to optimize for these ML workloads as well. Amazon S3 has a lot of storage classes. The ones on the left are the ones that are going to be used for that hotter data. Both Amazon S3 Express One Zone, which is all the way on the left, and Amazon S3 Standard provide very high levels of throughput for a bunch of different applications. Amazon S3 Express One Zone differs from Amazon S3 in that it provides much lower latencies. But really if you are working on continued pre-training or training a model from scratch, you should expect that all of your data is going to be in one of those two classes just because it is going to be accessed frequently and you are not going to want to have the request cost profile of anything in those further to the right tiers.

One of the ways that Amazon S3 Express One Zone has optimized itself for these use cases is it offers a lower latency profile by not making an authorization request on every single get request or put request. It actually uses a session-based authorization, so you do one authorization and then you can read and write or get input from your object storage to get that lower latency. The other thing is in the name you can see it is Amazon S3 Express One Zone.

We've deployed this S3 storage class within the availability zone, so it's co-located with your GPU or accelerator instances, shortening those network paths as well. These two capabilities—session-based authorization and the co-location of clusters with your GPU or accelerator instances—allow us to deliver single-digit millisecond latencies that can be really helpful when you're working with small objects. The other thing that goes hand in hand with small objects is having very high levels of transactions per second out of the box, with hundreds of thousands of transactions per second available with the storage class.

One interesting example of a customer that has done a lot of this kind of training is Meta. They came to us asking for a very high throughput object storage class that they could use with very high levels of transactions per second and very low latencies—these single-digit millisecond latencies. We worked with them to scale some of our S3 Express clusters in order to meet a 140 terabit per second throughput level, which is 17 terabytes per second. We are in clear supercomputing storage range here available in the cloud to train these models. I don't expect everyone to be doing anything near this kind of scale, but it does speak to the capabilities we have, both for small scale and the ability to scale to meet the needs of even the most demanding workloads.

In addition to moving some of the infrastructure closer to our compute physically in our data centers, we've also tried to bring the APIs a little bit closer to the workloads that you might be running with S3 for machine learning. Two examples of that are Mountpoint for Amazon S3 and the Amazon S3 connector for PyTorch. Because a lot of machine learning workloads and libraries rely on and expect file interfaces to be used to access data, we added Amazon Mountpoint for S3, which is a connector that's fully optimized for Amazon S3. There are a lot of connectors out there in the world that work with object storage and provide a file interface, but we wanted one that was fully optimized to take advantage of how S3 works under the hood.

This is super helpful for read-only machine learning workloads. If you're going to train a model and you're going to read through a lot of data on S3, it's a very simple plug-in to just say let me use Mountpoint for Amazon S3 to access that data read-only and to translate those file open and read APIs into an S3 get effectively. On the other side, there are also some workloads where we know that just swapping the interface isn't sufficient. When that happens, we want to go further up in the stack, and that's where something like the connector for PyTorch comes in, where we actually swapped out the whole file interface and developed something that was really optimized for Amazon S3 and allows us to do more streaming and prefetching to get data from S3 into those EC2 nodes for that training effort.

We've now finished the whole spectrum here of different ways for you to work with your data to improve your models. Both for inference, if you're not really going to change the model itself and you're just trying to get a better output at inference time, that's your prompts, your knowledge bases, and your metadata. Or if you need to change the underlying model itself, that's where getting that labeled data or unlabeled data and doing some of the techniques here—fine tuning, distillation, alignment, or continued pre-training, or training your own model from scratch—that's where that comes in more for the model builder persona.

If you're interested in learning more, here are a couple of sessions over the next 24 hours that might be interesting. My co-speaker Monica and I will be off on the side if you do have any questions. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.