🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Accelerate data discovery with object metadata in Amazon S3 (STG357)
In this video, Roohi Sood and Claire Edgcumbe from AWS introduce S3 Metadata, a solution for data discovery challenges in organizations managing massive datasets. They explain how S3 Metadata automatically extracts and maintains metadata from S3 objects in Apache Iceberg format, creating journal tables (audit logs) and live inventory tables (current snapshots). The presentation includes live demos showing how to enable metadata configuration, query tables using SageMaker Unified Studio and Athena, and integrate with S3 Batch Operations for storage management. They demonstrate querying with natural language using Kiro CLI and MCP for S3 Tables, and share real customer success stories including a medical imaging company that streamlined their processing pipeline and a digital content company managing petabyte-scale migrations. The service is available in 28 AWS regions.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
The Data Discovery Challenge: Finding Needles in 500 Trillion Haystacks
All right. A very warm welcome. Thank you for being here. I want to start our discussion today by sharing some scenarios that are becoming increasingly common. Let's say it's 8:00 a.m. or 8:30 a.m. on a beautiful Wednesday morning, and Alice, your lead data scientist, is looking for labeled and processed images amongst 23 million objects in her bucket. Meanwhile, your security team is trying to get a list of all the objects that have sensitive information because an audit request came in. They're listing all of the objects and trying to figure out which objects have sensitive content one by one. Does that sound familiar? This is not a data problem. This is a data discovery problem.
Hi, I'm Roohi Sood, and I'm a Senior Product Manager Technical at Amazon Web Services. I'm joined today by Claire Edgcumbe, our Software Development Manager who leads the brilliant engineering team that has brought the vision of accelerated object discovery to reality. Together, Claire and I have just one goal for today: to walk you through a roadmap of how you can overcome these data discovery challenges in your organizations forever. Your data scientists will get what they need within minutes, and your security teams will answer questions with simple queries.
Here's our journey for today. We'll start by diving deep into the data discovery challenges that are costing organizations millions in lost opportunities. Then Claire and I will show you exactly how S3 Metadata overcomes these challenges. We'll walk you through different use cases with demos—not theory, but real examples that you can replicate in your own environments. Everything that we show you today is available right now. You can literally enable this before you leave Las Vegas. Finally, we'll wrap up with three key takeaways that you can take back to your organizations.
So let's start by understanding these data discovery challenges. We call this a challenge, but really these are opportunities, one of the biggest opportunities, and in a moment you'll understand why. It's no surprise to anyone in this room that data is growing faster than ever. It's growing at a rate that was unimaginable five years ago. Every app you use, every car you drive, sensors in the buildings, every security camera—they're all dumping data into storage. To put that into perspective, Amazon S3 now stores over 500 trillion objects. That's half a quadrillion objects.
Enter the Gen AI revolution, which has fundamentally changed the value equation for data. Suddenly all of that unstructured data that's sitting in your S3 buckets—all of the videos, the images, the logs—that's not just storage anymore. That's potential training data worth millions. But here's the challenge: whether you're training models, fine-tuning them, or building RAG use cases using knowledge bases, you need to be able to identify, categorize, and access your data sets quickly. If your teams are spending hours and days scouring the data for the right videos or identifying documents which have the right content or don't have sensitive content, you're already behind.
This is why we call this a data discovery opportunity, because whoever solves this data discovery challenge for their organizations first is going to have an insurmountable AI advantage. This brings us to our fundamental question for today: How do I find or access actionable data sets at scale? The answer, as many of you already know, is metadata. It is the DNA of your data. It tells you what you have, where it is stored, and sometimes what it potentially contains. But this is where most organizations get stuck.
Traditional metadata solutions typically live outside of your storage solutions, so they are already complex, and then they have sync issues. Second, they're incredibly difficult to build, operate, and maintain at scale. We have seen organizations spending weeks, sometimes months, just to get the fundamental metadata solutions right. Finally, and most importantly, your metadata is only useful if it is current. In fact, stale metadata is worse than no metadata because it's going to potentially lead you down the wrong path.
Introducing S3 Metadata: Automatic, Comprehensive, and Always Current
This is why we asked ourselves this question: What if metadata worked just like S3? Simple, reliable, and scalable. This is why we created S3 Metadata. But before I tell you what S3 Metadata does, I'm actually going to tell you what it doesn't do. It doesn't require you to change how you work with S3, and it doesn't require you to do any complex setups. S3 Metadata is metadata done right—automatic.
Simply put, S3 Metadata provides automatic metadata extraction from your objects in S3 that you can query with simple SQL statements. Every time you add an object or delete an object, we automatically update your metadata. It's comprehensive, and always current. Let's dig into why this is exciting for customers today.
First, it captures both system and custom metadata. System metadata includes object size, checksum types, and encryption types. We're going to dive into both of these a little bit more in our presentation today. Second, it's built on the Apache Iceberg format and stored in S3 Table buckets, which means it's using proven open source standards that you can use with any compatible query engine, either now or in the future. Third, it's completely automatic. The moment you upload, update, or delete an object, your metadata tables update as well. There is no manual syncing required.
So how does it really work? It's actually very simple. All you have to do is add a simple configuration on your general purpose bucket that tells us you want metadata for this particular bucket. Once we see this configuration, we will set up a managed AWS S3 Table bucket. Within this table bucket, we start populating your metadata tables. These are called journal tables and live inventory tables. Now you might be wondering what these tables exactly are and what they contain. I'm going to invite Claire on stage to help us understand the difference between these metadata tables and their different use cases.
Understanding Journal and Live Inventory Tables: Architecture and Query Options
Thanks Ruhi. So as Ruhi mentioned, when you enable S3 Metadata for your bucket, we create two tables. The first is the journal table. You can think of the journal table as an audit log for your bucket. Every put, every delete, every modification is captured as its own row. This isn't just metadata; it's a time machine for your source bucket. You can see exactly what happened, when it happened, and who made it happen. Because the journal table refreshes within minutes, you're always working with current information. This year, to make cost management simple, we introduced automatic expiration of old records.
The second table is the live inventory table. The live inventory table gives you a detailed view of what's in your bucket. You can think of it as a snapshot of your entire data estate. There's one row per object version. It refreshes every hour and answers questions about the state of your data. This is perfect for analytics and reporting. You no longer have to issue expensive list requests and paginate through them, and you no longer have to build those complicated custom solutions that Ruhi was mentioning earlier. It's all here and ready for you to query.
Let's take a look at this in action. You have a set of users who are all operating on a shared dataset. They are adding, updating, and deleting objects. Now what we can see is that because S3 Metadata is enabled, as they're making these actions, new rows appear in the journal table capturing those actions and the associated metadata. For these changes to be reflected in your inventory table, we run a job roughly once an hour that's going to read the new rows recorded in your journal table and apply them to your inventory table, generating a new snapshot for your bucket. So here we see the metadata flowing from the journal table into your inventory table, creating that snapshot for you to query.
Let's take a look inside. The schema is incredibly rich. In total, there are 21 types of metadata recorded in the tables. The journal table contains request metadata, such as record type, request record timestamp, requester, and source IP. System and custom metadata are captured in both tables. System metadata includes things like bucket, key, object size, and object storage class, but also information about encryption status, whether the object was uploaded using multipart upload, and even the encryption algorithm.
Custom metadata is metadata defined by you in the form of user metadata or object tags. So where do these tables live and how do you query them? There are three things you need to know about your metadata tables. The first is that they are in Apache Iceberg format. The second is that they are S3 tables which live in S3 table buckets, and the third is that they are managed by AWS. Let's break down what each of these mean. Apache Iceberg is an open table format that supports SQL-like queries over Parquet data. It's very popular. In fact, it's one of the fastest growing types of data in S3. One reason why customers like it is that because Iceberg itself is just data at rest, it allows customers to choose which query engine they want to use when interacting with their data. You may have different users using different query engines to access the same tables at the same time depending on their individual needs.
Apache Iceberg is also designed to support analytics at scale. It supports queries on tables with petabytes of data and billions of files. It also supports functionality such as time travel and schema evolution capabilities. S3 Tables is our managed solution for Apache Iceberg. There are a few reasons why S3 Tables is the right place for your metadata. The first is performance. S3 Tables offers ten times the transactions per second compared to Apache Iceberg tables stored in general purpose buckets. The other reason is S3 Tables' deep integration with AWS analytics solution services. In particular, I want to highlight the seamless integration with AWS Lake Formation, giving you fine-grain access control down to the row and column level. This allows you to share metadata with several different teams while making sure that sensitive information remains protected.
The third thing I mentioned was that the tables were managed by AWS. What this means is that only S3 can write to your metadata tables, but you still retain full access over who can read what data. The reason we limit write access to S3 is so that you know what is in your tables is an accurate representation of what's in your bucket. The second thing that managed by AWS means is that we configure and control the maintenance operations on your tables. We set the target compaction size, configure how frequently compaction should run, set up and configure snapshot management and the configurations for unreferenced file cleanup. We do that so you don't have to, and we also do that so we can optimize these configurations based on the size and structure of your metadata tables.
Just yesterday we announced three new types of managed tables. You can now enable integrations to create managed S3 tables for AWS CloudWatch logs, SageMaker Unified Studio asset metadata, and S3 Lens data. This is incredibly powerful. We now have metadata from several different AWS services all in one place, and since they are all stored in the same format, you can query them together. So how do you query your managed tables as well as your S3 metadata tables? There are several different options. The first is using AWS analytics services. S3 integrates deeply with these services so you can use Athena, Redshift, and SageMaker Unified Studio pretty easily right out of the box. Second, since S3 Tables supports Iceberg REST Catalog endpoint, you can take advantage of the open ecosystem of engines that support Iceberg, whether that's DuckDB, Apache Spark, Flink, or Trino, to name a few. Finally, this year we launched MCP for S3 Tables, making it possible for you to chat with your data using natural language.
What else can you do with your metadata? On the screen, we have a few examples of how you can leverage Amazon QuickSight to visualize your metadata. This is an easy and effective way to track and summarize different usage patterns in your bucket.
Request and System Metadata Use Cases: From Audit Trails to Encryption Compliance
Enough theory. Let's talk about some use cases. To start, let's take a second to step back to that slide showing what's in your metadata tables. If you recall, there are three types of metadata: request metadata, custom metadata, and system metadata. We're going to take a look at each of these to see how you can use them to understand your data and take action on your data.
Request metadata lives in the journal table, and it's what you use to inspect what's changing on your bucket. Here we have another set of users who are operating on a shared dataset. Now, as the owner of the dataset, you want to understand what's changed within the past day. You can do that by writing a SQL query that's going to group the results based on source IP and requester. Source IP is going to tell you the source IP of the request, and requester is going to tell you the AWS account. For requests coming from S3 Lifecycle or other AWS services, you're going to see the AWS service's service principal instead of the account ID.
Now, say you want to zero in on a particular operation. Maybe you actually want to understand who is deleting data. You can do that in a couple of different ways depending on whether or not you have versioning enabled on your bucket. For an unversioned bucket, you're going to add a filter on record type. For a versioned bucket, you're going to add an additional filter on delete markers. Delete markers allow you to distinguish between permanent deletes and the delete markers that are added to versioned buckets when a delete request is made.
So now you understand who has been deleting data. You probably want to know what data was actually deleted, and you can do that by updating the select statement to have bucket, key, and version ID. Now, while you can't recover from permanent deletes on unversioned buckets, if you have versioning enabled, you can actually use the output of this query to roll back the delete requests made by removing the delete markers. In fact, last week we published a tool that allows you to do this at scale using S3 metadata. This tool doesn't just roll back deletes; it can actually revert all the changes made as long as you have versioning enabled on your bucket. It does this by querying S3 metadata to understand what versions existed at a particular time, and then it uses S3 Batch Operations to revert your bucket back to the state that it was by removing delete markers and copying objects in place. I highly recommend you all take a look at the tool and maybe even try it out.
So that was request metadata. Now I'm going to move on to system metadata. System metadata lives in both your journal and your inventory table, and it exposes the information you need to understand your data landscape. For example, up on the screen, we have a part of an inventory table for a legacy bucket. What we know about this bucket is that it contains millions of objects that have been uploaded by different departments over several years. Now last year your compliance team updated the policy requiring all objects be encrypted using SSE-KMS. Does this sound familiar to anyone? Well, now a year later you've updated all of your applications to use SSE-KMS, but you're not sure about those objects that were created before the policy was in place. How are you going to find the remaining objects that are still using plaintext?
As you can see from the screen, it can be like looking for a needle in a haystack. Enter S3 metadata. With S3 metadata, this challenge is reduced to a single, pretty simple SQL query. And what's even more exciting is once again you don't have to stop here. You can take action to encrypt these objects by passing your output into a batch operation, running copy in place on these objects, and specifying your desired encryption type as part of the copy request.
There's one more use case that I want to talk about for system metadata that actually combines system metadata and request metadata. So in this example, you're looking at who is uploading how much data to which storage classes. And you can do that with the query on the screen. Now what's compelling about this use case is that traditionally your access logs and your storage data are probably stored in two very different places with different access controls and probably in a different format, and it would take actually a significant amount of developer effort to go and get access to these and combine the data into a single place that you can now look at to analyze this data together. With S3 metadata, it's all there in one place ready to query.
Custom Metadata: Adding Business Context with User Metadata, Object Tags, and Self-Managed Tables
The third type of metadata is custom metadata, which I think is one of the most exciting parts. It's what allows you to enhance your metadata with your own business context. There are a few ways to do this. The first is to use user metadata. User metadata is visible in both the journal and the inventory table. To attach user metadata to an object, you provide it as key-value pairs as part of the request to create an object. Because user metadata is specified as part of the PUT request, it is immutable. This makes it ideal for storing information that shouldn't change throughout the course of the object's lifecycle. For example, the provenance or source of an object should not change, and therefore we see customers storing this type of information as user metadata.
The second option is to use object tags. You can attach object tags to your object using object tagging APIs, and once they are attached to the object, they automatically become visible in your journal and inventory table. One important thing to understand about user metadata and object tags is that once they are attached to the object, they live with the object. If the object is copied or replicated, this custom metadata will go with it. Similarly, if your object is deleted, we will clean up that metadata for you.
Finally, there is a third option: storing your custom metadata in your own self-managed S3 table within an S3 table bucket. We often see customers taking this approach when their metadata exceeds the size limits for tags and user-defined metadata. This might be to store thumbnails or summarization of text documents or video files. Self-managed S3 tables are an incredibly flexible option, and since they are compatible with the same query engine tools, you can easily combine them with your managed metadata. A tradeoff with using self-managed tables is that updates from your buckets are propagated into your self-managed tables, so you may have to join your self-managed table with your managed S3 metadata tables in order to understand if your metadata is still current.
Let's take a look at a couple of use cases for custom metadata. Something we hear from customers all the time is that with the explosion of synthetic data, they need a way to separate AI-generated versus non-AI-generated metadata. They also want to track the lineage of how a piece of AI data was created. As part of the Nova launch at re:Invent last year, Bedrock began annotating videos and images that are uploaded to S3 using user metadata. The annotations indicate that the object came from Bedrock and the model that was used to create them. Now if you have S3 metadata enabled, the problem of separating AI versus non-AI generated information is reduced to a single SQL query.
In the last example, we talked about making sense of data generated by AI. But another source of growing data in S3 is from the exponential increase of sensors and monitors that we see in our daily lives. Some examples would be security cameras, vehicle sensors, sensors on planes and ships, and also information from scientific studies such as topography data, geospatial or lunar imagery, or even DNA sequences from genomic studies. What's common in all of these use cases is that the data itself lacks the contextual information that's required to analyze it, particularly when it's first uploaded to the cloud.
For example, if you think of sensor data that you're uploading, you need to record additional information such as the time the recording might have been made, where the sensor existed, or other sensor configurations. Fortunately, you can add all of this contextual metadata to your object and to your data through the use of object tags. Once the object tag is attached to your data, it flows through into your metadata tables, so you can find your data by querying on that same contextual information you attach to your object.
Live Demo: Setting Up S3 Metadata, Querying Tables, and Automating Storage Actions
All right, I think it's time to see some of this in action. We're very excited about the demo today. Everything that we're going to show you today is available, and you can literally try this before you leave today. There are three parts to the demo today.
We're going to start by learning how to set up metadata configuration on a bucket. Then we'll talk about how you can query those metadata tables with different analytics engines, and finally we'll see how you can take storage management actions based on the outputs of your metadata queries.
So let's get rolling. Let's work on the first part, which is setting metadata on your general purpose buckets. This is the easiest part. You can really set this up in under a minute on your buckets. All you have to do is figure out the general purpose bucket where you want the metadata.
In this case, we're going to be working on the Starwatcher bucket. The two tables that we talked about are the journal and live inventory tables. You enable them, you can provide the encryption types, and the record expiration allows you to expire the records after a certain duration. So let's say if you want records older than 365 days to be expired, you can do that. The live inventory table is enabled. You can again choose the encryption type for the tables and create your metadata configurations.
A few things that I want to mention here is that once you create your configuration, the first thing that happens is your table status goes into a creating stage. Within a few seconds, the journal table will become active. Journal tables are forward looking. They capture your puts and deletes, so it's active as soon as you set up the configuration. The live inventory table first goes into a backfilling stage because the live inventory table has everything about your buckets. All of the existing objects are captured. So it's first going to go into the backfill stage. This is where we get the metadata of all of your objects and get your tables ready and prepped up.
Another thing I want to call out is you can see the table bucket here is AWS S3. This is the table bucket where all of your metadata tables, all your journal and live inventory tables for the region and for the account will be hosted. So within this single bucket you can see all of your inventory and journal tables. The namespace again might look familiar. This corresponds very closely with the name of the bucket on which we set up the metadata table. It typically has a prefix of B underscore.
That was all about setting up the metadata configuration. Next we're going to go to the second part of the demo, which is how we query these metadata tables. Now that we have learned to set up these tables, let's explore how we can read them and query them. So here I have both my journal and live inventory tables are active. We're going to start by querying the journal table today. I'm going to use both SageMaker Unified Studio and Athena to query these tables.
We'll start by querying the journal table, and I'm using SageMaker Unified Studio. The integration between SageMaker Unified Studio and S3 Table buckets is really simple. It's a one-click integration. The first thing that we're going to see here is I'd like to show you the entire topology that we just talked about. We're going to go look up our bucket. As you can see, the AWS S3 table bucket shows up. Your namespace, which is B underscore the name of your bucket, and the two tables, which are the journal and live inventory table, are showing up here.
The next thing that I would like to show you is the schema that we just walked through when we talked about journal and live inventory tables. Because this is a journal, you're going to see the three special fields about the requester, the request ID, and the source IP address. The scenario that I'm going to test on the journal tables today is show me everything that was deleted in the past 7 days in a specific prefix. You can filter by a lot of additional filters here, but I'd like to figure out if someone deleted objects in my specific prefix.
Here we go. Within a couple of seconds, I have a list of everything that was deleted and the requester who deleted these objects in my prefix. This is why journal tables are very powerful for investigative queries, for audit analysis, for any time you're trying to track what happened in my bucket and who did it.
The next part of this demo is querying the live inventory tables. We're using Athena here to query the live inventory tables. Live inventory tables are useful for storage landscape analysis when you want to look across your entire bucket and see what's going on. Here I'd like to see everything that has specific kinds of tags and is in a particular storage class. You can add additional types of filters on encryption and checksum types. I'm trying to get a list of all objects which meet the object tags of a specific type of weather and a training category.
Again, within a few seconds, I have a list of everything without me having to list the objects or do the get object tagging. I'm able to get a list of everything that met the specific criteria for the tags.
Another interesting thing I want to point out here is that all of these objects are in Glacier storage classes, but the metadata is queryable. This is a powerful aspect of S3 metadata. You can get metadata across all of your objects across all of your storage classes, and it is instantly queryable. We are not restoring anything at this point to get to this metadata.
So we've completed two parts of our demo at this point. We learned how to set up the configuration and explored how to query both the journal and the live inventory tables. The third part of our demo today is taking storage management actions based on the outputs of our metadata tables. To do this, I'm going to take another scenario. You remember Alice, our data scientist who was trying to look up data for training her models. She's actually training a parking assist system specifically for rainy weather. So what I want to do here is take all of my raw data stored in Glacier and restore all of that data so it's ready for her to use.
Let's go over this quickly. You've seen this query before. We just ran it on our inventory table and got the list of everything that met these specific tag criteria in Glacier storage class. I've limited it down to just the bucket and the key because that is what I need to pass to a batch operations job. The easiest way for me to pass the output of this query is to go to the bucket where this output is stored. I'm going to the actual S3 bucket where my query results are stored, and that is what I'm going to pass to the batch operations job.
My results are stored in this training query prefix. I'm going to go pick up the result. I have to do one small thing before I can use this result, which is convert it into a format that batch operations job can read. I can do it right here. I'm just going to separate the two values, which are bucket and key, and store it as a manifest file. So that's what we're doing here. We're taking the output from our metadata tables, converting it into a comma-separated list, and storing it in this manifest file.
I'm going to go fetch the URI for that manifest file, and this is basically the most important step. You've got your data, you've got it prepped, and it's ready for your batch processing. I'm going to take the URI of this manifest file and pass it to the batch operations job. That's essentially the largest step that you have with this. Next, we need to do a couple more configurations. Most importantly, you need to choose the operation that you want the batch operations job to perform. We're restoring objects from Glacier and getting them ready, so I chose restore. I want them to be available for five days, and we're going to run this batch operations job at a priority, so I'm selecting run when it's ready.
I typically like to have my completion reports in the same place as my manifest file, so I'm quickly going to go and fetch the URI of where my completion reports sit. So here's the URI for my completion reports. I'm going to pass this to the batch operations job, and that's basically it. You choose the role that you want batch operations to use when it's running your restore operation, and you have set up a batch operations job using the output of your queries. Couple of things that you want to look at as you're setting up the job is that when you have the job set up, you can see the whole manifest and you can see the number of objects that it is going to restore.
So let's go look at the batch operations job here that we just set up. It's showing that there are 88 objects in my manifest list. They're all going to be restored. The restore is the operation. It's going to be restored for five days. So this is the power of being able to use S3 metadata to run batch processing jobs. You're no longer spending hours trying to prep your data. You're just taking the output of a query, processing it, getting it ready for a batch operations job, and passing it.
Finally, for the last part of our demo today, I'm going to talk about querying with natural language. So you can, as Claire had mentioned, query your metadata tables in three ways. We're going to try querying them with natural language using Kiro CLI and MCP for S3 Tables. Let me start this by first initializing Kiro. I already have MCP for S3 Tables set up with Kiro, so you can see that as soon as I initialize it, it will start loading up the S3 Tables for MCP server. And this is what actually enables us to query metadata tables using natural language, the MCP for S3 Tables. Now instead of manually creating the SQL queries, I'm basically passing it a prompt, telling it to analyze my storage and go look at all of the object counts, the storage class distribution, and the prefixes.
Now look as Quiro interprets my request and automatically executes the analysis. Behind the scenes, it's basically connecting to S3 metadata tables. It's going to ask me permissions to run a few tools, specifically the query database. That's the tool it's going to use to query my tables. And now it's ready to run the SQL queries. It's connecting to the S3 tables at this point, optimizing the SQL queries that it's running.
I also prompted Quiro to tell me why it's running a certain query, so you can see it first started by looking at the count, the size, and the distribution. Once it has that information, it's moving on to prefixes and finding the largest prefixes. This is all happening without me running a single SQL query or telling it what SQL I want. All I did was tell it exactly what I wanted to look at. So it's correcting the prefix analysis, analyzing the storage distribution per prefix, and finally, I'm expecting it to provide me all of this in a summary, which it is doing at this point, a clean executive summary with actionable outputs.
You can see the storage total number of objects, the storage class distribution, and the prefix distribution. It's also making some recommendations on some of the actions that I can take. So this is the power of combining your S3 metadata tables with Quiro with MCP for S3 tables. You're able to talk to your metadata tables, your business teams can talk to your metadata tables, and you do not have to be a SQL expert. That's metadata in action today.
Real-World Impact and Key Takeaways: Transforming Data Discovery Across Industries
The next question that you're probably asking is where can I use it and which regions are we available in? We're available in 28 regions today, 22 of which were launched in the past week. We are continuing to expand our regional footprint, and we're very excited about that. We've covered a lot today. We've talked about the challenges and the opportunities with data discovery. We've talked about how S3 metadata solves these challenges. We've seen a demo in action. Now I want to talk about the real impact. I want to share with you two real-world customer scenarios where customers deployed metadata and saw immediate results.
The first one that I want to share is the story of a medical imaging customer that generates 3D models from CT scans, processing thousands of objects and files every hour. Previously, this customer used events and they processed each file with Lambda. This whole workflow was very fragile. It would time out, and there was a lot of complexity involved. When we introduced S3 metadata tables, this customer completely refreshed their job processing pipeline with S3 metadata. They now use S3 metadata journal tables. They query and get the query outputs every 15 minutes, and they use an open source orchestration platform. They feed the output of the journal table to the open source orchestration platform every 15 minutes, and that helps them batch process their new objects. They're now processing thousands of files every hour, which has dramatically reduced their overall processing time. This is really an impactful way to meaningfully speed up your workflows.
The second story that I want to share is of a digital content company that has millions of articles from thousands of publishers. Their challenge is they're moving everything from on-premise to cloud, and they need to keep track of these millions of files across different systems. They have old servers, databases, and S3. Without S3 metadata, this would have been a manual nightmare. However, with S3 metadata inventory tables, they're using simple SQL queries to find their data based on database IDs and migration tags. Now they're able to answer simple questions like which files were successfully updated, which files are duplicates, and which ones they can just delete. They're able to answer these questions with simple SQL queries. They're now moving petabytes of data confidently with full visibility into every file, and there's no manual tracking or guesswork involved.
Finally, to conclude our session for today with these impactful stories, I also want to leave you with three key takeaways on how S3 metadata will help your organizations with data discovery. First, it always provides you with current metadata, so you can stop searching and just start querying your tables for what you need. Your data scientists, your security teams, and your engineers can basically look up information within minutes.
Second, with these Iceberg compatible tables, the journal and the live inventory tables, you can now build smart workflows and convert all of these storage insights into actions. You can build batch workflows at scale. And finally, you can futureproof your data lake. S3 metadata provides the foundation for your AI agents that can now interact intelligently with your storage. Here are some resources that will help you get started on your metadata journey. We're very grateful to have you all here today. Thank you for spending the time with us, and good luck with the rest of your re:Invent session. Thank you.
; This article is entirely auto-generated using Amazon Bedrock.













































































































Top comments (0)