🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Maximize the value of cold data with Amazon S3 Glacier storage classes (STG208)
In this video, Gayla Beasley and Nitish Pandey from the S3 Glacier team explain how to maximize cold data value using Amazon S3 Glacier storage classes. They cover the three Glacier tiers (Instant Retrieval, Flexible Retrieval, and Deep Archive), S3 Lifecycle policies for automatic data transitions, and S3 Intelligent-Tiering for unpredictable access patterns. The session introduces two new capabilities: the compute checksum operation in S3 Batch Operations for verifying archived data integrity without downloading objects, and S3 Metadata with Live Inventory Table for querying object metadata using SQL or natural language. A demo demonstrates validating checksums for NASA images tagged under Project Odin using these new features.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: The Growing Value of Cold Data Storage
Hello. Good afternoon and welcome to STG208: Maximize the Value of Cold Data with Amazon S3 Glacier Storage Classes. My name is Gayla Beasley. I'm very excited to be here and I'm really excited to see you all in the afternoon after lunch. Give yourself a pat on the back for making it to your session. Just by a show of hands, how many folks are already using Glacier? We got a few of you out there. Good to know. We built this session for people who are new to Glacier, for some folks who maybe occasionally use Glacier and are familiar with it, and we even have some content for those archiving storage professionals out there, so we have a good session for everyone.
You can think of this session in three parts. In the first part, for those new to Glacier, I'm going to briefly cover why your cold data is important. I'm going to go over the S3 Glacier storage classes. I'm going to talk to you about getting your data into Glacier and getting your data out of Glacier. Then I'm going to hand it over to Nitish Pandey, who is going to cover some new S3 storage archive-related features, and then we'll have a quick demo for the professionals. Alright, let's get started.
S3 has hundreds of trillions of objects currently. One estimate out there says that 70 to 80 percent of all cold data storage everywhere is cold. What we mean by cold is data that's rarely accessed. It's sometimes stored for months, years, and even decades. That amount of data is hard to ignore because it's growing every single day. But what's interesting is that the value of that data is increasing. Across industries, we're seeing something very remarkable happen.
Cold data isn't just stored and forgotten anymore. It's now become a catalyst for innovation. It's emerging as a huge differentiating factor for a lot of businesses, and our customers are unlocking all different types of ways to use what was once considered dormant data and turning it into actionable intelligence. Think bigger than storage. Let's say you have a whole bunch of historical handwritten records and you want to use it to train machine learning to recognize handwritten data for those archives. You might want to derive new insights for archived analytics. Maybe you have a bunch of financial or stock data that you want to use to build a new application. The opportunities are endless.
Understanding S3 Glacier Storage Classes and Their Use Cases
Our customers are every day identifying new opportunities for this archived content. It's about turning every byte of cold data into an opportunity, and that's why Nitish and I are excited to go over S3 Glacier with you all. We're both on the service team and this is hugely important to us. Let's dive a little bit deeper into what the Glacier storage classes are. Now, if you've been around S3, you've probably seen a slide like this before, but for the new folks, let's talk about these storage classes. Imagine a continuum where you're balancing two key factors: your access speed and your cost efficiency. On your left, we have our frequently accessed data, so it's ready for you in your applications in milliseconds, but it's at a premium. As we move right, we're entering into progressively cooler territory. This is where your access becomes less frequent, but your storage costs are going to drop significantly.
Let's break this down a little bit more. With S3 Standard, you're getting that millisecond access for your active data. Moving through S3 Standard-IA and S3 Frequent Access, you're already going to start seeing some cost savings. Then as your data continues to cool, you can move it through Glacier, and this is where the real magic for cold data happens. You have Glacier Instant Retrieval, which is for that data that you need in milliseconds that's still archived. For data that you need within minutes, you have Glacier Flexible Retrieval.
And then you have Glacier Deep Archive, which is our lowest cost storage in the cloud. Here's the brilliant part: as your data naturally cools and as access patterns decrease, you can progressively move through these tiers and continuously optimize your storage costs without sacrificing the ability to retrieve when needed.
You may ask which storage class you should choose. Think of this as finding the right combination of cost and retrieval speed. For Glacier Instant Retrieval, this is your go-to for data that's cold but you need very quickly. Some examples could be medical records that are rarely accessed, but when you need them, you need them right away. Or if any of you are in the broadcast media space, this could be archival clips that have a tight deadline that you have to pull in for a project. This tier is also popular with those who have compliance requirements that require instant access.
Next, you have Glacier Flexible Retrieval, and with Glacier Flexible Retrieval, you get something really cool: free bulk retrievals. This is for large-scale data analytics where your timing is flexible, backup archives where you have some time and you can plan your retrievals, and also historical records that may need occasional bulk processing. Our last tier, Glacier Deep Archive, is our lowest-cost storage option. This is for your long-term retention at the lowest possible cost. It's storage for data you hope you never need, but you must legally keep. Typically, you're measuring storage archive in years and decades.
Managing Data Lifecycle: Getting Data Into and Out of Glacier
How do you get data into these storage classes? For this, we have S3 Lifecycle policies. Lifecycle policies allow you to transition your data into infrequent access or archive storage classes over time as your data cools off, and it can automatically delete that data at the end of its life. This allows you to automatically control costs based on known access patterns.
For example, let's say you have an object that you created on day zero that's accessed very frequently in the first ninety days. After that ninety days, it's rarely accessed. You can create a lifecycle policy that automatically transitions your object to Glacier Instant Retrieval after that ninety days. If your application then needs to access that object, it's available in milliseconds and there's no problem. It works great. But let's say after one hundred eighty days access is extremely rare and we want to further optimize our storage costs. Our lifecycle policies can handle that too by transitioning the object to Glacier Deep Archive.
Then after that, let's say you have a compliance requirement. Maybe you're in the financial industry and you have a requirement of keeping data around for seven to ten years. You can build a lifecycle policy that will automatically delete that data at the end of that time. We provide you with a number of filters that you can apply to your lifecycle policy which determine which objects it will affect. You can apply them to an entire bucket if you need to, but you can also apply them to specific prefixes or objects that match certain object tags. You can filter by object size, which is really great for archiving because you typically want to avoid sending a bunch of small objects into archive. You can also filter by the number of versions an object has. If you want to avoid creating a bunch of unneeded versions, you can create a policy that maybe creates one or two and will delete everything else.
But let's say you don't know your access patterns or they're very unpredictable. We have a storage class for that as well, and that's S3 Intelligent-Tiering. Sometimes access patterns truly are just unpredictable or unknown.
S3 Intelligent-Tiering is a storage class designed for customers who want to optimize their storage costs automatically when data access patterns change, and it does that without any performance impact, operational overhead, or lifecycle fees or retrieval fees. Intelligent-Tiering is a cloud object storage class that delivers automatic savings by moving your data between the tiers. There are five tiers: three of them are synchronous and available automatically as soon as you sign up, and you have to opt into the two asynchronous tiers. After you opt in, Intelligent-Tiering will choose the correct tier for you.
So far we've talked about why your code data is important. We've talked to you about how to choose your storage class, how to get your data into the storage tier, and how to get your data out. You've optimized all your storage costs. So what happens when your data is suddenly hot again? Let's talk about restoration options.
Restoration Strategies and Batch Operations for Large-Scale Retrievals
The first question you might ask is why are our customers storing data? Great question. Let's look at some new and exciting emerging patterns. You're sitting on a gold mine of archived media files. We're seeing our customers breathe new life into their historical content, transforming decades of old footage into fresh, compelling content for today's audiences. When it comes to backup and compliance, it's not just about checking regulatory boxes anymore. Organizations are finding strategic value in their historical records and using them to track long-term patterns and inform their future decisions.
But here's where it's getting really exciting. Cold data is rocket fuel for machine learning models. Think about it: you have decades of historical data training models to spot patterns we never could have seen before. Your archive data isn't just history; it's your competitive advantage. For example, let's say you are a company working on self-driving car technology. Perhaps you need to train your model and you want to restore all your content related to cars taking left turns. You can do that with Glacier. We're seeing a brilliant trend where companies are generating rich metadata from their archive content, making vast data lakes searchable and actionable in ways that they weren't before.
So now let's dive deeper into how customers are accessing their stored data in Glacier. First, with Glacier Instant Retrieval, it's pretty straightforward. It's the same GET request that you would use for S3 Standard and Intelligent-Tiering and frequent access. You get millisecond access, and the trade-off is higher retrieval and API charges. Customers often mention that with Glacier Instant Retrieval, it's amazing for them because now they're able to save on their storage costs without having to make any changes to their applications.
However, if you have retrievals that can withstand some wait time from minutes to hours to days, we can pull your data out of Glacier Flexible Retrieval, but that's a bit different. You can lower your cost of storage while also reducing the cost to retrieve even large amounts of data with Glacier Flexible Retrieval and Glacier Deep Archive. But with these storage classes, they require a request before data can be accessed, and once restored, you can call a GET request on the same object key to get the object.
So for retrievals from these Glacier S3 storage classes—Glacier Flexible Retrieval and Glacier Deep Archive—you generally have three steps involved. First, you have to initiate a request. Second, you have to check that the restore has completed. And finally, you access your data. I'm going to take you through each step.
When dealing with millions and even billions of objects, one factor to account for is the time it will take to submit all the requests to Glacier. Glacier supports a request rate of 1000 transactions per second. This TPS limit automatically applies to all standard and bulk retrieval requests from S3 Glacier Flexible Retrieval and Glacier Deep Archive. Let me walk through an example. At 1000 TPS, you can submit 10 million restore requests in under 3 hours. Using standard restore from Glacier Flexible Retrieval, for example, you can complete all restores in about 6 hours. This includes the 3 hours to submit the restore requests and then 3 to 5 hours to complete the restores.
To ensure you get the highest restore performance, you can also rely on Batch Operations, which I'll discuss in a moment. This will help you maximize your TPS and get automatic retries. If you remember our second step, that was monitoring the request status for Glacier restores. For these restores, S3 will create an event for restore initiation and completion. You can publish these events to Amazon EventBridge and configure that to fan out to an SNS topic or an SQS queue. You can even have it trigger a Lambda function if you need it.
You can also configure restore completion events within EventBridge to send events to an SNS topic, which your application can then subscribe to. This will allow your application to automatically proceed to the next step, such as a get or a copy, as soon as the object is restored. Now we get to our final step. Accessing your restored data. Once an object is restored from Glacier, your application can access it like you would any other object. It's now in S3 Standard. You can perform a get or access it any other way that you would access an object in a synchronous tier.
However, here's a tip to remember: the restored data is a temporary copy of the object. To move it out of Glacier, you can either do a copy and place it over the same key or you can copy it to a new bucket. If you're copying over the same key, just keep an eye out for adding an extra version in case you don't want to keep an additional Glacier version. Now I did briefly mention Batch Operations. Glacier works like a freight train. If you know the batch, we can do some optimizations on our side to make things run smoother. If you're getting less than 1000 transactions per second, that means the request will take longer to submit. I've seen examples of customers requesting around 25 TPS, and essentially increasing their total restore time by 40 times, which is suboptimal.
But the good news is that you don't have to optimize your software or multi-threading. You can use Batch Operations, and that can dramatically improve your restore experience. Instead of fine-tuning your software, you can use Batch Operations to automatically maximize your restore requests per second. It can also do automatic retries to handle any error failures, and with the completion report, it will signal any issues with your job. You create a manifest, a list of keys you want to restore, and submit it to Batch Operations. Each manifest will be associated with a single job, and additional jobs from there will then split up your TPS.
We've covered the importance of your cold data and how it can be an innovation for your applications. I talked to you about your storage classes and how to pick the best one. I talked to you about getting your data into Glacier, and I talked to you about getting your data out of Glacier. Now Latisha's going to come up and talk to you about some new archive-related features. Thank you. Whoa, I didn't fall.
New Compute Checksum Operation for Data Integrity Verification
Thank you, Gayla. Now you have seen S3 storage offering and the pathways of getting the data in and out of S3 Glacier. Let me build on that foundation and introduce two new capabilities that we have recently added, each specifically designed for archive workloads. We are actually going to talk about three capabilities. One will be coming in Matt Garman's keynote tomorrow, so stay tuned. The first thing that we are going to talk about is checking the integrity of your archived data at rest. The second thing that we are going to talk about is using S3 metadata for quick discovery of your archive content. For each of these capabilities, I'll address three questions. First, what problems are we solving here? How does it work and what does it mean for you? Let's start with the first feature, new compute checksum operation in Amazon S3.
Why this capability is so vital is because customers across domains such as media and entertainment, life sciences, crime and justice, and preservation institutions perform periodic data integrity checks to make sure that the data is intact. This could be the master copy of an iconic movie or it could be a historical artifact like the Constitution of the United States, or this is something that is required by the compliance team, or a proof that the evidence has not been tampered with. In all these cases, verifying archived data is an industry standard. Our customers asked us to provide tools in S3 to help them do this, and that is exactly what we have built.
Before I dive deeper into this capability, one thing that I want to highlight is that everything that we are talking about regarding checksum validation here is completely optional. S3 performs billions of checksum operations every second to make sure that every byte in transit and at rest is intact. But we also know that our customers are already doing that, and we can provide a better way of performing these checks. That was the intention behind this feature. But first, let's cover the basics. What is a checksum that we are computing here? A checksum is a digital fingerprint of an object. In this example, we have three objects. We are using the checksum algorithm CRC32, and we have a unique alphanumeric value for each object. Even a single bit flip will result in a different checksum value.
Customers store the original checksum value on their media asset management systems or in a checksum repository as a source of truth. Later, whether it is six months, two days, a few seconds after the upload, or after ten years, they come back and calculate a fresh checksum of the object that is in S3 and compare that against the checksum that is stored in their systems. It is a way for them to prove that the data remains intact. S3 already provides a range of capabilities during upload. When you upload an object to S3, you can specify the checksum algorithm you want to use, and you can provide a pre-calculated checksum value, which is optional. You have support for six checksum algorithms: CRC32, CRC32C, CRC64, MD5, SHA1, and SHA256. For every upload, S3 calculates CRC64 on the client side and on the server side to provide you end-to-end data integrity. Only on a match does the request succeed. This works well for data in transit.
Now let's see how customers perform data verification for already stored data in S3. Until now, verifying checksum for data at rest required two steps. First, you download the object. This will have a compute cost, a bandwidth cost, or data transfer fee, as well as some time. Then you spin up an EC2 instance or use your own infrastructure to calculate the checksum locally. That means more compute costs, more time, and added complexity. For large archives, this process can be cost prohibitive or time consuming. We needed to eliminate the step of downloading the object and then calculating the checksum. We wanted to come up with an innovative way to do an in-place read of the data and calculate a fresh checksum.
That is what we have done with the compute checksum operation. This is a new capability in S3 Batch Operations, and it provides you a new way to verify the content of your dataset stored in Glacier or any storage class.
You can efficiently verify billions of objects and automatically generate a data integrity report to prove that your data remains intact over time. This capability works with any object in S3 regardless of the storage class or object size. Whether you're verifying your data for compliance reasons, for digital preservation, or accuracy checks before feeding the data into a model, you can reduce the time, cost, and effort associated with that process.
Because it is built into S3 Batch Operations, you get automatic retries on failures and a detailed completion report at the end, which you can use as your data integrity report. Creating a compute checksum job is simple and has three components. First, you provide an object list, also known as a manifest in S3 Batch Operations. You can create a curated list and submit it as a CSV file, or you can use S3 Batch Operations automatic manifest generation service. You can also use an inventory report and feed that in as the manifest. Batch Operations supports that as well.
Second, you choose the checksum algorithm. We already discussed the algorithms supported in S3, so the story is consistent. Everything that is supported on upload is supported with the compute checksum operation. The algorithm you pick depends on your business or compliance use case. For example, if you want something that is secure and more compliant with regulatory needs, you can use a secure hash algorithm like SHA-1 or SHA-256. If you need performance and don't care much about compliance requirements, you can use CRC64NVME, CRC32, or CRC32C.
The third component is checksum type. You can provide a full object or a composite checksum. If you're in a media supply chain or dealing with third-party providers where you're providing your content and need to provide a full object checksum so that everyone is speaking the same language and the chain of custody is maintained, you can use the checksum type as full object. If you don't have such needs and it's all internal with your team knowing what you're discussing, or if you're dealing with large objects and want to perform parallel checksum operations, you can use composite checksum type instead.
Of course, you'll need to provide the IAM permissions to S3 Batch Operations so that it can read the bytes and write the completion report. These are the three inputs you need along with the permissions, and S3 Batch Operations handles the rest. It will read the object, compute the checksum, and provide you a nice integrity report at the end. The completion report will have fields like bucket, key, version ID, error code, and result message.
Once the job is complete, you can use this to validate against the checksums stored in your media asset management system or checksum repository. You can also use a JSON parser to extract the checksum values and convert this into a nice table using the result message fields available for validation. Additionally, you can use a Lambda function to automate this job every six months or one year if that is your need.
The best part is that you do not incur any fee for restore or retrieval. Your data in Intelligent-Tiering won't warm up, and if it's in Glacier, you don't have to restore those objects from Glacier. You pay a single fee of $0.004 per GB or $4 per TB to process the data, and that is consistent across all storage classes. Now let's move on to the next capability that would be helpful, which is S3 Metadata.
S3 Metadata: Querying Archive Content with SQL and Natural Language
This capability was launched during re:Invent last year, and we have made some improvements to S3 Metadata. S3 Metadata automatically extracts metadata from your objects and makes it available to you to generate valuable insights using simple SQL queries or natural language. We believe that this will fundamentally change how customers manage and extract value from their cold or archived data. Let me explain the problem we are solving here. We hear this from our customers very often.
We have a bucket with millions of objects or files that have been archived. We have tagged those objects, so in this case we have tagged all the objects either as Project Odin, Loki, or Thor, and then a few of them are untagged. Some are in Glacier and a few are in S3 Standard. The customer's ops team comes and asks, "We want to move all the data for Project Odin that is in Glacier to Standard. How much data are we talking about?"
Today, there are multiple options to answer that question. You can paginate the API requests like list and get object tags, scan the storage, and calculate the storage across storage classes. Or you can use the S3 Inventory report, which refreshes daily or every 48 hours, and then you can use that to answer this question. But both options at scale either have delays in time or are not easy to perform.
We saw an opportunity to make it faster. What if we could answer this question in seconds by writing a single query? That is exactly what we tried to do with S3 Metadata. It is a fully managed service that automatically captures all your object metadata, tags, object storage class, object size, everything that you get from the head object and stores that into a queryable Apache Iceberg table. No APIs to call, no inventory reports to wait for, just instant SQL queries.
We launched S3 Metadata at re:Invent last year, starting with the Journal Table. The Journal Table captures every change to your bucket in near real time. Puts, deletes, metadata updates, tag changes, all of them are captured in the Journal Table. It is your complete change log of what is happening in your bucket right now. In July this year, we added a new capability, the Live Inventory Table. It shows the current state of every object in your bucket. It backfills all your existing data when you enable it, and it refreshes every hour.
Both tables are read-only and fully managed by AWS, and you can think of them as your authoritative system of record for everything in your bucket. Now, here is where it gets really powerful. You can query these metadata tables in two ways. First, through standard SQL queries using Athena, Redshift, SageMaker Unified Studio, or any analytics tool that supports Iceberg tables. And second, through natural language using Amazon Q, Quiro, or any agent that you are using which supports MCP server.
Here are a few examples that we tried. First, to learn about storage usage, we ask ourselves, how much data is in Glacier for each project. Now, you can simply write a SQL query and it will give you the snapshot of that. Second, you can also identify all the objects that were untagged so that you can classify them properly. And lastly, you can also use it for auditing the Journal Table to track what data was deleted, who deleted it, and when it was deleted.
Just recently, we added MCP server support for S3 tables, which means you can connect your AI assistants like custom agents and Quiro directly to your metadata tables and interact with them in plain English. That is really powerful because it democratizes access or insight generation from your archive data. You do not have to be a data engineer or data scientist. Anyone from your team, whether from finance, ops, or compliance, can just write their questions in plain English, and the MCP server will convert those questions or the agent can convert those questions to queries and extract the insights from the archive data and provide them as something that is easy to consume.
Live Demo: Validating Data Integrity for a Space Tech Startup Project
Now, let us move on to the next section, which is a demo. For a moment, let us assume that we are a space tech startup. We are building an autonomous space vehicle, and we are using images and videos available from NASA to build a space simulation model so that we can train how the vehicle navigates.
These images are valuable for training data, but we need to verify the integrity of these objects before feeding them into the expensive GPU compute resources that we need. To do that, we are going to focus on three things today. We will use the inventory table to find all the objects that are related to Project Odin, which is our code name for our autonomous space vehicle project. Then we will calculate the checksum of these objects, and then we will use the completion report to compare it with the original checksums that are added as tags.
We will go to the bucket reinvent-stg208-demo, where we have hundreds of images and videos that we have downloaded from the NASA website. They are stored across different storage classes. Some of them are in Glacier Instant Retrieval, and some of them are in Standard. We will open one of them and look at the object properties. When we scroll down, we can see the tags that are attached to it.
The tags include the SHA256 checksum algorithm that was computed when this object was created, the checksum timestamp, the project name as Odin, and the baseline checksum hex code, which is the original checksum. All of these are added to the object, and we have added them to most of the objects and kept them. We intentionally left a few of them untagged. Our plan is to use the live inventory table to query all the objects that are associated with Project Odin and create a list that we can feed into batch operations.
We will go to query and use the live inventory table with SageMaker Unified Studio to query our inventory table. We need to make sure that we tweak the query in a way that can be used with batch operations. We need the bucket and key, and then we need to add a WHERE clause. We are adding the WHERE clause where object tags have project name as Odin. Then we will run this query.
We have 25 files that are associated with Project Odin. Now let us download the CSV and see the values before feeding them into S3 Batch Operations. We need to remove the first row so that it matches the format that is supported in the batch operation intake process, and then save this file. We will save it as Odin manifest V2. Now let us go back to our general purpose bucket, the S3 bucket that we had. We have created one more bucket there for S3 Batch Operations jobs.
We have two buckets here, one for storing the manifest and another one for providing us the destination location to get the results from batch operations. We will upload the Odin manifest V2 to this folder. Then we go back and look at everything to make sure we are good here.
This is our result location where we'll be storing the result from the compute checksum job operation. Now we'll create a new job and use the existing manifest that we just created, which is stored in our batch manifest folder, V2. We'll select the CSV file, then select the compute checksum operation, which is the new one. We'll set the checksum type as full object and SHA256. This is important—we need to acknowledge that this report can be accessed by the bucket holder because the completion report will have checksum values, which are plain text data. Then we'll set the destination location and add the permissions for the IAM role. We've created an S3 batch checksum role for this, and we'll submit the job.
We'll run this job and wait for it to complete, which can take a few seconds. Meanwhile, it's good to know that we've added automatic manifest generation capability, which was available through the Batch Operations API to the console as well, so you can use that for generating a manifest if you're trying it yourself. The job should be done any moment now. The job is complete. Let's go back to our folder that we created for batch results. We'll navigate to the results folder. Let's open it to see how the output looks in the completion report.
This is how you get the result message. This column is the result message, and it has the checksum algorithm SHA256, the checksum type full object, and the checksum value in both base 64 and hex code. After this, we'll use Keto to compare this with the values that are stored with the objects as tags. We're using the Keto CLI and want to specify that we validate Project Odin checksums using the output of the S3 Batch Operations job, then provide the link or S3 URI for that CSV file that we just got from Batch Ops.
It has created a Python script, and all 25 objects that were related to Project Odin have been validated to match the checksums that are stored as tags. With that, we conclude the demo. I want to summarize our session with a few key takeaways that you can keep in mind. First, with the advancement in AI, archived data can be the differentiating factor. S3 provides you multiple storage options curated for your specific business needs in terms of access pattern, storage time, and cost. Finally, we are adding new capabilities in S3 such as compute checksum operation and S3 metadata that can help you easily manage archive data while extracting more value from them. With that, we would like to conclude this session. Thank you so much.
; This article is entirely auto-generated using Amazon Bedrock.
































































Top comments (0)