Kazuya

Posted on Dec 4, 2025 • Edited on Dec 11, 2025

AWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351)

#dataengineering #performance #architecture #aws

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Inside S3: Lessons from exabyte-scale data lake modernization (STG351)

In this video, Carl Summers and Ran Pergamin from the S3 team share their journey modernizing an exabyte-scale data lake for S3's internal operational logs. They explain how one hour of S3 logs would stack taller than Mount Everest if printed, and describe the challenge of making this data useful rather than just storing it. The presentation covers their evolution from grep-based searches and JavaScript queries to a modern architecture using Apache Iceberg with Parquet columnar format, implementing pushdown optimizations, intelligent partitioning, and sorting strategies. They detail their three-layer schema design (identity, metrics, context), the custom transcoder they built to convert text logs in three minutes, and their gradual migration strategy that maintains backward compatibility. The result: thousands of engineering hours returned, queries running on five-minute-old data, and product managers accessing historical data orders of magnitude faster without engineering support.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Challenge of Making Exabytes of S3 Log Data Useful

Hello. My name is Carl Summers, and I'm a principal engineer with the S3 team. I'm joined today by Ran Pergamin, who is a senior specialist solutions architect. You are all in STG 351, which is inside S3: Lessons from our exabyte-scale data lake modernization efforts.

Let me start by painting a picture for you. If you printed out just one hour of Amazon S3's internal log data, the stack of paper would be taller than Mount Everest. Of course, it's growing year over year. Remember, that's just one hour.

But here's the thing: these are not just logs. Every single one of these entries represents a customer interaction with Amazon S3. It's a family photo being uploaded, a medical record being stored, a machine learning model being trained, or climate research data being shared. You and millions of customers like you put your trust in Amazon S3. So these aren't just logs. There are stories, there are businesses, and there are our lives.

Buried within those exabytes are the answers to some pretty critical questions. Why did a specific request fail? What's causing this particular latency spike? Which features are customers actually using? How do we prepare for the unexpected? Now our challenge here is not generating or storing this amount of data. After all, we have been doing that for nearly 20 years. Our challenge is making that data useful.

I think it's really exemplified by three different questions. We asked ourselves: are our engineers spending more time finding data than actually analyzing it? We have these brilliant engineers that are spending hours just getting to the data they need, not using that data to solve problems. How much faster could we be resolving issues with better insights into our logs? In operational and business decision making, seconds and minutes count. Hours or days spent retrieving the data you need is time lost doing or building the right thing.

What business questions remain unanswered or, even worse, unasked because the data is simply too difficult to access? How many opportunities are we missing out on? How many insights are buried in that Mount Everest of data simply out of reach? Now it's these three questions that have guided the direction of my work for the last two years.

Learning What 'Unlikely' Really Means at S3 Scale

To really understand why it was necessary, I'd like to tell you a personal story from my time here at S3. It's a story about what unlikely really means. It starts back in 2013 when I was a new S3 engineer who thought I understood what it meant. I started in S3 on the team that owned the front-end components—think load balancers, front-end fleets, and much of the business logic for S3. About a month after I started, I began working on my very first feature.

As part of that feature, I needed to retrieve some configuration information that had been applied to the customer's bucket. That meant I needed to make a network call to the service that owned the bucket configuration data. Now that client declared that it might throw a handful of pretty standard, straightforward exceptions, things like IO errors, buckets not found, and so on. But it also declared this kind of really odd-looking one that I didn't understand at first.

After reading through the client code and talking to some of the people that owned the service I was calling, I figured it was just never going to happen. So I wrapped it and threw an exception that resulted in the customer getting a 500 error, their SDK retrying the request, and likely succeeding. I finished up some other work that I felt was probably more important and sent the team a pull request. Obviously, I'm telling this story for a reason, so you can imagine I got some feedback on that code review.

We talked about it in standup the next day, and this far more senior, tenured engineer asked me, "Carl, do you think there's enough context here to be able to figure out what happened?" To which I replied, "I don't think it matters. This particular exception seems really unlikely."

Thankfully, they were not satisfied with that answer and pressed me a bit, asking what unlikely meant to me. To which I replied, there's effectively no chance of this happening. The chances of this happening are like one in a billion. Now that engineer put their head down, looking at the ground, doing a little bit of mental math. They came back and said, so this is happening once every three minutes. This was a mind-blowing moment for me. It taught me my first lesson of working at S3 scale: exceedingly unlikely things happen exceedingly often. In fact, the bulk of the work that we do at S3 is thinking about and planning for when things don't go the way they're supposed to.

Now it turns out that at S3's request rate back then, their mental math was a little bit off, but their point was very valid. This is going to happen far more frequently than I think. Do I have enough context to understand what went wrong? I think it's pretty clear that I didn't. So I did what any one of you probably does all the time: I added logs and I added metrics. By the time I was done, my code looked a little bit more like this. For every line of logic, there's one, maybe even two others capturing context, illuminating what's happening inside my system. I'm capturing whether I was able to fetch the configuration, whether we hit the cache and if we did, how stale the entry was. If we didn't, how many attempts did it take to actually fetch. Finally, when that unlikely thing happens, I'm capturing what the remote service was able to tell me.

So now I'm prepared to understand what happened when this unlikely thing occurred, right? Maybe. Let's have a look at where these log entries end up. This is a somewhat representative log entry at S3. It's in multi-line format, primarily a set of key-value pairs. Some of those pairs are nested to allow us to define counters and timers, and others are used to capture arbitrary context, things like what API was called, through what interface, by what, and how we responded.

I want to take a moment to clarify. Today we're discussing S3's internal operational logs. They are what we use to run, monitor, and operate the service itself, and they're fundamentally different from server access logs or CloudTrail logs, which are designed specifically for you to audit and analyze your own bucket activity. I also want to take a moment to pause and talk about a tenant that the principal engineering community at AWS holds dear. That's that we respect what came before. I think you and I can probably agree that if we were to sit down today and design a structured log format, it wouldn't look like that one. But that one works. It's simple, it's readable, it's eminently extensible, and anyone with about a month or two of programming experience can write a parser for it. But most importantly, it works, and it has been working for the better part of twenty years.

That's a theme we've taken to heart through the journey. Yes, we can build it better, for some definition stronger and faster. We're building it to replace something that has worked, and we can't lose sight of why it's worked and the value that it's delivered. Finally, I want to point out that there's a lot more than just my context in this log, right? Every feature, configuration, logical branch, cache hit and miss is captured here. By the time we're done, we're looking at about five kilobytes on average per request and tens of thousands of unique elements.

From Terabytes to Petabytes: The Evolution of S3's Log Ingestion System

My log entries are in my service's logs, but where do they go from here? Well, frankly, log ingestion, like anything in S3 when viewed in isolation, is actually very simple. Services typically write their logs to disk, and on some cadence, usually an hour, sometimes more frequently, those logs are compressed and uploaded to S3. The existence of that log file is registered in a centralized service that is indexing the time range and the source service for that log.

So far, I've just talked about the one microservice that I was working on. S3 is composed of multiple cooperating microservices operated by hundreds of engineers. Many of these are logging at very high volumes, and I hope you can begin to imagine the scale of data that we're generating. I've done my upfront work, captured and stored the context necessary to help me understand what went wrong when that unlikely thing occurred. I'm finally prepared to figure out what happened.

Let me do a little bit of arithmetic first. When I wrote this code in 2013, S3 received around a million requests per second. My service, as I said, is logging on average about 5 kilobytes per log entry, and there are around 3600 seconds in an hour. If we do that math, it means that at that time my service was logging, before compression, terabytes of logs per hour. Today that same service is logging nearly petabytes. Logs are stored for variable time periods—some on the order of days, some for months, and some can even be forever. The total log volume stored by S3 is somewhere around a whole lot, exabytes in fact.

So can I figure out what went wrong? All I have to do is download multiple terabytes of logs to my Mac and grep through them. I didn't have a multi-terabyte drive back in 2013, and I certainly don't have a petabyte drive today. Grepping through my logs isn't going to cut it. About 11 years ago we did what any self-respecting engineering team does—we had the intern write a tool. That tool was written with many of the same constraints we have today. It was built to handle S3 scale and was built with S3's foundational nature in mind. That meant it took as few dependencies as possible and still worked.

It was also built with a singular use case in mind: find, in the billions of uninteresting log entries, the one that matters to me, the S3 engineer, owner, and operator. Its interface to do that was very simple. You could give it a string like a request ID and do a contains-style search. If you needed to look for a counter above or below a certain value, you could pass it to regex. And if you really needed it, you could pass it some JavaScript, and it would execute this on every log entry in your requested time frame. Pass it the right input, and in a short bit you end up with exactly the log entries you need.

It turned out that this last capability—this ability to pass an arbitrary bit of logic representing a complex question—was this tool's key feature. We absolutely abused it. Do you need to find the logs for requests that hit the primary caches but still took more than 50 milliseconds to return the first byte? Write some JavaScript. You're a product manager and you want to know which customers are using a very specific combination of features? You write some JavaScript. Actually, I'm kidding—what product manager writes JavaScript? They get an engineer to write some JavaScript. You want to analyze the trend of a specific feature's usage by customer segment over the course of the last three months? You write some JavaScript. Well, actually it turned out that wasn't as easy. At least you're going to have to write some JavaScript, but then you're going to have to find some way to post-process the terabytes or even sometimes petabytes of results from your query.

Defining Outcomes and Meeting Users Where They Are

So it turns out that this tool was very good at the thing we built it to do, but not so good at the thing we needed to do. Now metrics and traces are powerful observability tools, and we use them across S3. But nothing comes close to the level of understanding that your raw logs can provide. In many cases, they're the only path we have to ask arbitrary questions about the current and historical behavior of our customers and our systems. So what are the outcomes that we're trying to achieve with this tool? We will always need to be able to service that foundational use case. S3's log volume has increased dramatically, but the importance of a single request has not changed. We're always going to need to find that needle in the mountain of logs. But we also want a tool that can be used easily by customer support engineers when helping our external customers.

S3, as I mentioned, is a conglomerate of cooperating microservices. We want to be able to join across those services' logs and understand complicated inter-service interactions. We don't want to stop at our logs. We want to be able to join arbitrary data sets with them, things like hardware information, so we can do things like track performance or failure rates of hard drives and chips.

We also wanted that tool to be accessible and usable by product managers and other business owners so they can use it to inform their own future proposals, roadmaps, and planning needs. This returns the engineering time back to building and operating those services. So to recap, what are some of the challenges that we faced and that I reckon you may face in a similar endeavor? I've talked at length about the scale of S3 systems, but scale is relative. This problem was hard for us 12 years ago when we were at the terabyte scale. It's hard for us today. Whatever scale means for you, that problem is growing. It's going to get harder.

None of my systems are legacies. They are heritage systems. They've been delivering value, doing the things the way they do them for years on years, and we want to interrupt that as little as possible. S3 is a foundational service. It needs its operational tooling and observability tooling to work even when everything else isn't. Finally, to break at least one rule and throw in a new item during the recap slide, my engineers have been writing JavaScript for the past 11 years to answer their queries. There are literally hundreds of wiki pages and scribbled notes with the right spell to make the tool do the right thing.

What that means is that what works with today's system has to work with tomorrow's. Does that mean I want them to continue to be able to write JavaScript? Really, no, it does not. But it does mean that whatever they're using the tool for today has to be straightforward and simple to do tomorrow. Which brings me to the second tenant that we held during this project: to meet our users where they are. We had to understand deeply the types of questions our engineers were asking the current system. We had roughly 11 years of JavaScript. We were able to get at 7 of them, so we analyzed those 7 years of queries to make sure that the majority of them were easy or even better yet easier to do.

It's more than just logs. S3 has dozens of internal bespoke formats for organizing our internal data sets, most of which were built with an equally bespoke software system in mind, definitely not analytic queries. We have to have something that can bridge the gap between those analytic style queries and our custom formats that our teams have developed to serve their customers' needs. While I'm pretty sure that SQL is the right interface for such a system, it's actually not a great deal easier to work with than JavaScript, and the last thing that you want to be doing at 2 a.m. is writing it. We want a system that lets our engineers write these queries in the middle of their afternoon and register them, making them easy to discover so that they can be found and executed quickly when necessary.

Shifting from Reactive Troubleshooting to Proactive Intelligence

I've spent the past few minutes giving you context on the problem that we had in front of us and what we were looking for out of our solution. At this point, I'm going to invite Wren on stage to talk about how we achieve that. Thank you, Carl. Thank you so much. It's great to share the stage with Carl today. As Carl noted, we've been struggling with the queries and thinking about queries for so long, and we've made a shift where we actually want to look at rather than asking the right questions to getting answers as quickly as possible. We want to shift from something that feels and looks more like reactive troubleshooting into proactive intelligence.

The question is how do we get users answers to the questions as quickly as possible. As Carl noted, we sat with our engineers and asked them about the questions they ask, the data they need, and also the questions they never thought of asking, the questions they never thought they could get the answers to.

When we looked at this process, the starting point for this modernization effort was breaking it down into a workflow, and we discovered a couple of distinct phases that we could optimize.

First comes an exploration and learning phase. I have a question in mind. What happened during a specific request, or how much does a certain type of operation take? Who is using a combination of features? This phase is about discovering what data exists and figuring out how to access it, crafting the right query in order to unlock my answer.

The next phase is around actually collecting the data. It is at this phase where we are either waiting for something to happen as data is being collected, or we actually need to do something in order to collect the data. Finally, there is a post-processing phase where we actually get the results, we aggregate them, and we put them in some human-readable format so we can actually use them. Each of these phases represents an opportunity for optimization.

The first part of getting customers answers, or users as quickly as possible—we call them internal customers—is making data discoverable. This sounds simple, but at exabyte scale it is actually quite challenging. Starting from scratch, how would I answer a question? Historically, this has been solved through some kind of institutional knowledge, like asking the engineer at the desk next to you. That does not scale well.

Second, you might look at some wiki page that was a little bit wrong when it was written, or completely outdated the next day. This does not scale, and it certainly does not help when you are dealing with a critical issue. So we needed something better, and we focused on making data discoverable through a proper catalog. But here is a key insight: every entry in the catalog has to be useful.

While we want everything in the catalog, we need to make sure there is the perfect amount of friction in the process of getting the data into the catalog. We want datasets to be interesting enough and useful enough so people will be willing to do some non-trivial work to get them in and maintain their quality. We want to ensure that it is as friction-free as possible, and that is a delicate balance in order to keep them up to date. The last part is that we do not want to become centralized owners of the data. This does not scale as well. Instead, we position ourselves as brokers and providers of tooling, and we want to engender a culture where people are motivated to be the owners of their data and their quality.

Having data discoverable is only half of this phase. Once you find the data, you need to be able to access it. When we were interviewing users of the query system, we discovered a repeatable pattern around getting data and downloading it. Let me walk you through an example. Put lifecycle requests by bucket. Historically, the process would have looked something like this: retrieve all the lifecycle policy requests for all customers, which would generate an amount of logs. Download the first hour, then use regex to extract the data you need, count and sort to get the results you need per bucket, save it to a temporary file, download the next hour, and so forth until you are done. This is really not very effective, and reducing this work has been a primary focus for us to improve time to answer.

We asked ourselves how we can make engineers crystallize the questions and what they are looking for and what is interesting in the results. What if the query instead looked like this? SQL. Probably all of you agree that it is a simple way to specify what you are looking for, what information matters, and how to aggregate the results.

Transforming Data Layout with Apache Iceberg

SQL is powerful, but writing a SQL query midday under stress or in the middle of the night can be challenging. As Carol mentioned, we catalog the SQL queries and ensure they're alongside the data sets and maintained using code change techniques so that engineers can get used to them and start using them. We wanted to do more optimization. While we are scanning for data, remember Carl needs to scan petabytes of data just for that single API call, and finding a 5 kilobyte element buried under petabytes of data is very challenging. So we wanted to optimize our query engine to scan less data and return only the relevant results at petabyte scale without restructuring our entire data estate of logs right now. We don't want to boil the ocean just yet.

So we have SQL that helps us craft the answer to what we are looking for and what information matters. We implemented a push down optimization technique. Some of you might be familiar with that. We have a SQL query so we can inspect the query and realize what columns are interesting for us, such as the lifecycle policy in our case, which values are interesting in the result, which is the bucket, and how to aggregate the result. We implemented that in our search where we use two kinds of optimization: the predicate and the projection.

The predicate pushdown is for filtering at scan time. As we scan through the logs in a linear fashion, we look for a certain type of operation. If the operation doesn't match, we don't materialize the whole record because we don't need it. If the operation does match, we use the projection pushdown, which essentially tells us which piece of information we actually need from the record, and we only fetch that piece of information. This way we actually get results faster. Keep in mind we're scanning less data, so this optimization dramatically improves performance by reducing both the number of rows we process and the amount of data we extract from each row. It's delivering faster queries with less resource consumption.

We've started where most people and organizations start, which is optimizing the query engine itself. We have made data more discoverable and more accessible. We implemented the SQL queries and the pushdown optimization, and these techniques gave us significant improvements and valuable results. But we quickly hit the ceiling because you can only optimize so much when the underlying data structure has inherent limitations. Text logs have fundamental constraints that no amount of query cleverness can overcome.

Instead of working around the limitations of our data format, we decided to transform the data layout itself. It may sound trivial at this point in our conversation, but we've been working and delivering value for a long time with a lot of data. So transforming data is crucial. However, by transforming the data itself and changing the physical structure, we want queries to become natural and not heroic. This transformation requires us to choose the right data platform, something that supports columnar storage, efficient partitioning, schema evolution, transactional consistency, and does all that at exabyte scale. We decided that this platform would be Apache Iceberg.

Why Iceberg? Well, because we know it works at exabyte scale. It has features that give us transactional updates for data consistency, something we didn't have. We are able to time travel, and we have schema and partition evolution, which we'll talk about. As our data volume grows, we need a modern foundation that can handle that kind of growth. Remember we are migrating a 15-year-old system to one that's going to be built for likely the next 15 to 20 years. We believe Apache Iceberg is the new foundation, and this is where the old world will meet the new.

In the past, you used to dump Parquet files in a bucket and call it a table, but then there was no schema evolution without major pain, inconsistent views during updates, and definitely no transactionality. Iceberg solves that by adding a metadata layer on top of the data files that gives us database-like guarantees.

So how does it actually work? Let me give you a little bit of Iceberg 101 in terms of the Iceberg table anatomy. At the bottom of an Iceberg table, there are data files. This can be Parquet, ORC, or Avro—different types of file formats that are supported. These files actually contain the data of the table. But the cleverness in Iceberg comes in the next layer, which is the metadata layer.

Anytime I ingest data into my table, I create multiple files—Parquet files, Avro files, ORC files. I need something that will aggregate those files and say, "This belongs to a certain commit to my database," and that's a manifest file. A manifest file is a pointer to data files, but it's more than just a list of files. It also has statistics and column information, so it knows more and has more intelligence about what's in the data files.

But then I make more than one ingestion into my table, right? Any time I make an ingestion, I may have more manifest files, so I need something to basically create a table view, and that's the manifest list. The manifest list is an aggregator of manifest files, basically creating a hierarchy going up, and now I have an entry point into my table view. But the real cleverness actually comes from the metadata file, which has a lot of information on the underlying table like the schema and information about versions. This is how we manage snapshots and can point back in time because a snapshot is essentially a pointer to a manifest list, which is an entry point to a certain point-in-time view of the table.

On top of that, you will obviously always have a catalog, which basically maps table names to metadata files, so we can find an entry into the table. That's Iceberg 101. Now keep in mind that Iceberg is not a server you run. It's a specification and collection of libraries that basically defines the layout structure of data files, metadata files, and manifest files, giving you database-like semantics with schema management, consistency, and transactional data lake storage.

Optimizing Performance Through Partitioning, Sorting, and Schema Design

However, as Carlos said, we have tens of thousands of microservices, and we can't have those start writing Parquet files into an Iceberg table just like that. It's just not practical. Because we're going to stream logs and we want column-based statistics, that's where we're aiming. We need to carefully think about file formats. There are different file formats that are supported in Iceberg, and I'm only mentioning two here: Avro and Parquet. Avro is a row-based structure and is very good for log collection, which sounds like a good fit for us.

But we are actually interested in Parquet-style columnar format because we're interested more in the analytics layer on the user side. I would just like to point out that from query inspections, we usually are interested in something like 10 to maybe 20 fields or columns out of thousands of potential columns in the query. So while Avro seems like the right thing for ingestion, Parquet wins for analytics. The key here would be intelligent partitioning to make it work and also intelligent ingestion that will transform the data efficiently.

Let's start with partitioning. Partitioning is foundational, especially with the volume of logs in S3. Why? Because it's the gate of the difference between scanning petabytes of data versus terabytes or maybe even gigabytes of data. Get it wrong and no query cleverness will save you.

Iceberg lets you solve that by allowing you to evolve partition strategies without actually rewriting data, and it's critical where no single approach optimizes all queries. Now, the magic happens through what we call two-level pruning. Manifest file pruning eliminates entire data sets, so I know which files I don't need to touch. Then column statistics, those little data points hiding in the manifest files, help me skip irrelevant files within partitions. This cuts scanning data by orders of magnitude.

But there's no perfect partitioning. For example, think about S3. If I take bucket time-based versus account-based partitioning, those pull in different directions in terms of partitioning strategy, so we need to start with some common pattern and evolve from there. The key thing here is that partitioning only gets you to the right files. What happens inside those files is equally important, and that's where sorting helps us and makes a difference, especially in the context of logs when we're looking for specific pieces of data. Let me show you that by using a concrete example.

Let's say I have a table and I'm looking for request ID 7. Now in the unsorted case, if you look at my table , every file in my current structure, in my current table, holds the range from 0 to 15. So as I'm looking for request ID 7, I need to scan every single file. Iceberg can't skip a single one of them. But in the sorted case, when every file has a distinct range, when Iceberg looks at the table , it realizes it only needs to touch one file out of all those files because that request ID is within the range of that specific file.

This difference isn't trivial. Partitioning will reduce the amount of data you scan, and if you are sorting properly, that's another potential order of magnitude of less data to scan. This directly transfers to faster queries and lower costs of processing as well. Of course, the magic only works if you sort on the right columns. We didn't guess. We actually had conversations and analyzed thousands of queries to find what sort keys we think are going to be the right sort keys as we build these tables and build this process.

Sorting and partitioning define how we physically store the data on disk. But we still need to make decisions around the logical schema and how the data is structured and accessed. Let me show you how we thought about the mental model and the schema. Let's define the problem with our log system, and if you're dealing with logs, you may find this very similar. Logs are messy, they're variable, they're ever-changing , and we need to balance structure and flexibility.

Queries that are already cataloged will continue working, but at the same time we need that flexibility to prepare for the unknown. Flexibility can be for known things, but also for unknowns. That's the challenge we are facing, and we defined certain goals that we want to maintain. One is minutes instead of hours to get answers to queries. That's crucial for us. Second is easy to understand and maintain. The schema has to be easy. Complexity kills adoption. If it's hard to understand, it won't work. So the schema has to be easy to understand, and it needs to be cost effective in terms of storage and processing. Obviously at exabyte scale, storage doesn't come free. Every design choice has a real cost implication that we need to take into account.

After analyzing our query patterns, we came up with a three-layer schema for our logs, and you may find it useful as well. The first layer is what we call the identity layer, the who, what, where, and when. These are core identifiers that appear in nearly every query that we make. The second layer is measurements and counters. These are numerical data about the request behavior and performance. The third level is the context level, that's everything else, debug information, service context, anything that doesn't fit in metrics or identity. Now, I want to give you more details on those layers so you have a deeper understanding of our S3 schema mental model.

The identity layer is flat and very simple. It holds the most common filter fields: Request ID, timestamps, and servicing for user accounts. These appear in nearly every WHERE or JOIN clause of queries that we make, and they're also natural partition candidates and sort keys. That's the identity layer.

The second layer is the metrics layer, and it's actually a nested structure. It organizes thousands of metrics and performance data, volume statistics, and resource utilization. They are all logically grouped and stored in Parquet column format. We only read the specific fields we're interested in, leaving the rest behind on disk. The third layer is the context layer, which holds everything else that we want to have on the table but doesn't break existing queries around the first two layers. It helps us evolve and add more context into the logs as we need and as we progress.

The result is a structure where we need it and flexibility where we need to be able to adapt. This is what it might look like, and if you know SQL, this is a fairly simple query. I'm going to get some data for a certain day, group it by operation, get some requests, and order them by total requests. At exabyte scale, getting here is anything but trivial. We're querying billions of log entries. We're accessing identity and metrics layers and doing aggregations, getting results in minutes versus hours.

Building the Bridge: Migration Strategy from Text Logs to Parquet

With this architecture, Iceberg layout, the manifest files, the partitioning, and the sort basically make SQL queries skip irrelevant data automatically. The data structure itself takes care of the optimization, and we can get answers to queries in minutes instead of hours. We now have a data construct that makes data discoverable and accessible, and queries are no longer heroic. All this data layout is optimized and tuned for S3 scale. But how do we transform existing text logs into structured Parquet files? Do we change everything that we have, or do we just ingest new data into Parquet files?

This isn't just a technical question. It's a strategic one because how do we introduce it at S3 scale without actually creating any disruption? As we said, we can't have servers write directly Parquet files into Iceberg. It's just not practical with the amount of microservices we have. We need to meet the systems and the users where they are. As I said about the tenants earlier, as we move forward, the goal is to meet them where they are. We need something in the middle that's going to bridge the legacy text logging and the Parquet. It will parse, it will structure the data, it will compress the data, and create those really nice structured Parquet files that we all come to love and appreciate.

Keep in mind that S3 is so foundational and operates at such a unique scale. We've built our own transcoder because it was more efficient for us than adopting existing batching tools or techniques. The transcoder works with our partition logic during the conversion itself. It's not just changing formats. Our compression team did amazing work to optimize this transcoder. We're able to compress one hour worth of logs in just three minutes and constantly work to improve that. But technology itself here is not enough.

Now that we have data layout and a tool that can convert text logs into formats, we need a migration strategy. We need to decide how we're going to introduce this change into the S3 logging system. We've built a tool that can convert to Parquet and push it down to Iceberg. We have this system that's been running for over fifteen years, always on. We want to introduce this change, and this is where engineering discipline meets operational reality.

The first principle is to meet the log agents where they are. We provide them a legacy-compatible interface. From the application perspective, nothing changes. The same format, same API. But underneath, the transcoder does its magic, taking those text logs and converting them to Parquet files and placing them into Iceberg. This is a critical part. We never compromise on what's already working, and we always maintain a rollback plan.

As we gradually introduce change, we deliberately split the data to the new transcoder, but also to the legacy system. Yes, there's temporary duplication, and that's a trade-off we consciously took so that we can always revert to a non-working state. From our perspective, this isn't wasteful. This is insurance. A rollout strategy is methodological, right? Server by server, microservice by microservice, availability zone by availability zone, region by region. Why? Because we want to monitor the behavior closely and query the performance at each stage. We want to make sure the data is consistent, storage costs, system load—everything matters. We want to catch those one-in-a-billion edge cases early on, the one-in-a-billion that Carlo was dealing with years ago.

Here's the beauty of this approach. Once we've completed this implementation and we evaluate that it's actually working, we can remove the duplication, but we no longer need it while maintaining the safety and reliability that S3 demands. We're getting close to the end of the presentation, and we now want to actually have the log agents themselves write directly. Well, not really directly, but we want to change the log agents themselves at some point. Here's a key architectural decision that we need to make again. We're not going to have them write directly to Parquet or Iceberg. This is simply not practical. We need to have a way to use some collectors or aggregators.

The log agents will essentially send the files to a certain collection point or a collector, and we're actually thinking that this would be Avro files because those are very well tuned for logs. But then those aggregators or collectors will actually do this different kind of transcoding, converging them into large, well-structured, beautiful Parquet files that are going to be then pushed into the Iceberg table. They are analytic query ready, and we can actually use them. We've talked about the process on how to make data discoverable and accessible through proper cataloging and SQL. We made it more accessible with our existing legacy heritage system by doing the pushdown optimization that helped us. But eventually, we realized we needed to transform the data layout itself and transition the data layout into Apache Iceberg. But at the end of it comes the question: what have we learned through the process?

Key Lessons from S3's Data Lake Modernization Journey

At this point, I would like to invite Carl for the takeaways and the lessons learned. Thank you so much. We have talked through an awful lot of technical detail from discovering what data we have to the post-processing of our custom transcoder. We covered Iceberg tables, Parquet optimizations, sorting strategies, and even parallel migration. Frankly, I could easily talk for another hour about some of the technical outcomes that we achieved. I would love to tell you about the number of queries we've run or the bytes that it's processed, but frankly, you don't care. Those are just inputs into the actual outcomes that we were trying to achieve.

Through this process, we have been able to return thousands of engineering hours back to building and operating our services. Additionally, our engineers can now run those arbitrary queries over very fresh data, oftentimes five minutes old or less. And finally, our product managers and our applied scientists can access vast amounts of historical data, magnitudes—orders of magnitudes—faster than before.

This enables them to answer questions about placement policy updates and changes or feature usage without an engineer in the loop. When giving a talk like this, it is actually very hard to not make it look like a linear path where we started with everything being terrible, we made a bunch of great decisions, and then ended up with everything being golden and beautiful. In truth, that journey is much more like a random walk at times than a linear path. However, I do think that there were a handful of things that kept us moving in the right direction that are worth highlighting.

For S3, our logs represented one of many internal data sets for which retrieving data was often just too hard to be worth it. This was despite recognizing the immense value that they contain. I can easily say that for the overwhelming majority of you, I believe that CloudWatch Logs is the right place to put your logs. Nearly every feature that we have been building in our system exists there out of the box.

However, I also bet that if you look around your organizations, you have one of these systems where there is something of value inside of it and it is just too hard to get to. Maybe that is because it is stuck behind a custom format or behind a custom interface. That is where you are going to find the biggest opportunity to apply all of these principles. Once you have identified one of these high value but high barrier data sets or systems, you should work backwards from what your users are doing, but importantly, what they wish they could do or what they never even thought to ask for.

For our users, most of their time was spent identifying what data existed, figuring out how to query it, and finally surmounting those scale issues associated with accessing it. Rand talked us through building out our catalog and importantly, the mental model for applying a schema to semi or unstructured data. We also discussed some of the technical improvements that you can make in your connector code to improve performance while your data remains in its natural form. I say natural form intentionally because at many of the stages in this journey, we had not required any changes from the systems generating the data at all.

They continued to work exactly as they had. However, because of those connectors and because of meeting the systems where they were, we were able to prove out the value of their data sets to the teams themselves and to other consumers in the company. Once that happened, it meant that the owners of those systems became intrinsically motivated to make changes to them. In our case, we changed how they emitted their logs so that we could land the data in its optimized physical form.

We walked through that optimized form, in particular our usage of Iceberg as a table format, and called out the partitioning and sorting as key elements of that performance strategy. I just want to say thank you very much for attending our session today. If you found portions of this valuable and you would like to dive deeper into Iceberg and some of the ways that we advise or work with customers to build scalable data lakes in S3, I recommend you take a photo and scan some of the QR codes. These are recommended sessions that are happening throughout the week to give you additional information on how to build scalable data lakes on Apache Iceberg on S3. Thank you again and have a great week.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community