🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Indeed's migration to Amazon S3 Tables (STG210)
In this video, Brett Seib introduces Amazon S3 Tables, a purpose-built service for storing tabular data that simplifies security, improves performance through automated compaction, and reduces Iceberg maintenance complexity. Venkatesh Mandalapa from Indeed then shares how they're migrating their 87-petabyte data lake (68PB Hive, 19PB Iceberg) with 15,000+ tables to S3 Tables. Indeed expects 10% annual AWS cost savings, reduced onboarding from one day to minutes, and four developer months returned to product work. They're using a phased dual-write pipeline approach to incrementally migrate 50 petabytes, starting with 2.5PB canary tables. Key learnings include tight Lake Formation integration, query performance considerations, and adapting to different logging mechanisms with CloudWatch and CloudTrail.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introducing Amazon S3 Tables: Simplifying Security, Performance, and Maintenance for Iceberg at Scale
Hello everyone. I'm Brett Seib, and I'll be talking to you about Amazon S3 Tables, giving you an introduction to what S3 Tables is at a high level, the value it brings customers, and then I'll turn it over to Venkatesh to talk about how Indeed is leveraging S3 Tables.
Amazon S3 Tables was launched a little over a year ago at the last re:Invent. We set out to solve three primary problems for our customers: simplifying security, improving performance, and reducing the complexity of maintaining Iceberg at scale. I'll talk in depth about how S3 Tables does that for customers in the next few slides.
S3 Tables was introduced as a first-class AWS resource and is purpose-built for storing tabular data at scale. So how do we improve security with respect to tabular data in a data lake? If you think about objects being written to a data lake, you would have data and metadata files being written to an S3 bucket with various objects coming in. You might have multiple tables in that S3 bucket as well. Trying to write security policies against individual objects doesn't scale well, and customers needed a better solution.
With S3 Tables, customers can write security policies against the table itself, greatly simplifying the policies that customers had to write. The next capability is compaction. When we think about Iceberg data in a data lake, individual objects are being written to that data lake over time, and a customer may need to query against those objects. That query may need to read multiple objects at one point in time, which results in multiple API calls to the service and consequently increases latency.
In the past, customers had to write custom Spark jobs or run custom compute resources to compact those objects. Compaction simplifies the data lake by reducing the number of objects you have to query to return a result, thus improving performance. S3 Tables does that automatically for the customer, so you don't have to create or spin up custom compute resources to do this work. The service manages that undifferentiated heavy lifting on your behalf.
In a similar fashion, you have table maintenance. As objects are being updated or deleted in your data lake, it creates unreferenced files. Those unreferenced files and snapshots in your data lake do powerful things like enabling time travel and rollback in your data lake. But they waste storage space, and when they're wasting storage space, they're also increasing your cost. So once again, S3 Tables comes in and manages this cleanup and maintenance on your behalf, so you no longer have to do those operations.
That's S3 Tables at a high level. I wanted to give a quick primer on the service and the value it adds to customers, and now I'll hand it over to Venkatesh, who will talk in more detail about how Indeed is leveraging S3 Tables to help people get jobs.
Indeed's Data Lake: Managing 87 Petabytes and the Journey Toward Iceberg
Good evening, everyone. I'm Venkatesh Mandalapa, the tech lead for the Data Lake team at Indeed. A quick primer on who we are: we are the number one job site in the world. We have 635 million job seeker profiles and 33.3 million employers in over 60 countries, and we have 11,000 employees. Our mission is to help people get jobs, and the engine that drives this mission is data. At a global scale, we generate a lot of data that we need to process and store, and my team, the Data Lake team at Indeed, is a central data store that stores and processes all this information.
I'm going to give a simplified version of the Indeed Data Lake. We have AWS Glue Data Catalog, which is our central catalog for the entire Indeed organization. This is sitting in one AWS account and is exposed to all Indeed customers, and the data itself is stored in S3 standard storage,
it's a mix of Hive data in ORC format and Iceberg data in Parquet format. We have about 68 petabytes in Hive and about 19 petabytes in Iceberg at the moment. We have a lot of engines that access the data: Athena, Snowflake, Trino, Spark, and a few more, but these are the primary engines that our consumers internally and externally use to query the information from the data lake.
We have some statistics here: 87 petabytes total for the data lake at Indeed, ingesting 550 terabytes per day into the data lake, 15,000 plus tables in different schemas, and 6 ingestion patterns. This is how we get data into the data lake. We provide multiple mechanisms for our customers internally to register the data, load the data, and make it available for all our consumers. It's not just that we have a lot of data, but it's also being used by a lot of people inside Indeed and outside, and we get about 170,000 plus queries a day.
We have a mix of analytical, ML, and operational data, and all kinds of modes of data: streaming, batch, ETL, and more. We had this legacy architecture with legacy formats, and we wanted to simplify, modernize, and make it more fast and efficient, both to ingest the data and to query the data. So we started with a goal: we wanted everything in Iceberg. Iceberg is the future and has a lot of promises and features that are very interesting, including schema evolution and maintenance capabilities.
Why Indeed Chose S3 Tables: Solving Maintenance Challenges and Achieving 10% Cost Savings
We made it a goal to move our entire data lake to Iceberg. With Iceberg, we decided to use the standard S3 storage with general purpose buckets. But then Amazon released the new S3 Tables, and we looked at it and thought it was very interesting. The product matured over 6 months, and we did a cost benefit analysis and found that it's actually really interesting for Indeed to switch 100 percent of the data lake to Amazon S3 Tables.
We also found out that the 6 ingestion patterns of how we get data into the data lake can be reduced to 1 pattern, simplifying the code base for my team, making it much more streamlined and much more maintainable for the data lake. So what are the challenges we face with Iceberg in general-purpose buckets?
We had to run our own maintenance operations for Iceberg tables in the data lake, and that was costing us a lot of dev hours: 2,000 plus dev hours per year managing the Iceberg tables, compaction, deleting orphan files, and all the other maintenance operations that come with Iceberg. We rolled our own support shops running on Kubernetes to do all this, and we are finding it much more challenging to keep this running.
We also have these ingestion patterns, the 6 that I talked about. On average, it takes about 1 day per customer team to actually get the data into the data lake with standard Iceberg in general-purpose buckets. S3 rate limiting was a problem. Some tables are queried a lot and some tables are queried like every quarter or every day, and these different access patterns generated S3 rate limits for us, which caused severe incidents of data not being available because of these rate limits with the general purpose buckets.
We also do per object tagging for access control, so we have about 600 million objects in the data lake, and each object has object tags, which are evaluated against the resource and identity policies to actually provide access control. Managing these object tags and trying to change the object tags or trying to assess the security of it was a nightmare for us. So Amazon S3 Tables provides a lot of benefits and solutions for all these problems that we are facing.
It simplified the data governance of the access problem that we have. With 600 million objects, if you want to reclassify the data, then we have to spend weeks trying to update the object tags. So one table today is accessed by all of Indeed, but we want to switch it to only accessible by Glassdoor, which is a sister company.
That would have taken us weeks to do. But with S3 Tables, it's a resource policy on the table object for S3 Tables, and we can simply go and do one operation on that table and update the policy.
We evaluated overall storage, maintenance, compaction, and ingestion. When we add up everything, we are seeing about 10% annually in AWS cost savings compared to general purpose bucket Iceberg tables. This has been a huge incentive for us to push everything into Amazon S3 Tables. I recently heard that the maintenance operations have been dramatically reduced by the Amazon S3 Tables team, and this is making this whole thing much more attractive for us.
We talked about how the onboarding experience was one day in the past, but now with S3 Tables we can do it in minutes. Our producing teams can write to S3 Table buckets with an API call and start ingesting data into the data lake within minutes. This is huge for our consumers to be very fast, very nimble, and very cost efficient.
Reducing the maintenance operations and letting Amazon do it, we also found that we can return four developer months back to the team. We can actually do product development and build things for our customers, so this is very exciting for us.
Migration Strategy and Lessons Learned: Indeed's Phased Approach to Moving 50 Petabytes
We have 87 petabytes, and we cannot do a big bang approach because that is not going to work. What we did is we looked at the server access logs and the query logs, and we identified different workloads and categorized them into cohorts, batches, and phases. We are trying to do this incremental migration piece with a dual write pipeline, and this is a standard approach that people use during migrations.
You still have your standard pipeline that writes to the existing production objects, and then you set up a separate pipeline, which is what we are trying to do with the incremental migration. We are going to develop some tools and APIs to automate the ingestion of data directly into Amazon S3 Tables, and this is going to help us remove one step during the migration process.
As a result of all the assessments that we did with the migration planning, it looks like this. We have the entire data lake, and we are going to split that into multiple phases. The first phase is going to move all the Hive data to Amazon S3 Tables, and then move Iceberg S3 standard data to Amazon S3 Tables and update data producers.
Within phases, we are going to break that into cohorts based on different workloads like Spark, Trino, and Athena. We can target our migration along these cohorts and we can further break this down into batches and do it at terabytes or petabyte scale so that we know we can understand what the challenges and problems we will face by using this incremental approach.
I am going to talk about the dual pipeline strategy. This is our current pipeline, very simplified. We have producers that are producing data, we have an ingestion pipeline with multiple patterns, and we store this in the data lake in standard S3 storage and the AWS default Glue catalog.
With a dual write, we are going to split the pipeline into two modes, one standard and one going to Amazon S3 Tables. Amazon S3 Tables has its own catalog, so we are going to have two catalogs with the same objects. Then we do the cutover process, which is simply linking the S3 catalog to the default Glue catalog.
All the objects that are in the S3 catalog will show up in the default Glue catalog. The consumers that are accessing the default catalog will actually go to the S3 Tables in the backend. Then we do a final cleanup where we remove the old S3 standard data and the data cleanup, and then we remove the pipeline that writes to the S3 standard. This is pretty typical in most migrations that people do.
The interesting thing here is the catalogs are going to be linked between the two systems. The consumers will always go to the default catalog. They are not going to be impacted and they are not going to notice any difference between accessing their queries or applications. It is all going to be seamlessly done because they are all hitting the AWS default catalog.
We're only changing the back end storage systems here. We have learned several things so far with migrating some tables to Amazon S3. We found that Amazon S3 Tables is very tightly integrated with Lake Formation, and this was unexpected for us. We don't use Lake Formation at Indeed; we manage permissions quite differently. Trying to understand Lake Formation's identity permissions that we have to update and the resource policies that we update became quite challenging. So just a heads up: if you're not using Lake Formation, Amazon S3 Tables is very tightly coupled. We made it work, but we had to struggle quite a bit and we had to get a lot of AWS help.
Query performance is going to be quite different because we have a lot of data in Hive and ORC format. Iceberg has a different planning mechanism, so make sure that all the workloads you have run against the Amazon S3 Tables and they perform similarly. During the migration, we're also changing the partition strategy quite a bit. We want to make sure that people using the old partitioning for Hive still works for Iceberg and faster.
Amazon S3 Tables have some limits: 10,000 table buckets per region per account and 10,000 tables in the buckets, for a total of 100,000 assets that you could have. We only have 15,000 so far, but with linear projection we could see it going to 20 and 25 and so forth in the upcoming years. We want to make sure that we use as few buckets as possible just for the maintenance of it. Backup and restore is a little bit of a problem right now. We have to roll up our own backup and restore strategy for Amazon S3 Tables, and we have to rely on some of the unreferenced file features and the policies on that to provide backup and restore.
Concurrent limits is more of an Iceberg problem. The more concurrent writers there are for Iceberg tables, things break, so we have to solve for how many writers we have for each table, which is also something that you might have to face. We have other issues with server access logs. In the past, S3 standard buckets came with a beautiful server access logging feature where every operation is logged, showing AWS roles, regions, and all that stuff. But S3 access tables has a different logging mechanism. It uses CloudWatch and CloudTrail. So if you're relying on S3 access logs, you might need to update your pipelines to go to CloudWatch and CloudTrail to export the data to the format that you want. It's a small change in how the access logs work.
We use Snowflake quite a bit at Indeed, and every asset in our Data Lake we want to expose in Snowflake. With Amazon S3 Tables, it provides two REST catalogs and Iceberg catalogs: one is the Glue REST catalog and one is the Iceberg REST catalog from S3 Tables. You can register both or one in Snowflake and expose everything that we have in Data Lake in Amazon S3 Tables in Snowflake as well, and this is huge for us. Our customers really like Snowflake.
This is the concluding slide showing how it's progressing. We are pretty happy with what we're doing right now. We have about 50 petabytes that we're scheduled to migrate to Amazon S3 Tables, and we're doing canary tables right now, about 2.5 petabytes, which is ongoing right now during this Thanksgiving and December 1st week. We're always preparing for reverting the pipeline, the dual pipeline. If there's any problems at any second, we are ready to revert the table access, the table objects, everything back to the old pipeline, so no consumers are impacted, no queries are impacted, no producers are impacted by what we're doing in the platform team.
The dual pipeline is obviously going to create higher cost and that's okay for us because what's more important for us is that our consumers are not impacted. Not just the internal consumers, but globally people are using this data, policymakers, media, so we want to make sure that no one's getting impacted by what we're doing here. Finally, we are on track for a unified, modern, and cost-efficient data lake, and we are very happy with Amazon S3 Tables that it meets all our needs. Thank you.
; This article is entirely auto-generated using Amazon Bedrock.












































Top comments (0)