Kazuya

Posted on Dec 9, 2025

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

In this video, AWS experts and Medidata's Principal Data Architect discuss best practices for building Apache Iceberg-based lakehouse architectures on AWS. The session covers how Iceberg solves data lake challenges through ACID transactions, time travel, schema evolution, and efficient row-level updates. Key AWS services are explored, including AWS Glue Data Catalog as an Iceberg REST catalog, S3 Table Buckets for automated storage optimization, and Lake Formation for governance. Three core ingestion patterns are detailed: batch ETL, Change Data Capture, and high-concurrency streaming. Medidata shares their transformation journey, achieving reduced latency from days to minutes, eliminating data silos, and enabling real-time analytics. The session concludes with consumption patterns using Amazon Athena, Redshift, and EMR, emphasizing query federation, catalog federation, and materialized views for optimized performance across multi-cloud environments.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Best Practices for Apache Iceberg-Based Lakehouse Architectures on AWS

Good afternoon everyone. Welcome to ANT343, Best Practices for Building Apache Iceberg-Based Lakehouse Architectures on AWS. I'm Purvaja Narayanaswamy, Senior Engineering Manager for AWS Glue Data Catalog and AWS Lake Formation. I'm joined by Mike Araujo from Medidata, Principal Data Architect, and Srikanth Sopirala, Principal Solutions Architect from AWS Data and AI Space.

Today we have a packed agenda. We're going to start with the data lake crisis, what brought us here, the journey of Apache Iceberg in the lakehouse, talk about AWS's stack on how it's coming together as an integrated framework to work on Iceberg-powered lakehouses, and I'm going to focus on getting data in, the production-ready architecture patterns that you can build on Apache Iceberg. Mike will cover the voice of the customer. You will hear Medidata's transformation journey on Iceberg, and Srikanth will focus on how open standards truly enable interoperability and focus on getting data out, the multi-compute integrations and optimizations in Iceberg-powered lakehouses. The key is you will walk away with some key takeaways that you can implement in production today.

The Data Lake Crisis and How Apache Iceberg Solved It

Data lake crisis. Data lakes were good on paper, right? You have flexible storage, disaggregated compute and storage, but in production it was chaos. Data corruption was common. There was no way for you to go back to a point-in-time snapshot in case a mishap happens. There was no way to roll back. Your queries were very slow. You're probably doing a terabyte scan to answer a megabyte question. And forget about schema evolution. You're now talking about rewriting petabytes of data, and it was such a hassle.

That's really where Apache Iceberg table format shone. You have this smart metadata layer. You have table-level snapshots with all detailed manifests, schema definitions, partition specifications all separate from your data files. So this intelligent layering and layout, with your table-level snapshots pointing to your manifest list which tracks which manifest changed, and the manifest files themselves contain all the file-level statistics, your row count, min-max value, null count, et cetera. So this layout gives you blazing fast query performance.

And you get five core benefits right off the bat. First and foremost, you get ACID guarantees. Multiple writers can simultaneously update the table leveraging optimistic concurrency control without the hassle of any distributed locks and such, so you get that, and in production you don't have to worry about data corruption. Then the second benefit you get is because every update creates this nice stable table-level snapshot. These snapshots are isolated and immutable, so it acts as your checkpoint so you can go back to a point-in-time snapshot and do a rollback and do time travel style queries, which is perfect for your debuggability, for your checkpointing use cases and such.

And the third benefit you get is schema evolution is very elegant. Iceberg tracks columns by IDs, so whether you want to add, update, or rename a column, it's just a metadata update, unlike traditional systems where you have to rewrite and repartition and such. The fourth benefit you get is query performance. You don't have to do expensive scans. Now query engines are probably going to have to do a few kilobytes of your manifest scanning and know exactly which data files to read because of this metadata layout that I spoke about.

And Iceberg also supports elegant row-level updates. So unlike traditional file systems where you will have to rewrite the entire partition, in Iceberg you have ways to do it more effectively. So in Iceberg V2 there is support for equality deletes where you track the delete files by row key values or positional deletes by row positions, and engines would appropriately apply these delete files along with your data files.

Apache Iceberg V3: Closing the Gap with Variant Types, Deletion Vectors, and Row Lineage

Now with Iceberg V3, the gap is closed even further. First and foremost, variant type. JSON is everywhere, whether it's your API responses, IoT payloads, event types. The good thing now is engines natively support variant type, otherwise you would have to do some kind of an ugly flattening that most of your column values are probably null, or you have to transform into a string and give up on query performance. Now the good thing here is under the hood you still get the optimized columnar benefit for your semi-structured data, and you still get that elegant query to render it as JSON.

I spoke about the row-level updates where delete files are tracked as equality or positional deletes.

Equality of positional deletes can get a little more expensive if you have a table with a million rows and it's a CDC-style delete-heavy scenario. You're now looking at an expensive operation in terms of merging those delete files with data files. That's where deletion vectors are a game changer because they just maintain a bitmap to track whether your row is deleted or not. So you're literally talking about loading this entire bitmap in memory and operating at memory speeds, and you apply these with the data files.

V3 also supports row lineage. Typically lineage is tracked at a table or a partition level. Now with Iceberg V3, you get the provenance record for every row. It tracks your ingestion timestamp, source identifier, your CDC database transaction identifier, so you have that complete auditability information. So in case bad data got ingested, you have the exact lineage to track where it came from.

The fourth benefit with V3 is support for default values. This is a huge operational pain alleviator. Think of a scenario where you have your gold aggregator table and you want to add a column value. Now you're looking at either historical backfill of all your old records, or your results are inconsistent where you probably throw up null for your old records and the new values for the new ones. Here with V3, you have a way to specify a default value at the schema level so engines can apply those for your old records and your results are elegant and consistent.

AWS's Integrated Stack for Iceberg-Powered Lakehouses: From Ingestion to Agentic Capabilities

Great. So now we understand why Iceberg is taking the front seat in building these modern lakehouses. Let's look at AWS's stack briefly. At the bottom layer, you have all these diverse data sources, whether they're data warehouses, on-premises databases, streaming sources, or external Iceberg-compatible sources. You're bringing in all that through our portfolio of ingestion services. It could be batching services like Glue or EMR, or it could be your streaming services like Kafka, Kinesis, or Firehose, and you're bringing it all as an Iceberg table format to your storage tier.

It could be your flexible general-purpose S3 buckets, or it could be your S3 Tables, which is fully managed Iceberg capability, or it could be your warehouse storage, Redshift managed storage, and you still are bringing it all as a consistent table format, Iceberg. The moment all these come into the storage tier, the catalog automatically discovers all of the technical assets that's coming in. Your Glue Data Catalog acts as your technical metastore with Lake Formation providing consistent enterprise-scale governance, and SageMaker Catalog for all your business metricing and lineage and auditability.

In the processing layer, you have your choice of AWS first-party compute or your choice of compute which are Iceberg-compatible, so all that support natively Iceberg table format. At the application stack, you have Amazon Bedrock for generative AI and ML applications, or QuickSight for your BI use cases, or SageMaker Unified Studio for your analytic use cases. So if you look at the stack overall, it's an integrated ecosystem with zero ETL eliminating the complexities associated with data movement. Query federation lets you tackle and join diverse data sources, and catalog federation opens it up where you can connect to all other external Iceberg-compatible catalogs and truly build an open ecosystem. So this is an ecosystem that's all built for your flexibility and it's not fragmented.

I want to zoom in a bit on Glue Data Catalog in terms of how it takes the central seat for the Iceberg-powered lakehouses. In terms of what it is, it's a Hive metastore. It's also an Iceberg REST catalog with all the V3 support, all the goodness that I earlier spoke about. The catalog also supports multi-catalog federation to other external Iceberg-compatible catalogs. So now this catalog stack is providing the standard interface that other computes can hook up to, or it could be again choice of first-party or third-party computes through standard interface as an Iceberg REST catalog.

Lake Formation provides you this enterprise-scale governance, whether you're talking about your fine-grained access control needs right from your catalog, database, table, column, row, or cell-level security, or it could be rich policy-based access controls such as role-based access controls, tag-based access controls, and attribute-based access controls. Lake Formation also supports what we call this architecture called data mesh. It's basically a way to democratize your data where you

have your different lines of producers that want to bring in the metadata to a central catalog for consistent governance. From that central catalog, you can then do secure data sharing to different lines of businesses, other organizations, or units. It also supports what we call credential vending, which ensures that only trusted compute upon authorization gets scoped down access to your underlying data sources.

In terms of an Iceberg-style layout, as I was explaining how data and metadata sit closer together, the catalog is truly a fast lookup layer. You want to enable millisecond lookups and make sure that computes get direct S3 access so they can leverage the manifest caching, parallel manifest processing, and you get that full throughput. That's all what the data catalog enables.

We're also seeing a trend where there is an uptick on the agentic lakehouse because the AWS Glue Data Catalog exposes all the APIs through programmatic interfaces. Now intelligent agents can come in, they can discover the tables, schema, and data quality rules. They can even detect the schema drift and notify the data owners, or they could even self-heal pipelines or optimize your partitioning strategy to get better query performance. Overall, I would say we are seeing a trend where the catalog is not just a control plane for compute and humans but also for all these intelligent agents.

S3 Table Buckets: Fully Managed Storage Optimization for Apache Iceberg

Speaking of Iceberg, we have to zoom into the S3 Table Buckets. This is a stack where you get fully managed storage optimization for Iceberg. Iceberg is great, but it needs maintenance. Small files proliferate, especially on high-volume CDC-style workloads, or small files keep accumulating. Because every update is now creating these table-level snapshots, at a point in time, your snapshots are going to blow up your storage costs, so you want to keep them lean. If there are unreferenced files that are not referenced from any live snapshots, that also keeps lingering in your system, affecting your query performance at some point.

That's where S3 Table Buckets come into play and completely do this behind the scenes, so it removes this complete operational overhead for you, which otherwise you would have to run separate cron jobs on schedules and constantly monitor these and do these off the shelf. The first benefit you get is compaction. We support different styles of compaction. The first one is bin-packed by default, making sure all your small files are optimally packed to a target file size. We also support sort or Z-order style compaction depending on your query styles, whether it's on a primary column or multi-dimensional clustering queries.

S3 Table Buckets support what we call policy-based snapshot retention. Depending on your time travel needs and how many snapshots you want to keep in the system, you can define a policy and S3 Table Buckets would take care of behind the scenes automatically pruning them. Likewise, unreferenced file cleanup and orphan files not accessible from any live snapshots are constantly referenced and cleaned up. Overall, S3 Table Buckets provide a lot of configuration and control plane APIs as well as monitoring APIs. You have nice AWS CloudTrail audit logging. It publishes these metrics like snapshot sizes, your compaction size, snapshot count, and reference file count. Again, you can think of an agentic lakehouse where agents can start monitoring these metrics and proactively optimize your storage tier for your lakehouse.

Medallion Architecture: Organizing Data from Bronze to Gold with Iceberg

Great. So we saw how the AWS stack is coming together for Iceberg-powered lakehouses with the focus on AWS Glue Data Catalog and S3 Table Buckets. Before I go into the core ingestion architecture patterns, I want to touch briefly on the medallion architecture. This is a way for your data organization. You have your bronze layer, which is your raw data ingestion append-only style. You typically leverage services like crawlers for schema discovery, or it could be streaming services like Amazon Kinesis or Apache Kafka. You're coming in, and typically a storage choice for your raw data is flexible general-purpose S3 buckets with Parquet and Iceberg format on top.

The benefit that you get from Iceberg in the bronze layer is you get this nice schema evolution, as I earlier mentioned, without needing to do a full data rewrite. With Iceberg V3, we support semi-structured data. You get variant support, and you get this nice time travel capability without having to break your data pipelines.

And ACID guarantees. Multiple writers can come in, ingest all the raw data, and update the same table without stepping on each other. A pro tip would be you want to keep the raw data typically indefinitely, so leverage S3 lifecycle policies, enable Glacier, and move them to Glacier after 90 days and such.

The silver is the layer where you cleanse the data. It's your enriched, validated data where you can apply all your classification rules. You can do your transformations and such. Typically leverage services like EMR Spark or AWS Glue ETL. Storage choice could be your general purpose S3 buckets or S3 Tables depending on your scale and need. The pro benefit that you get because of Iceberg and the silver stack is schema enforcement. Schema acts as a contract. And you get this nice partition evolution. So say you started off with a partition strategy and depending on the scale and usage, you want to move to, say you started with a daily basis, you want to now move to an hourly basis. The neat thing with Iceberg is the manifests are tracking partitions by spec IDs, so all your new data that you're ingesting will start enforcing the new specification while your old data can still continue enforcing the old spec.

And Iceberg also supports this nice incremental updates, so you want to apply all these enrichment rules and transformations on the incremental changes that are coming in. So leverage Iceberg merge operations to be able to update those incremental changes that are coming in, and you get that through snapshot checkpointing. Every update you have this nice isolation of what was your last checkpoint, so you know what is that exact delta that you're working with.

And gold is your analytics ready, optimized consumption table, and typically you leverage services like SageMaker, Athena, Redshift for your query analytic use cases and the storage choice definitely S3 Table Buckets for the gold layer. And the benefit of Iceberg in the gold is you get materialized aggregations that are typically stored as Iceberg tables, and you can leverage hidden partitioning on Iceberg. That means when you're running the query, you don't actually need to know the partitioning strategy and technique based off of the column values. Iceberg internally knows how to get to the exact data files. It will apply the right partitioning strategy. And definitely to reap the benefits of query performance, leverage sort order or Z-order depending on your query patterns, so you get that query boost.

And a pro tip would be again you want to make sure that you leverage snapshot retention policy so you are constantly optimizing the metadata layout for your better query performance. And a point to note is S3 Tables offers 3x better query performance and 10x higher throughput just by the sheer reduction of all this metadata overhead and optimizing the file layout.

Core Ingestion Patterns: Batch ETL, Change Data Capture, and High-Concurrency Streaming

All right, now let's look at core ingestion patterns. First and foremost, batch ETL. This is your workhorse style workload where you're moving voluminous amounts of data on schedules. Typical use cases are lake migrations, warehouse loads, historical backfills, or regulatory reporting. So you have your source data sources like RDS or on-prem databases, and you're leveraging say serverless Spark for ETLing it and write it as Iceberg tables on S3 Tables and catalog it on the AWS Glue Data Catalog and it's ready for your analytics consumption.

So in this pipeline, the pro role for Iceberg is that you get this nice schema evolution. Adding, renaming, or dropping columns is very elegant without needing rewrites. You get this nice partition evolution without incurring huge scans and rewrites, and it provides you this nice time travel capabilities and you get ACID guarantees so that way any time a mishap happens you don't have to disrupt the rest of your ETL pipelines. You can go to a consistent point in time snapshot. And you get this nice low level lineage with Iceberg metadata, so it provides you this better auditability and debugability.

Pro tip, definitely leverage S3 Table Buckets for your optimized faster queries, and you want to tune in the file size if you're leveraging say bin pack algorithm to understand, hey, what is my pattern, how do I optimize my query performance and such. Understand your data if it's time series partitioning strategy based off of date and time, or if it's region-based or geo-style data, you want to partition it differently, and enable sort order or Z-order for better query performance.

The second popular pattern is Change Data Capture. This is your real-time style ingestion. This is real-time replication from operational databases to analytics.

You have all the source databases like Aurora, MySQL, and RDS, leveraging the change data capture from bin logs or your write-ahead logs. You leverage typically Database Migration Services, and all these captures are coming in and going through Kinesis or Kafka for stream buffering and replay. Then you're leveraging Flink or Spark for doing the transformations, and you merge them into Iceberg format on S3 tables, and then it is ready for your analytics consumption.

The primary role for Iceberg here is you get efficient upserts and merge operations because you get all these wall captures, the delta changes that are coming in. Without doing a full rewrite of the entire partition, you want to be able to know exactly which rows to update, and Iceberg does that very efficiently. You know the exact rows to update, so it rewrites only those files because of these nice optimized layouts and updates the manifest atomically.

It supports two strategies: copy-on-write and merge-on-read. Typically for CDC-style workloads, we've seen customers adopt merge-on-read because it's more write-heavy. Definitely leverage deletion vectors, which are much more optimized for performance, so you can bring the entire bitmap in memory and operate at memory speeds when you want to apply those deletes. You get this nice snapshot isolation, so even when CDC streams are coming in and ingesting, you get consistent query performance and you also get consistent reads on your analytics side. Definitely leverage auto-compaction, and concurrent writers can update the same table without stepping on each other.

A pro tip would be to definitely bring in the files in Parquet format. That's generally better for your compression techniques. You want to monitor the lag through your entire pipeline, whether it's your DMS or Kinesis, to see where your latency is going up so you can optimize your CDC pipeline. Definitely leverage deletion vectors if your workload is write-heavy with more deletes.

The third popular pattern is high concurrency streaming ingestion. You have clickstream analytics, financial transactions, and gaming telemetry. We are talking about all these IoT apps coming in through, say, Kinesis streams with high throughput and multiple streams. Flink is doing some real-time aggregates and Spark is doing some complex transformations, and they're all writing to the same table. You write it on S3 tables and then it's ready for consumption.

Iceberg is perfect for this because you have multi-writers and you get that consistency leveraging optimistic concurrency. You get this nice read consistency on the consumption side, and small file handling is perfect with S3 tables. You get this nice exactly-once semantics because snapshots provide this nice checkpointing, and combined with your Kafka offsets, you get this nice exactly-once semantics. Iceberg is perfect for all these incremental processing scenarios, so you read only and do the transformations on the delta from the last checkpoint.

A pro tip is that the perfect choice for these workloads is S3 tables, so you get all the automatic storage optimizations. Especially in these multi-writer scenarios, you want to monitor for the commit conflicts so you want to tune the concurrency accordingly. Typically we recommend customers do some namespace isolation so they don't step into the same partition. You have this nice segregation of high-frequency writers, and you want to again understand your pattern and tune the microbatch size to balance the latency versus file count. You get all the goodness right off the bat leveraging AWS components.

Overall, we saw batch ETL, CDC, and high streaming concurrency where with Iceberg benefits and S3 tables and Glue Data Catalog and analytic services, you get all the core benefits. Over to you, Mike. Let's hear the transformation journey of Medidata.

Medidata's Journey: From Legacy Architecture to Modern Data Solutions

Okay, thank you, Purvaja, for that amazing introduction to some of the popular Iceberg architectures we're seeing emerge today, as well as the amazing AWS services that are helping make those a reality. I'm Mike Araujo, an engineer here at Medidata, and before we start looking at our customer journey and our use cases, here's a little note about who we are and what we do.

Medidata is a life sciences tech company that has been working on innovating and providing digital solutions to our life sciences, pharmaceutical, and healthcare companies across our 25 plus years of experience. Our principal use case today, we're going to be looking at two of our newer offerings in Data Connect and Clinical Data Studio.

The purpose of these new offerings was to bring the data under a single roof for our customers to interact with.

We wanted to bring a single source of integration regardless of where that data came from. We wanted to support and enrich that data with semantics via our internal Medidata Data Catalog, as well as accelerate the data review and insights process through AI and machine learning.

As you can imagine, over our 25 plus years, we've seen millions of patients enter under our roof, which has resulted in the consolidation of billions of data points that have emerged from these various data sources. One of the things we were really hoping to do is reduce the friction that customers experience today generating ad hoc analytics data sets and data assets for that aforementioned insights through AI and machine learning.

This is a look at the previous architecture that we used to accomplish this type of thing, and this probably looks very familiar to a lot of people in the audience. This pattern works very well and has been working very well for a long time until the recent data explosion that we've been seeing emerge today. We have our bronze layer consumers and collectors on the left side, including our flagship Rave platform, which is our raw data collection system, and our legacy ingestor solution, which was used for file-based integrations.

What you notice here is a lot of those ETL blocks, so there were all these different batch jobs going on that had to land in some different staging layers for processing before finally making it to some downstream resting point where your consumers would then pick it up and read from it. But the reality is you probably didn't have just one of these pipelines hanging around. You probably had many of them because your consumers probably had various different requirements for how they wanted to read that data. You might have had streaming consumers, application consumers, APIs, BI solutions. The range is very wide, and so you probably were replicating pipelines like this all over your organization to try to satisfy all those.

Obviously, you probably ran into all the same issues that we started to run into with solutions like this. As the data began to scale, we started to see the latency deteriorate and started to become measured in days instead of hours or minutes. The inconsistency with the timing for each of the individual jobs meant that it would be difficult to align all of the data as it moved through the system, especially since we had collectors collecting data at all points at all times.

With that many batch jobs running and that many places to store the data, you increase your opportunity for error, places where things can go wrong, inconsistent data throughout the pipeline solution as the batch jobs fall out of consistency. If you needed to scale this, you needed to accommodate billions and billions of data points, you were probably looking at an overhaul of your entire system, possibly a data migration. The net result is numerous copies of data all over the place for different use cases at all times. To make matters worse, if we wanted to throw observability on this, we were talking about replicating that observability across the entire data estate.

Medidata's Iceberg Implementation: Architecture, Results, and Transformative Benefits

This is a look at how we solve this challenge today with Iceberg and AWS. We still have our Rave solution collecting data on the left side. We've replaced the ingestor solution with our new solution Data Connect. What's happening here is we got rid of all those ETL jobs that were running on some cadence in the middle, and we got rid of all of the intermediary storage layers and staging layers that we had needed before.

Because what's happening here is Iceberg, as you can see from the far right, plugs right into anything you want from a consuming perspective. That meant that what we could do was if we could fill up this Iceberg sink in the Glue Catalog, we're going to get immediate interoperability with a number of different streaming solutions, batch solutions, warehousing solutions, other open source projects like Arrow, and even MCP tools that we can then feed into our AI solutions.

Right before that is the streaming piece that I was alluding to before, and I just want to highlight that for a quick second. As some of you are probably aware, Iceberg, if you write a lot of snapshots from a lot of writers, you start running into all sorts of problems with file sizes, small files, lots of snapshots. By putting Flink inside EKS and using a Kafka buffer in the middle, we could just write to that Iceberg table once every 20 minutes and keep our snapshot size down while aggregating data from thousands of Flink pipelines.

So let's take a look at what that architecture delivered for us in terms of results for both us and our customers.

First, the data availability. Naturally, or perhaps unsurprisingly, when you move from a batch pipeline to a streaming pipeline, you see a significant reduction in latency, and that's exactly what we saw. That reduction in latency consequently meant that our customers were seeing consistent views of the data. The Glue compaction was ensuring that the data could be retrieved in a timely manner. It was abstracting away all of those frustrating complications with the small file problems, the orphan file deletions, and all of that. This is one of the things that we worked very closely with the Amazon team on, who was incredibly receptive and helped iterate over that to deliver the solution you see today. Finally, the snapshots were giving our customers and ourselves point-in-time access to the data.

The integrity was significantly improved. Now, instead of having data all over the system in Kafka topics, staging tables, Snowflake tables, Databricks tables, and whatever tables, we had a single copy of the data, one copy, one Iceberg table. That table was able to satisfy both the streaming and batch use cases. We were able to unify that estate there as well as serve application interfaces, APIs, and BI tools. As a result, we had a significant reduction, unsurprisingly, in data loss. We just had to focus on this one copy of the data, and we knew that everything reading from it was reading from the same thing.

The scalability. Now that we were working with data stored on S3, we no longer had to store data throughout its lifetime inside Kafka. That allowed us to let Kafka do what it does best, which was consolidate across all the writers and transport data downstream. This actually saved us a whole lot of money, unsurprisingly, because all of the lifetime data was sitting on S3, which we could storage tier, instead of in Kafka where you'd have to scale up those servers to store all of that. Of course, the Kafka and Flink solutions that are running on Amazon MSK and Amazon EKS come with auto-scaling support out of the box. Unsurprisingly, our observability was significantly enhanced and simplified. We were getting all of the amazing metadata coming out of these pipelines and the metrics, the Prometheus metrics, and all that, and we just had to worry about looking at that within a single plane.

The security. Now we have a single data access layer with the Glue Catalog, so nothing can get in the door to see our data that doesn't go through Glue and the corresponding IAM roles. Additionally, the data now never leaves our VPC, so the data is just inside S3 in our account, and whatever we decide to plug in as a reader can plug in as a reader, but we're not storing physically any copies anywhere else.

Then the integration. As I alluded to on the diagram side, the integrations that were available to us significantly expanded. Warehousing solutions like Snowflake and Databricks plug right in. Iceberg REST API delivered via the Glue Catalog has made it possible for a number of advanced Iceberg concepts to be put in place. Finally, a bunch of open-source technologies, we're talking streaming writers, readers, SQL solutions, are out-of-the-box enabled, and more and more coming every single day. This is just a snapshot of all those benefits that we've been talking about: the availability, the integrity, the scalability, the security, and finally the integrations.

What's Next for Medidata: AI Agents, Customer Data Sharing, and Iceberg V3 Features

What's next for Medidata with Iceberg? We're looking to equip our coming suite of AI agents with the real-time data that's flowing through these Iceberg tables, so we want to give our agents the insight into data, watch our customers upload new pieces of data, and then start asking questions about that data minutes later. Our current plan, something we've been working on, is to basically make that available through MCP tools that have access to this lake. We want to enable and expand upon our customer data sharing solution. We've recently launched a solution that gives customers direct access to read the data that lives in these Iceberg tables, but we want to expand that to allow customers to do write-backs, direct write-backs into the lakehouse solution as well. We have our Medidata curated SDKs, including our recently released RStudio SDK, which actually allows customers to write R code on their local laptops reading data that lives inside this lake, and that's all made possible by the open-source Apache Arrow project where we've set up a Flight server that gives the connection and the bridge between RStudio and our lakehouse. Finally, direct integrations.

We want to be able to open the box so that if customers are using Iceberg today as part of their solution, we can plug right into that and integrate it with the rest of the experiences that they get to enjoy in Data Connect and Clinical Data Studio. And so finally, some excitement on our side for what's coming in V3.

There's the variant support, and so metadata. We saw the Iceberg table that was writing from those thousands of jobs. Well, that was because those thousands of jobs are actually writing multiple tables. It's a multi-table within a single table, and so we've been taking extensive use of the map column. The map column worked great for that use case, but it does lack in some areas, including indexing and query performance for some tables within that table at scale. And so we're looking forward to replacing those map columns with variants where you can actually get way better performance against a subset of a table inside another table.

The default values we want to enable, the column defaults, so that we can save ourselves days and days of backfilling. And so right now, if we have to make a schema change or we have to evolve or move some sort of process there, it actually takes days and days of backfilling to get the table back up to current. And so the column defaults promise to save us all of that. We don't have to worry about any of those migratory patterns or anything like that that can get a little tricky and a little complicated. This is going to be a great feature for that.

And then finally, the row level lineage. Our customers are constantly obsessed with seeing how the data changes and evolves over time. And so while we've put in features to help give them a clear picture of that, a lot of it is not out of the box solutions and tooling. It's navigating snapshots, it's looking at timestamps, it's doing date range comparison and all that stuff. But in V3, Iceberg is coming out with row level lineage, and so what we want to be able to do is plug this row level lineage right into our CDC pipelines and then expose that to our customers directly so that they can see the change of the data and we don't have to spend hours and hours writing custom code and custom SQL to surface that to them. And so now I'm going to hand it off to Srikanth who's going to show you some of the amazing things that Amazon is doing to help make this possible for their customers today.

Open Standards and Interoperability: Query Federation, Catalog Federation, and Materialized Views

Thank you, Mike. So I think it's great to see customers like Medidata use Iceberg and provide the solutions to their customers. But I think what I want to do is I want to kind of touch on the consumption patterns in terms of Iceberg with the AWS compute engines. So first of all, before I do that, I just want to double click on the open spec. Why is it important? If you think about it fundamentally, open spec changes the game of how you consume your data with standardized table semantics and also interoperability. I think that's very important for us, and also catalog choices.

Purvaja talked about catalog choices because Iceberg's openness in terms of catalog choices with AWS Glue, Hive Metastore, or any other catalogs that you can bring in, custom catalogs, the automatic schema evolution, the ACID transactions, and also the accessible metadata, all of these features are really important with this open spec. The other thing to think about is the multi-cloud and multi-source integration because we talked about, imagine you're doing operational logs in S3. You have a bunch of operational logs in S3, you have app exports, and you're trying to bring all of that data, third party marketing data into one place, into Iceberg tables. I think it can be governed centrally in one place. The open spec basically provides you that.

And finally, open discovery features. These are important features because the enterprise-wide visibility, because of compliance for if you have sensitive data sets, comprehensive lineage and standard definitions can also be offered with this open spec. So the reason I wanted to touch on the open spec, which is important for us to continue with this discussion. With that, let's get into the query federation. This is an important topic, right? We have seen how open table spec unlocks interoperability. We've seen that. What we can do is, what is query federation? Why is it transformative?

So if you think about it, given the centralized governance and the metadata management, by plugging every source, for example, you have a unified catalog, you have everything that is findable, and also you can audit and secure all of this data right out of the gate. I think that is important for you. So the idea of the open query federation is this data residency and compliance also provides you the front and center, which basically stands out there.

Query federation stands out as a particularly important capability. For example, with query federation, the data remains in place as you're federating the data across multiple data sources. This provides a really important use case when you come to Apache Iceberg. Agile analytics becomes really important in this case as well. Think of this: if you're trying to have analysts who are trying to get to the data with multiple sources without moving any data, without any data movement, query federation becomes a very important aspect in their analysis. Lastly, cross-source reporting is critical because if you think about it, cross-source reporting where you have all of these data sets across multiple environments—it could be multi-cloud, it could be multi-vendor—you are bringing them into one Iceberg format with an API where you can access all of that data, so that provides you that capability.

For that, we've just announced recently, if you have seen this, the catalog federation of external Apache Iceberg catalogs. So what does it do? With catalog federation with Iceberg, your teams can query the Iceberg tables sitting in S3 directly, which is very important. Using whatever analytics compute engine that you want, you can access that data, and it keeps the metadata synced between catalogs—not just the AWS Glue catalog, but also other external catalogs or remote catalogs. So when you hit a table, you're basically looking at the data that's coming from the other catalogs and they are all in sync, which is important. This is an important feature for that. Lastly, it gives you consistent centralized security across all of these things, with the option—of course this is optional—where you can apply the broad policies or fine-grained column-level access through Lake Formation depending on the use case. So the federated catalog basically becomes an important part because it's integrating multiple catalogs into Iceberg formats.

Now materialized views are fundamentally important when it comes to performance, because if you think about it, the materialized views supercharge your analytics because they provide pre-computation. For example, the way to think about this is basically a few factors here. One is implementation strategy. Think about what you want to identify in terms of best practice. You want to identify what are the high-frequency queries that you are using or your users are using constantly, and then set up a refresh schedule where you can have the views accurately and timely refreshed. So that's the implementation strategy. Then of course, performance design becomes another important factor here. What does performance design in this case mean? Pre-aggregate metrics and also optimize those joins to deliver the right lightweight, fast results for your customers.

Then storage optimization becomes very important because I know Purvaja talked about storage optimization. Storage optimization is formatting your tables in Parquet, choosing the right compression, partitioning very smartly, so that every query scales up without blowing out your complete storage costs. So that's an important aspect of it. Then refresh management becomes an important point here too, because it keeps the insights refreshed. So you want to figure out your refresh schedule—what the schedule looks like. So that's an important aspect of that. Finally, metrics and ROI: you have to be able to measure this performance. So you want to look at the major cost reduction, you want to look at rapid query optimization, and also from seconds to milliseconds—you want to measure those measurable improvements. So those are important things there. So the idea of the materialized view is it gives you this pre-computed aggregated way of accessing your Iceberg data so that you can get better performance.

On this note, we just announced the materialized views in AWS Glue. So think of materialized views in Apache Iceberg. What are they? These are materialized views—think of this as a managed table that stores the output of a SQL query in Iceberg format, and that's what these materialized views do. Then rather than constantly rerunning that query from scratch, as the tables change, the view is incrementally refreshed constantly, so you don't have to do anything. So it eliminates the whole pipeline building where you have to constantly refresh that. So that's what AWS Glue materialized views for Apache Iceberg do.

What are the benefits? You might be thinking, okay, how does it help me? The performance benefits are huge with these materialized views with Apache Iceberg, because the results are pre-computed and they're also stored in Iceberg tables. They are super fast. The Spark jobs that are reading from these views are significantly accelerated.

The Spark jobs that are reading from these views can figure out whether the materialized views are the right approach for them, or whether they should go back to the tables themselves. You have that intelligence built into those things. If you're building pipelines, especially complex pipelines, these materialized views can help you simplify these pipelines with the help of those capabilities. This is a cool feature and one of my favorites here.

I know Purvaja talked about it, but I just wanted to flash the screen out to give you an idea. Apache Iceberg is becoming the go-to table format across the board in analytics, and every customer we talk to is embracing this approach. We have built multiple data integrations across the entire ecosystem. On the producer side, we have all of these tools like Data Lake Foundation and EMR that we talked about, Athena, and integration across our analytics stack so you can produce data into Apache Iceberg tables. That's huge. Plus, with the catalog federation I talked about, you can now consume and get that catalog federation from remote catalogs, so you have that capability there.

Optimizing Iceberg Data Consumption with Amazon Athena, Redshift, and EMR

With the open spec and the APIs we talked about, you can have Spark or Trino or Presto consume all of the data across these different tools. This becomes an integral part as you're thinking about your architectures. Think of these as components you'll be considering. Now that we have talked about the services that can consume data, let's double click on a few of the services like Amazon Athena, and then we'll talk about a couple of other services here.

How do you optimize Iceberg data consumption using Amazon Athena? One of the biggest wins here is Athena's ability to leverage Iceberg's manifest files and partition statistics for aggressive metadata pruning, which is very important. With the right partitioning and the right filter predicates, Athena skips irrelevant Parquet files and helps you with that performance. Data management goes beyond just querying. Using tasks like Create Table As Select, you can build new Iceberg tables and write inside Athena, which is important with SQL, enabling the snapshot isolation that Purvaja talked about, and also branch and tag support in Athena using these Iceberg tables.

The last one is serverless scaling. Serverless scaling is pure simplicity because there is no cluster management. You don't have to worry about managing any cluster to access your data in Iceberg tables. Athena provides you that serverless management at five dollars per terabyte scanned, which is pretty cost effective. In terms of best practices with Athena, think of partition predicates and filter conditions while you're designing your Athena queries for every query. Leverage the manifest statistics and compaction for fast and efficient writes. This way, Athena and Iceberg work together and provide you the best performance that you can get.

The bottom line is Athena plus Iceberg is a true pay-as-you-go analytic solution for you because there's no cluster management, it's cost effective, and you can access those Iceberg tables with all the features that I talked about. Let's talk about Redshift. Redshift is a very popular data warehouse out there. A lot of my customers use it. When you're ready to take your lakehouse analytics to the next level at enterprise scale, people generally look at Redshift. Amazon Redshift and Iceberg really work very well together in terms of performance and scale.

On the consumption side, Redshift excels at concurrency and efficiency. You can easily handle fifty plus concurrent queries at the same time, and the system's auto scaling and workload management really help with these Iceberg tables. For modern lakehouse integration, Redshift has built-in support for Apache Iceberg tables. It's a powerful combination where you are able to get direct read and write to Amazon S3 with snapshot isolation and full compatibility with both business intelligence and analytical workloads. Everything is tuned for cost control because Redshift uses statistics for query planning and optimization so that your queries can perform very well. They provide you that query rewrite capability so you're able to do that quickly.

In terms of best practices revolving around RA3 nodes for optimal storage and compute separation,

leveraging concurrency scaling is an important aspect of Amazon Redshift, along with cross-cluster sharing, which is also key for enterprise needs. Performance monitoring and performance metrics using various other tools are essential components. Query latency and concurrency, along with cache hits for smooth, predictable operations, are all capabilities that Redshift provides.

Some of the best practices for Redshift include these considerations. The other compute engine we see a lot is Amazon EMR, because Apache Spark in general makes scalable Apache Iceberg analytics both cost-efficient and enterprise-ready in terms of the lakehouse architecture. In terms of infrastructure, you need to think about flexibility. Clusters can range from single-digit nodes to handling 100 to 1,000 terabytes of data at the same time. You can deploy multiple clusters, and using Graviton provides you the ability to spin up those clusters efficiently.

Performance hinges on runtime optimization, because specifically with Apache Iceberg, Spark gets adaptive query execution, smart join planning, and dynamic partitioning, all available as part of that Spark implementation. Even with your largest data jobs, you are able to use millions of partitions across petabytes, so you can consume that data and run it effectively. Storage integration is key because it ties all of this together. You get the S3 committers, which provide full ACID transaction support, and also schema evolution, making it very easy to manage reliable versioned datasets.

Amazon EMR with Spark plus Apache Iceberg are built for all of these analytics which we talked about: pipelines, streaming workloads, and ad hoc queries. No matter what the workload type is, a few of the core best practices to unlock scale include right-sizing your partitions. That's number one, between 128 megabytes to 1 gigabyte, which is not new but a standard practice. Also, tune your executor memory for maximum performance and enable the adaptive query features for shuffle optimization.

The bottom line is Amazon EMR with Spark and Apache Iceberg lets you build fast, flexible, and cost-optimized lakehouse analytics. Whether it's massive data, machine learning, ad hoc analysis, or even streaming, everything runs smoothly with this kind of feature set. That's Amazon EMR with Spark. Let's get to key takeaways.

Key Takeaways and Action Plan: Building Production-Ready Apache Iceberg Lakehouses

In closing, the key thing to remember is some of the key takeaways that really set Apache Iceberg for modern lakehouse architectures on AWS. The foundation is built with enterprise data features. Think of ACID transactions, think of time travel that Purvaja talked about, think of schema evolution. All of these things come standard, and that's important. The other thing is if you're building for the future from day one, when you're architecting a solution with Apache Iceberg, you're basically building for that future-proof architecture.

By leveraging S3 Table Buckets, which we talked about, every query gets a speed boost in terms of high-performance, high-throughput access. Use that capability, and with AWS taking care of the maintenance and managing those Apache Iceberg tables, you get significant operational benefits. For high-performance analytics, you've got the built-in engine-specific optimizations, whether you're planning to use Amazon Athena, Amazon Redshift, or Apache Spark. Iceberg materialized views that we talked about, query federation, which is a fundamental thing that you want to start thinking about, and also catalog federation in that respect, these are for fast cross-source analytics that provide you better insights that you can quickly access.

In production, the key thing to remember is Apache Iceberg isn't just about batch jobs, because it powers real-time workloads like we talked about in one of the use cases that Purvaja discussed: Change Data Capture and streamlining workloads. The idea here is to get you subsecond latency across wide concurrent writer support and make your platform more responsive, and that's an important thing.

In terms of best practices, when you think about it, start new projects with S3 Table Buckets if you can, because that's an optimized, managed Apache Iceberg storage layer that is available for you. Think about file sizes for storage efficiency. Use smart partitioning for agile queries while enabling catalog optimization for quick searching, and configure snapshot retention in terms of balancing reliability and cost.

Here's the action plan for you. Use S3 Table Buckets and right-size files for all your Apache Iceberg projects if you can. Set up snapshots and partitioning to future-proof your performance. Lean into materialized views. I talked about AWS Glue materialized views. Lean into the materialized views because the pre-aggregated, pre-computed results provide valuable benefits. They reduce your data pipeline complexity so that you can quickly get those results, particularly AWS Glue Iceberg views that we talked about.

Make catalog management a priority. Think about catalog federation using other vendors, using other catalogs. Try to use catalog federation so you can consume your data and get the best compute, irrespective of which platform you're on. In terms of Apache Iceberg, the way you deliver scale, the way you deliver flexible analytics, is using Apache Iceberg with these compute engines, making your workflows more productive. As a data engineer, data architect, or whoever you are, it makes it easy for you to consume that data.

That's my last slide. Thank you for coming to the session and enjoy the re:Invent party. Thanks a lot.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community