Kazuya

Posted on Dec 9, 2025

AWS re:Invent 2025 - Building multi-Region data lakes with Replication for Amazon S3 Tables (STG358)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Building multi-Region data lakes with Replication for Amazon S3 Tables (STG358)

In this video, AWS announces replication support for Amazon S3 Tables, enabling fully managed multi-region Apache Iceberg data lakes. Aritra Gupta and Nikos demonstrate how customers previously spent 6-12 months building custom replication solutions using S3 replication, ETL jobs, or dual writes. The new feature creates read-only replicas that maintain snapshot ordering, automatically rewrite file paths, and support independent retention policies. Live demos show configuring replication via console, monitoring with APIs, and centralizing data from three continents for analytics. Soumil from Zeta Global shares how they manage 10 petabytes across 10,000 Iceberg tables, ingesting 6TB daily, and plan to use regional replicas to reduce latency for global customers.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: The Challenge of Multi-Region Data Lakes with Apache Iceberg

Good afternoon, everyone, and welcome to our session. I hope everyone's having an amazing re:Invent so far. This afternoon, I'm Aritra Gupta, a product manager on the Amazon S3 team. With me, I have Nikos from the S3 Engineering team, and I have Soumil as well from Zeta Global. You'll hear from both of them in a bit.

Just to get started, if you're in this room, you're probably already using Apache Iceberg or are thinking of using Iceberg in production for your data lakes. Maybe just a quick show of hands if you're familiar with or you're already using Iceberg in some form or shape. Wow, that's quite a lot of us. Maybe to take this a step further, now that you have your data lake set up or you're thinking about it, you're probably also thinking about taking it to more than one region or replicating it, whether to share with your teams who are distributed globally, for compliance, or for just a backup use case.

For those who've tried this, I think we all know that replicating Iceberg tables is a super hard problem because it's not just about replicating objects from one point to the other. It's also about taking care of Iceberg's metadata guarantees, making sure that we understand the spec really well, and then having your replicas in a read-only state that's easily consumable by your downstream applications. Typical customers who've tried this told us that it took them about six to twelve months to get a working application running, and then it also takes them additional time to maintain this infrastructure as the team scales up further. That's what we are set out today to solve for, which is how do we get to you a fully managed Iceberg replication solution that helps you solve for all of these problems.

With that, let's get started. In the next fifty minutes, here's what we are going to cover. First of all, we are going to see why you even need a multi-region data lake in the first place. Then we'll talk a little bit about S3 Tables. We'll introduce the new offering, which is replication support for S3 Tables, look at some of the use cases and workloads where you can apply them, and then we'll have Nikos walk us through the feature with a few demos. Finally, we'll hear from Soumil, who's been using S3 Tables in production throughout this year at Zeta Global and how he plans to use the replication support as they grow into multiple regions.

With that, let's get started. First up, let's talk about why we even need multi-region data lakes in the first place. There are three key drivers that we've heard from multiple customers. The first one is performance. Imagine that you have data scientists based out of Singapore, London, and New York, while you have your data sets in US East One. Your data scientists in Singapore will definitely say that their queries are much slower, basically because the data is traveling all the way over the Pacific. On top of it, each query made by them will also result in additional inter-region data transfer costs. In this scenario, it makes a lot of sense to have your data sets closer to where your users are.

Next is compliance. For those of us who are in regulated industries such as healthcare or financial services, it's almost always a requirement to have another copy of your data isolated from a primary copy. This could be for resilience, for disaster recovery, or any other mandates out there. In fact, there are some regulations that need you to have another copy of your data totally isolated from the primary, and they require at least five hundred miles or more, and they also need it to be encrypted differently. That's another common use case that comes up.

Finally, data protection itself. I'm sure all of us are thinking, well, Iceberg gives you time travel capabilities, and it's easy to roll back to a previous version, and that's all true. However, we've also seen customers run into those days where someone accidentally just drops a table or drops an entire database worth of tables, and there's no good way for you to get back to that previous state. Additionally, you might also want to protect yourself against ransomware attacks or any other kind of malicious activities. In all of these scenarios, you do want to look at having another copy of your data or replicate your data lake into another region.

Understanding Iceberg Architecture and Current Replication Approaches

Now before we dive further, I want to briefly talk about how Iceberg works. It looks like all of us are pretty familiar, so I won't go too much into the depth. At the very bottom, you have your Parquet files which are containing the actual rows, containing the actual tabular data. On top of it, you would typically have your metadata files, which basically make up the Iceberg spec. This could be manifest lists, manifest files, metadata JSONs, and on top of this you would have the query engines and the catalogs. You can use any query engine of your choice, and that's why we all love Iceberg, right? You can use Spark, you can use Trino, you can use Athena or Redshift.

And because Iceberg is made up with these multiple layers, that's where replication becomes a little more challenging. To outline this, I want to walk us through three ways in which customers replicate Iceberg tables today.

Now, the first way that customers would do it is simply use S3 replication. What happens in this scenario is you take your source buckets, source files, and you turn on S3 replication or you physically replicate copies into the replica tables. There are three big challenges that come up when you're using this approach. First one is, because S3 replication is asynchronous today, you need to figure out a way to understand whether snapshot one has arrived before you go and commit snapshot two to your destination. Similarly, there might be a scenario where your metadata files have landed before your data files can land, and that way your queries at the destination may not be useful.

The second problem, which is kind of peculiar with Iceberg, is all the metadata files that refer to the data files underneath have absolute file paths embedded within them. For example, it typically would start off with S3 slash your bucket and then the path to your exact key itself. What that really means is that you need to now have custom logic to transform them and point to your destination buckets themselves. And finally, all of this has been happening at the storage layer itself. That means you need to coordinate with your catalogs and you need to coordinate with your higher level applications to make sure that they are aware of each of these commits that you're replicating on the other end. So once all of that happens, now you have a queryable replica on the other side itself.

Now, some customers take an alternate approach, where what they do is they have a source table and then they run custom ETL jobs or Spark jobs where you would have every logical snapshot that's written on the source get replicated on the destination. With this approach, there are three key considerations that customers need to think of. The first one is your jobs need to be aware and track state of the snapshots or the commits that have been replicated in the source. For example, if you run this hourly, you want to make sure that if you're caught up to snapshot 101, you want to make sure that the next time you run, you are looking at all the snapshots on top of it.

Second, you need to think through how you're handling error scenarios and how if some objects were successfully replicated versus others were still kind of in flight, how do you handle those error scenarios itself. And finally, these jobs need to be really smart to understand Iceberg spec and, let's say you have a schema evolution on the source or you did some sort of hidden partitioning on the source tables, you want to make sure that all of those updates are also getting accurately replicated on the other end.

And the final one is where customers look at a single application doing dual writes on both these tables. Now, the obvious design challenge here is kind of appearing here itself, where essentially you're putting your replication infrastructure on the critical path of your applications. And this has its own implications to think through, right? Now, imagine that you need to get an acknowledgement from your writer. Do you do that when your source is committed, or do you do that when both your source and replicas are committed successfully? You also need to think through latency, you need to think through performance. And finally, a lot of Iceberg writers may be used to writing only one region and one catalog at a point. So you need to make sure that all your applications are wired to write to both these regions at the same time.

Now, all of this is not just theoretical. In fact, we had a customer who successfully built this, where they were replicating petabyte-sized data lakes from region A to region B. And you can see how complicated this can get, right? For example, they used S3 replication to move these objects from region A to region B. And then on top of it you have a DynamoDB global table that's kind of understanding and tracking the state of these replicas as you move them forward. And there's a bunch of Lambda functions, CloudTrail, S3 inventory, and other stuff that goes under the hood to make this successful. So, while they were able to get this, we hear this from multiple customers who want a simple and easy way to replicate the tables from region A to region B.

Announcing Replication Support for Amazon S3 Tables

So with that, I just wanted to make this my mic drop moment where we want to talk about this announcement that was made yesterday, which is replication support for Amazon S3 Tables. Last year, we announced S3 Tables as the easiest way for you to get started with Apache Iceberg in the cloud. And this year, we kind of take it a step further, where now you can have multiple copies of your data lake within a few clicks.

This is not just simple object replication with Iceberg bolted on top of it. As we go through my talk, the demos, and how all of this works, you'll see that we built this from the ground up, where we've really taken into account how Iceberg works. Working backwards from your requirements, we built this replication feature in the first place.

Let's talk about this new concept of a read-only replica. Every time you set up replication on either a bucket where we want to replicate all your tables, or you selectively choose tables that you need to replicate, the service creates something called a read-only replica. It has four key characteristics that I wanted to highlight.

First, these replicas have the same namespace and table name as the source. What that really means is you can easily point your applications and queries from the source bucket to the replica bucket, and it's super simple for you to reuse all your queries. Second, we take care of replicating everything the moment you tell us that you need to replicate a particular table. What I mean by that is we backfill all the current snapshots that are in your Iceberg table, and we also make sure that we take care of any ongoing updates that are happening on your table once you turn on replication.

Third, we understand Iceberg data and metadata natively, and we understand each part of the Iceberg spec. What that means is when you evolve a schema or when you have any other changes to your metadata as well, all of that gets propagated in the right way to your replicas. Finally, all of these read replicas are always query ready. What that means is you don't need to worry about how to point your applications or whether you need to have different kinds of RTO or RPO requirements. As soon as you point these applications to the replicas themselves, they are query ready.

Key Features: Simplified Operations, Scale, and Purpose-Built for Apache Iceberg

Now let's talk a little bit more about what this feature really gets you. With native replication support out of the box for S3 Tables, you can simplify your operations, you get the scale and flexibility of Amazon S3, and as I mentioned, this is purpose-built for Apache Iceberg. Now let's look at each of these in depth.

The first one is around just simplifying your operations and how you can get rid of all the heavy lifting that you need to do to make your data lakes multi-region. As I mentioned, within a few clicks and within a few configuration files, you can replicate all your tables within a bucket. Alternatively, you can say you have some production tables or some critical tables, and you just need to replicate those, and we'll take care of that as well.

Second, you get out-of-the-box auditing and monitoring. For any tables that you replicate, anytime replication replicates an object from a source table to a replica table, you get a CloudTrail log. Additionally, we also share the status API, which essentially tells you how far along the replica is from the source and what changes are in flight. It also gives you an update in terms of the latest metadata file on the replica that you can cross-validate with the source itself.

Finally, all of this comes with out-of-the-box integration with native AWS analytics services. It's still the one-click integration with the Glue Data Catalog and Lake Formation and the Unified Studio on top of it. That makes it super simple for you to just spin up your data lakes in another region, have applications on top of it, and get them running.

Next, I want to highlight a very interesting stat, which is that today, Amazon S3 replicates more than 150 petabytes of data every week across regions. The reason why it's interesting is that the whole S3 Tables replication support is built on top of this. That means we are ready for your data lakes of any size and kind. You can bring in petabyte-sized data lakes, you can have tens of thousands of tables, and we will replicate them out of the box.

Now, the more interesting part of this is not just the scale, but also the flexibility that you get out of it. I wanted to highlight three ways in which you can make your replicas really customized and useful for your individual needs. First one is storage class. You can have your primary tables be in standard storage class if that's what's needed, but in case you want your replicas to have more cost benefits, you can have them in Intelligent-Tiering storage class. In fact, we just announced Intelligent-Tiering for tables as of yesterday itself.

Second, you can have separate and independent retention policies for your replica tables.

For example, you could have your source table be retained for seven days with a number of snapshots, while your compliance backups could be retained for ninety days, and your archival replicas could be retained for up to seven years. Snapshot expiration works independently for each of these replicas as you configure them. Finally, you can choose your own encryption key that's separate from the source table so that you have an improved security posture. All of these features help you create a replica that's tailored to your specific needs.

Now let's talk a little bit more about how this is purpose-built for Apache Iceberg. The first thing, as I mentioned, is that this is not asynchronous replication. What we really do here is understand the sequence in which your commits or your tables were updated in the source, and then we apply the same sequence as we replicate to the destination. What that means is snapshot five will always appear after snapshots one through four are replicated. Second, we make sure that when we are replicating the metadata files, all the parts where we refer to the data files themselves are transformed and point to the replica bucket. This is a huge simplifier because we know today customers spend a lot of custom code making those transformations as you move the metadata files.

Finally, a very important part of this is the way we want you to think about replication support. This is not just a blind copy of your metadata files. In fact, what we do is for each table update that you make, we understand what that update really means and then merge them safely on your replica. That is how we're able to keep longer replicas, which can be there for seven years or longer, while your primary is just, let's say, seven days. This is because every time you write to your source table, we understand what that delta really means in terms of Iceberg parlance. Finally, we replicate both Iceberg V2, which is most commonly used right now, and also V3 tables, which was just launched by S3 Tables a couple of weeks back. We support both of these versions out of the box.

Use Cases: Global Performance, Data Protection, and Compliance

With that, I want to briefly touch upon some of the use cases and where you would use these capabilities. Let's look at a very common scenario. Let's say you are a financial services company, and your data lake is in US West One, and this is where all of your transaction data lands. Now, let's say a few years down the line you've started to scale globally. So now you have teams in London, Mumbai, and Cape Town, and all of them require access to the same data. They want to run training models and they want to do various kinds of analysis. What you can do here is really fan out your applications where you can have all of your data collected in one region, and you can replicate them to one or more geographies of your choice.

Now, the inverse of this is also an interesting use case that we discovered along the way. Imagine you are a supply chain company and you have different components being manufactured at different areas of the world, and now you want a centralized platform because in Seattle is where all of your data scientists live. Now you want all the data to be aggregated towards them, and what you can do is generate all the data closer to where your supply chain factories are located and replicate all of them to Seattle itself. This is about how you can get better performance, lower latency, and better costs for your overall multi-region workloads.

Let me spend a few minutes talking about data protection. Let's say we have a primary and you have a replica. What you do is commit a new update to the primary. As expected, we do replicate it to the other end as well. Now let's say the next command that came in was to expire your snapshots S0 and S1. In this case, the replication service does not take action on that. That is because we know that while you expired your snapshots on the source, the replica will have its own independent snapshot expiration policy.

Now how is this helpful? Imagine you run into a scenario where you now need your old data back and you had accidentally run that expire snapshot command. In that case, what you can do is easily have all of your data being read through the replica, and that will help you get back to the state that you were in before. Now moving on to the next use case, as I mentioned previously, you can think of these as multiple replicas serving different use cases.

For example, while your primary table is short and needs to be performant, you can have your operational replica be slightly longer, and you can have your archival tables be as long as you need them to be. Finally, for all of us who need to meet those compliance requirements, you can imagine these tables being isolated in two different regions. We also do cross-account replication, so these could be two different accounts altogether and encrypted completely differently.

Demo Part 1: Configuring Table Replication and Monitoring Ongoing Updates

So with that, I'll pass it on to Nikos to show us some demos on how this really works. Thank you Aritra. I'm Nikos, and today we're going to see together something really interesting: how to replicate S3 tables across regions and across accounts. So we're going to look into three specific demos. The first one is how to configure table replication and how to backfill your existing data from the source to the replica. The second one is about replicating ongoing updates, new records that land in your source table. We're going to look here also into how to monitor replication with the API that Aritra just mentioned. The third one is the most exciting in my opinion, which is a real world scenario about centralizing data for analytics, and we're going to look into some interesting queries.

After the demos, we're going to take a look into some interesting challenges that we faced while designing this new S3 Tables feature. So let's dive into the first demo. Our case study here is that Aritra and I are building a new startup for e-commerce, so we have a mobile app which is storing transactions into our conveniently named transactions history table. This history table is in the source table bucket because that will be the source of our replication. Now our use case is that Aritra wants this data to be closer to his AWS region where he's operating to generate some analytics.

So let's take a look at what data exactly we have in this transactions table. We select the count in Amazon Athena. And we have exactly 100 transactions in the table. Let's see how many Iceberg snapshots we have in the table. Ten. So it's very convenient: we have 100 rows, 100 records across 10 snapshots.

So now, we're going to replicate that, so let's take a look at how to replicate it. We go back to the source table, which is transactions, click on it and go to the new Management tab. Here we can define a table replication configuration to replicate exactly that one table. We'll click on Create table replication configuration, and here we'll specify the parameters. The first one is what is the destination table bucket. We want to choose Aritra's table bucket, which is destination, conveniently. We'll submit this choice and then we'll choose an IAM role that replication will use to replicate our data. This role contains permissions for the source and the destination table bucket. I have already created the role, so I'm going to select it here, and then I'm going to submit the replication configuration. So basically two parameters to set it up: the destination and the role.

So now we have already created it and we see an updated Management tab, which contains the parameters that we have just set. And if we press the refresh button, it will show us the pending status. Pending status means that there is an ongoing replication happening right now. What exactly does that entail? The first one is that the replica table will be created in the destination table bucket we specified. The second one is that as soon as that replica is created, the data that exists in the source table will be replicated to the destination.

So then you will ask what exact data is replicated. It is the current active commit which contains all the active snapshots. So we maintain queries and time traveling capabilities from the source to the replica. So now it's been a few minutes, and let's go and press again the refresh button. And we'll see that this status pending will have changed to completed. As soon as we click on the completed, we will see a pop-up which shows a timestamp. That timestamp represents the moment exactly when the replica got in sync with the source, when the replication was completed and all our data is present in the replica.

So now all the schema, all Iceberg files, all data is in the replica. Let's go and see it. So I showed you before Amazon Athena, and I really like it personally because it has the same name as my hometown. However, Aritra really likes terminals. He's an engineer at heart, I think, even as a product manager. So let's go and connect in DuckDB to his destination table bucket.

We will attach the one called destination. Let's see how many snapshots we have transferred and replicated to this table bucket. Specifically, the transactions table has 10, exactly as many as in the source. Let's see how many records we have. You remember we had 100 in the source. Perfect, we have a perfect match. We have 100 records across 10 snapshots.

So we have the replica fully in sync with the source, and we can continue running our queries against the replica now. Now let's take a deeper look into these snapshots. On the top we have the four last snapshots of the source table, and on the bottom of the replica. Take a look at the first three columns: sequence number, snapshot ID, and timestamp. Those are exactly the same between the source and the replica because replication maintains exactly the same values. But now take a look into the manifest list location. You will see that in the source we have a specific S3 storage location, while in the replica we have a different one. That is because the storage locations are related to the context, which is the source or the replica.

The takeaway here, the super interesting takeaway, is that replication will do this for you. It will rewrite all file paths inside Iceberg files, so in manifest lists, manifest metadata files, and so on, automatically for you, and you need to take no action at all. So we see that all the manifest lists here are updated already by replication. Now let's go and see some other case. Our application is being used by our customers, and now they are actually creating more updates. Let's see how to replicate those updates.

We'll simulate exactly that scenario. We'll insert one record in Amazon Athena in the source table. So let's go and do it. We'll insert a guide to using S3 Tables replication. Why not? As you will see, this will be automatically replicated. We have already set up replication. It's active, which means that any new record, any new snapshot, will be automatically replicated. Let's see how many transactions we have after inserting this one. One more, 101. And how many snapshots? 11. We have one new snapshot, one new transaction, one new record in the table transactions.

So now let's check the status of this replication. We have taken no action. We're just taking a look at the status with the new API, so get table replication status. We're going to come back to this API, so for now we just want to see the status, which is completed. Perfect. This status is exactly the same that we can see also from the Management tab of the AWS console. You will see here that when we refresh, we see again completed. But if we take a deeper look, we'll see a different timestamp. That is the timestamp when the new snapshot was replicated to the destination.

As a matter of fact, the status would flip to pending and then again completed every time you have new data in the source. So then you can query your replication progress in real time. Now on to the destination. We had 10 snapshots if you remember from before. Let's see that we got the new one. And there it is, it's now 11. So we have now exactly the same data in the source and destination. We can continue again running our analytics. Aritra is happy, I think. So let's move on to another topic.

Demo Part 2: Centralizing Multi-Region Data for Analytics

We discussed about this get table replication status. Let's take a deeper look here. We said that first, this is equivalent to using the console, but it gives us some more data, and why this is important. We ask for the table replication status on the level of the source table. So you see that the parameter and the result here includes the source table ARN. For every table, we can have more than one destination. So here we get a list of all the destinations with data for each of the destinations.

It can be that the status that we see here is completed for one destination but in progress or pending for another one. So we'll get the full information there. We also see the information about the configuration that we set, what is the table that we're replicating to and what is the table bucket, and also what is the last update, which if you remember, that was the timestamp that we saw on the console. Here we see additionally what is exactly the metadata location of the last metadata file.

Okay, so let's go ahead and see the most interesting use case, in my opinion. Fast forward a few months, our startup is expanding. So now we have customers across three continents: Europe, Asia, and Australia. The interesting part of this is that we store the data in three different AWS regions.

Aritra wants to perform some analytics and run queries on a combined dataset. So how are we going to combine it? We're suggesting here to combine it by replicating all the data into the region closest to Aritra. We're going to replicate it to one region in North America, for example, US East, and let's see exactly how we're going to do that.

So this is the data that we're working with. We have three table buckets: ecommerce Asia Pacific, ecommerce Australia, and ecommerce Europe. Now let's take a look at exactly what we have in these table buckets. We're going to open the first one and see what tables we have. So we have orders and order items in a parent-child relationship, a very typical setup.

We don't want to set up replication for each one of them again because we don't want to run this configuration twice. So what we do this time is go to the management tab of the table bucket. This will configure a replication for all the tables in this table bucket, so let's go ahead and do it. As you can see, the configuration is exactly the same. You choose the destination table bucket, which in this case is in another region, and it's called ecommerce analytics.

We submit that and then of course choose the replication role that we have already created. Again, we will create a replication configuration. And now we're back to the management screen where we see the basic information for that. So now I've done exactly the same process off the screen for the other remaining two table buckets, so we're not going to spend time on that, but we're going to spend time to see what exactly we have on the replica.

So now we have replicated all the data from three continents into one specific table bucket. Let's go back to Amazon Athena and see what data we have there, but first you may ask me, how do we differentiate between orders and order items of each of the three sources? Don't they conflict when they get merged into the same table bucket? The answer is no, and why is that? Because we use S3 Tables and namespaces to differentiate the source from which the data is coming, and we're going to see this exactly in action now.

So let's take a look into this database dropdown. It has three values which are the namespaces that we defined. We can use exactly those namespaces as identifiers in the query in order to define what data we want to actually pull and generate analytics from. So let's take a look into some interesting queries that we can run here. The first one is showing us our top product categories, where do we perform the best and where should we focus next. And here we see per continent, per region, the categories, revenues, orders, and so on. It's just a typical report that we can run.

The second one is about regions and how exactly they perform. So how did our sales go in Europe, in Asia, Australia, and so on. And here we see sales, we see customers, average orders, order values, and we can actually aggregate all sorts of data. This is just a sample. The last one is about payment methods. So let's say, for example, which payment methods do best, where exactly our customers are leaning towards, and where should we invest next, what exact collaborations we should do.

So here we see bank transfers, mobile payments, amounts per payment method, but basically the point is these are some very basic queries that you can run on top of the tables. And when it comes to analytics, the sky is the limit. So basically you can run whatever you can imagine, but what you need to remember here is that this is powered by S3 Tables replication in order to centralize these datasets and execute those analytics in one specific location.

Technical Deep Dive: Solving Snapshot Ordering and Concurrency Challenges

So now let's move on to some interesting challenges that we faced. This is slightly more technical but not too much. We'll try to keep it a bit high level for everyone. So the first one is about snapshot ordering. You may know that Iceberg snapshots are not independent. So snapshot two has to come after snapshot one because its parent is snapshot one, and snapshot three has to come after snapshot two. In traditional data terms, imagine that snapshot one is creating a record, snapshot two is updating that record, and snapshot three is deleting that record. So you can never replicate, for example, snapshot three, the deletion, before the update or the creation. That doesn't make sense.

If you do that, you will actually corrupt the replica. You will not be able to execute queries or even time travel. So the question is, how do we solve that in S3 Tables replication to replicate everything in order? So we base everything around sequence numbers.

Every snapshot has its own unique sequence number and they are monotonically increasing, so we validate the sequence numbers. We never commit a snapshot before it is time to be committed, before it's in order to be committed. We validate and enforce the parent-child relationships for snapshots and also commits to be committed in chronological order. And last, the bonus one, we coordinate concurrent writes. This is a slightly more interesting one because it is the topic of the next slide.

So let's look into concurrency. What we do is that, like Aritra mentioned, we don't just copy blindly the metadata from the source to the destination. We cannot do that because metadata has fields and information that is context specific, like replica specific or source specific. Let's take it with an example. You want to replicate a source table to a destination, to a replica. Replication is not the only actor acting on these tables. You may also have your retention policies executed by S3 Tables maintenance.

And let's take an even more concrete example to look into this. So you have a source table. When you create snapshot one, it will be replicated to snapshot one in the destination. That is almost a direct copy if you don't think about paths. Then you may have snapshot two, also replicated. But then, if you expire snapshot one, then source and replica have diverged. So you have to be very careful and very context specific and aware when you make any sort of operations for replication.

So when you replicate afterwards snapshot three, the replication process will have to assess the state of the source, assess the state of the destination, and make sure that it does not remove snapshot one when the incoming metadata actually does not have it. So that is why we actually merge. We assess the state of both and we take very informed decisions.

So now what exactly do we do? Let's take a look. We classify the metadata fields, all the information into three categories. The first one is historical data. Historical data, firstly, is preserved in the replica, and second, it has to be specific to the context, meaning that the snapshots, snapshot log, metadata log, and schemas that we retrieve from the source, they have to be evaluated against those at the destination before we actually commit to the destination. The second one is the fields that are synchronized from the source, which are basically transferred directly from the source, and those pertain to the active state of your table. So for example, what is the current schema, what is the current snapshot, and so on.

The last one is those fields that are modified on the replica level because they only relate to the replica. Those are the file paths or storage locations that we saw in many instances so far, which we update all of them so they are valid for each of the destinations. If you have a source and five destinations, for example, each one will have its own file paths rewritten. The last one, of course, is catalog, which of course in the destination, it may be a different one.

Customer Spotlight: Zeta Global's AI-Powered Marketing Platform on S3 Tables

So that concludes our interesting challenges section and next I'll hand it off to Soumil to take a look into how exactly they're using S3 Tables. Thank you so much, Nikos. Hi, everyone. Let me introduce myself to the audience. I'm Soumil Nitin Shah, a lead software engineer at Zeta Global.

Let me tell you more about our company. Zeta Global operates an AI powered marketing platform that helps large enterprises acquire, retain and grow customers. The company's mission is to make sophisticated marketing more simple and effective by leveraging artificial intelligence and large consumer proprietary data sets to generate a high return over investments for its clients.

Now let me talk about the AI powered marketing platform. The Zeta marketing platform makes marketing simple and more effective by bringing everything into one place, from knowing who your customers are to understanding their behavior, to taking action. Let me talk about the Zeta Marketing Platform. It's built on three core pillars. The first pillar is the identity. We operate at a scale of 245 million US consumer profiles that lets us deliver a highly precise identity resolution.

Next is the intelligence. We ingest over one trillion consumer signals on behavior, location, purchases, websites, social activity, and much more. This guides our AI to find the best marketing strategies for you. And last, the activation. We enable a true one-to-one engagement across all channels, helping marketers measure, optimize and improve results.

Zeta is the only platform that unifies everything into one single place. Zeta Global today has about 10 petabytes of data across the data lakes and data warehouse. We ingest close to about 6 terabytes of data every single day, and this data is stored into 10,000 Iceberg tables supporting multi-tenant workloads. Our dataset is growing by 40% year over year, increasing the pressure on our current pipelines.

Traditional pipelines could not keep up with the combination of high ingestion volume and the need to deliver insights under 10 minutes, which is why Zeta Global has adopted Amazon S3 Tables. This gives us predictable performance on extremely large Iceberg workloads, reduces data freshness by up to 80%, and enables faster and more reliable analytics without rearchitecting the entire platform. Now, let's see how Zeta Global manages multi-tenant data ingestion into S3 Tables.

Every month we handle close to about 185 terabytes of data. That translates to about 250 gigabytes every single hour, all of this into 10,000 Iceberg tables under 10 minutes. So let's take a look at the solution. We have a people service which manages user profiles, which uses Aerospike as an OLTP database. This service processes 50,000 writes per second, and all of this generates a large amount of CDC events.

These events are captured into the Kafka topic, and then the Kafka consumer reads these messages and dumps them into the S3 bucket partitioned by the tenant ID. Next, a Step Functions workflow runs every single minute to check if there are any new data files. If there are new data files, the system will acquire a lock to ensure other systems cannot process the same data simultaneously. The workflow then reads the data in a distributed fashion and submits jobs to the EMR as shown in the diagram.

Let's take a deeper look at the orchestration and how we orchestrate these jobs. The first step is the lock acquisition. Before initiating any ingestion, the system will first acquire a lock. Second, we partition the data. After acquiring the lock, the system partitions the tenant data into different buckets, as you can see. Next, we submit the Spark job to the EMR cluster, and we use an async callback pattern.

Once the job is completed, it responds back to the Step Functions. Any failures that are captured by the Step Functions will trigger a retry for two times, and then it will clean up any orphan files that are left. Here is how we use the S3 Tables replication feature. We operate across US, Europe, and Asia, but our primary Iceberg tables were centralized in one region.

Conclusion and Next Steps for Getting Started

With S3 Tables replication, we now maintain a region-local replica of our core dataset. A region-local replica allows our customers to be served from the nearest region, reducing the network latency. Now, handing it over to Aritra. Hey, thanks, Soumil. All right, so just wanted to quickly recap all that we covered today. I think we still have some time for Q&A after this as well if you're interested to learn more, but the key takeaway is here, right?

Essentially, as I mentioned, last year we offered S3 Tables, which is a very simple and fully managed way for you to get started with Iceberg, and we expanded that with replication support where you can do same-region, cross-region, same-account, cross-account. You can create copies of your data, replicas of your data for various purposes. It's very simple to set up as Nikos showed, and it's easy to audit and monitor, and there are some best practices around how you would want to think about these replicas in terms of how you want to encrypt them, what storage class, and how long do you want to retain them at the same place.

And finally, I really encourage all of you to take a look at our console, take a look at all of the various Iceberg applications out there, be it DuckDB, be it Spark, Redshift, and just give this a spin. Additionally, I just wanted to highlight a few more sessions that we have as we wind up re:Invent. So tomorrow one of my teammates Adi is going to present how S3 Tables work, some of the best practices, so that can be an interesting session as well.

And then we have a chalk talk where some of our other engineers and team are going to dive deeper into the overall S3 Tables plus SageMaker architecture. So with that, I really thank you for joining us today, and I hope this was helpful and yeah, excited to see all of you try out the feature. Thank you. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.