Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - How Samsung optimized 1.2 PB on Amazon DynamoDB with zero downtime (DAT323)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - How Samsung optimized 1.2 PB on Amazon DynamoDB with zero downtime (DAT323)

In this video, Manseok Kim from Samsung Cloud shares how his team migrated a 1.2 petabyte DynamoDB table while reducing storage costs by 50% and achieving ROI within 3 months. Serving 1 billion monthly active devices with 50 billion daily requests, they discovered 90% of their data were tombstones, with 60% exceeding the 6-month retention period. By analyzing propagation data showing 99.9% of devices synced within one week and leveraging the 99-tab limit policy, they reduced the retention window from 6 months to 1 week. Using a per-user migration strategy with Migration Status Table, ElastiCache, flow regulator, and LSI-based filtering, they completed the migration in one week with zero customer complaints, cutting the table from 1.2 petabytes to 100 terabytes. The success stemmed from data-driven decisions, deep domain understanding, and unwavering user-centric principles.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Samsung Cloud's 1.2 Petabyte Challenge: Identifying the Problem and Evaluating Solutions

Hello everyone. It's great to be here at AWS re:Invent. I'm Manseok Kim, a Principal Software Engineer from Samsung Cloud. Today, I'm going to share a great story about how my team solved a massive technical challenge on our platform.

First, let me give a quick introduction to Samsung Cloud. Our platform is key for billions of Galaxy users around the world. We provide core cloud services like Sync, Backup, and Restore for popular Samsung apps such as Samsung Notes, Samsung Health, and Samsung Internet. Our mission is simple. We make sure all your data is consistent and available across all your devices for a smooth experience.

Samsung Cloud operates at an enormous scale. We handle over 1 billion monthly active devices. This drives 50 billion requests every single day. Facing this massive traffic, we chose Amazon DynamoDB as our primary database.

We use Amazon DynamoDB to store our core synchronization data for applications like Samsung Internet. Let's zoom in on the specific data set. You have open tab data. The tab information is constantly generated and updated as the user opens, closes, or moves between tabs. As our service grew, the continuous creation and update of tab data caused our data volume to grow fast.

The table size kept getting bigger. And by early 2025, it had reached 1.2 petabytes. This massive table was consuming hundreds of thousands of dollars in AWS storage cost every month. Our mission was twofold. We had to cut our storage cost by 50%. And second, we had to get back our entire optimization investment within 3 months. This was our critical ROI target. Achieving this with a 1.2 petabyte table was a huge challenge.

With trillions of records, just analyzing the data was a monumental task. We couldn't analyze all the data, so we decided to take a sample of records to understand their characteristics. The result was surprising. 90% of our data table consisted of tab deletion records or tombstones. When the user closes a tab, the system doesn't immediately delete that data. Instead, it changes its status to deleted and keeps it in the table for 6 months. This mechanism is important for inactive devices to receive all sync history when they reconnect.

The tombstone mechanism itself wasn't the problem. So we dug deeper and found that 60% of our tombstones were over 6 months old. Focusing on fast development and stability in our early days meant long-term tasks like lifecycle management were pushed back. This was our technical debt. Based on this analysis, we decided to optimize the table by removing all the data over 6 months old.

We considered two main strategies. Strategy A was to clean up the existing table to find and delete the data directly. Strategy B was to create a new table and migrate only the data we wanted to keep. The direct scan and delete approach required scanning trillions of records, consuming massive RCU and WCU. This process was projected to take over 6 months. The total expense would equal several months of storage fees. This massive spend made it impossible to hit our 3 months target, so we immediately ruled out Strategy A.

By migrating only the necessary data, we were processing a much smaller data set. We reduced the data volume to process to 40% of the original. Migrating 40% of the total data volume was an improvement, but the RCU and WCU still amounted to hundreds of thousands of dollars. We knew we had to do better. So we took a step back from technical details and refocused on the core purpose of our service.

Rethinking Retention Windows and Building a Zero-Downtime Migration Architecture

As I mentioned earlier, our system stores the deletion records for a 6-month period after they are transmitted. But we started to question this fundamental assumption. We asked ourselves, is this 6-month retention window truly necessary? A deletion record's purpose is fulfilled once the signal is propagated to all of the user's devices. To find a reasonable standard, we measured the time it takes for deletion records to be fully propagated to all of the users' devices. The results of the investigation were unexpected. Fewer than 0.1% of devices needed more than one week for tab deletion history to propagate. The remaining 99.9% completed propagation within one week, and 99% of devices received deletion history in 3 days or less.

Based on the data we just reviewed, we were faced with a pivotal decision. Should we shorten our traditional history window from 6 months to 1 week? The upside was substantial. By moving to one week, the amount of data to be migrated would shrink by 96%, from 24 weeks to just 1 week. This was a major reduction in storage and processing expenses. The trade-off, however, involved a specific technical limitation. The roughly 0.1% of devices would lose the ability to perform delta-only sync. Those users would be forced to get the full data set rather than just incremental changes.

The problem caused by long-inactive devices creates a system-level risk. At Samsung Cloud scale, even 0.1% is millions of devices. If those devices had to full sync at the same time, a massive traffic surge could overload the system and trigger throttling or downtime. When we hit this technical roadblock, we knew the solution wasn't just technical, so we had to dig into our core service domain. By collaborating with the Samsung Internet team, we learned a critical operational behavior: the 99-tab limit. This policy means no device can ever have more than 99 open tabs. This was our breakthrough. A traffic storm is only dangerous when the load is unlimited and unexpected. The 99-tab limit quantified the risk. This meant that the maximum data for any full sync was kept at a small and predictable size, which neutralized our biggest fear.

With data propagation insights and the 99-tab policy in hand, our path was now crystal clear. We officially decided to change our traditional history window to one week. This decision meant that the total data volume to be migrated dropped from our previous plan of 40% down to just 10% of the original. A massive reduction from trillions of records to just hundreds of billions. With our final data strategy set, our mission was clear: find the 10% of data that met our criteria in the 1.2 petabyte table, migrate it to a new table, and finish quickly to maximize cost effectiveness. But for the Samsung Cloud team, there's something even more important than the mission itself: our users.

So before building anything, we established two principles that would guide every single decision in the migration process. The first principle was simple: user experience first. Our project could not cause any negative experience for the user. The second principle was controlled execution. Every step had to be predictable, observable, and fully controllable. These were not just slogans; they were the absolute standard for every piece of our migration architecture.

Our goal was not system-level zero downtime, but zero downtime experienced by the user. To achieve this, we adopted the per-user migration strategy. Users not migrated continue to use the old table. Once the user data migration is completed, all of their requests are redirected to the new table. While the device's data is being transferred, the device is placed in a locked state. Any request that arrives during this lock is temporarily rejected until the migration finishes and the device is switched over. If a user receives this temporary error, the client automatically retries within seconds, ensuring a real seamless experience.

When the device sends a request, the sync service needs to know which table to use. To solve this routing problem, we introduced the control tower, the Migration Status Table, or MST. The Migration Status Table records each user's migration status. For every sync request, the sync service queries the MST first, determines the user's migration status, and accesses the right table.

Having seen this architecture, you've rightly identified the key issue. Our flow regulator, which we will discuss next, protects the new table. It ramps up the write load steadily and smoothly, but nothing protects the Migration Status Table. It was our single point of failure. To solve this single point of failure problem, we applied the AWS constant work principle. We placed an Amazon ElastiCache instance in front of the MST and made the sync service call the cache first. The Migration Status Table is pre-warmed so that the underlying DynamoDB table is ready to serve traffic instantly. We also tuned the provisioned capacity minimum value to guarantee the table always accepts the full request, even during cache outages. This is what AWS calls the constant work principle.

The goal of any migration is speed, but focusing only on speed hurts stability and control. This is why we adopted the controlled execution principle. We didn't open the floodgates. We managed the flow. We limited the maximum number of concurrent users migrating, much like controlling a dam to prevent a flood downstream. This approach gave us four key advantages. It protects the old table from I/O problems and ensures the new table scales predictably. It makes sure our infrastructure is stable and secures observability by making our data flow thin and constant.

To implement this controlled speed, we created a flow regulator using a combination of a job feeder and a working queue. The job feeder supplies the user IDs to the working queue. The migration worker pulls the jobs from the working queue and executes the data migration. The queue strictly limits the maximum number of users being migrated. If the queue fills, a backpressure mechanism automatically pauses the job feeder. This self-regulating system makes sure our throughput stays exactly at the planned level.

Our last technical hurdle was how to find the 10% of necessary data within the 1.2 petabyte table. If we had to read every item just to decide if we needed it, we'd incur massive RCU costs and immediately stretch our project timeline. The breakthrough didn't come from new technology, but by looking at the fundamental mechanism of our existing sync service, timestamp-based synchronization. Our sync mechanism works by tracking the timestamp, the exact time of each modification. This allows the device to only download the changes they missed. Our table contains a Local Secondary Index keyed by timestamp to support this. Crucially, the LSI has status as a projection attribute. This is the key.

This allows us to filter out the 90% of the old data by reading the right weight index without touching the main table. So instead of a heavy table scan, we ran a just filtered query, treating our migration worker as that new device requesting its first sync.

Here's how we do data extraction in two steps. We use the LSI to find the data to be migrated. We ran a single LSI query to get our two target groups: first, our active tabs, and second, our tombstone records from last week. We use the tab ID from the first step to pull the actual data. Instead of fetching each item one by one, we use the DynamoDB batch get item API. The LSI query was cost-effective and fast, avoiding full table scans. The batch API gathered all the data efficiently. This combination dramatically lowered our Read Capacity Units.

Migration Success: Achieving Technical, Financial, and Customer Excellence

The migration was more than just moving the data. It was the ultimate test of our philosophy. We are proud to say our principles worked. We will break down our success in three key metrics: technical, financial, and most importantly, customer. Our entire migration was completed in just one week. Thanks to our controlled execution principle, the number of concurrent migrating users stayed steady at a few tens of thousands. The WCU on the old table declined steadily, showing the gradual decrease in traffic. This graph shows the new table's RCU and WCU increasing smoothly and steadily with no noticeable spikes. This confirms that our controlled execution approach successfully managed the load.

On the financial side, the impact was instant and huge. We reduced the table size from 1.2 petabytes to just 100 terabytes, drastically cutting our storage bill. Thanks to our strategy, the migration cost was minimal. The total expense was less than one month of savings. We recovered our full investment before the next AWS bill arrived. We achieved 150% of the original goal.

But beyond the technical and financial metrics, there's one that matters most to us: the customer metric. During the entire migration, we received zero customer inquiries or bug reports related to migration. This is the number our team is most proud of. This is the ultimate proof that our user experience principle was not a slogan.

We are proud of these results, and I want to share three keys that made that possible. The first lesson is the power of data-driven decision making. It was data sampling that first revealed our problem: 60% of our table was old tombstones. And it was data analysis that gave us confidence for our solution. We analyzed the logs and found that 99.9% of signals propagated within one week, and we made our most critical decision to change the retention history window from six months to one week. Data removed the guesswork. We applied this data-first approach at every critical juncture.

However, our team also learned that data alone is not enough. Data gives you the numbers, but only the domain analysis gives you the meaning. Without this, you can look at the same data and still make the wrong choice.

So our second lesson is deep domain understanding. Our first insight came from asking our fundamental sub-question about the true purpose of Tombstone Records. This was the key that allowed us to challenge our six-month rule in the first place. Our second insight wasn't just discovering a domain policy called the 99 tab limit. It was the realization of what that policy meant and how it naturally kept the data constrained. This instantly neutralized our biggest technical risk. True breakthroughs don't come from optimizing the how. They come from questioning the why.

The final and most important lesson is user-centric mindset. This is the principle that guided every single decision we made. Technical and financial success means nothing if you compromise the user experience. This mindset is why we chose the safest path instead of the easy one. In doing so, our team learned what the true measure of success is. It's not the petabytes we migrated or the dollars we saved. It's the trust we kept with our users.