🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Accelerate & automate secure data transfers at scale with AWS DataSync (STG340)
In this video, Tugba Goksel and Jeff Bartley from AWS, along with Aditya Dhoot from PathAI, discuss AWS DataSync for large-scale data migrations. The session covers DataSync's capabilities for moving petabytes of data securely across on-premises, multicloud, and AWS environments. PathAI shares their success story of migrating whole slide pathology images using DataSync agents to enable AI-powered diagnostics. Jeff provides a technical deep dive demonstrating how to achieve 20 Gbps throughput by deploying multiple agents in parallel, transferring 4.2 terabytes in under 40 minutes. Key topics include enhanced mode for unlimited file transfers, agent deployment strategies, private endpoints via Direct Connect, and optimization patterns for maximizing network bandwidth during migrations.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
The Growing Challenge of Enterprise Data Migration at Scale
Welcome everyone. Thank you for joining us today. Whether you're managing your IT infrastructure or steering your organization's IT strategy, there's one challenge that keeps us all united: the complexity of large scale migrations. You are here today because you recognize the significant challenges large scale migrations present in your cloud journey. My name is Tugba Goksel. I'm a go-to-market specialist at AWS, and today I'm joined by Jeff Bartley, principal product manager for DataSync, and we're excited to have Aditya Dhoot, VP of Engineering at PathAI, join us today. Aditya has a very exciting story to share with us.
From an agenda perspective, I will introduce you to DataSync, go over our use cases, and then cover our recent launches. Then I'll hand it over to Aditya, who's going to talk about the customer success story and the PathAI use case. Jeff is going to provide a deep dive on DataSync and wrap up with some resources.
We're living in an era where enterprises are creating exabytes of data every single day, and it's not slowing down. In fact, data shows that it is growing at a steady pace year over year with no sign of plateauing. An average enterprise works with hundreds of applications, sometimes more than 500 systems, creating petabytes of data across multicloud environments. It's not just about data; it's about maintaining data governance and data quality across these systems that were not built to work seamlessly together.
Security and reliability have absolutely become critical for every data transfer and every backup operation. You have to maintain the highest security standards, ensuring zero data loss. You're moving data which represents the lifeblood of your business operations, so any compromise in security and reliability may have serious consequences. Enterprise data does not live in a neat, tidy data center anymore. It's distributed across regions, some living in edge locations scattered between multiple on-premises data centers spanning multicloud environments, so it becomes very complicated to work with this data that's all over the place.
It requires sophisticated orchestration and management strategies to maintain security and performance standards. As organizations go through the different stages of their migration and modernization journey, they face the complex challenge of working with massive amounts of data. Some organizations will do it using do-it-yourself tools, perhaps combined with some custom solutions or open source tools. While it sounds plausible at first, it can become very complicated fast when you're working with petabytes of data and billions of files.
Data verification is an area that often gets overlooked. I've had customers who had to restart their entire data process because they had not implemented the proper data verification processes in place. Errors are going to happen. The question is how do you recover from errors gracefully and keep on schedule? How do you transfer your data in the most secure way and in the most efficient way? Most importantly, how do you assure the performance that you need?
Introducing AWS DataSync: A Fully Managed Solution for Data Transfer
AWS DataSync was introduced to overcome these challenges associated with large data transfers. It's our online data transfer service that moves file and object data from on-premises, other clouds, and AWS. It's fast and easy to use with built-in features such as advanced filtering, flexible scheduling, precise bandwidth control, and comprehensive reporting for your data transfers at scale. It's secure and reliable. We encrypt your data at rest and in flight.
In fact, it's built on a custom network protocol that maximizes your available bandwidth through parallel throughput operations, and it will recover from some of those network failures. You can scale it to any data size. And lastly, it's fully managed, meaning we take care of all the heavy lifting for you so you can focus on your data migration strategy. Perhaps one of the most differentiators for DataSync is its deep integration into the AWS ecosystem.
When we go out and talk to customers, we see them using DataSync for one of these four use cases. Our primary use case is migrations, so customers use DataSync to quickly and easily bring in their file and object data to AWS. They could be moving from on-premises environments or from other clouds or moving data between various AWS storage services.
Some customers use DataSync to replicate their data. They make a second copy of their data primarily for disaster recovery purposes. Archive data is cold data, infrequently accessed data that takes up a lot of space from your on-premises storage. We see customers using DataSync to bring in their data, freeing up that space from their on-premises storage environment, moving it into S3 Glacier or S3, where they also take advantage of the cost and the durability of these services.
And lastly, we see customers using DataSync to accelerate their business workflows. Customers in the life sciences industry work with on-premises equipment like genome sequencers that output a lot of data, and it's really critical for these businesses to bring in this data into the cloud for processing. We're seeing more and more customers using DataSync for these types of recurring data transfers.
There are three data movement scenarios DataSync supports. First, you can use DataSync to move your data from your on-premises storage environment into AWS. Some common use cases we see here are with customers exiting their data centers or shutting down their storage service. You can connect any storage service that talks to NFS or SMB protocols, object storage, or Hadoop. You can move your data into Amazon S3, any of our FSx file systems, or Amazon EFS.
DataSync has the ability to move your data and your metadata. It's really critical to preserve the metadata information such as timestamps, permissions, or Windows attributes. You can also use DataSync to move from and to other clouds into AWS. We support Google Cloud Storage, Azure Blob, and others on this list. If your data is in another cloud and that cloud offers S3-compatible object storage, it's very likely that DataSync can work with it.
It's in both directions, so you can move to and from other clouds to AWS. At all times we use secure protocols when communicating with the other clouds, so you can be assured that your data is encrypted. We support S3, EFS file systems, or FSx as the destination. Customers who have a multicloud strategy or customers who are trying to consolidate their data in one single cloud use DataSync for their use cases.
Thirdly, if your business requires you to move data between any combination of these AWS storage services, you can use DataSync for your use case. That could be moving data across accounts or across regions. You could be moving between two Amazon S3 buckets in different regions or moving from S3 to EFS, so any combination of these storage services. The data is transferred over the AWS backbone. There's no infrastructure to manage. It's fully service managed end to end.
AI is an emerging use case for DataSync, so customers use DataSync today to bring in petabytes of data into Amazon FSx for Lustre or Amazon S3, both of which provide high throughput and high scalability. Customers bring in their data into AWS to build their data lakes, which are then used to build their training models. They can use services such as Amazon Bedrock for the processing of their data pipelines. Another use case we see is customers using DataSync to move their training data around. I had one customer who used DataSync to move between two Amazon S3 buckets for GPU optimization to support their expanding models.
We'll also hear from Aditya about how PathAI successfully used DataSync to bring in their image datasets to S3 to support their diagnostic workflow. Let's go over some of our recent launches. DataSync enhanced mode is a new capability that we introduced last year, and it enables our customers to virtually move an unlimited number of files for their S3 transfers as well as their cross-cloud transfers. You also get enhanced metrics and reporting for your data transfers. Enhanced mode has the ability to transfer very large files with increased transfer speeds.
The way enhanced mode works is it breaks down those large files into pieces and then transfers them in parallel, which leads to higher performance and increased transfer speeds. Our customers, especially in the media and entertainment industry who work with very large files like terabytes of data or hundreds of gigabytes in size, are finding increased transfer speeds with enhanced mode. Enhanced mode also simplifies your cross-cloud transfers. There's no infrastructure to manage. You no longer need to deploy an agent in the other cloud, and it provides a very easy setup. All you need to do is create an object storage location and point it at the endpoints in the other cloud and get started.
PathAI's Journey: Accelerating Digital Pathology with DataSync
In summary, DataSync enhanced mode provides increased scalability and performance for your data transfers. I'll hand it over to Aditya. Hello everybody, I'm excited to talk about how PathAI is accelerating digital pathology using AWS. At PathAI, our mission is to improve patient outcomes using AI-powered pathology, helping labs and clinicians get access to more modern tools such as AI to be able to do their diagnosis and do case reviews. Behind this innovation is a large data infrastructure problem: how do you move petabytes of tissue imaging data from on-premises into the cloud, securely, reliably, and at scale.
In this session, I'll talk about how at PathAI we've been able to develop a data pipeline using AWS DataSync to solve this particular problem. But before I get started, let me tell you a little bit about how pathology works. It all starts when a patient needs to get a biopsy. After a biopsy is taken and processed at a lab, the biopsy is mounted on a glass slide, and then pathologists look at this glass slide underneath the microscope to be able to assess the diagnosis. This particular process has not changed over 100 years. It's been practically the same. It's very manual, it's very physical, and it relies on the pathologist's ability to look at these glass slides underneath the microscope to make this assessment.
As case volumes go up across the world, these manual steps have become a huge bottleneck. Let's dive a little deeper into it. Pathology is at the center of almost all cancer diagnosis, yet it's the last imaging modality to go digital. A lot of these labs are extremely manual. They're still using glass slides, they're still using microscopes, and all the cases that they review are manual. As case volumes go up, there simply aren't enough pathologists in the world to keep pace. This is where digitization can really help. Once you digitize these glass slides,
you're able to push them into the cloud and then run AI on these digitized glass slides so that you could get a more consistent diagnosis. You're able to do remote review with pathologists and also enable a more data-driven case workflow. But this particular innovation also means that all of these glass slides now need to be digitized, and there are a lot of data that's created that needs to be managed, maintained and stored. So at PathAI we're tackling this challenge head on.
AI Site, the platform we've developed, is used by labs and biopharma partners to manage these whole slide images, share them, and then run AI on top of them. To give you a glimpse, every glass slide that's digitized could be over 1 gigabyte worth of data. So these glass slides are quite massive. AI Site is built on AWS. It uses fast, secure architecture and storage and uses automation tools like DataSync to be able to push these images into the cloud. This is what has allowed us to move labs from a very manual workflow into the cloud and digitize them so that they can get access to more modern tools.
So what exactly is the bottleneck within these labs? These labs generate a ton of terabytes, petabytes worth of imaging data on a fairly regular basis once they're scanned. These whole slide images are stored on local storage systems behind strict firewalls, and these labs have never been built to be able to push this data seamlessly into the cloud. Oftentimes these IT teams are overburdened. The lab setups are unique. They use different sets of network configurations, different scanners, and they simply need a solution that works fairly seamlessly.
This particular problem compounds even more because health systems oftentimes are dealing with multiple labs that operate in multiple geographies across state lines, across countries. Every single one of these labs has a different network configuration and different setup altogether, but they need a standardized solution such that all of these whole slide images that they're generating can still end up within a centralized cloud environment where pathologists can then log in, access these cases, and do their review. So the challenge here was how do you build a seamless data pipeline that allows all of these labs to come online and go digital.
So what we've utilized to enable this architecture is AWS DataSync. At labs, the DataSync agent runs on a local hypervisor that talks to the local storage systems and is able to push this data into the cloud into a lab's S3 bucket. This is what allows the DataSync agent to run in the background to be able to push these whole slide images every few minutes into the cloud for AI Site to be picked up so that pathologists can look at these cases and then run AI tools on them.
Let's look at this particular architecture that we've deployed across many labs in a little bit more detail. On the left hand side, you've got the on-premises lab environment. Every lab has a unique set of whole slide image scanners. They're talking to and storing these local whole slide images into a local environment. Then a DataSync agent is able to push these whole slide images into the cloud, particularly in the lab's AWS environment in their S3 bucket. From there, these whole slide images are pushed into PathAI's AWS environment so that AI Site can utilize these images and process them.
But there is another piece of metadata that's also important that we need in order to generate and make the case ready to be reviewed by the pathologist. This is the patient metadata. This is the case metadata. We use our middleware AI Site Link that communicates with a lab's information system to be able to retrieve this patient metadata using something called HL7 messages. Together with the digitized whole slide image and this metadata, the AI Site backend processes these images and then makes it visible in AI Site for the pathologist to be able to view and then for them to also run a variety of different AI algorithms to get a better assessment of the diagnosis.
This particular architecture has allowed us to bring many digital labs online and use our AI site solution. What's the operational impact of all of this? The solution I described has allowed us to bring labs online in the US, in Europe, and also in South America. Petabytes of whole slide images on a regular basis can be moved into the cloud and be used through AI. This has also streamlined a lot of IT operations, where they don't want to deal with complex solutions or manual uploads. They can deploy data sync agents to manage this data into the cloud fairly seamlessly.
Deep Dive: Understanding the DataSync Agent and Deployment Options
Altogether, this is what has allowed us to bring digital pathology into the modern era where labs are now going digital from a physical to a whole slide image-based pathology. I thank you, and then I'll pass it over to Jeff. What's great is being able to see how the products and services that we built impact customers and their lives, and I love seeing stories like that. So thank you very much for sharing that. My name is Jeff Bartley. I'm a product manager on the Data Sync team, and I'm going to do a deep dive into Data Sync. We're going to frame it in a real-life scenario, specifically a migration use case, but a lot of what I'm going to talk about is applicable to the other use cases that Tuba mentioned, whether it's archive, replication, or the ongoing data movement use cases. Pretty much the same things apply to what I'm going to talk about here.
For our deep dive configuration, I'm going to walk through an example of how to use DataSync in this configuration, which is to migrate data from an on-premises NFS server to our S3 bucket located in the US West 2 region. In between, we've got a Direct Connect link. Direct Connect provides a private network between your on-premises environment and your VPC running in AWS. In our case, we've got two 10 gigabit per second links that are bonded together to give us a total of 20 gigabits per second of network bandwidth, and that's going to come into play later. You might be thinking that 20 gigabits per second of performance into AWS would be great. I actually work with customers who have over hundreds of gigabits per second of bandwidth, so it is possible. DataSync works in those kinds of environments, but we'll talk about how this works with DataSync here in a second. Our goal is to get that data migrated, and I'll walk through this to frame our deep dive discussion.
Specifically, we're going to talk about three general areas, which cover a lot of the questions I encounter when working with customers who are trying to utilize DataSync. Often the questions come in these three areas. First, we'll talk about the DataSync agent: what it is, how to deploy it, and some of the decisions and considerations you have to think about as you're working with the DataSync agent. Then we'll walk through running a test. Often I see customers who are in a rush to get their data moved, particularly if it's a migration where they need to exit a data center or get that data moved quickly. They'll just run right into trying to do their migration, trip over themselves, and realize that they missed some steps. Running a test is always a really good best practice whenever you're utilizing DataSync.
Based upon those test results, I'll then show you how you can think about optimizing performance for DataSync. Let's start with talking about an agent. Aditya mentioned the agent, and Tuba mentioned it as well. The DataSync agent is a virtual machine that you deploy outside of AWS. It's used to access storage that's outside of the AWS environment. This could be storage systems located on premises or storage systems in other clouds, but it's storage that we as a service, the DataSync service, can't access directly. As a virtual machine, it deploys on various hypervisors. We support VMware, KVM, and Hyper-V. We recently enabled support for Nutanix, so if you have that, we can work with that as well. You can also deploy it as an EC2 instance. The agent brings a number of advantages to these environments, which I'll talk to you about in a little bit, but one of the big ones is that it compresses data in flight and can often help with increasing your speeds or optimizing your network utilization.
Configuring Network Connectivity: Public vs. Private Endpoints
Let's dive a little bit deeper into this. One of the common first questions that I get from customers when they're looking to utilize DataSync is whether to run the agent as an EC2 instance in AWS or as a virtual machine either on premises or in another cloud. There are various things that you need to consider in these scenarios. If you go with running the agent as an EC2 instance, some of the advantages are the simplicity of installing it. I work with a lot of customers who operate environments where there's a lot of overhead for getting an actual virtual machine deployed. There might be another team who controls deploying virtual machines, or it could be that it's just difficult to get VMs deployed on premises. It would be so much easier if you could deploy as an EC2 instance, and you certainly can do that.
What you need to consider, though, is that the network protocol that the agent uses to talk to your storage might be sensitive to latency. For example, in our case we're going to be working with an NFS server, so the agent needs to communicate with that NFS server over the network. That means the path from the EC2 instance to the on-premises storage needs to be generally low latency. Protocols like NFS or SMB are sensitive to latency, and typically you want single digit millisecond latency, which gives you the best performance. It will work if you're in the double digit milliseconds, but if you start going above that, you're really going to start hitting issues. That's one of the challenges with deploying the agent as an EC2 instance.
On the other hand, if you can deploy it in your on-premises environment, you get a couple of advantages. The biggest one is that DataSync is now able to use our custom protocol to communicate data over the network. Now what's happening is that the agent has that short network trip to get to the NFS server. It's low latency, typically very fast, and then we use our custom protocol to move the data over the network. We're able to better optimize data movement over the network. Like I said, we use compression. All data is encrypted in flight as it's moved over the wire, and we're also using parallel streams to maximize bandwidth. We're also very resilient to things like packet drops or network retransmits, so we can handle all of those situations and provide for a much faster experience moving data over the network.
If you can and you're working with data in your on-premises environment, we recommend running the agent on premises for that reason. Now when you set up and deploy an agent, you've got two decisions that you need to make: how am I going to connect my agent to the DataSync service that's running in the cloud? We offer two types of endpoints. The first are public endpoints, which enable you to connect the agent over the Internet, and the second is using private endpoints, which utilize something like a Direct Connect or a VPN, providing that private connectivity from the agent into the cloud.
Now when you're working with private endpoints, one of the things to understand is the communication path that DataSync leverages in order to move data. We have two kinds of traffic streams. The first is for control traffic, and that traffic goes from the agent through a VPC endpoint that you would create in your VPC and subnet. That's things like instructions from the DataSync service on when to run jobs or tasks and move data, and it's for uploading logs. But the majority of the data is being moved over the data path, and that data path actually takes a separate network path. It actually goes through ENIs, or network interfaces, that the DataSync service creates in your subnet in order to actually connect the agent directly to our system running in the back end and give that direct path of data movement, highly optimized.
The other thing that it does as well is that it avoids routing data through the VPC endpoint, which adds an additional per gigabyte charge for data movement. So in this case, we're bypassing that while at the same time achieving the high levels of throughput and performance that our customers expect from DataSync. This is important to understand, especially if you're setting up firewalls or thinking about how you know.
Another area where customers can sometimes run into challenges with data sync is that we create multiple ENIs per data sync task that you create. You want to make sure that your subnet has enough IP address space to support the various tasks that will run as you launch your data sync tasks. This is something important to keep in mind.
In our case, we have our Direct Connect links, so we're going to want to use a private endpoint. I'm going to deploy my VPC endpoint in my subnet and then deploy my DataSync agent in my on-premises environment . I'll configure and activate it with the DataSync service running in my region and in my account. This associates that agent with the DataSync service, and from that point on, the agent can only be used with that account and with DataSync running in the particular region that you activated in.
Running Your First Test: Validating Performance and Migration Patterns
The next step that I would want to do is run a test of DataSync and get a sense of what the performance is like. I want to validate that I can correctly connect to my storage and validate connectivity over my network, including firewall settings. Typically, I'm going to do this with a small data set. I don't want to transfer everything. I just want to get a sense of how quickly I'm going to be able to transfer data.
When customers first come to me asking about data sync, I often ask them to tell me about your data set. Are these large files, are they small files, is it a mix? What does the folder layout look like? There's obviously a wide variety of data, and understanding that can help you in optimizing the use of data sync. In our example, we're going to use a data set that looks like this where data is split across multiple years. Older data is read-only and not being modified, and it's a mix of large and small files.
When you're using DataSync, we have built-in filtering with different ways for you to specify what data you actually want to move. You can use include filters to specify the data that you want to copy. You can use exclude filters to leave things out, such as temp files. You can also use a manifest where you can specify a list of files to copy, and we'll only copy those specific files. Manifests can be useful in situations where you have a well-known set of files that you need to move on a regular basis and you want to avoid the overhead of data sync scanning to figure out what's actually changed in your data set.
In our case, I'm going to start my test with just using a simple small set of data. I'm going to copy the January folder from 2025 and put that in as an include filter so that data sync only focuses on that one particular folder and doesn't copy anything else. My next step would be to create a data sync task. A task consists of a source location, which tells DataSync how to connect to my NFS server using the agent that I deployed, then my destination, which is an S3 bucket. I configure my task options using an include filter to specify what it is that I want to transfer.
When I go ahead and run this, here's an example of me running a data sync in the console. You can see it's a time-lapse video, but you can see how it runs as it's transferring and giving you an idea of what's being verified. It transferred about 550 gigabytes of data in about 17 minutes, achieving about 550 megabytes per second throughput.
In my case, where I'm looking to get 20 gigabits per second of performance, which is about 2400 megabytes per second, I'm well short of that. I would go back and start thinking about where the issue is that I'm hitting. For our example, I talked to the networking team and they told me there's a bottleneck between your agent and the router that leads out to the Direct Connect. A single agent can only achieve up to 5 gigabits per second of performance.
In this case, if I want to take advantage of that full 20 gigabits per second of network bandwidth, I have to approach my problem a little bit differently. Let's talk about how to optimize performance using DataSync. We're going to focus on migration patterns, talking about first-time versus incremental transfers, and then how you can scale out tasks and agents to achieve higher performance levels.
Most migrations, which is what we're trying to achieve here in our use case, typically follow a pattern where you start with an initial transfer of your data. Most of your data is copied to your destination, and then you run incremental transfers over time to capture the differences and changes that occurred between the first copy and your cutover. Your cutover is where you actually move your application from your original dataset to your new dataset located at your destination. In this case, it would be in S3. Knowing that cutover time is critical when it comes to migration.
DataSync has several capabilities that help customers, particularly with migrations. One key capability is that it copies file data and metadata. If you're migrating file systems, this is critical for preserving the permissions on your files and folders. It has filters that enable you to copy only the data that's necessary, and it can scale to maximize your bandwidth, which we'll discuss shortly.
For incremental transfers, DataSync has built-in scheduling so you can run it on a schedule. In a migration scenario, you might set it up to run every day, picking up the changes automatically. As it runs, you get an idea of what your cutover time will be. DataSync also has detailed logs and audit reports that enable you to verify that the data you expect to migrate is being moved correctly.
Scaling Out for Maximum Performance: Achieving 20 Gbps with Parallel Tasks
With a single agent, we achieved about 550 megabytes per second. To maximize the 20 gigabits per second of network bandwidth available, we want to scale that up. This is a common pattern we see with DataSync customers who have significant network bandwidth but find that a single task cannot achieve the performance level they need. What they do is partition their dataset by folder and run multiple tasks in parallel, each using a separate agent.
If you want to better understand this pattern and how to apply it, you can read a great blog written by one of our solutions architects using the QR code provided. We're going to use this pattern and partition our dataset by years, so we'll copy from 2022, 2023, 2024, and 2025 in parallel to maximize available bandwidth. We're not going to copy all of it initially. We're going to start with a test to verify that we can scale our performance.
The first thing we'll do is replace that single agent with four DataSync agents on premises. We'll activate them the same way we did previously, and then I'll create four separate tasks with my source and destination. This time, I'm going to set up an include filter for each task to point to the February folders in each of those years. We're going to copy a similar size dataset as we did with our first test, but we're now going to do four times as much data.
This shows four tasks running in parallel. You can see the start times are all running at the same time, and they're transferring data in parallel. All of them are achieving roughly 520 megabytes per second. The good news is that our on-premises storage can scale to that level of performance. You can see they're all flatlining at around 550 megabytes per second, showing they're hitting the limit of the network we pointed to previously.
In aggregate, they're achieving well over 22 gigabytes per second. We moved about 4.2 terabytes of data in under 40 minutes, which is a very good transfer rate and demonstrates the ability to move a lot of data quickly. With this, I've shown that I can scale DataSync out massively. In fact, I've had customers use this same pattern to move petabytes of data a day using DataSync, running tens to dozens of tasks in parallel.
We've talked about optimizing performance, so let's wrap up. DataSync provides the ability to move data quickly and reliably. You can use it with a variety of storage systems on premises, other clouds, and AWS storage. Enhanced mode increases scalability and enables you to move virtually unlimited numbers of files at very high levels of performance. We've shown how you can use DataSync to scale out your data transfers with multiple tasks, maximize your bandwidth, and achieve high levels of performance.
If you want to learn more, head to our website where we have blogs, demos, use cases, and more information about DataSync. We also have a chalk talk coming up tomorrow if you want to go deeper, specifically into moving data between other clouds. I definitely recommend checking that out. With that, I want to say thank you and appreciate your time for being here.
; This article is entirely auto-generated using Amazon Bedrock.



















































Top comments (0)