Fellow AWS Hero, Matt Bonig, recently asked a very interesting question on Twitter as a poll;
Three Possible Answers
It's an interesting question because depending on your perspective and experience, each of the three possible poll answers make sense.
If you don't know about the S3 bucket to bucket copy feature (which, while introduced in 2008 isn't crystal clear in the docs), passing data through the system that called the command makes sense. That's how most file transfers work.
If you've been working with Amazon S3 regularly, you've no doubt seen the ultrafast transfer speeds even over bad connections. The only way that makes sense is if the data is only moving within the S3 infrastructure.
There are always exceptions to everything which is why, "It depends" holds up as a valid answer. Anyone that's been working with tech for any length of time intuitively understands this...after many, many frustrating stories & experiences.
What Actually Happens
Let's breakdown this command and figure out what's going on.
aws s3 cp s3://SOURCE_BUCKET/KEY s3://DESTINATION_BUCKET/
aws
calls the AWS CLI program.
s3
filters the commands to the S3 service. a/k/a, "We're using S3!"
cp
is the action to call within the specified AWS service.
Now, cp
is a little different than most of the AWS CLI commands. Most commands are directly parallels to the AWS API for the service in question.
In this case, there's a lot of syntactic sugar applied here with the goal of making aws s3 cp
work as similar as possible to Linux's cp
command.
The rest of the command provides a mix of options to indicate the source and destination of the file copy. For our example;
s3://SOURCE_BUCKET/KEY
is the source file. A key is the S3 term for what we commonly think of as the directory structure + filename. In this case, the file is in an existing S3 bucket.
s3://DESTINATION_BUCKET/
is the destination. Here, we've indicated another S3 bucket and because the path ends in /
, we are telling cp
that we want the same filename (or key) in the destination bucket.
API Calls
Behinds the scenes, the S3 API command the cp
calls depends on what you've asked it to do.
Here are the most common possibilities;
If you're copying a file from the local system into S3, it calls the
PutObject
action or—if it's a really large file—theCreateMultipartUpload
actionIf you're copying a file from a bucket to another bucket, it calls the
CopyObject
actionIf you're copying a file from a bucket to the local system, it calls the
GetObject
action
Our Transfer
In the case of our command;
aws s3 cp s3://SOURCE_BUCKET/KEY s3://DESTINATION_BUCKET/
The CLI translates that to the CopyObject
action which means the data never leaves AWS. The contents of our file (or key) are copying via the S3 backend from the source bucket to the destination bucket.
We can verify that by looking at the outbound traffic on your local system. Here's a screenshot of the original file upload.
I've run the command;
aws s3 cp LOCAL_FILENAME s3://BUCKETA/
This results in an outbound transfer running at ~17 MB/s to AWS.
You can see that not only as reported by the AWS CLI but also by my outbound firewall. That outbound firewall is reporting ~14 MB/s, but the difference is just a matter how each tool updates.
The reported speed also lines up with the amount of time the command was running based on the file size (just over a minute for a 1.1 GB file).
The key point here is that they are reporting very similar numbers.
After that upload completes, to copy the file one bucket to another, I run the command;
aws s3 cp s3://BUCKETA/KEY s3://BUCKETB/
Here are the transfer results from my local system for this command;
The AWS CLI is reporting ~180 MB/s while my local network traffic is only 30 KB/s to AWS. The file transfer takes about 10 seconds to complete.
This proves that the CopyObject
API is being used to copy the file through the S3 backend.
What's Next?
The bucket-to-bucket copy feature is a massive time and bandwidth saver when you're working with files in AWS. This works not only between buckets in the same account and region but using different accounts and regions (when the appropriate permissions are in place).
The results of this little experiment also highlight a key rule of working with data in Amazon S3;
Keep data inside of Amazon S3 for as long as possible
You don't want to have to wait on data to be downloaded outside of AWS, nor do you want to have to pay for it.
The AWS CLI itself is a really interesting open source project that has a lot of very cool code behind the scenes to make it work. You can check that out on GitHub.
Top comments (0)