DEV Community

Ryan Nazareth
Ryan Nazareth

Posted on

Data Transfer from S3 to Cloud Storage using GCP Storage Transfer Service

Storage Transfer Service automates the transfer of data to, from, and between object and file storage systems, including Google Cloud Storage, Amazon S3, Azure Storage, on-premises data, and more. It can be used to transfer large amounts of data quickly and reliably, without the need to write any code. Depending on your source type, you can easily create and run Google-managed transfers, or configure self-hosted transfers that give you full control over network routing and bandwidth usage. Storage transfer service only allows transfer into GCP and does not support bi-directional transfer e.g. from GCP to AWS.

In this blog, we will demonstrate how to create a on-off storage transfer job to transfer data from S3 bucket to GCP Cloud Storage. In addition, we will also demonstrate how to setup an event transfer job to transfer objects by continuously listen to event notifications associated with objects being added or modified in source S3 bucket

Prerequisites

Before you begin, make sure you have the following prerequisites:

  1. A GCP account with the necessary permissions to create and manage storage buckets and transfer jobs.
  2. An AWS account with the necessary permissions to create and manage S3 buckets.
  3. The AWS CLI installed and configured on your local machine.
  4. The gcloud CLI installed and configured on your local machine.
  5. The necessary IAM roles and permissions set up in both AWS and GCP.

Create a source S3 bucket demo-s3-transfer and destination cloud storage bucket demo-storage-transfer. In the source S3 bucket, we will upload some parquet files in a prefix 2024/12. We will be transferring the parquet files in this prefix into the demo-storage-transfer bucket.

Storage Transfer REST API

Storage Transfer Service uses a Google-managed service account to move your data. This service account is automatically created the first time you create a transfer job or call googleServiceAccounts.get, or visit the job creation page in the Google Cloud console. The service account's format is typically project-PROJECT_NUMBER@storage-transfer-service.iam.gserviceaccount.com. googleServiceAccounts.get

  • We can use the googleServiceAccounts.get method to retrieve the managed Google service account that is used by Storage Transfer Service to access buckets in the project where transfers run or in other projects. Each Google service account is associated with one Google Cloud project.

  • Navigate to the googleServiceAccounts.get reference page here.
    On the right, you will see an window open, where you can enter the project ID under the request parameters. Executing this will return the subjectId in the response, along with the storage transfer account email. Keep a note of the subject ID and storage service managed account as we will require it in the latter sections.

Image description

Alternatively, we can do the same via cli, using curl command and passing the bearer token in the header.

curl -X GET -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "x-goog-user-project: <project-id>" https://storagetransfer.googleapis.com/v1/googleServiceAccounts/<project-id>
Enter fullscreen mode Exit fullscreen mode

The x-goog-user-project header key is required to set the default project quota for the request see the troubleshooting guide. If excluded, you may get the following error:The storagetransfer.googleapis.com API requires a quota project, which is not set by default

AWS IAM role permissions

  • In the AWS console, navigate to IAM and create a new role.

  • Select Custom trust policy and paste the following trust policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "accounts.google.com"

            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "accounts.google.com:sub": <subject-id>
                }
            }
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode
  • Replace the value with the subjectID of the Google-managed service account that you retrieved from the previous section using the googleServiceAccounts.get reference page. It should look like the screenshot below.

Image description

  • Paste the following json policy to grant permissions to the role to list bucket and get objects from the S3 bucket.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [ "*"]
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

Image description

  • Once the role is created, note down the ARN value, which will be passed to Storage Transfer Service when initiating the transfer programatically in python.

Transfer permissions in GCP

The GCP service account used to create the transfer job will need to granted the Storage Transfer User role (roles/storagetransfer.user) and roles/iam.roleViewer. In addition, we need to give the Google-managed service account retrieved in the previous section, access to resources needed to complete transfers.

  • Navigate to the Cloud Storage Bucket demo-storage-transfer. In the permissions tab, click grant access.

Image description

  • In the new window, enter the principal as the managed gcp transfer service email. Assign the Storage Admin Role.

Image description

Create one-off batch Storage Transfer Job

We can interact with Storage Transfer Service programmatically with Python.

  • copy this folder which contains the requirements.txt and script for initiating the storage transfer job, checking status and verifying completion.
  • in command line terminal window, run pip install - requirements.txt, to install the google-cloud-storage-transfer and cloud-storage libraries.
  • If you use a service account json, then set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path to this service account. Otherwise, use one of the other GCP authentication options
  • Now, run the following command to execute the storage_transfer_batch.py job script in a terminal of your choosing. This will transfer the data from the 2024/12 prefix in the S3 bucket to the GCP bucket with a Data prefix. We pass in the arn of the role we created earlier, which will be assumed during the transfer to generate temp credentials with the required permissions.
python python/storage_transfer.py --gcp_project_id <your-gcp-project-id> --gcp_bucket <your-gcp-bucket> --s3_bucket <your-s3-bucket> --s3_prefix <s3-prefix> --gcp_prefix <gcp-prefix> --role_arn <aws-role-arn>
Enter fullscreen mode Exit fullscreen mode
  • You should see the logs as in the screenshot below. Wait for the job to show as completed.

Image description

Navigate to the cloud storage bucket and you should see the data in the bucket in the Data prefix

Image description

You can monitor and check your transfer jobs from the Google Cloud Console UI. Open the Google Cloud Console and navigate to "Transfer Service". The jobs executed will be listed.

Image description

In the monitoring tab, we can see plots for performance metrics (bytes transferred, objects processed, transfer rate etc).

Image description

In the operations and configuration tabs, we can get more details regarding Transfer specifications e.g. Run history, data transferred and other configuration details we set for the transfer job.

Image description

Create event driven transfer job

Event-driven transfers listen to Amazon S3 Event Notifications sent to Amazon SQS to know when objects in the source bucket have been modified or added.

Create an SQS queue in AWS

  • In AWS management console, go to the SQS service, click on "Create queue" and provide a name for the queue.
  • In the Access policy section, select Advanced. A JSON object is displayed. Paste the policy below, replacing the values for , and . This will only permit SQS:SendMessage action on the SQS queue from the S3 bucket in the AWS account.
{
  "Version": "2012-10-17",
  "Id": "example-ID",
  "Statement": [
    {
      "Sid": "example-statement-ID",
      "Effect": "Allow",
      "Principal": {
        "Service": "s3.amazonaws.com"
      },
      "Action": "SQS:SendMessage",
      "Resource": <SQS-RESOURCE-ARN>,
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": <AWS-ACCOUNT-ID>
        },
        "ArnLike": {
          "aws:SourceArn": <S3_BUCKET_ARN>
        }
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Image description

Now we need to enable notifications in the S3 bucket, setting the SQS queue as destination.

  • Navigate go the S3 bucket and select the Properties tab. In the Event notifications section, click Create event notification.
  • Specify a name for this event.In the Event types section, select "All object create events", as in the screenshot below.

Image description

  • As the Destination select SQS queue and select the queue you created previously.

Image description

Create an event driven Storage transfer job

We will now use the GCP cloud console to create an event driven transfer job. Navigate to the GCP Transfer Service page and click Create transfer job

  • Select Amazon S3 as the source type, and Cloud Storage as the destination.
  • For the Scheduling mode select Event-driven and click Next.

Image description

  • Enter the S3 bucket name. We will use the same bucket we used previously for the one-off transfer but you can use a different one if you wish.
  • Enter the Amazon SQS queue ARN that you created earlier, as in the screenshot below

Image description

  • Select the destination Cloud Storage bucket path (which can optionally include a prefix) as in the screenshot below.
  • Leave the rest of the options as defaults and click create.

Image description

  • The transfer job starts running and an event listener waits for notifications on the SQS queue.

Image description

We can test this by putting some data into S3 bucket source location. Observe your objects being replicated from AWS S3 to GCS bucket. You can also view monitoring details in the SQS queue.

Image description

Conclusion

GCP's Storage Transfer Service is a powerful tool for transferring data from S3 to GCS. It offers a cost-effective, scalable, and secure solution for data migration, with flexible scheduling and data filtering options. In this practical blog, we walked you through the steps required to set up GCP's Storage Transfer Service for transferring data from S3 to GCS. By following these steps, you can easily migrate your data from S3 to GCS with minimal effort and maximum efficiency.

References

Top comments (0)