DEV Community

Jacob
Jacob

Posted on

Bridging Clouds: A Comprehensive Guide to Connecting AWS Glue Tables with Snowflake

In today's data-driven world, organizations often find themselves working with multiple cloud platforms to leverage the unique strengths of each. A common scenario is using AWS for data storage and processing, while utilizing Snowflake for advanced analytics and data warehousing. But how do you efficiently bridge these two powerful platforms? In this guide, we'll walk you through the process of connecting AWS Glue tables to Snowflake, enabling seamless data access and analytics across your cloud environments.

Prerequisites

Before we dive in, make sure you have:

  • An AWS account with appropriate permissions
  • A Snowflake account with admin access
  • Basic knowledge of AWS IAM, S3, and Glue
  • Familiarity with Snowflake

The Connection Process: A Step-by-Step Guide

1. Creating an IAM Role for S3 Access

Our journey begins in AWS, where we'll create an IAM role that grants Snowflake access to your S3 bucket and AWS Key Management Service (KMS).

  1. Navigate to the IAM console in AWS
  2. Create a new role
  3. Attach the following policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:GetObjectVersion",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-name",
                "arn:aws:s3:::your-bucket-name/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": "arn:aws:kms:region:account-id:key/key-id"
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode
  1. Add this assume role policy to the trust relationships:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::000000000000:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {}
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Remember to replace placeholder values with your specific details. We'll update the 000000000000 later with Snowflake's account ID.

2. Setting Up a Storage Integration in Snowflake

Now, let's hop over to Snowflake to create a storage integration using our newly minted IAM role.

Execute this SQL in Snowflake:

CREATE STORAGE INTEGRATION s3_int
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = S3
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::account-id:role/role-name'
  STORAGE_ALLOWED_LOCATIONS = ('s3://your-bucket-name/');

DESC INTEGRATION s3_int;
Enter fullscreen mode Exit fullscreen mode

After creation, run DESC INTEGRATION s3_int; again and note down the STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID. You'll need these for the next step.

3. Enhancing Security with External ID

Back in AWS, let's add an extra layer of security to our IAM role using the external ID we just obtained.

Update the trust relationship of your S3 access role with:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::account-id:user/username"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "0000_external_id_from_snowflake"
        }
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Use the values from the DESC INTEGRATION command to replace account-id, username, and 0000_external_id_from_snowflake.

4. Creating a Glue Catalog Connection in Snowflake

Now, let's set up a connection to the AWS Glue Data Catalog in Snowflake.

Run these SQL commands:

CREATE DATABASE glue_db;

CREATE SCHEMA glue_db.glue_schema;

CREATE EXTERNAL CATALOG glue_catalog
  WITH LOCATION='aws_glue_catalog'
  CATALOG_ROLE='arn:aws:iam::account-id:role/glue-access-role'
  CATALOG_EXTERNAL_ID='your_external_id'
  REGION='your-aws-region';

DESC CATALOG glue_catalog;
Enter fullscreen mode Exit fullscreen mode

Note down the CATALOG_AWS_IAM_USER_ARN and CATALOG_EXTERNAL_ID from the DESC CATALOG command output.

5. Creating an IAM Role for Glue Access

Back in AWS, we need to create another IAM role specifically for Glue access.

  1. Create a new IAM role
  2. Attach this policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition"
            ],
            "Resource": [
                "arn:aws:glue:region:account-id:catalog",
                "arn:aws:glue:region:account-id:database/*",
                "arn:aws:glue:region:account-id:table/*/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": "arn:aws:kms:region:account-id:key/key-id"
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode
  1. Add this assume role policy to the trust relationships:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "CATALOG_AWS_IAM_USER_ARN"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "CATALOG_EXTERNAL_ID"
        }
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Use the CATALOG_AWS_IAM_USER_ARN and CATALOG_EXTERNAL_ID you noted earlier.

6. Creating an Iceberg Table in Snowflake

Finally, let's create an Iceberg table in Snowflake that uses our new storage integration and Glue catalog connection.

Execute this SQL:

CREATE OR REPLACE TABLE glue_db.glue_schema.my_iceberg_table
  EXTERNAL_CATALOG = glue_catalog
  EXTERNAL_DATABASE = 'your_glue_database'
  EXTERNAL_TABLE = 'your_glue_table'
  EXTERNAL_LOCATION = 's3://your-bucket-name/path/to/iceberg/data/'
  STORAGE_INTEGRATION = s3_int;
Enter fullscreen mode Exit fullscreen mode

7. Auto refresh

This week snowflake release the feature to auto fresh.

https://docs.snowflake.com/user-guide/tables-iceberg-auto-refresh

CREATE CATALOG INTEGRATION auto_refresh_catalog_integration
  CATALOG_SOURCE = GLUE
  CATALOG_NAMESPACE = 'my_catalog_namespace'
  TABLE_FORMAT = ICEBERG
  GLUE_AWS_ROLE_ARN = 'arn:aws:iam::123456789123:role/my-catalog-role'
  GLUE_CATALOG_ID = '123456789123'
  ENABLED = TRUE
  REFRESH_INTERVAL_SECONDS = 60;
Enter fullscreen mode Exit fullscreen mode

or change later with ALTER CATALOG INTEGRATION auto_refresh_catalog_integration SET REFRESH_INTERVAL_SECONDS = 120;

Conclusion: Unleashing the Power of Cross-Cloud Analytics

Congratulations! You've successfully built a bridge between AWS Glue and Snowflake. This powerful connection allows you to:

  1. Query AWS Glue data directly from Snowflake
  2. Combine AWS data with other sources in Snowflake for comprehensive analytics
  3. Leverage Snowflake's powerful querying capabilities on your AWS data
  4. Maintain data in different cloud environments while centralizing analytics

Remember to monitor your usage and costs, and always follow best practices for data security and compliance. This setup opens up new possibilities for data analysis and processing, allowing you to get the best of both AWS and Snowflake worlds.

Happy analyzing!

Top comments (0)