DEV Community

Using Snowflake data hosted in GCP with AWS Glue

This post covers a use case of accessing data held in a Snowflake database hosted in GCP within an AWS Glue ETL job.

What is Snowflake?

Snowflake is a cloud-based data warehousing platform that provides a fully managed and scalable solution for storing and analyzing large volumes of data. It is not a traditional relational database but rather a data warehouse as a service. Snowflake is designed to work with cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

What is AWS Glue?

AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services (AWS). It is designed to make it easy for users to prepare and load their data for analysis. AWS Glue simplifies the process of building and managing ETL workflows by providing a serverless environment for running ETL jobs.

Snowflake Prerequisites

As I wasn't a user of Snowflake already I signed up for a free trial in order to work on this use case. The process was simple and I didn't need to provide any form of payment to try it out. Well done Snowflake!

Test Data

I generated some test data to load into Snowflake using Mockaroo

Image description

Snowflake Setup

Within Snowflake I created a database, schema and a warehouse.

Image description

Image description

Image description

With these things in place I loaded the data from my Mockaroo generated JSON file into a new table.

Image description

Image description

Image description

The data is now able to be queried.

Image description

Next I created the required user and role for Glue to use to connect to Snowflake.

Create a role called car_sales.

Image description

Execute the SQL below to assign privileges to the new role.

grant select on table matt.matt.car_sales to role car_sales;

grant usage on database matt to role car_sales;

grant usage on schema matt to role car_sales;

grant usage on warehouse matt to role car_sales;
Enter fullscreen mode Exit fullscreen mode

Create a user for Glue to connect to - ensure you set the default warehouse and assign the car_sales role.

Image description

AWS Glue Setup

Create an IAM role for the Glue job to execute. Note Glue will need to be able to access Secrets Manager.

Image description

In Secrets Manager create a secret.

Select Other type of secret. For the keys use sfUser, sfPassword and sfWarehouse.

Image description

Image description

Now in Glue create a Data connection to Snowflake

Image description

Image description

Image description

You can find your Snowflake URL in Snowflake by
selecting Admin > Accounts > . . . Manage URLs

For the AWS Secret, select the one you created earlier.

Give your connection a sensible name.

Image description

Extract Snowflake Data with Glue ETL

Create a new Visual ETL job in Glue.

Image description

From Sources select Snowflake

Image description

On data source properties complete the details as shown.

Image description

Test the connections working by selecting the Snowflake connection in the Visual. This will open the Data preview window. Select the role created earlier. This will start a data preview and display a sample of the data from Snowflake. This will take a few minutes to run.

Image description

Within activity in Snowflake you will see the query has been executed by the GLUE user.

Image description

You can continue to complete your ETL job within Glue. For example if I only want cars made by Porsche I could add an SQL transformation step.

Image description

Selecting the Script tab provides the Glue ETL code for the job.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame


def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
    for alias, frame in mapping.items():
        frame.toDF().createOrReplaceTempView(alias)
    result = spark.sql(query)
    return DynamicFrame.fromDF(result, glueContext, transformation_ctx)


args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Snowflake
Snowflake_node1706530098901 = glueContext.create_dynamic_frame.from_options(
    connection_type="snowflake",
    connection_options={
        "autopushdown": "on",
        "dbtable": "car_sales",
        "connectionName": "snowflake_glue_connection",
        "sfDatabase": "matt",
        "sfSchema": "matt",
    },
    transformation_ctx="Snowflake_node1706530098901",
)

# Script generated for node SQL Query
SqlQuery0 = """
select * from myDataSource
where car_make = 'Porsche'
"""
SQLQuery_node1706531170324 = sparkSqlQuery(
    glueContext,
    query=SqlQuery0,
    mapping={"myDataSource": Snowflake_node1706530098901},
    transformation_ctx="SQLQuery_node1706531170324",
)

job.commit()
Enter fullscreen mode Exit fullscreen mode

Conclusion

The cloud provider used for hosting Snowflake makes little difference to the easy integration of Snowflake as a data source in Glue ETL jobs.

The low code / no code interface for building ETL jobs with Glue makes it simple to gather data from many sources once initial IAM, secrets and connections are in place.

Top comments (0)