DEV Community

Cover image for Data Ingestion in Snowflake with Google Cloud Storage - Part I
Raj Das
Raj Das

Posted on

Data Ingestion in Snowflake with Google Cloud Storage - Part I

In this article we will go through the process to ingest data into Snowflake from Google Cloud Storage (GCS). Along the way we will understand the required concepts and tasks involved in a step by step manner.

You can refer to the official docs here.

Assumptions

  • You have an active Google Cloud account with permissions to create IAM roles, GCS buckets, Pub/Sub topics.
  • You have an active Snowflake account with permissions to create Database, Schema, Stage, Integration Objects and fire queries.

Data Ingestion Flow

Data Source: Open Data

Create a bucket in GCS (GCP)
We create a bucket in GCS snowflake_gcs-stage-bucket and a folder raw_csv inside it to stage the files that we would load into Snowflake later.
GCS Bucket and Folder

For the steps described here, ACCOUNTADMIN user and COMPUTE_WH Warehouse were used in the Snowflake account.
Snowflake Account and Warehouse

Create a Database (Snowflake)
Here we are creating a database to hold the manufacturing details of various Breweries.

CREATE DATABASE manuf_db;

USE manuf_db;
Enter fullscreen mode Exit fullscreen mode

Create Storage Integration Object (Snowflake)
A storage integration is a Snowflake object. It stores a generated identity and access management (IAM) entity for the external cloud storage location, in our case it will be Google Cloud Storage (GCS). The IAM entity will be provided with permissions to access the GCS location within the GCP Project.

In the below SQL, we are creating the storage integration object which will have access to two locations viz snowflake_gcs-stage-bucket/ and snowflake_gcs-stage-bucket/raw_csv/. This object will enable us to create a staging location to host our files which are located externally i.e. outside of Snowflake.

CREATE STORAGE INTEGRATION snowflake_gcp_int
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'GCS'
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ('gcs://snowflake_gcs-stage-bucket/', 'gcs://snowflake_gcs-stage-bucket/raw_csv/');
Enter fullscreen mode Exit fullscreen mode

In case you get the below error, make sure the gcs:// url prefix is preceding the GCS bucket/folder location.
Integration Error

DESC STORAGE INTEGRATION snowflake_gcp_int;
Enter fullscreen mode Exit fullscreen mode

Storage Integration Object

Make a note of the STORAGE_GCP_SERVICE_ACCOUNT from the output above. This is a service account created by Snowflake which needs to be provided with appropriate permissions in Google Cloud so that it can access GCS.
You may notice that the nomenclature of the Service Account appears to be something within an AWS account rather than GCP. This is because my Snowflake account is hosted in AWS.

The STORAGE_GCP_SERVICE_ACCOUNT needs to be provided with the following permissions.
storage.buckets.get
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list

Create an IAM Role (GCP)
Create a Role in Google Cloud and add the required permissions. This step can be performed using GCP Web UI or CLI. The gcloud CLI command shown below creates a new role snowflake_gcp_role and adds the required permissions to it. Replace the GCP-PROJECT-ID with your own GCP Project ID in the below command.

gcloud iam roles create snowflake_gcp_role --project=GCP-PROJECT-ID  
 --description="This role will be assigned to Snowflake Account to enable GCS Access" --permissions=storage.buckets.get,storage.objects.create,storage.objects.delete,storage.objects.get,storage.objects.list
Enter fullscreen mode Exit fullscreen mode

GCP Role Creation

Attach Principal and assign Role (GCP)
In this step we will add the STORAGE_GCP_SERVICE_ACCOUNT to the GCS bucket and assign the Role to the account. In the GCP Web UI, go to Cloud Storage and open the GCS bucket we created for staging our files. Click on the Permissions tab.
Add Permissions to gCS

Now click on Grant Access
Grant Access

Under Add Principals set the STORAGE_GCP_SERVICE_ACCOUNT name and under Role set newly created IAM Role in GCP.
Add Principal and Role

Create an external stage (Snowflake)
Here we are creating an external stage in Snowflake which references data files stored in a location within GCS. We provide the STORAGE_INTEGRATION object name created earlier which has the required access permissions to the external GCS location.

CREATE SCHEMA stages;

CREATE STAGE manu_db.stages.snowflake_gcp_stage
  URL = ('gcs://snowflake_gcs-stage-bucket/raw_csv/')
  STORAGE_INTEGRATION = snowflake_gcp_int;

SHOW STAGES IN manuf_db.stages;
Enter fullscreen mode Exit fullscreen mode

List files from the external stage (Snowflake)
The LIST command may return empty results if no files are loaded in the designated GSC bucket/folder.
List Output

LIST @manuf_db.stages.snowflake_gcp_stage;
Enter fullscreen mode Exit fullscreen mode

The LIST command may throw up an error as below if the STORAGE_GCP_SERVICE_ACCOUNT was not provided with appropriate permissions on the GCS bucket/folder.
List Error

Once you load files in the GCS bucket/folder, you will see output similar to the below.
GCS Bucket Folder

List Output

You may query the file in the external storage using the below SQL.

SELECT $1, $2, $3 
FROM @manuf_db.stages.snowflake_gcp_stage/beer-manufacturers-microbrewers-brands.csv;
Enter fullscreen mode Exit fullscreen mode

Stage File Output

Notice that the Header row is showing up as part of content from the staged file. We need to skip the header to avoid this error. We can create and re-use a File Format object for this.

Create a file format (Snowflake)
A file format is a named object in Snowflake which holds information about a particular file format like CSV/JSON etc. and certain attributes associated with the format. This object comes handy to describe the files in the staged location while loading into Snowflake tables.

CREATE SCHEMA file_formats;

CREATE OR REPLACE FILE FORMAT manuf_db.file_formats.file_format_csv
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = '"';

SHOW FILE FORMATS IN manuf_db.file_formats;
Enter fullscreen mode Exit fullscreen mode

Create Table (Snowflake)
A table is created in the manuf_db. The publicschema is the default schema in Snowflake.

CREATE TABLE manuf_db.public.manuf_tbl(
    manuf_type  varchar(20),
    manuf_name  varchar(100),
    manuf_brand varchar(100)
);

SELECT * FROM manuf_db.public.manuf_tbl;
Enter fullscreen mode Exit fullscreen mode

Loading data into Table (Snowflake)
The COPY INTO command will load staged file, the data from the file beer-manufacturers-microbrewers-brands.csv, into the existing table manuf_tbl by referencing the stage location snowflake_gcp_stage created earlier. Multiple runs of the COPY INTO table will not duplicate the data unless FORCE = TRUE is specified as an optional clause.

COPY INTO manuf_db.public.manuf_tbl
FROM @manuf_db.stages.snowflake_gcp_stage
FILE_FORMAT = manuf_db.file_formats.file_format_csv
FILES = ('beer-manufacturers-microbrewers-brands.csv');

SELECT COUNT(*) FROM manuf_db.public.manuf_tbl;
Enter fullscreen mode Exit fullscreen mode

Table Output

These are all the Snowflake objects that we created in our account.
Snowflake Objects

Clean Up (Snowflake)
Clean up by deleting the data from the table and also removing the staged file from GCS.

DELETE FROM manuf_db.public.manuf_tbl;

REMOVE @manuf_db.stages.snowflake_gcp_stage PATTERN='.*.csv';
Enter fullscreen mode Exit fullscreen mode

Delete Stage File

In Part-II of Data Ingestion, we will use SnowPipe to auto-load data in Snowflake.

Kindly let me know if this article was helpful. Your feedback is highly appreciated.

Top comments (0)