Mastering Data Integration: Syncing Diverse Sources to Snowflake Raw Tables with OpenFlow
Photo by Solen Feyissa on Unsplash
tl;dr
Up until now, you would have had to purchase a separate service to ingest and integrate data into Snowflake, Fivetran, Airbyte, and many other services. The ability to consolidate information from disparate sources into a central repository for analysis is a key requirement for data. Snowflake just released a new service, Snowflake Openflow , a powerful data integration service based on Apache NiFi, that enhances this capability by allowing you to build and manage data pipelines from a multitude of sources directly into Snowflake.
This article will walk you through the conceptual implementation of OpenFlow, which synchronizes data from multiple sources hosted within Snowflake. This tool is designed for teams that require GUI-based solutions as part of their toolkit.
Understanding Snowflake Openflow: The Engine for Your Data Pipelines
Before we discuss specific implementations, it’s crucial to understand OpenFlow. Snowflake Openflow is an integration service built upon the robust foundations of Apache NiFi. It operates as a managed service within your own cloud environment (initially on AWS), offering you control and security of your data pathways.
Think of OpenFlow as an orchestration layer where you can:
- Design data flows using a visual canvas.
- Utilize a rich library of pre-built connectors and processors for various data systems and transformations.
- Manage these flows within Deployments (your OpenFlow environment in your cloud) and Runtimes (isolated instances where your data flows execute).
The goal here is to land data into a “raw table” in Snowflake. This raw table will act as the initial staging area for your source data, preserving its original structure (or as close as possible) before any further transformations or modeling within Snowflake. This approach is fundamental for data lineage, auditing, and re-processing capabilities.
Use cases
Openflow is a fantastic choice if you're looking for a UI-based tool to easily gather data from any source and effortlessly send it to any destination. You can use any Snowflake Openflow or Apache NiFi processors and controllers. Plus, you'll enjoy the added benefits of Snowflake’s built-in data security and governance, making the experience even better. Examples of Openflow use cases include:
- Extract information from unstructured sources like Google Drive and Box, preparing it for chat with your AI assistants via Snowflake Cortex, or utilize the data for your custom processing needs.
- Transfer change data capture (CDC) from database tables into Snowflake to enable comprehensive and centralized reporting.
- Capture real-time events from streaming services, such as Apache Kafka, importing them into Snowflake for near real-time analytics.
- Import data from SaaS platforms, like LinkedIn Ads, to Snowflake for detailed reporting, analytics, and insights.
Setting the Stage: Prerequisites
These are the prerequisite steps to set up the Snowflake OpenFlow service.
- If you’ve never used image repositories in your Snowflake account, you need to set the following to make sure the deployment service can pull the Openflow images from Snowflake.
USE ROLE accountadmin;
CREATE DATABASE IF NOT EXISTS OPENFLOW;
USE OPENFLOW;
CREATE SCHEMA IF NOT EXISTS OPENFLOW;
USE SCHEMA OPENFLOW;
CREATE IMAGE REPOSITORY IF NOT EXISTS OPENFLOW;
GRANT USAGE ON DATABASE OPENFLOW TO ROLE PUBLIC;
GRANT USAGE ON SCHEMA OPENFLOW TO ROLE PUBLIC;
GRANT READ ON IMAGE REPOSITORY OPENFLOW.OPENFLOW.OPENFLOW TO ROLE PUBLIC;
GRANT READ ON IMAGE REPOSITORY OPENFLOW.OPENFLOW.OPENFLOW TO ROLE OPENFLOWADMIN;
GRANT WRITE ON IMAGE REPOSITORY OPENFLOW.OPENFLOW.OPENFLOW TO ROLE OPENFLOWADMIN;
- Setting up Openflow requires defining new privileges at the Snowflake Account level. The new privileges are assigned to the ACCOUNTADMIN role as part of the default set of privileges. ACCOUNTADMIN will automatically have the following two privileges and will be able to grant them to a role of their choosing for the Openflow admin role, denoted as OPENFLOW_ADMIN in the following code:
USE ROLE ACCOUNTADMIN;
GRANT CREATE OPENFLOW DATA PLANE INTEGRATION ON ACCOUNT TO ROLE OPENFLOWADMIN;
GRANT CREATE OPENFLOW RUNTIME INTEGRATION ON ACCOUNT TO ROLE OPENFLOWADMIN;
- Grant User to OPENFLOWADMIN role
USE ROLE ACCOUNTADMIN;
ALTER USER <yourusername> set default_secondary_roles = ('ALL');
GRANT ROLE OPENFLOWADMIN TO USER <yourusername>;
- Accept the Openflow terms of service
Snowflake OpenFlow Configuration
To begin using OpenFlow, you’ll typically perform these preliminary steps:
- Create a Deployment: The Snowflake UI facilitates this process by launching a CloudFormation stack within your AWS account. You have the option to use a Snowflake-managed VPC or to use your own VPC. In this article, I will use my own VPC; however, a follow-up article featuring the first option may be published later.
FYI: Users cannot sign in to Openflow if their default role is ACCOUNTADMIN, ORGADMIN, GLOBALORGADMIN, or SECURITYADMIN. You must change the default role for your user to a role other than ACCOUNTADMIN, ORGADMIN, GLOBALORGADMIN, or SECURITYADMIN to log in to Openflow.
- Make your role OPENFLOW_ADMIN.
- Go to Data » Openflow.
- Choose Launch Openflow.
- In the Openflow UI, select Create a deployment. The Deployments tab opens.
- Choose Create a deployment. The deployment creation wizard appears.
- In the Prerequisites step, verify that you fulfill all the requirements. Click Next.
- During the Deployment Location step, choose Amazon Web Services as your deployment site, since it is currently the only option available. Enter a name for your deployment and then select Next.
- During the Configuration step, you have the option to choose from one of the following configurations:
- Fully managed VPC : Select this option if you prefer Snowflake to manage your VPC.
- Use Your Existing VPC : Select this option if you wish to utilize a pre-existing VPC. I opt to bring my own VPC for the deployment.
- Select Create Deployment , which will then provide you with a dashboard.
When the state moves to Active, you can continue creating Runtimes.
- AWS deployment using the AWS CloudFormation script provided during the deployment creation.
Download the template, and go to AWS Cloudformation with the assumption that you have a VPC with at least 2 private and 2 public subnets with NAT gateways already deployed.
Using that Cloudformation template, all you need to do is give it a name and select the VPC and two private subnets.
It will launch an EC2 server that after 45 mins to 1 hour will register a Deployment as active.
As an early product in this step, I got stuck and could not continue until the support team helped me with some errors. Here are some diagnosis steps, in case you need them or just want to know what is happening on the EC2 server.
1. Login to the Openflow Agent EC2 instance
EC2 Instance Connect - Connection Type: "Connect using EC2 Instance Connect Endpoint" - Connect
2. These scripts are available in the home directory /home/ec2-user
a. journalctl -xe -f -n 100 -u docker
b. journalctl -u openflow-sync-images -f -n 500
This helped find a problem where I had not given read access to the image repository.
also there is a diagnostic.sh in the server you can run.
In the end, the issue was that the Network Load Balancer for Kubernetes required all public subnets to have “kubernetes.io/role/elb” with a value of 1. This resolved the problem. Additionally, I had to grant READ access to the PUBLIC role for the image repository.
- Establish Runtimes: In your deployment, you create runtimes, which are the engines that execute your data integration flows. You can use different runtimes to isolate projects, teams, or development stages. This step took about two minutes by clicking the button Create Runtime.
12. Access the OpenFlow Canvas: This serves as your design surface where you will construct or configure the data pipelines.
In the Runtimes tab, click on your runtime, in my case archetype-sandbox, go to the end with three dots, and click View canvas.
It will open a new URL, and you will have to log in again with your Snowflake user.
At this point, you are ready to build your data workflows in Snowflake Openflow (aka Apache NiFi). Let me take you through a simple workflow. The UI looks like this:
My First Data Workflow Setup
The first thing to note is that I won't be teaching you an Openflow workflow here; instead, I'll help you set up a basic one to show you how it is done.
There is no real-life case here for now, but I may augment the article later with a few more use cases. I create a first group that generates fake Avro data and then feeds it to Snowflake as a variant and logs failures into logs.
You can see four types of Controllers, GenerateRecord, QueryRecord, LogMessage, and PutDatabaseRecord, that were used.
I will not try to teach you the UI at this point, as it is in essence Apache Nifi, you should check out the many videos on your YouTube, but let me add here some tips and tricks to get the first workflow done for Snowflake Openflow.
Conclusion
Snowflake OpenFlow, leveraging its NiFi foundation, provides a flexible and powerful solution for managing ingestion pipelines. While specifics about connector availability and configuration may vary, the core principles of defining sources, processing data in real-time, and loading it into Snowflake remain consistent.
I am Augusto Rosa, a Snowflake Data Superhero, Snowflake SME, and Snowflake Toronto User-Group Organizer. I am also the Head of Data, Cloud, & Security Architecture at Archetype Consulting. You can follow me on LinkedIn.
Subscribe to my Medium blog https://blog.augustorosa.com and Archetype Consulting blogs https://blog.archetypeconsulting.com/ for the most interesting Data Engineering and Snowflake news.
References
- …
Top comments (0)