Giuliano Ribeiro

Posted on Dec 5, 2019

Cloud Data Fusion, a game-changer for GCP

#googlecloud #gcp #datafusion #bigdata

Continuing the Big Data topic, I want to share with you this post about Google Cloud Data Fusion.

The foundation

Cloud Data Fusion is based on Cask™ Data Application Platform (CDAP).
CDAP was created by a company named Cask and this company was bought by Google last year. CDAP was incorporated into Google Cloud and named as Google Cloud Data Fusion.

Data Fusion

Data Fusion is a fully managed CDAP with steroids 🧬.

From the Google Cloud page:

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open-source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.

On important info about Data Fusion is it rely on Cloud DataProc(Spark), handling the cluster (create and delete) for you.

The game-changer of Data Fusion is the amazing graphic interface providing for the user an easy to use way, to create from a simple transformation pipeline to the complex ones. The best: without a line of code.

This post will detail the first 2 options: Wrangler & Integrate
At the end of this post, you can watch a video tutorial where I show a step-by-step using Data Fusion.

Wrangler

Wrangler is the center part to prepare your raw data. You can upload files or connect to a variety of external sources like databases, Kafka, S3, Cloud Storage, BigQuery and Spanner.

Choose you source file or something else

Right after you choose your source, you are redirected to the parsing mode.

As you can see at the top, there's a tab called insights. There you can see some useful graphs about your data:

Insights tab — Insights about your data!

Studio

On the Studio, you have all the great tools to create your data pipeline. The source, the transformation, and the sink. Each one of them with a diversity of choices. Take your tool!

The main page of Studio, on the left you have the Source, Transform, Analytics, Sink, "Conditions and Actions" and "Error Handlers And Alerts". The gray area where you design your pipeline.

A simple but complete pipeline:

All available tools in the Studio:

You can also install more sources and sinks, in the Hub:

Pipeline

After design your pipeline on the Studio, you need to deploy it. Just click the "Deploy" button, then you can see your pipeline on the next page. On this page, you'll be able to run the job, watch the logs, configure Schedule and also see the Summary of executions.

At the top in the center, click on "Summary" to check some facts about your jobs.

Step-by-step

This step-by-step was recorded and edited by me(sorry any issue).
You'll be able to see a complete design and execution of a pipeline. The steps of this pipeline are:

Ingestion a CSV
Parse the CSV
Prepare some columns of the CSV
Use the Join Transformation tool
Connect to a PostgreSQL(CloudSQL) instance
Get information about the States of Brazil
Join it with the correspondent column of the CSV
Output the result in a new table on BigQuery

Conclusion

As you can see, Data Fusion is an amazing tool for data pipelines. Powerful because it can handle as much data you have, taking the advantage of Google Cloud features, and impressively easy to design the workflow. Another great feature is the possibility to create your connector, as it is based on CDAP, you can develop your connector and deploy on HUB.
Data Fusion is GA since last week and has a lot of big customers already using. It is just the beginning of this incredible tool.

As usual, please share this post and consider giving me feedback!

Thank you so much!

Top comments (7)

JS Gourdet • Mar 10 '20

It's a great tool but very expensive if we want to create few pipelines once and let them run daily.
Is there any possibility to reduce pricing ? As I understand the Fusion instance must be run 24/7 to be able to execute the scheduled pipeline on daily basis.

Giuliano Ribeiro • Mar 10 '20

Hi there!
Thanks to reading my post :)

Answering you, yes it is expansive. The focus of this product is big/giant companies.

But here you can get a tip: go to GCP Marketplace install the CDAP with "Click to Deploy" option.

The opensource and package version available there can do almost all the options that you have on Data Fusion. The best part: the cost is only for the server running the CDAP and for your Dataproc cluster.

Thank you!

JS Gourdet • Mar 13 '20

Hey Giuliano,
Thanks for this insightful article.
As I was telling, price is prohibiting small & medium company using it just for daily usage unfortunately and who would prefer using 3rd party solution like Segment and others (of course it has less features). So it's pity that GCP could do offer a special package for such audience and use case.
Using CDAP from marketplace is actually a possibility but not serverless.
I was wondering if a trick like saving and exporting the pipeline to swtich off the instance and then daily create an instance import the saved pipeline, execute it and close instance after, could be done ?
So far, I couldn't find a possibility to do it unfortunately.

Keep me informed if you by any chance you do.

Beliche • Apr 2 '20

Hi!
I'm currently searching for a serverless solution for ETL transformation and I was thinking in GCP Data Flow but pricing is restrictve for us.
Our basic requirements is to read a json file from an API which returns 4000 objects, do data transformation to objects and call an API on destiny for data import.

It's not possible to swith of Data Flow instance as you asked, right?

Regards

JS Gourdet • Apr 2 '20

Hi,
DataFlow is really not the tool for such load, it concerns much higher volume.
Probably Google Cloud Function could be an cheap option, depending of your data transformation.

PS: My question was about Google Cloud Data Fusion, which is anyway not appropriate for your use case.

Beliche • Apr 2 '20

Hi!
Thanks for the reply. Definitly GCP Data Fusion is not the use case for my data integration requirements.
I tried to say Data Fusion instead of Data Flow, sorry for that, I'm reviewing too much tools that I mispelled.

Regards

Nick Guebhard • Jul 25 '20

Hi Giuliano,

thanks for the informative blog post and the tip about GCP marketplace. I've managed to create the server for CDAP but do you have any info about how to provision the Dataproc cluster to include the server running CDAP. It seems that without running the plugin on a Dataproc cluster, the process of authenticating access to BigQuery and other Google Cloud sources is more complicated.

Thanks!