DEV Community: Suraj Singh Thakur

Getting Started with Matillion ETL Tool

Suraj Singh Thakur — Wed, 12 Oct 2022 13:56:58 +0000

Getting Started with Matillion ETL Tool - Perficient Blog

In this article, I will show you how you can set up your own Matillion instance, but before we get started let’s talk about the Matillion ETL tool.

What is Matillion, and why should you care?

Matillion is an ETL tool that is available on the Cloud Marketplace. It is completely cloud-based, billed at an hourly rate, and comes with the most significant advantages when coupled with Snowflake, Redshift, and Delta Lake. To get that significant advantage, I will be showing you the coupling of Matillion with Snowflake in this article.

We have been using the Matillion ETL tool on the data warehouse, such as Snowflake on the ETL project for our clients. Recently I worked on a project in which I was working on Matillion for data transformation and orchestration from source to target.

Matillion has two main flavors of jobs:

Orchestration: Orchestration is primarily concerned with DDL statements (especially Creating, Dropping, and Altering resources), loading data from external sources.
Transformation: Transformation is used for transforming data that already exists within tables. This includes filtering data, changing data types, and removing rows.
It has various types of functional components, that are used as a toolkit for your ETL journey. These components range from DDL SQL commands to Python scripts, and some of these components are designed to perform some of the most complexes of tasks. Also, it gives us the flexibility of creating variables that we can use in any query editor component. Matillion provides variables in two flavors – The Environmental variable and the Job variable.

I will be showing you the example of a Transformation Job and Orchestration Job created in my recent project.

This is the Transformation job in which I have used SQL Query Component (Blue) for writing SQL script. The outcome is passed on to the next component that is Table Update Component (Red), which updated the outcome of the SQL component into the target table that is defined in Table Update Component.

This is the Orchestration job in which we are using several components, but the main task of this job is to execute the transformation job. As you can see, we have a Start Component that starts the job run, then we have a transformation job that is performing the transformation (shown above), then we have an End Success Component that ends the job run. If there is a failure in the Transformation Job component, then the OR Component will wait for any of its input to be completed before continuing the job. So when the failure occurs the OR component adds the SNS Message Component to the job that sends the notification of failure over mail using SNS service of AWS and the End Failure Component ends the job run in the failed state.

Now that you have a basic understanding of the Matillion tool, you are probably excited to build your job for data transformation, so let me give you the complete procedure of how you can create your own Matillion instance and start getting your hands dirty.

Let’s Begin!!! The creation of your Matillion ETL Journey.
First, create a Matillion Free Trail account on Matillion Hub here. (Note: Matillion account required organization domain ID. It will not allow a free domain ID like gmail.com or yahoo.com).

Matillion will send you a confirmation email on your organization id after successful account creation. You need to make sure your organization is added properly to your account as shown below in the image.

If not added you can add your organization with the “Add new organization” button as shown in the image above.

Once you have added your organization and click into your organization panel, you will need to select a service as shown below.

Select the “Add new Matillion ETL instance” option for creating a new instance/VM to work on.

On the next page, you will need to select the appropriate cloud provider as per your requirement option (AWS and Azure). Please note: GCP is currently not supported, as shown below.

For now, let’s go with the AWS option as a cloud provider, next you will need to choose your cloud data platform.

In this instance we are going to select Snowflake as the data platform. Then, you will need to choose the launching method of your Matillion as per your AWS service. You will receive two options, as shown below.

Amazon Web Services - Avoid Contact Center Outages: Plan Your Upgrade to Amazon Connect
Avoid Contact Center Outages: Plan Your Upgrade to Amazon Connect
Learn the six most common pitfalls when upgrading your contact center, and how Amazon Connect can help you avoid them.

Get the Guide

Option 1 – CloudFormation Template There are several prerequisites you will need to define in the next couple pages.

First, let’s select the region in which you want to host an instance.

Next, select VPC for your instance to launch AWS resources in a virtual network that you define.

The next page will allow you to select CloudFormation Template and choose the template accordingly as per your requirement.

You will then be redirected to the AWS console page where you will need to log in with your AWS credentials.

Once logged into AWS, you will be asked to create a stack for your template, please provide details as required to create a stack. After which it will create a stack and will launch your Matillion instance with the selected region and VPC. The stack status should be create_complete status as shown below.

Option 2

If you select the AMI option to launch your instance, it will redirect you to the AWS console page where you will need to log in with your AWS credentials.

After logging into your AWS account, you will get a list of AMIs for the Matillion ETL tool, you just need to select one AMI for the launching instance.

After successfully creating an instance (whether by AMI or CloudFormation Template), you are done with the Mtillion ETL tool creation and now will need to access that tool over the internet as a SaaS platform.

Accessing Matillion ETL Tool
For accessing the Matillion tool you need to make sure that your created instance is up and running continuously. You will require an instance Public IP to load the Matillion ETL Tool login page on your browser.

Use the copy to clipboard option and paste the public IP on any browser. You will get a Matillion Tool login page asking for a Username and Password. The username depends on the OS on which the AMI is built.

As in my scenario, my AMI is Linux based so my Username will be ec2-user. Password will be the instance id as shown in the above image.

Please make sure you use the copy-to-clipboard option to avoid failed login attempts. Below image showing the login page of Matillion ETL Tool.

Once logged in into the Matillion ETL tool console you will get a pop-up window, use the Create Project Button to create a new project. You will get windows with 4 stages required for the project.

Fill in the required details as shown below.

The next stage is AWS connection. As shown below, the environment name and AWS credentials are required to proceed. In my case I have created an IAM user in the AWS account (i.e., mat-user so I have selected that user as AWS credentials). For adding an IAM user to Matillion use the Manage button. You will need the access key ID and secret access key of that IAM user.

The next stage is setting up the Snowflake Connection, which provides details for the options, as shown below. For these, you will require a Snowflake account. If you don’t have an account, please create a 30-day free trial account here.

The last stage is setting up Snowflake Defaults Environment, which means the snowflake default data warehouse setting. (See below for a reference). If Snowflake is connected successfully you will get your default warehouse options in the drop-down list automatically.

Click on Finish and your project will be created. Now, you can work on Matillion ETL for data transformation using Orchestration Jobs and Transform Jobs. Use components for transformation and data flow as shown in the image below.

Some basic panel is provided on the console page of million.

At the bottom right you get task status (success or failed) for failed you get the descriptive error also for troubleshooting.
At the center, towards the bottom, you will get properties of all activities you are using in the Grid panel for data flow. Select that activity and the properties tab will show you all properties of that activity.

Here you have successfully created a Matillion ETL Tool instance using AWS and Snowflake. You have now been able to successfully create a Matillion ETL Tool instance using AWS and Snowflake. To learn more about our AWS cloud capabilities and practices and how Perficient can help you further your enterprise’s digital transformation visit us here!

Keep Learning!!!!!

Transform Your Data with Azure Data Factory

Suraj Singh Thakur — Wed, 12 Oct 2022 13:55:16 +0000

Transform Your Data with Azure Data Factory - Perficient Blogs

This blog will help you to understand the basic functionality of Azure Data Factory (ADF) and how powerful a tool it is when working with big data. Explore the basic architecture on ADF and get to know the components and services involved.

A Quick Intro to Azure Data Factory & Its Key Features
ADF is a cloud-based integration service for orchestrating and automating data movement and data transformation with 90 maintenance free connectors built-in at no added cost. Easily construct ETL and ELT processes in a visual environment or write your own code.

Cut down operating expenses and capital expenditures and save precious time with these key features:

No code or maintenance required to build hybrid ETL and ELT pipeline within the Data Factory visual environment.
Cost efficient and fully managed serverless cloud data integration tool that scales on demand.
SSIS integration runtime to easily rehost on-premises SSIS packages in the cloud using familiar SSIS tools.
Azure security measures to connect to on-premises, cloud-based and software-as-a-service (SaaS) applications with peace of mind.
Remembering ADF Past (v1) to Understand ADF Present (v2)
Azure Data Factory v1 went into public preview on 28th Oct 2014 and then released for general available on 6th Aug 2015. Back then, it was a limited tool for processing data, but it couldn’t hold up in front of SQL Server Integration Services (SSIS) features. In early days of Azure Data Factory, you needed to develop solutions in visual studio, and even though there was some improvement made for diagram view, there was lot of JSON editing that needed to be done.

At MS Ignite 2017, Microsoft introduce the new updated version of ADF. It was launched as v2 due to it many new features and capabilities making it was almost an entirely new product. You could now Lift and Shift your on-premises SSIS solution to Azure very easily. ADF v2 went into public preview on 25th Sept 2017.

The most important updates were things like branching and looping and even running pipeline on scheduled time clocks or in regular intervals. Azure Data Factory v2 even become more popular when the new Visual Studio went into public preview on 16th Jan 2018. Some more feature went into public preview on 27th Jun 2018 such as DRAG and DROP feature.

What is ETL?
As said above, ADF is an ETL and ELT tool for Data solution. ETL stands for Extract, Transform and Load. ETL provides the method of moving the data from various sources into a data warehouse. The image below illustrates the 3 stages of data flow in ADF i.e., E – Extract, T – Transform and L – Load.

diagram of ETL process

The ETL process collects unstructured data from On-Premises SQL Database. Data is extracted into a cloud service storage platform for data transformation as per requirements. ADF has the functions and activities which transform the unstructured data into structured data for BI tool and Analytics. With the help of ADF, the transformed data is then loaded into data warehouse or any cloud storage.

ETL is a process that uses cloud storage services for staging environments such as Blob, Data Lake, S3 storage and so on. Staging environments are like workspaces for actual data with its complete properties to test before the actual data is stored.

Microsoft Innovation
Unleash the Potential of Power Platform With a Center of Excellence
Business innovation often comes from within. Discover how to empower innovation from non-traditional developers with the Microsoft Power Platform.

Learn More

In this case raw data is staged from On-Premise’s server and stored into cloud storage. The transformation job is done on RAW data stored in source destination i.e., cloud storage. After that it is stored into data warehouses such as Snowflake or Databricks.

Now let’s see how all these stages are connected to each other and what functions are used to transform the data in ADF.

Connection and workflow of ADF
The image below is an example of an ADF copy workflow and the components and services used to orchestrate the task. Let’s go through the flow and learn each part involved in an ADF transformation job.

Azure Data Factory architecture

This example activity uses two Datasets as input source and output destination (sink) and these Datasets are connected using Linked services. Linked services are used to connect the actual storage location of data on both the source end and destination end. Integration Runtime binds this complete workflow and to automate or perform this task with single click, you will create a Pipeline.

Pipeline
As shown in the above image, pipeline is a logical grouping of activities. Activities deploy together to execute a specific task. Pipeline has lots of activities such as Get Metadata, For Each, Copy Activity, etc., to perform data transformation tasks. In this scenario, “copy activity” is used to copy data from the data source store to destination store.

Integration Runtime
Commonly known as IR, Integration Runtime is the heart of Azure Data Factory. IR provides the computer resources for data transfer activity and for dispatch of data transfer activities in ADF. There are 3 types of Integration Runtime – Azure IR, Self-hosted, Azure-SSIS.

Blob Storage
This cloud storage provided by Microsoft stores massive amount of unstructured data, such as text or binary data. As you can see, the Customer.csv file with unstructured data of customer details is stored in blob storage.

Linked Service
Linked services connect data source and destination. In this example, the linked service connects to our data source i.e., blob storage. Data source can be an Azure blob storage, Azure SQL Database, or on-premises SQL Server.

Dataset
Dataset is used to connect to the data source and destination via linked services. Datasets are created based upon the type of data source and destination you want to connect.

As you can see above, on the source side a blob storage type of dataset connects via linked service to the actual location of data in blob storage. On the destination side, an SQL server type of dataset connects via linked service, pointing to the exact table location in database to storage data.

Copy Activity
The activity in this pipeline is used to copy data between different data stores, and in this case transform the data format as required in the destination store. Datasets are provided to this activity as input and output source.

So, with this you can understand ADF is the umbrella for all the components – such as Linked Service, Datasets, Integration Runtime – fused together in Pipeline Activity to produce transformed data for analytical tools such as Power BI and HDInsight for planning business outcomes.