Azure Data Factory

#data #azure #datafactory #dataengineering

What is Azure Data Factory?
Azure data factory or ADF is an ETL(Extract-Transform-load ) Tool to integrate data from various sources of various sizes and formats together, in short, It is a serverless, fully managed data integration solution to ingest, and prepare, and transform all of your data at scale. The pipelines of Azure data factory are used to transfer the data from on-premises to the cloud within an interval of certain periods.
Azure data factory will help you automate and manage the data workflow that is being transferred from on-premises and cloud-based data sources and destinations. Azure data factory manages the data-driven workflow pipelines. The Azure data factory stands out from other ETL tools because of attributes such as easy-to-use, Cost-Effective solution, and Powerful and intelligent code-free service.
As data is increasing day by day around the world, many businesses and companies are shifting towards the implementation of cloud-based technology to make their business scalable. Because of the increase in cloud adoption, there is a need for reliable ETL tools in the cloud to make the integration

How does Azure Data Factory Work?
With a graphical interface, ADF enables the creation of complex ETL (Extract, Transform, Load) processes easily that can bring together data from various sources and formats. Below are the some of the key points about Azure Data Factory:
•Data Ingestion: Azure Data Factory can connect to a wide range of data sources like on-premises databases, cloud based storage devices.
•Data Transformation: By mapping the data flow and increasing various transformation activities ADF can clean, aggregate and transform the data to meet the business requirements using Azure services like Azure Databricks or Azure HDinsights.
•Scheduling and Monitoring: It enables strong scheduling capabilities to automate the workflows and monitor the tools for tracking the data pipeline progress and health.

Azure Data Factory(ADF) Architecture

Simple/high level Architecture of Azure Data Factory:

detailed overview of the complete Data Factory architecture:
[https://learn.microsoft.com/en-us/azure/data-factory/media/introduction/data-factory-visual-guide.png]

note the grey background is of different scenarios, or definition or concepts.

Connect and Collect

Businesses have data in various forms and places (on-premises, cloud, SaaS, databases). ADF makes integration easy by connecting multiple sources and aggregating data to be processed.

Businesses would otherwise keep costly, complicated custom pipelines. ADF does it programmatically with Copy Activity, copying data to Azure Data Lake or Blob Storage to process with Azure Databricks or HDInsight.

Transform & Enrich with Azure Data Factory
Once data is in the cloud, ADF Mapping Data Flows helps process and transform it using Spark, without needing Spark expertise.

For custom transformations, ADF supports external compute services like HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning. 🚀

CI/CD and publish
Data Factory supports end-to-end CI/CD of data pipelines with Azure DevOps and GitHub, allowing incremental development and deployment of ETL processes prior to final publishing. After they are perfected, load data to Azure Data Warehouse, Azure SQL Database, Azure Cosmos DB, or any analytics engine supported by business intelligence tools.

Monitor
Track your data integration pipeline to make sure that it is providing business value. Azure Data Factory natively supports tracking by Azure Monitor, API, PowerShell, logs, and health panels.

Overview of ADF Components:

Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data Factory is composed of the following key components:
• Pipelines
• Activities
• Datasets
• Linked services
• Data Flows
• Integration Runtimes
These components work together to provide the platform on which you can compose data-driven workflows with steps to move and transform data.

Pipeline :
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each one individually. The activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.

Mapping data flows:
Create and manage graphs of data transformation logic that you can use to transform any-sized data. You can build-up a reusable library of data transformation routines and execute those processes in a scaled-out manner from your ADF pipelines. Data Factory will execute your logic on a Spark cluster that spins-up and spins-down when you need it. You won't ever have to manage or maintain clusters.

Activity:
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data from one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on an Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.

Datasets:
Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.

Linked services:
Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to the data source, and a dataset represents the structure of the data. For example, an Azure Storage-linked service specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob dataset specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:

To represent a data store that includes, but isn't limited to, a SQL Server database, Oracle database, file share, or Azure blob storage account. For a list of supported data stores, see the copy activity article.
To represent a compute resource that can host the execution of an activity. For example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and supported compute environments, see the transform data article.

Integration Runtime:
In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a compute service. An integration runtime provides the bridge between the activity and linked Services. It's referenced by the linked service or activity, and provides the compute environment where the activity either runs on or gets dispatched from. This way, the activity can be performed in the region closest possible to the target data store or compute service in the most performant way while meeting security and compliance needs.

Triggers:
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. There are different types of triggers for different types of events.

Pipeline runs:
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within the trigger definition.

Parameters:
Parameters are key-value pairs of read-only configuration.  Parameters are defined in the pipeline. The arguments for the defined parameters are passed during execution from the run context that was created by a trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can reference datasets and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data store or a compute environment. It is also a reusable/referenceable entity.

Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.

Variables:
Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with parameters to enable passing values between pipelines, data flows, and other activities.

for detail Go to the link: [https://learn.microsoft.com/en-us/azure/data-factory/introduction#connect-and-collect]

DEV Community

Azure Data Factory

Top comments (0)