DEV Community

Ayush Kumar
Ayush Kumar

Posted on

Ways to load data in DW from External Data Source

Introduction

After performing transformation on data, then it needs to save it in Data Warehouse for further analysis. This process comes under ETL (Extract, Transform, Load). It involves extracting data from various sources, transforming it as per business need, and then loading it into destination.

There are many ways to load data in DW: -

  • POLYBASE
  • COPY INTO command
  • ADF Copy activity
  • Spark Pool in Synapse
  • Databricks

In this blog I will explain these processes and talk about some limitations of some activity and its workaround.

NOTE: Whenever we use other tools for inserting data in Synapse like ADF, Data Flow, Spark Pool and Databricks, it highly recommended to enable staging or under hood it can be implemented automatically. It allows for efficient data transfer and can handle large datasets more effectively. Basically, this approach minimizes the load on both the source and the destination during the transfer process. Internally the data is first loaded in Staging layer(temporary storage layer), then from there, data is loaded in DW. After loading the data in DW, the temporary data in staging layer is deleted.


POLYBASE

PolyBase is a technology that uses metadata to allows SQL Server and dedicated SQL pools to query and import data from external data sources using T-SQL.

Step 1:

Create an object of Database Scoped Credential. This indicates how it going to connect to external data source(Azure Data Lake Gen 2). In the Database Scoped Credentials, I am using Managed Identity (means Synapse uses its own credentials to access Data Lake).

You also have to give Storage Blob Data Contributor role to the Managed Identity of Synapse Workspace.

CREATE DATABASE SCOPED CREDENTIAL adlsgen2synp
WITH IDENTITY = 'MANAGED IDENTITY'
GO
Enter fullscreen mode Exit fullscreen mode

Step 2:

Create an External Data Source. It's a definition of External Data Sorce Location.

CREATE EXTERNAL DATA SOURCE taxi_raw_data
WITH (
    LOCATION = 'abfss://raw@synapselearningadls.dfs.core.windows.net/',
    CREDENTIAL = adlsgen2synp,
    TYPE = HADOOP
);
GO
Enter fullscreen mode Exit fullscreen mode

Step 3:

Create an External File Format. It defines format of the file like CSV.

CREATE EXTERNAL FILE FORMAT csv_file_format
    WITH(
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
            FIELD_TERMINATOR = ',',
            STRING_DELIMITER = '"',
            FIRST_ROW = 2,
            USE_TYPE_DEFAULT = FALSE
        )
    );
GO
Enter fullscreen mode Exit fullscreen mode

Step 4:

Create External Table so that it able to query the external data directly from synapse without importing it.

CREATE EXTERNAL TABLE temp_table.ext_taxi_zone
    (   LocationID SMALLINT,
        Borough VARCHAR(15),
        Zone VARCHAR(50),
        service_zone VARCHAR(15))
    WITH(
        LOCATION = 'taxi_zone.csv',
        DATA_SOURCE = taxi_raw_data,
        FILE_FORMAT = csv_file_format
    );
GO
Enter fullscreen mode Exit fullscreen mode

Step 5:

Use CTAS statement to import the data in Synapse SQL DW.

CREATE TABLE temp_table.taxi_zone
WITH (
    DISTRIBUTION = ROUND_ROBIN
) AS 
SELECT * FROM temp_table.ext_taxi_zone;
Enter fullscreen mode Exit fullscreen mode

uisng polybase

NOTE:

  • Polybase does not support DELTA file format.
  • External Table can't be modified. For altering columns, you must have to drop the table and then recreate again.

COPY INTO command

It is a newer way of loading the data in DW. It doesn't need any extra objects. It copies the data directly to the table in DW. It is faster and simpler compared to POLYBASE.

COPY INTO [temp_table].[taxi_zone]
FROM 'https://synapselearningadls.blob.core.windows.net/raw/taxi_zone.csv'
WITH (
    CREDENTIAL = (IDENTITY = 'MANAGED IDENTITY'),
    FILE_TYPE = 'CSV',
    FIRSTROW = 2
)
Enter fullscreen mode Exit fullscreen mode

Using Copy

NOTE: It supports only these CSV, PARQUET, ORC file types.


ADF Copy Activity

Step 1:

Giving some permissions to ADF Managed Identity to access Synapse Database.

Here EXTERNAL PROVIDER refers to the Microsoft Entra.

CREATE USER [adf-rnd-learning] FROM EXTERNAL PROVIDER;
GO

EXEC sp_addrolemember db_owner, [adf-rnd-learning];
GRANT ADMINISTER DATABASE BULK OPERATIONS TO [adf-rnd-learning];
Enter fullscreen mode Exit fullscreen mode

Step 2:

  • Create linked service for ADLS Gen 2 (External data source) and Azure Synapse Analytics.
  • Create Dataset for source external file in ADLS and sink Table in synapse DB.
  • Create a pipeline and use COPY Activity to copy data from ADLS to Synapse database.
  • Set the source and sink dataset in COPY activity and enable the Staging. Using copy

Here in staging, the data stored temporary and after loading in Synapse DB it deleted.

Staging data

NOTE: ADF COPY activity doesn't support DELTA. But a workaround is you can use Data Flow in ADF. While adding the source in Data Flow, choose source type as Inline.

Delta Support


Spark Pool in Synapse

In Spark Pool we can use Synapse DW Objects with the help of spark connector. And the best part is, Synapse automatically manages all authentication for us. So, we don't have to focus on managing credentials.

sqlanalytics contains a spark connector that is used to connect to dedicated SQL Pool from Spark Pool.

Spark Pool Connector

Here, under the hood staging data is loaded automatically in some default location.

Spark Pool Staging


Databricks

In Databricks, for accessing Synapse DW Objects first we have to manage credentials for both Synapse and ADLS.

Step 1:

Give required permissions to Databricks to access ADLS Gen 2.

Create a service principal in Microsoft Entra Id and add Client Secret to this service principal.

Create Service Principal

Assign the Storage Blob Data Contributor role of ADLS to this service principal.

Role assign to ADLS

Then add the client_id, tenant_id and client secrets of service principal to Azure Key Vault.

Key Vault

Then create Scope in Databricks and add the required details of your key vault. Databricks keeps the Secret Scope in a hidden user interface. To reveal that interface, type this “#secrets/createScope” in the end of URL in the Databricks homepage.

Databricks Scope

Step 2:

Give permission to Service Principal to access Synapse Database. Execute this below query in Synapse Database.

CREATE USER [databricks_learning_appreg] FROM EXTERNAL PROVIDER;
GO

EXEC sp_addrolemember db_owner, [databricks_learning_appreg];
GRANT ADMINISTER DATABASE BULK OPERATIONS TO [databricks_learning_appreg];
Enter fullscreen mode Exit fullscreen mode

Step 3:

Fetch the credentials from Databricks Secrets Scope.

client_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-id')
tenant_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-tenant-id')
client_secret = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-secret')
Enter fullscreen mode Exit fullscreen mode

Step 4:

Set the configurations in Databricks, for ADLS.

spark.conf.set("fs.azure.account.auth.type.synapselearningadls.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.synapselearningadls.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.synapselearningadls.dfs.core.windows.net", client_id)
spark.conf.set("fs.azure.account.oauth2.client.secret.synapselearningadls.dfs.core.windows.net", client_secret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.synapselearningadls.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
Enter fullscreen mode Exit fullscreen mode

Step 5:

Set the configurations in Databricks, for Synapse SQL DW.

spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", client_id)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", client_secret)
Enter fullscreen mode Exit fullscreen mode

Step 6:

Now load the data from ADLS into data frame and perform some transformations.

read df from ADLS

Step 7:

Now write the transformed data to Synapse SQL DW.

df.write.format("sqldw") \
.option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \
.option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \
.option("forwardSparkAzureStorageCredentials", "false") \
.option("dbTable", "temp_table.taxi_zone") \
.option('enableServicePrincipalAuth', 'true') \
.mode('append') \
.save()
Enter fullscreen mode Exit fullscreen mode

Step 8:

Now check the table, transformed data is inserted into Synapse SQL DW.

df2 = spark.read.format("sqldw") \
.option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \
.option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \
.option("forwardSparkAzureStorageCredentials", "false") \
.option("dbTable", "temp_table.taxi_zone") \
.option('enableServicePrincipalAuth', 'true') \
.load()
Enter fullscreen mode Exit fullscreen mode

DW table

NOTE:
Here you notice that in pyspark code, I am defining staging location for both reading and writing.

Staging from Databricks

Top comments (0)