DEV Community

Cover image for Where to Write Python in Azure - Building the Python ETL Pipeline
Jin
Jin

Posted on • Originally published at blog.luca-liu.com

Where to Write Python in Azure - Building the Python ETL Pipeline

Many data analysts know how to read and process Excel files using Python and Pandas locally. But what happens when you move to the Azure cloud?

When building a recent ETL pipeline, the target database was Azure SQL Database. Suddenly, running Python on my local machine was no longer an option because local scripts couldn't easily or securely connect to the cloud database via ODBC. I needed a place to write and execute Python directly in Azure, read Excel files, and schedule daily tasks.

Here is the architecture I built, the services I tested, and the exact costs of my final solution.

Step 1: Getting Business Data into the Cloud

The data sources for this pipeline were monthly Excel files and mapping tables that business users manually updated.

To bridge the gap between business operations and the cloud, I used Power Automate. I set up a flow that automatically syncs the users' OneDrive folders to an Azure Storage Account every day. This allows business users to update mapping tables in a familiar environment (OneDrive), while seamlessly feeding the latest data into the data engineering pipeline.

Step 2: The Quest for the Right Compute

Once the data was in the Azure Storage Account, I needed a compute service to process it and write the results to Azure SQL Database. I tried four different Azure services before finding the right fit.

1. Azure Synapse Analytics Synapse is powerful, but it is expensive. According to Microsoft’s documentation, Synapse uses a Massively Parallel Processing (MPP) architecture. For medium-sized Excel data, this is massive overkill. Paying for distributed computation when you don't need it simply isn't cost-effective.

2. Azure Machine Learning (Virtual Machine) Next, I tried creating a VM in Azure ML. The developer experience was fantastic. By connecting via VS Code, I could easily read data from the Storage Account and write it to the SQL Database. However, it had one fatal flaw: scheduling. Setting up a simple daily automated run for a notebook in Azure ML is unnecessarily complicated.

3. Azure Functions Azure Functions are incredibly cheap. But as the data processing logic grew, I hit its limitations. Functions are great for lightweight, event-driven tasks, but they are not designed for managing complex ETL dependencies and heavy data transformations.

The Final Solution: Azure Databricks Serverless

Ultimately, I moved to Azure Databricks. Initially, I used a standard hybrid workspace, but the idle costs of keeping VMs running (or waiting for them to spin up) were too high.

Then, I switched to Databricks Serverless (hosted in the Germany West Central region). This solved everything. I had an excellent environment to write Python, seamless connections to Azure Storage and SQL Database, and built-in, reliable scheduling.

Transparency: What Does It Actually Cost?

One of the biggest concerns with Databricks is the cost. For this production pipeline, my Databricks service costs exactly €52 per month.

Here is the breakdown of my real Azure bill:

  • Premium Interactive Serverless Compute DBU: €42.24
  • Premium Automated Serverless Compute DBU: €8.27
  • Premium Databricks Storage Unit DSU: €0.11

The largest chunk (€42.24) comes from Interactive Compute—this is the cost generated when I am actively writing, testing, and debugging code.

The actual production run—the Automated Compute—only costs €8.27 per month. The pipeline is scheduled using a standard CRON expression (0 0 5 ? * MON-FRI) to run every weekday at 5:00 AM. Because it is Serverless, I only pay for the exact seconds the compute is running to process the data, with zero idle costs on weekends.

The Takeaway

When building a data pipeline in Azure, finding the right place to write Python isn't just about code execution. It is a balancing act between developer experience (like VS Code integration), operational ease (simple scheduling), and cost control. For medium data workloads, Databricks Serverless currently hits that sweet spot perfectly.


Explore more

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Top comments (0)