DEV Community: Jin

DP-750: Azure Databricks Cluster Explained and with Real Exam Questions

Jin — Fri, 24 Jul 2026 05:00:03 +0000

When preparing for DP-750: Microsoft Certified: Azure Databricks Data Engineer Associate, one of the first topics you should understand is Azure Databricks compute, often informally called a cluster.

Many exam questions are not really testing whether you remember a button name. They are testing whether you can choose the right compute type for a workload:

interactive development
production ingestion
Lakeflow Spark Declarative Pipelines
batch ETL jobs
cost optimization
machine learning workloads
troubleshooting out-of-memory failures

In the uploaded DP-750 question bank, several questions focus directly on clusters, autoscaling, job compute, serverless compute, Photon, auto termination, and Spark UI troubleshooting.

1. What is a cluster in Azure Databricks?

In Azure Databricks, a cluster is a compute resource used to run notebooks, jobs, Spark workloads, machine learning code, and data engineering pipelines.

More generally, Azure Databricks documentation now often uses the word compute instead of only “cluster”. Microsoft’s documentation describes configuration settings for both all-purpose compute and job compute, and explains that many users create compute through policies that control which settings are available.

A Databricks compute resource usually includes:

a driver node, which coordinates the Spark application
one or more worker nodes, which execute distributed tasks
a Databricks Runtime, which provides Spark, libraries, connectors, and Databricks optimizations
optional features such as autoscaling, auto termination, Photon, and compute policies

For DP-750, the most important point is not the internal architecture. The most important point is choosing the right compute for the right workload.

2. Main compute types you must know for DP-750

All-purpose cluster

An all-purpose cluster is designed for interactive work.

Typical use cases:

notebook development
data exploration
debugging
ad hoc analysis
collaborative development

In exam questions, if you see words such as interactive development, users working in notebooks, or development cluster, the answer often points to an all-purpose cluster.

However, all-purpose clusters are usually not the best choice for production pipelines because they can be shared by many users, stay idle, and allow development activity to affect production workloads.

Job cluster

A job cluster is created for running a job and is usually terminated after the job finishes.

Typical use cases:

production ETL
scheduled ingestion
automated workflows
isolated pipeline execution
repeatable job runs

In DP-750 questions, when the requirement says:

production workloads must run as scheduled, non-interactive pipelines

or:

prevent development activity from affecting production pipelines

you should think about job compute, serverless compute, or Lakeflow pipeline compute, not a shared all-purpose development cluster.

Serverless compute

Serverless compute means Databricks manages the infrastructure for you. You do not manually provision the underlying compute resources. Microsoft describes serverless compute as an Azure Databricks-managed service for notebooks, workflows, and Lakeflow Spark Declarative Pipelines; Databricks automatically allocates and manages the required compute resources, which reduces idle time and management effort.

For Lakeflow Jobs, serverless compute lets you run jobs without configuring and deploying infrastructure. Databricks manages, optimizes, and scales the compute resources, and autoscaling and Photon are automatically enabled for the compute resources that run the job.

For Lakeflow Spark Declarative Pipelines, Databricks recommends serverless compute for new pipelines. Serverless pipelines use enhanced autoscaling and can scale both horizontally and vertically based on workload demand.

This is very important for DP-750:

If the question asks for the lowest operational effort for a new Lakeflow Spark Declarative Pipeline and serverless compute is available as an option, serverless compute is usually the best answer.

3. Autoscaling vs auto termination

These two features are often tested together, but they solve different problems.

Autoscaling

Autoscaling automatically adds or removes worker nodes based on workload demand.

Use autoscaling when:

workload size changes
ingestion volume spikes
users run variable workloads
you need to scale up and scale down automatically

In DP-750 language:

“Automatically add and remove worker nodes” = Autoscaling

Auto termination

Auto termination shuts down compute after it has been idle for a configured period.

Use auto termination when:

users forget to stop clusters
all-purpose clusters remain idle
you want to reduce unnecessary compute cost
active workloads must not be affected

In DP-750 language:

“Automatically shut down when idle” = Auto termination

Microsoft’s compute best-practice documentation also recommends enabling auto termination to ensure compute is terminated after inactivity, and considering autoscaling based on the analyst’s workload.

4. Photon acceleration

Photon is Databricks’ native vectorized query engine. It accelerates SQL workloads, DataFrame API calls, ETL pipelines, and stateless streaming workloads. It is compatible with Apache Spark APIs, so existing Spark code can often run without code changes.

In DP-750, Photon is usually associated with:

performance improvement
ETL acceleration
SQL workloads
DataFrame workloads
cost reduction per workload when faster execution reduces total compute usage

However, Photon is not a magic answer for every cost problem.

If a cluster is overprovisioned and CPU utilization is very low, enabling Photon is not necessarily the best answer. In that case, the better solution may be to reduce the number of workers or right-size the cluster.

5. Cost optimization logic for cluster questions

DP-750 cluster questions often test cost optimization. Here are the most useful decision rules.

Case 1: The workload is variable or bursty

Use autoscaling or serverless compute.

Example signals:

telemetry spikes
unpredictable volume
workload demand changes
new records arrive frequently
pipeline must scale automatically

Case 2: The cluster is idle for long periods

Use auto termination.

Example signals:

users finish work but clusters keep running
all-purpose clusters remain idle
reduce cost without affecting active workloads

Case 3: The workload is predictable and overprovisioned

Reduce the number of workers.

Example signals:

CPU utilization remains below 20%
workload does not spike
current node type already meets requirements
need to reduce cost without increasing duration

In this situation, disabling autoscaling and reducing workers can be better than simply enabling Photon or changing the auto-termination timeout.

Case 4: Production and development are sharing compute

Separate them.

Example signals:

production and development workloads run on the same all-purpose clusters
development activity affects production pipelines
production ingestion should be scheduled and non-interactive

The answer will usually involve job compute, serverless compute, or a production-specific pipeline configuration.

6. Cluster libraries and Unity Catalog

Sometimes the exam asks how to install libraries on a cluster while using Unity Catalog for access control.

A cluster-scoped library can be used by notebooks and jobs running on that cluster, and Microsoft’s documentation explains that libraries can be installed on a specific cluster through the Azure Databricks workspace UI, REST API, CLI, Terraform, or policies.

For DP-750, the key point is:

If the solution must use Unity Catalog for access control, avoid unmanaged or ad hoc installation patterns. Prefer workspace-managed or governed library installation approaches.

7. Machine learning cluster traps

A common exam trap is confusing Databricks Runtime for Machine Learning with GPU capability.

Databricks Runtime ML includes common machine learning and deep learning libraries, but GPU acceleration still requires GPU-enabled compute. Microsoft’s GPU documentation says that to create GPU compute, the worker type must be a GPU instance type.

So if a cluster is:

single-node
general-purpose VM
Databricks Runtime ML
Python supported

then it can run Python ML code using preinstalled libraries, but it cannot automatically train GPU-based deep learning models unless the node type is GPU-enabled. A single-node cluster also cannot distribute workloads across multiple worker nodes.

8. Troubleshooting cluster failures with Spark UI

When a job fails because of out-of-memory errors, you should not only look at the notebook output. You need to understand what happened during execution.

The Spark UI is the key tool for analyzing runtime behavior. Microsoft’s Spark memory troubleshooting documentation explains that memory errors can be generic and may come from several causes, such as shuffle partitions, large broadcasts, UDFs, skew, and streaming state.

For DP-750:

OOM root-cause analysis = Spark UI

Memory/runtime behavior = executors, tasks, stages, shuffle, spill, skew

If the question asks specifically for execution behavior and root cause, Spark UI is usually stronger than cluster event logs or notebook output.

Real Exam Questions

Question 1

Useful case information

Contoso has a single Azure Databricks workspace named Workspace1 in the West US Azure region. Workspace1 is enabled for Unity Catalog.

Workspace1 contains all-purpose clusters for both development and production workloads.

The company’s existing analytics environment has several compute issues:

Production and development workloads run on the same all-purpose clusters.
Production and development workloads do NOT support autoscaling or workload isolation.

Contoso identifies the following environment and compute requirements:

Ensure that production ingestion workloads run on compute clusters that can scale automatically during telemetry spikes.
Prevent development activity from affecting production pipelines.
Production ingestion workloads must run as scheduled, non-interactive pipelines rather than on shared interactive development clusters.

You need to configure compute for the ingestion of telemetry data. The solution must meet the data ingestion and processing requirements.

What should you do?

A. Move the ingestion pipelines to shared compute.

B. Enable Photon acceleration for a job compute cluster. ✅ Correct Answer

C. Increase an all-purpose cluster to a larger fixed node type.

D. Disable autoscaling for a job compute cluster.

Question 2

You have an Azure Databricks workspace.

You are creating a Lakeflow Spark Declarative Pipelines (SDP) pipeline that scales automatically.

You need to configure compute for the pipeline. The solution must minimize operational costs and effort.

What should you use?

A. the existing SQL warehouse

B. an all-purpose cluster that uses autoscaling

C. a job cluster that uses autoscaling ✅ Correct Answer

D. a single-node, all-purpose cluster

Question 3

You have an Azure Databricks workspace that contains an all-purpose compute cluster named Cluster1. Cluster1 is used for interactive development.

You need to configure Cluster1 to meet the following requirements:

Automatically add and remove worker nodes based on workload demand.
Automatically shut down when the cluster has been idle for a specific period.

What should you configure for each requirement?

Requirement	Answer
Automatically add and remove worker nodes	Autoscaling ✅ Correct Answer
Automatically shut down	Auto termination ✅ Correct Answer

Question 5

You have an Azure Databricks workspace named Workspace1.

You create a compute cluster named Cluster1 that will be used to ingest data.

You need to install the required libraries on Cluster1. The solution must use Unity Catalog for access control.

What should you do?

A. Install the libraries by using pip3.

B. Create a custom dependency management script and run the script from a Databricks notebook.

C. Upload the libraries to Workspace1 and install the libraries on Cluster1. ✅ Correct Answer

D. Install the libraries on Cluster1 and manually restart the cluster.

Question 7

You have an Azure Databricks workspace that contains an all-purpose cluster named Cluster1.

You need to configure Cluster1 to meet the following requirements:

Scale up automatically when workloads increase.
Scale down automatically when workloads decrease.
Minimize costs.

Which two actions should you perform? Each correct answer presents part of the solution.

A. Disable Photon acceleration.

B. Enable autoscaling for Cluster1. ✅ Correct Answer

C. Apply a compute policy that enables users to manage the cluster settings.

D. Specify a fixed number of workers.

E. Configure Cluster1 to terminate after 30 minutes of inactivity. ✅ Correct Answer

Question 10

You have an Azure Databricks workspace that contains a cluster named Cluster1.

Performance monitoring shows that Cluster1 is consistently overprovisioned for its batch workload:

CPU utilization remains below 20 percent, including peak processing periods.
The workload is highly predictable and does not spike.
The current node type already meets the workload requirements.

You need to reduce compute costs without increasing job duration.

What should you do?

A. Enable Photon acceleration.

B. Configure Cluster1 to use a larger node type.

C. Decrease the autotermination timeout of Cluster1.

D. Disable autoscaling and reduce the number of worker nodes. ✅ Correct Answer

Question 11

You have an Azure Databricks workspace.

You are creating a Lakeflow Spark Declarative Pipelines (SDP) pipeline that scales automatically.

You need to configure compute for the pipeline. The solution must minimize operational costs and administrative effort.

What should you use?

A. serverless compute ✅ Correct Answer

B. a single-node, all-purpose cluster

C. an all-purpose cluster that uses autoscaling

D. an existing SQL warehouse

E. a job cluster that uses autoscaling

Question 13

You have an Azure Databricks workspace that contains a cluster named Cluster1.

Cluster1 has the following characteristics:

Configured as a single node cluster
Uses a general purpose virtual machine node type

The cluster runtime environment has the following configurations:

Uses Databricks Runtime for Machine Learning
Includes common machine learning libraries
Supports Python workloads

For each of the following statements, select Yes if the statement is true. Otherwise, select No.

Statement	Answer
Cluster1 can be used to train deep learning models that require GPU acceleration.	No ✅ Correct Answer
Cluster1 can distribute machine learning workloads across multiple nodes.	No ✅ Correct Answer
Cluster1 can run Python workloads that rely on preinstalled machine learning libraries.	Yes ✅ Correct Answer

Question 61

You have an Azure Databricks workspace that contains an all-purpose cluster named Cluster1.

You discover that out-of-memory, OOM, errors intermittently cause jobs running on Cluster1 to fail.

You need to identify the root cause of the failures by analyzing the runtime execution behavior.

What should you do?

Area	Answer
Diagnostic tool to use	The Apache Spark UI ✅ Correct Answer
Execution level to analyze	Executors ✅ Correct Answer

Question 70

You have an Azure Databricks workspace that contains multiple all-purpose clusters.

You discover that some clusters remain idle for long periods after users finish their work.

You need to reduce compute costs without affecting active workloads.

What should you do?

A. Enable autoscaling.

B. Convert the clusters into job clusters.

C. Use spot instances.

D. Configure automatic termination. ✅ Correct Answer

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Was My Claude Account Banned Because I Speak Chinese?

Jin — Thu, 23 Jul 2026 18:59:58 +0000

Recently, Anthropic has heavily tightened its restrictions on users linked to unsupported regions, particularly China. While I live and work in Germany, I found myself caught in this wave of account bans.

I want to share what happened, look at how these bans work technically, and explore why legitimate Chinese speakers outside China might be experiencing algorithmic "friendly fire."

My Experience: Three Months, Three Bans

My setup is completely legitimate: I live in Germany, registered with a German phone number, and paid with a German credit card. Germany is fully supported by Anthropic.

Despite this, I have opened three separate Claude accounts, upgraded to Claude Pro three times, and every single time, my account was banned near the end of the billing month. (Fortunately, Anthropic automatically issued full refunds each time).

Because my network and billing footprints are strictly German, I was left with one obvious question: Was I banned because I chat with Claude in Chinese?

The Technical Reality: Multi-Signal Risk Scoring

Anthropic doesn't publish its enforcement logic, but public statements, news reports, and community reverse-engineering reveal that the system doesn't just look at your IP address. It uses a combination of signals to calculate a risk score:

Location & Network: IP geolocation, VPN/proxy detection, and cloud-hosted IP ranges.
Payment Data: The card issuing country (via Bank Identification Numbers) and billing address.
Device Metadata: Local system timezone, browser language, and OS locale settings.
Corporate Ownership: In late 2025, Anthropic expanded its rules to block access by overseas subsidiaries of companies with more than 50% ownership from unsupported jurisdictions.
Client Environment: In early 2026, controversies surrounding Claude Code revealed that Anthropic experimented with checking local environments for proxy configurations and timezone mismatches (like Asia/Shanghai).

Is Language a Trigger?

In a standard risk-scoring model, no single factor causes a ban. Speaking Chinese alone is a poor indicator of location—millions of people in supported countries (like Singapore, Malaysia, Germany, and the US) speak Chinese daily.

However, if an automated system combines multiple "weak" signals—such as system language settings, browser locale, heavy Chinese prompting, and perhaps minor network flags like using a corporate VPN or iCloud Private Relay—the risk score might cross the threshold for an automatic ban.

When this happens, the platform simply disables the account and issues a refund with zero transparent explanation.

The Takeaway

My current guess is that Chinese-language usage isn't the sole reason for the ban, but it likely acts as a contributing feature in a rigid, automated trust and safety model. This creates a unique frustration for Chinese speakers living abroad: our language and browsing habits can accidentally mimic the exact patterns the algorithms are trying to block.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Selling Power BI Templates for Passive Income

Jin — Wed, 22 Jul 2026 04:59:58 +0000

Like many people, I used to think passive income meant getting enough blog traffic to earn decent ad revenue. But relying on ads is difficult, slow, and requires massive audience volume.

Recently, I realized there is a much more practical approach: shifting the focus from generating clicks to selling actual digital products or services. As data professionals, we already have highly monetizable skills. One of the best ways to leverage this is by selling ready-to-use Power BI templates.

Why Power BI Templates?

Many small businesses and independent professionals need data visualization, but they do not have the budget to hire a full-time Data Analyst. They just want a dashboard that works.

If you can build a clean, plug-and-play Power BI template for common use cases—like sales tracking, inventory management, or personal finance—you are solving a direct problem. The best part of a digital product is the scale: you build the template once, and you can sell it infinitely.

Where to Sell Your Templates

You do not need to build a complex e-commerce website from scratch. There are several platforms designed specifically for hosting and selling digital files. Before researching this side hustle, I had never heard of most of these places, but they are exactly what creators use.

Etsy

Most people think Etsy is only for handmade crafts. In reality, it has a massive market for digital downloads. Business owners actively search Etsy for Excel trackers, resume formats, and Power BI dashboards. It gives you access to a huge built-in audience.

Gumroad

Gumroad is a platform built specifically for digital creators. It is incredibly easy to use. You simply upload your Power BI file, set a price, and Gumroad gives you a clean checkout link you can share on your blog or social media.

Ko-fi

Ko-fi is well-known as a platform where followers can "buy you a coffee" to support your work. However, it also includes a great digital storefront feature. You can list your templates for sale directly on your profile, often with lower fees than other platforms.

Payhip

Payhip is another straightforward platform focused entirely on selling digital downloads and memberships. It handles the checkout process and the secure file delivery to the customer automatically.

The Takeaway

If you want to build a passive income stream, stop chasing pennies from website ads. Take the technical skills you already use at work, package them into a valuable template, and list it on a digital storefront. It is a much faster and more controllable path to monetizing your expertise.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Why I Left China as a Data Analyst

Jin — Sun, 05 Jul 2026 05:00:01 +0000

In 2021, I graduated with my Master’s degree in Industrial Engineering in Germany and decided to move back to China. During the final year of my degree, I taught myself Python and SQL on DataCamp. I used those skills to pass a data case study and landed my first job at a small SaaS startup in Shanghai. A year later, I moved to an American company, RRD, also in Shanghai.

I worked there from 2022 to 2024. During those two years, I noticed a few undeniable trends in the data and tech industry. Eventually, these trends made me realize I needed to leave. Here is why.

1. Two Separate Software Ecosystems

At my job, I used Microsoft Teams and Power BI. However, many of my friends in domestic companies used local Chinese office suites and BI tools.

China has built its own independent software ecosystem. It works perfectly fine for those used to it, but it is completely separate from the global market. Because my skills and habits were rooted in global tools like Power BI, my employment options in China were almost entirely limited to foreign companies. That instantly shrank my job market.

2. The Great Tech Decoupling

Between 2022 and 2024, the decoupling of global and domestic tech became obvious. Salesforce shut down its direct China operations, and Tableau made similar moves. Many companies were forced to adopt domestic ERP software.

For a Data Analyst, the ERP system is your foundation. Domestic ERPs and SAP run on completely different logics. The same divide is happening with cloud infrastructure—global players like AWS, Azure, and GCP versus domestic Chinese clouds.

I realized I was standing at a crossroads. I had to choose a path: adapt entirely to the Chinese software ecosystem, or stick with the international one. Trying to jump back and forth between the two just means a massive loss of time and high learning costs.

3. Budget Constraints Over Value Creation

Profit margins for many companies in China are tight. Even in multinational companies, the high-profit departments usually stay abroad, leaving the Chinese branches with strict cost constraints.

For example, we did not have the budget to give everyone a Power BI Pro license. Because of this, a significant part of my job turned into finding cheap workarounds. I had to figure out how to set up local servers for Power BI or build wrappers for Tableau just to save money. Instead of spending my time analyzing data and creating real business value, I was wasting energy trying to bypass budget rules using cheap alternatives.

4. The AI Barrier

When the AI boom started, the tools were immediately inaccessible in China. Using them requires extra effort: setting up VPNs, buying virtual foreign phone numbers, and navigating blocks.

On top of that, a $20 monthly subscription for AI tools is expensive relative to local salaries. AI is developing at lightning speed. I didn't want my first step with every new technology to be researching how to secretly bypass regulations just to use it.

The Decision to Leave

Ultimately, I decided to leave China. The choice was half for my career and half for my family.

Today, I am back in Germany, working as a Data Analyst. Looking back, I am happy with my decision. I can focus my time on creating real value, and most importantly, I am staying seamlessly connected to the global tech frontier.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Where to Write Python in Azure - Building the Python ETL Pipeline

Jin — Sun, 05 Jul 2026 05:00:00 +0000

Many data analysts know how to read and process Excel files using Python and Pandas locally. But what happens when you move to the Azure cloud?

When building a recent ETL pipeline, the target database was Azure SQL Database. Suddenly, running Python on my local machine was no longer an option because local scripts couldn't easily or securely connect to the cloud database via ODBC. I needed a place to write and execute Python directly in Azure, read Excel files, and schedule daily tasks.

Here is the architecture I built, the services I tested, and the exact costs of my final solution.

Step 1: Getting Business Data into the Cloud

The data sources for this pipeline were monthly Excel files and mapping tables that business users manually updated.

To bridge the gap between business operations and the cloud, I used Power Automate. I set up a flow that automatically syncs the users' OneDrive folders to an Azure Storage Account every day. This allows business users to update mapping tables in a familiar environment (OneDrive), while seamlessly feeding the latest data into the data engineering pipeline.

Step 2: The Quest for the Right Compute

Once the data was in the Azure Storage Account, I needed a compute service to process it and write the results to Azure SQL Database. I tried four different Azure services before finding the right fit.

1. Azure Synapse Analytics Synapse is powerful, but it is expensive. According to Microsoft’s documentation, Synapse uses a Massively Parallel Processing (MPP) architecture. For medium-sized Excel data, this is massive overkill. Paying for distributed computation when you don't need it simply isn't cost-effective.

2. Azure Machine Learning (Virtual Machine) Next, I tried creating a VM in Azure ML. The developer experience was fantastic. By connecting via VS Code, I could easily read data from the Storage Account and write it to the SQL Database. However, it had one fatal flaw: scheduling. Setting up a simple daily automated run for a notebook in Azure ML is unnecessarily complicated.

3. Azure Functions Azure Functions are incredibly cheap. But as the data processing logic grew, I hit its limitations. Functions are great for lightweight, event-driven tasks, but they are not designed for managing complex ETL dependencies and heavy data transformations.

The Final Solution: Azure Databricks Serverless

Ultimately, I moved to Azure Databricks. Initially, I used a standard hybrid workspace, but the idle costs of keeping VMs running (or waiting for them to spin up) were too high.

Then, I switched to Databricks Serverless (hosted in the Germany West Central region). This solved everything. I had an excellent environment to write Python, seamless connections to Azure Storage and SQL Database, and built-in, reliable scheduling.

Transparency: What Does It Actually Cost?

One of the biggest concerns with Databricks is the cost. For this production pipeline, my Databricks service costs exactly €52 per month.

Here is the breakdown of my real Azure bill:

Premium Interactive Serverless Compute DBU: €42.24
Premium Automated Serverless Compute DBU: €8.27
Premium Databricks Storage Unit DSU: €0.11

The largest chunk (€42.24) comes from Interactive Compute—this is the cost generated when I am actively writing, testing, and debugging code.

The actual production run—the Automated Compute—only costs €8.27 per month. The pipeline is scheduled using a standard CRON expression (0 0 5 ? * MON-FRI) to run every weekday at 5:00 AM. Because it is Serverless, I only pay for the exact seconds the compute is running to process the data, with zero idle costs on weekends.

The Takeaway

When building a data pipeline in Azure, finding the right place to write Python isn't just about code execution. It is a balancing act between developer experience (like VS Code integration), operational ease (simple scheduling), and cost control. For medium data workloads, Databricks Serverless currently hits that sweet spot perfectly.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Stop Using Spark for Your Small Data - Why Azure Functions is the Right Tool for the Job

Jin — Wed, 06 May 2026 09:22:57 +0000

As a data analyst, my job is to get data from A to B, cleaned and ready for use. A common workflow for my team involves users uploading Excel files to a OneDrive folder. A Power Automate flow then syncs these files daily to a container in our Azure Storage Account.

From there, my responsibility begins:

Read the new Excel file from Blob Storage using Python.
Process the data (clean, transform, apply business logic).
Write the final data to an Azure SQL Database.

I needed this to run on two triggers: a time schedule (e.g., every morning at 7 AM) and an event-driven trigger (i.e., as soon as a new file lands in the container).

My first thought was to use the "big data" tools I'd heard of: Azure Databricks or Azure Synapse Analytics.

The "Big Tool" Trap

On the surface, Databricks and Synapse are perfect.

They let me write Python in a Notebook, which I'm very comfortable with.
They have easy-to-use trigger and monitoring tools.

I set up a proof-of-concept, and it worked. But I quickly realized a problem. My Excel files are 10MB, not 10TB.

Using a full Spark cluster (which is what both Databricks and Synapse Notebooks run on) was like using a sledgehammer to crack a nut. I was paying for a powerful, multi-node cluster (which took 5-10 minutes to "cold start") just to run a Python script that finished in 30 seconds. The cost was going to be far too high for such a simple task.

The "Right Tool": Azure Functions

After some research, I found the perfect tool for small-to-medium data tasks: Azure Functions.
Azure Functions, when used on a "Consumption Plan," is a true "serverless" service. This means:

It's cheap: You get a generous free grant every month, and after that, you pay only for the seconds your code is actually running. For my task, the cost is practically $0.
It's fast: It starts in seconds (or less), not minutes.
It's perfect for triggers: It has built-in triggers for exactly my needs (Timer and Blob Storage).

The (Small) Learning Curve

The one trade-off is that it's slightly more complex than a notebook. You can't just write and run your code in a web browser. The modern, recommended workflow is to use Visual Studio Code (VS Code) to develop your code locally and then "deploy" (push) it to the cloud.

This "local development" workflow is a best practice. It means you have a copy of your code, can use source control (like Git), and can test everything on your machine before it goes live.

More Than Just Timers

My needs were simple, but Azure Functions has triggers for almost anything. The most popular ones include:

Timer Trigger: Runs on a schedule (e.g., 0 7 * * 1 for 7 AM every Monday).
Blob Trigger: Runs when a new file is uploaded to a storage container.
HTTP Trigger: Runs when it receives a web request (creating a simple API).
Queue Trigger: Runs when a new message is added to a storage queue.

You can see the full list on the official Microsoft Azure Functions Triggers and Bindings documentation.

Conclusion

Databricks and Synapse are amazing, powerful tools, but they are not the answer for everything. For our team's daily Excel processing, using them was costing us time and money.

By investing a little time to learn the VS Code + Azure Functions workflow, we built a solution that is faster, more efficient, and costs a fraction of the price. Don't pay for a Spark cluster when all you need is a 30-second Python script.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Data Analyst: Does Your Work Actually Matter?

Jin — Wed, 06 May 2026 09:22:37 +0000

Introduction

I recently saw a question on Reddit that stopped me in my tracks: "Do you feel your work in data analysis is valuable to the organization you work for?"

It is the question that haunts every data analyst.

We spend hours cleaning data and building complex dashboards. We send them out into the void. And then... silence. We wonder: Is anyone actually reading this? Does this dashboard change anything?

If you are just answering ad-hoc requests, the answer is often "no."

The Trap of "Saving Time"

Many analysts get stuck in the "automation trap." A colleague from another department asks you to automate their manual workflow. You do it. They are happy because they save two hours a week.

You feel useful. But does the company see the value?

Often, they don't. From a management perspective, that colleague’s salary is already paid. Unless that saved time is directly used to generate new revenue, your automation didn't change the company's bottom line. You just made someone's life easier.

That is nice, but it isn't necessarily valuable in a way leaders notice.

The Shift: Stop Doing Projects, Start Building Products

If you want your work to matter, you need to stop acting like an IT support desk and start acting like a Product Owner.

What is the difference?

A Data Project has a start and an end date. It is usually a one-time request. The goal is "delivery." Once you hand over the dashboard or report, you are done. It quickly becomes outdated.
A Data Product is a living tool. It doesn't just report the past; it helps shape future decisions. It evolves. Its goal is not "delivery," but measurable "business impact" (like saving money or reducing risk).

Real-World Example: The SpendCube

Let’s look at a real example from my work with a purchasing department.

The "Project" Approach:
The department asks for a report on last month's spending. I pull the data, send an Excel file, and close the ticket.
Result: They look at what happened. Nothing changes. The value is low.

The "Product" Approach (The SpendCube Dashboard):
I build a live dashboard that doesn't just show what was spent, but actively highlights where we are overspending against budget in real-time. It identifies specific suppliers where we could negotiate better contracts tomorrow.
Result: The dashboard isn't just a report; it is a tool they use to actively save the company money. It contributes directly to the P&L (Profit and Loss).

How to Make Your Work Valuable

If you are tired of wondering if your work matters, change your approach.

Don't just accept tasks. When someone asks for a dashboard, ask them: "What decision will you make with this data?" If they can't answer, the dashboard probably isn't necessary.

Move away from automating tasks and start building data products that solve real business problems. When your work directly helps the company save money or make money, you never have to ask if you are valuable. You already know the answer.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

How to Fix "command 'claude-vscode.editor.openLast' not found" in VS Code

Jin — Wed, 06 May 2026 08:06:22 +0000

The Problem

When trying to use the Claude Code extension in VS Code, you might run into this error preventing it from opening (2.1.129):

command 'claude-vscode.editor.openLast' not found

The Solution

The fix is simple: you need to downgrade the extension to a specific stable version (2.1.128).

Here are the exact steps:

Uninstall your current Claude VS Code extension.
Click the Gear (Settings) icon on the Claude extension page in VS Code.
Select "Install Another Version..." from the dropdown menu.
Choose version 2.1.128 from the list.
Reload VS Code.

That's it! The error should be gone and Claude will work properly again.

How to Store JSON and XML in SQL Databases

Jin — Fri, 13 Mar 2026 15:37:17 +0000

Introduction

In the era of big data and diverse data formats, the ability to store and query semi-structured data like JSON (JavaScript Object Notation) and XML (eXtensible Markup Language) in SQL databases has become increasingly important. This article explores how to effectively store and manage JSON and XML data in SQL databases, along with the pros and cons of each approach.

Understanding JSON and XML

JSON

JSON is a lightweight data interchange format that is easy for humans to read and write, and easy for machines to parse and generate. It is often used in web applications for data exchange between clients and servers.

XML

XML is a markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable. It is widely used for data representation and exchange, especially in web services.

Storing JSON in SQL Databases

Many modern SQL databases, such as PostgreSQL, MySQL, and SQL Server, provide native support for JSON data types.

How to Store JSON

Using JSON Data Type: Some databases allow you to define a column with a JSON data type.

   CREATE TABLE Products (
       ProductID int PRIMARY KEY,
       ProductData json
   );

Inserting JSON Data:

   INSERT INTO Products (ProductID, ProductData) VALUES (1, '{"name": "Laptop", "price": 999.99}');

Querying JSON Data

You can use built-in functions to query JSON data.

SELECT ProductData->>'name' AS ProductName FROM Products WHERE ProductID = 1;

Storing XML in SQL Databases

SQL databases also support XML data types, allowing you to store and query XML documents.

How to Store XML

Using XML Data Type: Define a column with an XML data type.

   CREATE TABLE Orders (
       OrderID int PRIMARY KEY,
       OrderDetails xml
   );

Inserting XML Data:

   INSERT INTO Orders (OrderID, OrderDetails) VALUES (1, '<order><item>Book</item><quantity>2</quantity></order>');

Querying XML Data

You can use XPath and XQuery to extract data from XML columns.

SELECT OrderDetails.value('(/order/item)[1]', 'varchar(100)') AS ItemName FROM Orders WHERE OrderID = 1;

Pros and Cons of Storing JSON and XML

Pros

Flexibility: Both JSON and XML allow for flexible data structures, making it easy to store complex data.
Interoperability: They are widely used formats, making it easier to integrate with other systems and APIs.
Schema-less: You can store data without a predefined schema, which is useful for evolving data models.

Cons

Performance: Querying semi-structured data can be slower than querying structured data, especially for large datasets.
Complexity: Managing and querying JSON and XML data can add complexity to your database operations.
Storage Overhead: JSON and XML formats can consume more storage space compared to traditional relational data.

Conclusion

Storing JSON and XML in SQL databases provides a powerful way to handle semi-structured data. By leveraging the native support for these formats in modern SQL databases, you can efficiently store, query, and manage complex data structures. Understanding the advantages and limitations of each format will help you make informed decisions about how to best utilize them in your applications.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

Connect with me on LinkedIn

Connect with me on X

Fixing Azure SQL Connection Errors in Azure Scheduled Python Job

Jin — Fri, 27 Feb 2026 13:37:00 +0000

As a Data Analyst, I recently faced a frustrating issue while automating a daily data processing task in Azure.

The goal was simple: run a scheduled job every morning to process data and sync it to an Azure SQL Database. When I ran the code manually, it worked perfectly. But when the scheduled job (via Azure Functions or Synapse) triggered at 6:00 AM, it crashed immediately.

Here is the solution to fixing the "Database not available" error without increasing your Azure bill.

The Problem

The job failed consistently with Error 40613:

(pyodbc.Error) ('HY000', "[HY000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]Database 'xxxxxxx' on server 'xxxxxxxxxxxxxxxxxx' is not currently available. Please retry the connection later. If the problem persists, contact customer support, and provide them the session tracing ID of '{...}'. (40613) (SQLDriverConnect)") (Background on this error at: https://sqlalche.me/e/20/dbapi)

Why this happens

I am using the Azure SQL Database Serverless tier. To save costs, this tier features Auto-pause. If no one uses the database for a set period (e.g., 1 hour), Azure puts it to sleep.

When my scheduled job runs in the morning, the database is cold. It takes approximately 60 to 90 seconds for Azure to spin the compute back up. The default Python connection string gives up before the database is ready.

The Expensive Fix (Don't do this)

My first instinct was to disable Auto-pause.

Go to Azure Portal > SQL Database.
Click Compute + storage.
Uncheck Enable auto-pause

The result: The error stopped, but my costs tripled. I was paying for compute 24/7 for a job that only runs for 10 minutes a day. This is not efficient.

The Smart Fix: Intelligent Retry Logic

Instead of keeping the server running all night, we should write code that is patient enough to wait for the server to wake up.

I wrote a custom wrapper for the SQLAlchemy engine that handles the specific behavior of Azure Serverless cold starts.

The Code

Here is the robust connection function. It attempts to connect, and if it detects the database is sleeping, it waits and retries until the server is back online.

import time
from sqlalchemy import create_engine, text
from sqlalchemy.exc import OperationalError, InterfaceError

def connect_sql_engine(max_retries=10, delay_seconds=30):
    """
    Attempts to connect to the database. 
    If the database is in serverless pause state, it retries until it wakes up.

    max_retries: Default 10. Covers ~5 minutes of startup time.
    delay_seconds: Default 30s. Wait time between attempts.
    """

    # Replace with your credentials or use Environment Variables (Recommended)
    server = 'your-server.database.windows.net'
    database = 'your-database'
    username = 'your-username'
    password = 'your-password' 

    # LoginTimeout=30 gives the driver time to negotiate the handshake
    connection_string = (
        f'mssql+pyodbc://{username}:{password}@{server}/{database}'
        f'?driver=ODBC+Driver+18+for+SQL+Server&LoginTimeout=30'
    )

    # Create the engine with connection pooling enabled
    engine = create_engine(
        connection_string,
        fast_executemany=True, # Optimized for bulk inserts
        pool_pre_ping=True,    # Checks connection health before usage
        pool_recycle=1800
    )

    print(f"Attempting to connect to {database}...")

    for attempt in range(1, max_retries + 1):
        try:
            # Try to execute a simple query to wake the DB
            with engine.connect() as conn:
                conn.execute(text("SELECT 1"))

            print(">>> Success: Database is connected and awake!")
            return engine

        except (OperationalError, InterfaceError) as e:
            print(f"Attempt {attempt}/{max_retries} failed. Database might be auto-paused.")
            print(f"Error details: {e}")
            print(f"Waiting {delay_seconds} seconds for wake-up...")
            time.sleep(delay_seconds)

    # If we reach here, the database is genuinely down or credentials are wrong
    raise Exception(">>> Failed to wake up the database after multiple attempts.")

How it works

The Loop: It tries to run SELECT 1. This is a lightweight query that forces Azure to trigger the resume process.
The Trap: If it catches an OperationalError (which covers the 40613 code), it pauses the script for 30 seconds using time.sleep().
The Success: Once Azure allocates the compute (usually after attempt 2 or 3), the connection succeeds, and the function returns the active engine object for your pipeline to use.

Summary

Don't change your infrastructure to fit your code; change your code to fit the infrastructure. By handling the "cold start" in Python, you keep the cost benefits of Serverless architecture while maintaining the reliability of a Production environment.

Happy coding!

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

🎃 Connect with me on X

How to Install Python Package in Azure Synapse for Apache Spark pools

Jin — Tue, 06 Jan 2026 21:58:00 +0000

Efficiently Installing Python Packages in Azure Synapse Analytics

When working in Azure Synapse notebooks, you can use the %pip command (e.g., %pip install pandas) in a code cell to install packages. However, this method is temporary. The package is only installed for the current notebook session and must be re-installed every time the session starts.

This repetition can lead to significant delays in notebook execution and is inefficient for frequently run jobs.

A more permanent and efficient solution is to install packages directly onto the Apache Spark pool. This approach ensures the libraries are pre-installed and automatically available in every session attached to that pool.

How to Install Packages at the Spark Pool Level

This method involves uploading a requirements.txt file that specifies the packages and versions you need.

Go to your Azure Synapse workspace in the Azure portal.
Navigate to the "Manage" section on the left-hand side.
Select "Apache Spark pools" under the "Analytics pools" section.
Choose the Spark pool where you want to install the package.
move your mouth to the three dots on the right side of the Spark pool and click on "Packages".
upload requirements.txt file which contains the list of packages you want to install.
Click Apply to save the changes.

The Spark pool will update and automatically install the specified packages. This may take a few minutes. Once complete, all notebooks attached to this pool will have access to these libraries by default.

How to generate `requirements.txt` file

The requirements.txt file is a simple text file that lists the packages to be installed. You can easily generate this file from your local Python environment.

Open your terminal or command prompt and run the following command:

pip freeze > requirements.txt

This command captures all packages and their exact versions from your current environment and saves them into a file named requirements.txt. Uploading this file ensures that the exact same package versions are installed in your Synapse environment, providing consistency and preventing dependency conflicts.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

🎃 Connect with me on X

How to Calculate a Dynamic Truncated Mean in Power BI Using DAX

Jin — Tue, 06 Jan 2026 21:57:00 +0000

Why You Need a Truncated Mean

In data analysis, the standard AVERAGE function is a workhorse, but it has a significant weakness: it is highly susceptible to distortion from outliers. A single extreme value, whether high or low, can skew the entire result, misrepresenting the data's true central tendency.

This is where the truncated mean becomes essential. It provides a more robust measure of average by excluding a specified percentage of the smallest and largest values from the calculation.

While modern Power BI models have a built-in TRIMMEAN function, this function is often unavailable when using a Live Connection to an older Analysis Services (SSAS) model. This article provides a robust, manual DAX pattern that replicates this functionality and remains fully dynamic, responding to all slicers and filters in your report.

The DAX Solution for a Dynamic Truncated Mean

This measure calculates a 20% truncated mean by removing the bottom 10% and top 10% of values before averaging the remaining 80%.

You can paste this code directly into the "New Measure" formula bar.

Trimmed Mean (20%) = 
VAR TargetTable = 'FactTable'
VAR TargetColumn = 'FactTable'[MeasureColumn]
VAR LowerPercentile = 0.10 // Defines the bottom 10% to trim
VAR UpperPercentile = 0.90 // Defines the top 10% to trim (1.0 - 0.10)

// 1. Find the value at the 10th percentile
VAR MinThreshold =
    PERCENTILEX.INC(
        FILTER( 
            TargetTable, 
            NOT( ISBLANK( TargetColumn ) ) 
        ),
        TargetColumn,
        LowerPercentile
    )

// 2. Find the value at the 90th percentile
VAR MaxThreshold =
    PERCENTILEX.INC(
        FILTER( 
            TargetTable, 
            NOT( ISBLANK( TargetColumn ) ) 
        ),
        TargetColumn,
        UpperPercentile
    )

// 3. Calculate the average, including only values between the thresholds
RETURN
CALCULATE(
    AVERAGEX(
        FILTER(
            TargetTable,
            TargetColumn >= MinThreshold &&
            TargetColumn <= MaxThreshold
        ),
        TargetColumn
    )
)

Deconstructing the DAX Logic

This formula works in three distinct steps, all of which execute within the current filter context (e.g., whatever slicers the user has selected).

Define Key Variables
TargetTable & TargetColumn: We assign the table and column names to variables for clean, reusable code. You must change 'FactTable'[MeasureColumn] to match your data model.
LowerPercentile / UpperPercentile: We define the boundaries. 0.10 and 0.90 mean we are trimming the bottom 10% and top 10%. To trim 5% from each end (a 10% total trim), you would use 0.05 and 0.95.

2. Find the Percentile Thresholds

MinThreshold & MaxThreshold: These variables store the actual values that correspond to our percentile boundaries.
PERCENTILEX.INC: We use this "iterator" function because it allows us to first FILTER the table.
`FILTER(..., NOT(ISBLANK(...))): This is a crucial step. We calculate the percentiles only for rows where our target column is not blank. This prevents BLANK() values from skewing the percentile calculation.
The result is that MinThreshold holds the value of the 10th percentile (e.g., 4.5) and MaxThreshold holds the value of the 90th percentile (e.g., 88.2) for the currently visible data.

3. Calculate the Final Average

RETURN CALCULATE(...): The CALCULATE function is the key to making the measure dynamic. It ensures the entire calculation respects the filters applied by any slicers or visuals in the report.
AVERAGEX(FILTER(...)): The core of the calculation. We use AVERAGEX to iterate over a table.
FILTER(...): We filter our TargetTable a final time. This filter is the "trim." It keeps only the rows where the value in TargetColumn is:
- Greater than or equal to our MinThreshold
- AND
- Less than or equal to our MaxThreshold
AVERAGEX(..., TargetColumn): AVERAGEX then calculates the simple average of TargetColumn for only the rows that passed the filter.

Conclusion

By implementing this DAX pattern, you create a robust, dynamic, and outlier-resistant KPI. This measure provides a more accurate picture of your data's central tendency and will correctly re-calculate on the fly as users interact with your Power BI report.

Explore more

JinFollow

Hello there! 👋 I'm Jin, a Business Intelligence Developer with a passion for all things data. Proficient in Python, SQL, Power BI, Tableau

Thank you for taking the time to explore data-related insights with me. I appreciate your engagement.

🚀 Connect with me on LinkedIn

🎃 Connect with me on X