Debugging Windows Race Conditions in Dagster

#devjournal #dataengineering #python #opensource

Okay, another PR on Dagster. I tackled a deceptively complex issue. Specifically, I focused on the dagster-dbt integration.

At first glance, the Pull Request might look small. However, getting to that one line required diving deep into Windows filesystem internals and race conditions.

Here is the story of how I diagnosed and fixed a nondeterministic crash that was haunting Windows users.

"It works, until it doesn't"
The issue appeared simple. When a user reloads their dbt project definitions in Dagster on Windows, the process crashes with a FileExistsError: [WinError 183].

The traceback pointed to this logic in dbt_project_manager.py:

# The original code
shutil.rmtree(local_dir, ignore_errors=True)
local_dir.mkdir()

On paper, this logic seems flawless. It delete the directory, and then create it. So, why was Python complaining that the file already exists right after deleting it?

The Root Cause - The Windows Race Condition
This is where the complexity lies. Unlike Linux or macOS, the Windows filesystem behaves differently regarding to file locking and deletion latency.

When shutil.rmtree() is called, it requests the OS to delete the directory. However, on Windows, if a file inside that directory is briefly locked, the deletion doesn't happen instantaneously.

Python executes rmtree. Often Windows starts deleting but lags slightly due to a lock.
Because ignore_errors=True was set, rmtree returns silently without finishing the job.
Python immediately executes mkdir().
CRASH!: The directory is essentially a "zombie" - it’s flagged for deletion but still technically exists.

This is a classic Race Condition. The code assumed instant deletion but the Windows proved otherwise.

Aligning Intent with Reality
I didn't want to simply suppress the error. I needed to architect the creation step to be flawless.

The original author used ignore_errors=True for deletion, implying a design philosophy of "Availability over Atomicity" which means that if cleanup fails, the program should try to continue rather than crash. However, the mkdir() step was strict, breaking this philosophy.

This is my suggestion:

local_dir.mkdir(parents=True, exist_ok=True)

By adding exist_ok=True, I ensured that even if the "zombie" directory lingers due to OS latency, the program proceeds gracefully. The subsequent sync() operation then handles the data consistency by overwriting files, ensuring that no stale data causes issues.

The Dilemma - Logic over Observation
This presented a significant engineering dilemma for me. As a developer, I crave the validation of seeing a test fail before I fix it. I wanted to witness the crash with my own eyes to confirm the bug.

However, my local test environment worked too well. Because I was testing with a relatively small dataset, the rmtree operation on my Windows machine finished instantaneously, beating the race condition every time. No matter how many times I reloaded, the crash wouldn't trigger.

I decided to trust my static analysis of the code over my local observation. I looked closely at the existing code's intention:

shutil.rmtree(local_dir, ignore_errors=True)

The original code explicitly ignores errors during deletion. It proved that the original authors anticipated that deletion might fail or be incomplete. They designed the system to tolerate a messy cleanup.

However, the very next line, mkdir() was strict and intolerant of pre-existing directories. This was a logical contradiction in the code's design philosophy.

I concluded that my fix, adding exist_ok=True was necessary to align the creation logic with the deletion logic. Trusting this architectural logic, I submitted the Pull Request.

PR, issue-32841

DEV Community

Debugging Windows Race Conditions in Dagster

Top comments (0)