DEV Community

Guy Gontar
Guy Gontar

Posted on

When os.walk() Freezes: Diagnosing and Fixing Silent Network Hangs on NetApp Shared Storage with Python

The Problem
Our team manages a NetApp shared storage cluster holding operational data for approximately 1,500 entities. Each entity has its own directory subtree, split by record type and date, resulting in a tree of roughly 4,500 leaf directories per date. Each leaf directory held tens of small .txt files — each containing a JSON array of records dumped by a backend service — scattered across the tree with no consolidation.
The file count had grown to a point where NetApp itself started reporting problems. Inodes were being consumed, quotas were being hit, and our sysadmin was fielding out-of-disk-space errors even after physically expanding the storage. The culprit wasn't raw capacity — it was file count. NetApp, like most enterprise storage systems, tracks files through a fixed inode table. Running out of inodes while having free disk space is a well-known but frequently misunderstood failure mode.
The solution was straightforward in concept: consolidate the small files per leaf directory into single JSONL files — one JSON object per line — reducing tens of files per directory down to one. A Python script using os.walk() to traverse the tree, parse each JSON array, merge the objects, sort by timestamp, and write a consolidated output file. Standard stuff.
The problem emerged at scale.

Diagnosing the Real Cause
The first version of the consolidation script ran fine on individual leaf directories and small subtrees. Pointed at the full tree — all 1,500 entities across three dates — it appeared to start normally, printed its initial configuration, and then went silent. No output. No tqdm progress bar. No exception. No traceback. The process was running according to Task Manager but nothing was happening.
The natural first instinct was a code bug. I attached a debugger and set breakpoints at the point where matched directories were being appended to the results list. The breakpoints were not being hit — but the process wasn't stopping either. It was alive, consuming a small amount of CPU, and doing apparently nothing.
This is the behavior of a syscall-level hang.
os.walk() works by calling os.scandir() on each directory it visits. os.scandir() is a thin wrapper over the operating system's directory enumeration syscall — on Windows against a UNC path, this translates to an SMB QUERY_DIRECTORY request sent to the remote share. When the share responds normally, the call returns in milliseconds. When the share is under stress, the response can be delayed by seconds, tens of seconds, or indefinitely — and Python has no timeout mechanism for this. The thread simply blocks, waiting for a network response that may never arrive, with no way to surface this condition as an exception.
The share was under stress because it was approaching capacity. NetApp's performance characteristics above approximately 85% utilization are non-linear — latency increases sharply as the storage system works harder to manage fragmentation, snapshot reserves, and metadata operations simultaneously. Our consolidation task was running against a share that was already struggling, and os.walk()'s sequential single-threaded enumeration was exposing that struggle in the most opaque way possible: total silence.
The debugger not hitting breakpoints was actually the key diagnostic signal. The code wasn't executing incorrectly — it wasn't executing at all. It was stuck below Python's layer, in a blocking network call that the runtime had no visibility into and no ability to interrupt.

"Just Parallelize It" — But How, Exactly?
Every developer has heard this advice at some point, usually delivered with confidence and very little elaboration. "Parallelize it. It will be faster". The gap between that sentence and actually knowing which concurrency model to apply — and more importantly, why — is where most of the real engineering work lives.
Python offers three primary concurrency models and choosing the wrong one for a given problem can produce no improvement at all, or actively make things worse.
multiprocessing spawns independent OS processes, each with its own Python interpreter and memory space. Because they don't share the GIL, they achieve true parallelism and are the correct tool when the bottleneck is CPU — computation, data transformation, numerical work. The cost is overhead: spawning processes is expensive, and passing data between them requires serialization. For our problem, multiprocessing would have been the instinctive-sounding choice — it's the model most associated with "serious" parallelism — but it would have been wrong.
asyncio uses a single-threaded event loop to interleave I/O operations cooperatively. It's elegant in theory for high-concurrency network workloads but carries a critical constraint: file I/O in Python is not truly asynchronous without third-party libraries like aiofiles. The standard open(), read(), and os.scandir() calls are all blocking and will stall the event loop just as thoroughly as they stall a regular thread. Rewriting the entire script around asyncio would have added significant complexity for no real benefit in this context.
threading with ThreadPoolExecutor is the correct model here, and the reason comes down to one specific behavior: Python's GIL is released during I/O syscalls. When a thread is blocked waiting for a network response — precisely what was happening with os.scandir() on a stressed share — other threads are free to run. The GIL, which normally serializes Python execution, steps aside exactly when we need it to. Multiple threads can be simultaneously blocked on different network requests, and whichever one gets a response first continues executing while the others remain blocked. This is genuine concurrency for I/O-bound workloads, achieved with minimal overhead and no data serialization.
The distinction matters because it's not about which tool sounds more powerful. It's about matching the tool to the bottleneck.

The Solution
The fix was to split the directory scan at the top level of the tree — one subtree per entity — and assign each subtree to a thread in a ThreadPoolExecutor pool. Rather than one thread walking 1,500 entity directories sequentially, four threads each walked roughly 375, concurrently.

`def scan_subtree(subdir: Path, dates: set[str]) -> list[Path]:
found = []
for dirpath, dirnames, _ in os.walk(subdir, topdown=True, followlinks=False):
dirnames[:] = sorted(dirnames)
p = Path(dirpath)
if not dirnames and p.name in dates:
found.append(p)
return found

def collect_targets(root_path: Path, dates: set[str], workers: int) -> list[Path]:
top_level = sorted([p for p in root_path.iterdir() if p.is_dir()])
targets = []
with ThreadPoolExecutor(max_workers=workers) as executor:
futures = {executor.submit(scan_subtree, d, dates): d for d in top_level}
with tqdm(as_completed(futures), total=len(top_level),
unit="subtree", dynamic_ncols=True) as pbar:
for future in pbar:
pbar.set_description(f"Scanning {futures[future].name}")
try:
results = future.result(timeout=120)
targets.extend(results)
except Exception as e:
tqdm.write(f" ERROR scanning {futures[future]}: {e}")
return sorted(targets)`

Two details in this implementation are worth noting explicitly.
The followlinks=False parameter on os.walk() is not optional on a network share. If any directory in the tree contains a symlink pointing to a parent directory — a configuration that is entirely possible on NetApp — the default behavior of following symlinks would produce an infinite loop with no error. This is a silent correctness bug that only surfaces at the worst possible moment.
The dirnames[:] = sorted(dirnames) in-place assignment is equally important and easy to get wrong. os.walk() uses the dirnames list internally to determine which subdirectories to visit next. Modifying it in place with slice assignment changes the list that os.walk() holds — giving consistent traversal order and the ability to prune subtrees if needed. Reassigning the variable with dirnames = sorted(dirnames) creates a new list that os.walk() never sees, silently doing nothing.

The more important insight from this solution is that the speed improvement was almost a secondary benefit. The primary benefit was resilience. In the single-threaded version, one slow or hung os.scandir() call stalled the entire program. In the concurrent version, a thread blocked on a slow subtree simply waits while the other threads continue making progress. The tqdm progress bar — which had been completely absent in the single-threaded freeze — now advanced continuously, confirming that the system was alive and working even when individual threads were stalled. In a long-running batch operation against an unreliable network resource, that visibility is not cosmetic. It is operational.

The script successfully scanned and deleted leaf directories across all 1,500 entities in a fraction of the time the single-threaded version had spent frozen on its first subtree.

Top comments (0)