Introduction
This article shows you how to create a directory walker that extracts files from multiple directories and subdirectories. The walker initially uses a combination of recursion and a for loop and then uses a generator/iterator to save time and memory. I assume you are familiar with basic Python programming, functions, generators, and iterators. Let's get started.
Directory Structure
A directory is typically represented as a tree data structure where the topmost directory sits as the root node. The root node can contain zero, one or more subdirectories (child nodes), and files (leaf nodes). The subdirectories can also contain other directories and files.
The diagram below shows a Desktop (root) directory containing three subdirectories - Work, Music, and Videos.
To extract the contents of all directories, you will need to visit each directory using a loop or recursion. These two methods have their advantages and disadvantages. However, this article will not discuss them. You can check out this post on StackOverflow for more information.
The Path Class from the PathLib Module
The pathlib
module provides utilities for working with the filesystems of different operating systems. The Path
class, an alternative to os.path
, provides a high-level, friendly way to assess file paths while avoiding the many ceremonies associated with the os.path
module. For example:
from pathlib import Path
current_dir = Path('.')
print(current_dir) # .
Assessing and navigating subdirectories is as easy as it gets. For example:
# List all subdirectories
subdirs = [x for x in current_dir.iterdir() if x.is_dir()]
print(subdirs) # [WindowsPath('Desktop/Music'), ...]
## List all songs within Desktop
all_songs = list(current_dir.glob('**/*.mp3'))
print(all_songs) # [WindowsPath('Desktop/Music/Hip Pop/song3.mp3'), ...]
# Assuming my current directory is "Desktop"
work_dir = current_dir / "Work"
print(work_dir) # Desktop\Work or Deskop\Work (Linux)
comedy_dir = current_dir / "Videos" / "Comedy"
print(comedy_dir) # Also OS specific
# Print a tuple containing parts of the directory
print(comedy_dir.parts) # ('Desktop', 'Videos', 'Comedy')
If you simply want to extract a specific file type, you can use the
.glob()
method mentioned above. However, if you want more control over the process, then consider using a loop or a generator function.
You can also check if the path is a directory using the .is_dir()
method or if a path exists using the .exists()
method. For example:
print(current_dir.is_dir()) # True
dont_know = current_dir / "Created"
print(dont_know.exists()) # False
Walking Directories
Using A Loop and Recursion
I assume you've created the example directory structure in the diagram above. Let's look at a sample code.
from pathlib import Path
from typing import Callable
def song_strategy(filename: str) -> bool:
"""Check if the file is an mp3 that does not contain certain words."""
# ignore file if any of these words are in the file name
ignore_words = ['slow', 'jazz', 'old']
if filename.endswith('mp3'):
res = [kw for kw in ignore_words if filename.lower().find(kw) != -1]
return len(res) == 0
return False
def collect_files(root_dir: Path, file_strategy: Callable[[str], bool]) -> list[Path]:
"""Collect files by walking multiple directories."""
files: list[Path] = []
# directories to ignore
ignore = {'work', 'videos', 'reggae'}
def inner(root_dir):
for x in root_dir.iterdir():
if x.is_dir() and (x.parts[-1].lower() not in ignore):
# recursion
inner(x)
else:
# use the strategy
if file_strategy(x.parts[-1]):
files.append(x)
inner(root_dir)
return files
# The current path is "Desktop"
print(collect_files(Path('.'), song_strategy))
-
collect_files()
takes aroot_dir
path and afile_strategy()
function that filters files. - A strategy,
song_strategy()
is an example of afile_strategy()
function that selects only mp3 files. You can easily add others! -
collect_files()
uses aninner()
function to recurse through each directory and collects the selected file in an array offiles
. - Note that a recursion must have a termination condition to prevent stack overflow. In this example, the
if x.is_dir() ...
ensures that recursion ends once all directories have been transversed.
While collect_files()
works, its time and space efficiency drastically reduces as the number of directories to transverse increases, especially when dealing with thousands of directories.
Using A Generator Function
A solution to the above problem is to replace collect_files()
with an alternative function that uses a generator. The function is defined below.
def collect_files_generator(root_dir: Path, file_strategy: Callable[[str], bool]):
"""Collect files by walking multiple directories using a generator"""
# directories to ignore
ignore = {'work', 'videos', 'reggae'}
def inner(root_dir):
for x in root_dir.iterdir():
if x.is_dir() and (x.parts[-1].lower() not in ignore):
# yield from a generator
yield from inner(x)
else:
if file_strategy(x.parts[-1]):
yield x
yield from inner(root_dir)
-
collect_files_generator()
is a generator that uses a sub-generatorinner()
. -
inner()
is a sub-generator that recursively uses itself. -
collect_files_generator()
is better thancollect_files()
because it produces values one by one (or lazily) instead of storing them infiles
(memory) before returning them.
While I assume that you already know how a generator works, this post is a great recipe on how yield from
works!
Since a generator produces an iterator, you can control how you retrieve each element from the iterator. For example
# loop through the iterator
for file in collect_files_generator(Path('.'), song_strategy):
print(file)
# Or
# Automatically extract the iterator content as a list
print(list(collect_files_generator(Path('.'), song_strategy)))
Summary
In this article, you saw how
- directories are traversed using loops and recursions,
- the space and time efficiency can be improved using a generator function rather than a normal function,
- directories are represented and how to use the
Path
class from thepathlib
module, - to use a strategy function to filter files.
Thanks for reading.
Top comments (0)