Fast folder iteration in python
Post created at 2022-01-19 07:53
I have a problem to deal here.
I need to iterate over a big tree of folders and files and run process over each file.
In python, we have some options to deal to folders and files.
glob
iglob
os.walk
os.scandir
pathlib.Path
Running this benchmark, we can read some implementation details:
import glob
folder = 'some_folder/another_folder/**/*'
for file in glob . glob ( folder , recursive = True ):
print ( file )
Enter fullscreen mode
Exit fullscreen mode
some_folder/another_folder/f1/file1
some_folder/another_folder/f1/file1
some_folder/another_folder/f1/file2
Enter fullscreen mode
Exit fullscreen mode
Pros
Easy to use, less code to write
Easy to apply filters using masks (fnmatch based)
Returns data as a list
Cons
Time to scan is the worst (more than 11x the quickest method)
Data will be available only after all files/folders been scanned
import glob
folder = 'some_folder/another_folder/**/*'
for file in glob . iglob ( folder , recursive = True ):
print ( file )
Enter fullscreen mode
Exit fullscreen mode
some_folder/another_folder/f1/file1
some_folder/another_folder/f1/file1
some_folder/another_folder/f1/file2
Enter fullscreen mode
Exit fullscreen mode
Pros
Easy to use, like glob
Same filtering
Returns data as iterator, so you can have your data without need to wait all iteration ending.
Cons
Time to scan is a little better than glob (10x the quickest method)
Time to first file could be mutch better
import os
folder = 'some_folder/another_folder/**/*'
for root , _ , files in os . walk ( folder ):
for file in files :
print ( os . path . join ( root , file ))
Enter fullscreen mode
Exit fullscreen mode
Pros
Good performance. Second quickest method (less than 2x slower than the quickest)
Time for the first file almost imediattely (better result of all)
Explicit code in loops gives more visibility and control if you need validations or another nasty processes
Cons
More code needed, with nested loops.
If you need to nest os.walk in another os.walk loop, some strange things can ocurr. But probably your code needs some refactoring.
import os
from typing import Generator
folder = 'some_folder/another_folder/**/*'
def get_files ( folder : str ) -> Generator :
with os . scandir ( folder ) as scan :
for item in scan :
if item . is_file ():
yield item . path
else :
for subitem in get_files ( item . path ):
yield subitem
Enter fullscreen mode
Exit fullscreen mode
Pros
Best performance of all
Context based, assure resources are released after processing
Less verbose than os.walk
Easy to implement your custom business rules
Cons
A little more complex. No big deal.
import pathlib
folder = 'some_folder/another_folder'
for path in pathlib . Path ( folder ). rglob ( '*' ):
if path . is_file ():
yield str ( path )
Enter fullscreen mode
Exit fullscreen mode
Pros
Easy to implement your custom business rules
Returns a iterator
You can have your files soon and don't need to wait for all scan (7x more time than os.scandir)
Cons
Average time to scan (almost 8 times greater than os.scandir)
Memory consumption is 4 times more than the other methods
Some data from tests
System Information
System
Release
Version
Machine
Processor
Linux
5.11.0-46-generic
#51~20.04.1-Ubuntu SMP Fri Jan 7 06:51:40 UTC 2022
x86_64
x86_64
CPU Info
Physical cores
Total cores
Max frequency
Min frequency
Current frequency
4
8
3400
400
1.892
CPU Usage Per Core
0
1
2
3
4
5
6
7
Total
15.6
16.5
10.4
14.1
15.2
14.4
15.2
12.4
14
Memory Information
Total
Available
Used
Percentage
15.51GB
4.77GB
9.56GB
69.2%
SWAP
Total
Free
Used
Percentage
15.26GB
3.28GB
11.98GB
78.5%
Creating sample files
+ Creating folder ./test_files
+ Creating files LEVELS=6 FOLDER_COUNT=6 FILE_COUNT_BY_FOLDER=20
+ Created 933120 files in 0:02:08.928680
* Running iterators: GlobFolderIterator IGlobFolderIterator OSWalkFolderIterator ScanDirIterator PathLibFolderIterator
Enter fullscreen mode
Exit fullscreen mode
MEMORY USAGE
Iterator
RSS
VMS
DATA
IGlobFolderIterator
111280128
96468992
96468992
OSWalkFolderIterator
111280128
96468992
96468992
ScanDirIterator
111280128
96468992
96468992
GlobFolderIterator
117309440
102498304
120254464
PathLibFolderIterator
469721088
455081984
455081984
ELAPSED TIME
Iterator
Elapsed time
X
ScanDirIterator
0:00:01.317526
1
OSWalkFolderIterator
0:00:02.496639
1.9
PathLibFolderIterator
0:00:10.464794
7.9
IGlobFolderIterator
0:00:14.358165
10.9
GlobFolderIterator
0:00:15.308936
11.6
TIME FOR FIRST FILE
Iterator
Elapsed time
X
ScanDirIterator
0:00:00.000098
1
OSWalkFolderIterator
0:00:00.000247
2.5
PathLibFolderIterator
0:00:00.000690
7
IGlobFolderIterator
0:00:00.001135
11.6
GlobFolderIterator
0:00:07.831005
79908.2
You can check the source code for this here .
Image from this wikipedia article .
Top comments (0)