I've recently gone to make some edits in a program, and immediately realised that my variable and function naming was poor and should be rectified before I make the next edits.
The general situation is in a part of the application Foldatry, in particular the part that does the hash-based matching of files by their content. In the process of that, it visits every file and calculates a hash by reading the entire file. This is a fairly standard way of finding files that have the exact same content, regardless of the filenames and timestamps.
What I would like to do, is to amend the program to let me have an option to do a faster process in which it first matches file by their filenames, sizes and timestamps. But first, some housekeeping seemed worthwhile.
Prefixes
As I'll be using them in this example, while not vital to make sense of the following, it may be useful to know my personal prefix codes are:
-
i
= internal - where that seems useful to distinguish from one of similar name with a different prefix -
v
= var parameter - aka passed by reference (the "var" here comes from Pascal) -
p
= parameter - by passed by value -
r
= result - a value that will be quoted in a Python "return" statement -
g
= global - which in the context of Python means global to a module - I try to avoid needing this.
The Outer Call
Here is the outer layer of the call that does this.
def filedupetry_command( p_tree_paths, p_multi_log):
# set up structures for Phase 1
d_dct_size_lst_pfn = {} # dictionary of file sizes each with a list of pathfiles
mtc_dct_hash_lst_tplfilinf = {} # dictionary of hashes
mtc_dct_file_selves = {} # dictionary of files with info about them
# ---------------------
# Phase 1 - tree traversal
# Traverse the tree calculating and collecting hashes
tpi = 0
tpn = len(p_tree_paths)
for i_tree_path in p_tree_paths:
tpi = tpi + 1
traverse_tree_add_hash( d_dct_size_lst_pfn, mtc_dct_file_selves, i_tree_path, p_multi_log)
# ---------------------
# Phase 2 - Hashes for files the same size
build_hash_dct_from_Size_dct( mtc_dct_hash_lst_tplfilinf, d_dct_size_lst_pfn, p_multi_log)
# ---------------------
# Phase 3 - Analysis - make a size based dictionary of the hashes
dct_size_lst_hash_multi_files = make_dct_size_lst_hashed_with_multi_files( mtc_dct_hash_lst_tplfilinf )
# ---------------------
return mtc_dct_hash_lst_tplfilinf, mtc_dct_file_selves, dct_size_lst_hash_multi_files
It was passed in a list of folders to process - i.e. a list of strings giving the paths. You can ignore all the p_multi_log
references as that is for my logging system.
So actually the first thing that's poor about this code already: is that the name of a function it calls is quite misleading.
i.e. traverse_tree_add_hash
because while that name was valid the first time I wrote it, I later changed the approach.
Actually now, that function builds just a dictionary of files keyed by their sizes - for the idea that only those of the same size should then be hashed for contents. However, this improvement is reperesented in the name of the local parameter being passed to it d_dct_size_lst_pfn
being a dictionary keyed by size and having a list of path-filenames.
So, a better name would now be something like traverse_tree_make_sizes_dct
While we're here, let's note that the running of that function is just a first phase, with a second phase being to use that size-keyed dictionary to then build yet another resource keyed by the content hashes - by calling build_hash_dct_from_Size_dct
Even while we're looking at this outer part of code, there are some other names that I might now want to change.
Let's look at these three :
d_dct_size_lst_pfn = {} # dictionary of file sizes each with a list of pathfiles
mtc_dct_hash_lst_tplfilinf = {} # dictionary of hashes
mtc_dct_file_selves = {} # dictionary of files with info about them
At the time I probably named d_dct_size_lst_pfn
with a d
prefix to indicate it was "data" being assembled. Now, I would rename this to have a prefix of i
to indcate that it is internal to this function, hence to become i_dct_size_lst_pfn
At the time I probably named mtc_dct_hash_lst_tplfilinf
with a prefix of mtc
to distinguish which layer of function it was in, with the "mtc" being a reference to that function context. That's not relevant here, and is only a hangover from when this function was a copy-and-adapt construction from earlier code. So this can be renamed to r_dct_hash_lst_tplfilinf
Similarly, mtc_dct_file_selves
can be renamed to r_dct_file_selves
to clarify that this will be built and then returned.
Ditto for
-
dct_size_lst_hash_multi_files
which can becomer_dct_size_lst_hash_multi_files
Hence, after applying those renamings the code looks like:
def filedupetry_command( p_tree_paths, p_multi_log):
# set up structures for Phase 1
i_dct_size_lst_pfn = {} # dictionary of file sizes each with a list of pathfiles
r_dct_hash_lst_tplfilinf = {} # dictionary of hashes
r_dct_file_selves = {} # dictionary of files with info about them
# ---------------------
# Phase 1 - tree traversal
# Traverse the tree calculating and collecting hashes
tpi = 0
tpn = len(p_tree_paths)
for i_tree_path in p_tree_paths:
tpi = tpi + 1
traverse_tree_make_sizes_dct( i_dct_size_lst_pfn, r_dct_file_selves, i_tree_path, p_multi_log)
# ---------------------
# Phase 2 - Hashes for files the same size
build_hash_dct_from_Size_dct( r_dct_hash_lst_tplfilinf, i_dct_size_lst_pfn, p_multi_log)
# ---------------------
# Phase 3 - Analysis - make a size based dictionary of the hashes
r_dct_size_lst_hash_multi_files = make_dct_size_lst_hashed_with_multi_files( r_dct_hash_lst_tplfilinf )
# ---------------------
return r_dct_hash_lst_tplfilinf, r_dct_file_selves, r_dct_size_lst_hash_multi_files
Note that here we have glibly renamed a function call to be traverse_tree_make_sizes_dct
but we will of course need to rename that where it is defined.
The Inner Call
Now let's look at the inner function, and what renaming could or should be done in there.
Actually, it is really two layers because there is a recursive function, with a non-recursive outer call.
Here is the outer call:
# recursively traverse a folder structure, deriving and storing hashes
# this is the outer non-recursive part
# currently the only difference is the use of a depth counter in the recursive call
# and that's only there to help with debugging
def traverse_tree_add_hash( d_dct_size_lst_pfn, folderselfs_d, p_tree_path, p_multi_log):
i_lst_ncalls = [0]
traverse_tree_recurse( d_dct_size_lst_pfn, folderselfs_d, p_tree_path, 0, i_lst_ncalls, p_multi_log)
in which I've left comments in, to show that these also need changing - as the function does not in fact do any "deriving and storing hashes"
So, as we've already covered, we want to rename this to traverse_tree_make_sizes_dct
# recursively traverse a folder structure, compiling a dictionary of sizes
# this is the outer non-recursive part
# currently the only difference is the use of a depth counter in the recursive call
# and that's only there to help with debugging
def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
i_lst_ncalls = [0]
traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, i_lst_ncalls, p_multi_log)
So what I've done there is to also do some internal variable renaming:
-
d_dct_size_lst_pfn
becomingv_dct_size_lst_pfn
-
folderselfs_d
becomingv_dct_folderselfs
Now let's look at the inner, recursive, function.
# recursively traverse a folder structure, deriving and storing hashes
def traverse_tree_recurse( d_dct_size_lst_pfn, d_pfn_stock, p_tree_path, p_tree_depth, p_lst_ncalls, p_multi_log):
# d_hash_lst_tplfilinf is a dictionary being built, with hashes for keys, and lists of tuples for values
# d_pfn_stock is a dictionary being built, with folderpaths for keys, and tuples for values
# p_tree_path is the folder location now being accounted
p_lst_ncalls[0] += 1
fldr_n = p_lst_ncalls[0]
entries = explrtry.get_lst_sorted_os_scandir( p_tree_path )
filcount = 0
dircount = 0
for entry in entries:
if entry.is_file(follow_symlinks=False):
# assemble specifics and derive hash
pfn = im_osp.join(p_tree_path, entry.name)
size = entry.stat().st_size
# make as a tuple
n_t = tpl_FileInfo( pfn, entry.name, size, "")
# now store it
store_dct_size_lst_add_PathFile( d_dct_size_lst_pfn, n_t )
store_dct_pfn_tpl_FileInfo( d_pfn_stock, n_t )
filcount = filcount + 1
elif entry.is_dir(follow_symlinks=False):
# recursive call returning the hash of the subfolder
new_subpath = im_os.path.join(p_tree_path, entry.name)
traverse_tree_recurse( d_dct_size_lst_pfn, d_pfn_stock, new_subpath, p_tree_depth + 1, p_lst_ncalls, p_multi_log)
# count subfolders and prepare for recursion further down
dircount = dircount + 1
return
Here is the plan of which things I want to rename:
-
d_dct_size_lst_pfn
becomesv_dct_size_lst_pfn
-
d_pfn_stock
becomesv_dct_pfn_stock
-
p_lst_ncalls
becomesv_lst_ncalls
- we'll say more about this later
Here is the code after the renaming has been done,
# recursively traverse a folder structure, deriving and storing hashes
def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, v_lst_ncalls, p_multi_log):
# d_hash_lst_tplfilinf is a dictionary being built, with hashes for keys, and lists of tuples for values
# v_dct_pfn_stock is a dictionary being built, with folderpaths for keys, and tuples for values
# p_tree_path is the folder location now being accounted
v_lst_ncalls[0] += 1
fldr_n = v_lst_ncalls[0]
entries = explrtry.get_lst_sorted_os_scandir( p_tree_path )
filcount = 0
dircount = 0
for entry in entries:
if entry.is_file(follow_symlinks=False):
# assemble specifics and derive hash
pfn = im_osp.join(p_tree_path, entry.name)
size = entry.stat().st_size
# make as a tuple
n_t = tpl_FileInfo( pfn, entry.name, size, "")
# now store it
store_dct_size_lst_add_PathFile( v_dct_size_lst_pfn, n_t )
store_dct_pfn_tpl_FileInfo( v_dct_pfn_stock, n_t )
filcount = filcount + 1
elif entry.is_dir(follow_symlinks=False):
# recursive call returning the hash of the subfolder
new_subpath = im_os.path.join(p_tree_path, entry.name)
traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, new_subpath, p_tree_depth + 1, v_lst_ncalls, p_multi_log)
# count subfolders and prepare for recursion further down
dircount = dircount + 1
return
Issues that Renaming Brings To Light
Now one of the benefits of doing this kind of renaming - other than making the code a little clearer for the next time it needs to be read and comprehended - is that the process brings to light various other things that should be re-done.
In this case, we bump into some things I did very early in the process of adapting from my generic programming experience over to how to do things in Python.
- that the recursive calls should be sub-scoped to the outer call - i.e. the functions should be nested
- that this allows a replacement of an ugly use of a list as a single value store
First let's look at the nesting. Here is a pseudo-brief of the functions as they are.
def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, v_lst_ncalls, p_multi_log):
return
def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, i_lst_ncalls, p_multi_log)
return
That happens to be how things get done when coding in a language that supports recursion but not nesting. An example of this is VBA.
As Python does support nesting (as indeed does Pascal which is my mental internal coding language) that means we can restructure like this:
def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, v_lst_ncalls, p_multi_log):
return
traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, i_lst_ncalls, p_multi_log)
return
As presaged above, this allows a change to how one of the tracking variables is handled.
Now to be clear: as the code here was from my very early tackling of Python, when my inclination was to have a passed-by-reference scalar variable (i.e. just a number value) that would travel through the recursion to act as a progress counter. As Python doesn't support that idea - a scalar parameter when written to, creates a new local mutable instance - then I "got around" that by passing a list of just one value.
This makes for an ugly (in several senses) arrangement as see in this abstraction:
def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, v_lst_ncalls, p_multi_log):
v_lst_ncalls[0] += 1
fldr_n = v_lst_ncalls[0]
print( "Folders traversed: " + str( fldr_n) )
return
def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
i_lst_ncalls = [0]
traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, i_lst_ncalls, p_multi_log)
return
Where the list-of-one-value has to be created in the outer layer, passed to the inner layer and carefully handled there - incremented and then passed to itself in recursion.
But with the nested version, we can dispense with the list-of-one-value and also not need to pass it as a parameter.
def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
i_ncalls = 0
def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, p_multi_log):
nonlocal i_ncalls
i_ncalls += 1
print( "Folders traversed: " + str( i_ncalls) )
return
traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, p_multi_log)
return
Do note that this requires one of Python's rare syntactic-only elements nonlocal
to be deployed, to ensure that the required scoping of the variable i_ncalls
is implemented across the function call depths. I simply didn't know about this when I originally wrote the code.
- It's also a curious thing in Python syntax because it is only a guide instruction to the Python interpreter - no program action occurs at that point in the code. Also - in case it's not obvious - it requires the referred variable to already be known to the parser. Hence the following, which perhaps should be identical, will cause an error.
def traverse_tree_make_sizes_dct( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, p_multi_log):
def traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_pfn_stock, p_tree_path, p_tree_depth, p_multi_log):
nonlocal i_ncalls
i_ncalls += 1
print( "Folders traversed: " + str( i_ncalls) )
return
i_ncalls = 0
traverse_tree_recurse( v_dct_size_lst_pfn, v_dct_folderselfs, p_tree_path, 0, p_multi_log)
return
End Note
This was written as notes for making a video. If that happens then a link will be edited into here.
Top comments (0)