geraldew

Posted on Mar 19, 2023

Coding Diary 2023-03-19 an undesired feature in filecmp.py

#python

I'm writing this just an exercise in open sharing. Perhaps it can show that there's value in copying a stock Python module and playing around with it.

The Problem

My program Foldatry is intended to run unattended for long periods of time as it pores over gigabytes and terabytes of files and folders. Here the focus is on a section that seeks to prove that two structures of files are identical - a useful thing to do after a big copy operation.

The function for doing this was meant to do a two part run:

first a "light touch" in which only the metadata was compared;
then if that found all looked the same, then a second "heavy touch" run would compare all the file contents as well.

But was happening was that the amount of time being spent in the first round was taking much longer than expected. Indeed it seemed to be taking as long as I would expect the heavy touch to run.

So what was going on?

As it happened, I've long had good enough logging built into the program to show me where the problem was. It seemed to be inside my use of the dircmp class in the stock Python module filecmp.py

Here is the documentation link for that module: https://docs.python.org/3.11/library/filecmp.html and from it, here is the brief for that class/object:

The dircmp class

class filecmp.dircmp(a, b, ignore=None, hide=None)

Construct a new directory comparison object, to compare the directories a and b. ignore is a list of names to ignore, and defaults to filecmp.DEFAULT_IGNORES. hide is a list of names to hide, and defaults to [os.curdir, os.pardir].

As it happens, I had looked at this module a while back (and that's another story) so I did have some idea of how it works. Enough to know that:

it is written purely in Python - although itself using some other stock Python modules
therefore, I could copy the module code into my own program and modify it as a way to see what was happening.

So that's what I did.

Being a little careful, I fetched a copy from the version of Python that I'm using (on Xubuntu 20.04) - from https://github.com/python/cpython/blob/3.7/Lib/filecmp.py and put it, slightly renamed, into my source code folder. As I'd already been using this module via an alias on import, I only had to change the reference to the now-local renamed copy.

After a quick check that this had taken effect, I started inserting print statements to see which internal functions were being called.

The Issue Identified

Note: to keep this from being tedious I'll skip some of the concepts of this module and jump to the particular.

I soon determined that the flaw lay in a section where the metadata of files was being compared.

    s1 = _sig( os.stat( f1 ) )
    s2 = _sig( os.stat( f2 ) )
    if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG:
        return False
    if shallow and s1 == s2:
        return True
    if s1[ 1 ] != s2[ 1 ]:
        return False

In the context of use by the dircmp class, the parameter shallow will be in a state of True and this where things were failing. And by failing, the internal file comparison function was thereby continuing to compare the files byte-by-byte - which is why it was being heavy instead of being light.

Ok, so that begged the question of why that was failing.

Values made `.sig` call

If you look at the code section above, you'll see that the s1 == s2 comparison is on values that had been fetched by calling a function _sig

Here is the code for that function.

def _sig(st):
    return (stat.S_IFMT(st.st_mode),
            st.st_size,
            st.st_mtime)

There's nothing particularly obviously wrong with that, though the detail will be in the feature being used from the stock module stat but happily I didn't need to go inspect that module. Instead I just added some print statements to inspect the values that were going into the return clause. Here they are:

For file A:

32768
11
1673954681.5983226

For file B:

32768
11
1673954681.0

Clearly the difference is in the third value, which was being fed from st.st_mtime and while you can look that up, I can tell you it's just a float number representing the modification timestamp of the files. Being a float, the integer portion is the POSIX time in seconds. The fractional portion of the number is therefore the fractions of a second for the timestamp. As you can see, for one of the files the timestamp was stored with a fractional component and the other without one.

While the question of whether timestamps are a good enough thing to be comparing - in deciding generally whether the file contents should be compared - is clearly a good question, in this context it is being so picky about the fractional parts that's causing my problem.

The Fix

So, in the short term, a simple fix is for me to just add an int function on the times being collected so that only their integer components will be compared.

Thus, the very minor code change is:

def _sig(st):
    return (stat.S_IFMT(st.st_mode),
            st.st_size,
            int( st.st_mtime ) )

Of course, having a local copy of a stock Python module is not a good situation, but there is a wider arc to that story.

Addendum 1

As it happens, I had already realised some time ago that I should write my own replacement for the filecmp.py module - not so much because I think it's bad, but because my needs are both more specific than it is intended to handle as well as needing something more flexible.

Indeed the very issue that prompted this post is an example - that there could be various strategies about matching files - their names and/or their metadata and/or their content -that I'd like handled in more controllable ways.

At the moment, some parts of those ideas are implemented as things to check and/or do after using the dircmp class. While I did start writing the replacement module - it's in the code base but is unused - I don't intend tackling it until I have settled all the different things I want it to cover. These are described in part of the Foldatry documentation so I won't detail them here.

However, for the context of this article, I will say that being able to usefully compare files that have been copied from one file system to another is a definite need that I have - and this is likely to throw up differences of how the timestamps are stored and handled on them.

there are some useful comments about this at Get file timestamps (creation, modification, access date and time) in Python

Addendum 2

On how the stock module "works".

When I needed to look at this a while back it took me some reading and thinking to see why it was written the way it is. As an overview. it tries to use a "phased" approach to doing the comparisons - thus allowing a fairly simple external call to lead to quite varied amounts of operation depending on what is encountered in the folders and files.

When I realised that, I could see the sense of how it was written. It happens that for my needs in Foldatry, I - the application programmer - want to be more in control than that - so that told me I will need to write something more suited.

I do happen to think that the module is written a bit more cryptically than it needs to be - but that's not an uncommon opinion I have of Python that I see written.

Addendum 3

There's an unimportant story of why it took me so long to notice this undesired feature - and do note that I'm not calling this a "bug".

When I started Foldatry, the confirmation of perfect copies was the first part that I wrote - before any GUI and even before the primary reason I wrote anything (which was the matchsubtry feature).

That means that my testing was being done on large amounts of files from very early on. However, it was only quite a while later than I hit on the idea of doing a "light touch" run before embarking on the full file contents comparison. When I tested this, I probably didn't run it on very large file sizes and so saw no significant delays. And when I have run it - for real usage - on very large structures, knowing such a run would take hours, I'd run it overnight - so again, not noticing that the light touch run wasn't actually fast. I would have seen confirmation of the two runs in the logs but not inspected the times closely.

Of course, for the light touch to take a long time was dependent on there being minor differences in the timestamps - and in retrospect I'm guessing that may depend on which copying tools I had used and/or the sequences of copies and moves between ext4 and NTFS file systems (as most of my USB drives use the latter).

Oh well. I got there eventually. And I will at some point write the replacement module.

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

DEV Community

Coding Diary 2023-03-19 an undesired feature in filecmp.py

The Problem

The Issue Identified

Values made `.sig` call

The Fix

Addendum 1

Addendum 2

Addendum 3

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

Top comments (0)

Read next

25 retos de Programación de JavaScript y Python: AdventJS

Why Rust? 🦀 - Speed

Advent of Code 2024 - Day7: Bridge Repair

Extracting Text from HTML Content in Python: A Simple Solution with `HTMLParser`

The Problem

The Issue Identified

Values made .sig call

The Fix

Addendum 1

Addendum 2

Addendum 3

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

Read next

25 retos de Programación de JavaScript y Python: AdventJS

Why Rust? 🦀 - Speed

Advent of Code 2024 - Day7: Bridge Repair

Extracting Text from HTML Content in Python: A Simple Solution with `HTMLParser`

Values made `.sig` call