Yup, I was wondering about rearranging it so that it's driven by a single Python process which calls diff in a subprocess, rather than a bash script which calls Python twice.
In my tests, calling bexplode.py on either of the individual files takes about 5 seconds, so I think the 10 seconds to get the diff is dominated by doing this twice. (I'm piping the output to /dev/null for timing, so it's not limited by the terminal)
But maybe it's possible to just make bexplode faster. E.g. I spotted that from Python 3.8, bytes.hex() takes a separator parameter, so sep=' ' can replace a re.sub() call. That's about a 30% speedup in my tests.
Organiser of the Edinburgh Language Exchange and The Edinburgh Open Tech Scene |
Full Snack Developer 🥪, Ramen guzzler 🍜, quiche murderer 🥧. A friendly cat.
So the current bash shell actually causes two calls to the python script indeed ; each one's output actually writes to a named pipe (typically held in RAM) rather than to disk.
I suspect the overhead of calling the two python processes from a bash script would be significantly outweighed by writing to two on-disk temp files, which diff then needs to read... I was partway through trying to implement the suggestion when I realised this... 🤦♂️
I did have a look at difflib to see if there's a way of passing the chunks to it gradually, but no, it expects a pre-loaded entirety of data in the form of a list of strings. On reflection, I do believe it is part of the diffing algorithm to look at "future" data further down so as to compute the context. Worst-case scenario, the context will be searched for on the whole file...
Generally is a sensible option, because diff is normally used with source files in the small kilobytes range....
--
Try running
# Print the start datedate# Run a no-op that takes the file-descriptors for FILE1 and FILE2 simultaneously# The completion times for each will write to stderr only after completion
: <(bexplode.py FILE1 >/dev/null;echo"File 1: $(date)">&2) <(bexplode.py FILE2 >/dev/null;echo"File 2: $(date)">&2)echo"Please wait - processing in background ..."
It should show that the total time will only be the time of the largest file.
The main culprit is likely diff, loading both fully, and then running the diffing algorithm...
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Yup, I was wondering about rearranging it so that it's driven by a single Python process which calls diff in a subprocess, rather than a bash script which calls Python twice.
In my tests, calling
bexplode.py
on either of the individual files takes about 5 seconds, so I think the 10 seconds to get the diff is dominated by doing this twice. (I'm piping the output to /dev/null for timing, so it's not limited by the terminal)But maybe it's possible to just make bexplode faster. E.g. I spotted that from Python 3.8,
bytes.hex()
takes a separator parameter, sosep=' '
can replace are.sub()
call. That's about a 30% speedup in my tests.So the current bash shell actually causes two calls to the python script indeed ; each one's output actually writes to a named pipe (typically held in RAM) rather than to disk.
I suspect the overhead of calling the two python processes from a bash script would be significantly outweighed by writing to two on-disk temp files, which
diff
then needs to read... I was partway through trying to implement the suggestion when I realised this... 🤦♂️I did have a look at
difflib
to see if there's a way of passing the chunks to it gradually, but no, it expects a pre-loaded entirety of data in the form of a list of strings. On reflection, I do believe it is part of the diffing algorithm to look at "future" data further down so as to compute the context. Worst-case scenario, the context will be searched for on the whole file...Generally is a sensible option, because diff is normally used with source files in the small kilobytes range....
--
Try running
It should show that the total time will only be the time of the largest file.
The main culprit is likely diff, loading both fully, and then running the diffing algorithm...