Juha-Matti Santala

Posted on Jun 23, 2019 • Edited on Jun 25, 2019

I love writing scripts to solve small problems

#python #scripting #bash

One of the reasons I like programming so much is that it allows me to automate small and annoying things that would otherwise require bunch of manual work.

Yesterday, I downloaded a set of files that came in a following directory structure inside a zip file:

- Main Folder
  - Theme A (1)
    - FileA.pdf
    - FileA.txt
    - FileA.jpg
  - Theme B (2)
    - FileB.pdf
    - FileB.txt
    - FileB.jpg
  - Theme C (3)
    ...

For my use case though, I was only interested in the pdf files and wanted to record the order of those files that was written in parentheses in the folder. I could have manually moved them all to a new folder in Finder but since there were a few dozen of them, I opened my editor and started writing Python.

import os
import re
import shutil

NUMBER_PATTERN = re.compile(r'\((\d+)\)')

for directory, _, files in os.walk('.'):
    if directory == './Output':
        continue
    for filename in files:
        if not filename.endswith('.pdf'):
            continue

        episode_number = NUMBER_PATTERN.findall(directory)[0]
        path = os.path.join(directory, filename)

        new_filename = f'{episode_number:0>2} - {filename}'
        new_path = os.path.join('Output', new_filename)

        print(f'Copying {path} to {new_path}')
        shutil.copyfile(path, new_path)

It's a single-run script that relies on a very specific naming and file structure as well as the existence of Output folder. So if something's out of order, it just breaks.

That means it's not very maintainable and it probably isn't the best nor most pythonic code I could write. But since it's a script meant to run once in this very particular situation, I can recover from error situations manually.

And that's the beauty of it. It doesn't have to be good code, it just has to work once. It saves me lots of annoying manual copying and renaming files.

As opposed to the quest of writing good, maintainable and error-resistant code at work for production, these scripts allow me to get small wins by just scraping some code together.

edit I want to highlight this beautiful bash script that @teroyks created in the comments:

find . -iname "*.pdf" | while read F; do FILE=$(basename "$F"); NR=$(printf "%02d" "$(echo "$F" | sed "s/.*(\(.*\)).*/\1/")"); cp -v "$F" "./Output/$NR - $FILE"; done

edit2 @teroyks also provided us an example in fish shell:

for f in (find . -iname "*.pdf")
    set file (basename $f)
    set number (string match -r "\((.*)\)" $f)[2]
    set number (printf "%02d" $number)
    cp -v $f "./Output/$number - $file"
end

Latest comments (25)

Tero Y • Jun 25 '19 • Edited

Just for fun, here is the way I would most likely currently do this – using the fish shell instead of bash. Bash is, of course, more standard and concise, but the fish version is much more readable – fish has IMO much of the same elegance and delight as Python. Fewer special characters ($), more intelligent variable handling (less worrying about remembering to quote everything), etc.

Less street cred for producing esoteric incantations though. :-)

for f in (find . -iname "*.pdf")
    set file (basename $f)
    set number (string match -r "\((.*)\)" $f)[2]
    set number (printf "%02d" $number)
    cp -v $f "./Output/$number - $file"
end

(The string matching and printf could be combined into one line, but since fish has multi-line command editing as a standard feature, doing the two things separately makes the code more readable.)

Juha-Matti Santala • Jun 25 '19

This looks great!

Much easier to read and understand than the bash pipe.

James McPherson • Jun 24 '19

This is the sort of relatively simple use-case that I'd just use a shell kinda-one-liner for. Admittedly, I've done thousands of these over the years (ripping CDs to flac and renaming etc) but shell is what comes to me first for this.

Something along the lines of

$ LIST=`find * -name \*pdf`;
$ for f in $LIST; do \
    N=`echo $f|sed -e"s, ,,g;s,(,_;s,),,g"`; \
    mv "$f" $N; done

You'd need to match that up with your preferred directory hierarchy and nomenclature; my bias is against spaces and suchlike in filenames so I remove and replace with - and _ depending on what I wish to make clear.

The great thing, though, is that the problem space here is simple enough that there are several different ways to solve it, each of which allow us to use our favourite language. But - please! - don't try doing regex in C!

Comment deleted

Juha-Matti Santala • Jun 24 '19

Yeah, shell scripts are also awesome. I find that once things get a little bit more complex (like requiring looping, conditionals and regex in same script), my skills with bash scripting are not sufficient and I tend to jump into Python because it's fastest for me.

Rémi Lavedrine • Jun 24 '19

I love writing small scripts (shell or python) to automate everything that requires redundant tasks.
For instance, I did it to experiment on Git rebase very recently (even wrote an article about that).
When I am talking to non tech people, I am just stunned at how they could do so less work by knowing these things and automating these kind of work.

Juha-Matti Santala • Jun 24 '19

So happy to hear other people enjoy it too!

I think the first step to getting people started is to make them aware what's possible. That will get their imagination rolling and they start to correctly identify problems they encounter and then learn the required skills to solve them.

Adrien • Jun 24 '19

I see I've become a command line snob because I was expecting two bash commands separated by a pipe.

Edvin Dunaway • Jun 24 '19

agree, I would use bash for this

Juha-Matti Santala • Jun 24 '19

@eddinn , would you like to help me learn how to solve this with bash?

Edvin Dunaway • Jun 24 '19

hey, I just saw your reply, and also see that @teroyks beat me to it.
He did it well! :)

Juha-Matti Santala • Jun 24 '19

My bash skills are good to a limit but especially any time I need regex stuff or conditionals, I find myself grabbing Python because I'm more comfortable building things with it.

Would love to see and learn how this could be done in bash!

Tero Y • Jun 24 '19 • Edited

Perhaps not the most elegant way (and uses a few more than two commands), but this is how I would have done it:

find . -iname "*.pdf" | while read F; do FILE=$(basename "$F"); NR=$(printf "%02d" "$(echo "$F" | sed "s/.*(\(.*\)).*/\1/")"); cp -v "$F" "./Output/$NR - $FILE"; done

So, basically:

Find all the pdf files under the current directory (case insensitive name search) and output a list of their full (relative) paths
Loop through the list one by one
Save the file name (without path) into the var FILE
Grep the episode number with sed and a regex
Pad the number with leading zeros with printf and save into NR
Copy the file into Output with a new name (prepended with the episode number)

Could use internal bash functions instead of some of the external utilities, but internal shell logic is (at least for me) harder to remember than simple utility commands.

The biggest gotcha in handling file names is to remember to surround the values with quote marks when outputting them – otherwise, bash will split file names with spaces into several values, and everything will break.

Juha-Matti Santala • Jun 24 '19

This is so cool, thanks Tero for taking the time to educate!

Individually, all parts are familiar to me but I probably couldn't have constructed such a beautiful pipe.

Juha-Matti Santala • Jun 24 '19

I added this as an edit into the original post so people interested in bash scripting can also find it more easily.

still-dreaming-1 • Jun 24 '19 • Edited

And yet the reason this is possible is because someone did take the time to write nice reusable code. It's just short term thinking vs long term. If you stick to one language as much as possible and keep building up a more and more reusable set of base libraries, you can become more and more productive over time, except that you are wasting all of it writing reusable code. Wait... Does that mean it's both a win-win and lose-lose scenario at the same time? Anyway, I agree it would not have made sense to turn it into a reusable script, but to realize which parts were harder to get right or took too much time, and find possible places to embed that experience into your existing reusable libraries can be a good thing, but of course the immediate time and energy is not always there in the moment to do that.

Juha-Matti Santala • Jun 24 '19

That is true. It's great to have a community that builds so that we can build on top of each other's work.

If there's anything novel in my scripts, I try to package it and make it reusable (at least on the second or third time I'm using it). But quite often it's just like this code above. Trying to make it general would probably make it even harder to read and extend because it would require many levels of abstraction.

There's one thing (a parser for certain type of data) that I've built more times than I care to admit and I'm planning on building it into a library either on Python or Javascript (or both) because I start to have quite a nice grasp of what it takes and how it's supposed to work.

still-dreaming-1 • Jun 24 '19 • Edited

Yeah the code in that script does look simple enough that the reusable part seems to have been made for you already. There comes a point at which a thing (library, user interface, car) is good enough that trying to make it better is mostly just rearranging or abstracting things, which only confuses people.

One thing I disagree with that the development community has been promoting lately is this idea of waiting until some code has been duplicated 3 times before you remove the duplication. If you can tell the code you wrote/will write is a good candidate for making reusable and that you will eventually be able to reuse that at some point, it is better to just write/rewrite it in a reusable way from the beginning. Particularly I disagree with the part where people say you won't know what a good abstraction looks like or how it really will be used until you have seen it repeat a few times. Instead I feel the fastest way to a good abstraction is to make it one right away and then look for good opportunities to start reusing it. This will help you to feel the pain of using it as early as possible, giving you feedback more quickly that you can incorporate. I feel the time coming with the most motivation to write it in a reusable way is when you first recognize it is possible, and if you wait until later you are more likely to put it off longer than intended. I also disagree with the reason where people say/imply the first attempt will necessarily be so bad/off as to be a waste of time and effort because I think this is a skill we can and should strive to learn. What I mean is, if you had the ability to write good, reusable code on the first attempt, would that not be the best way and a worthwhile skill to have? To say it is impossible and then to not try is a self fulfilling belief, the only way one can learn this skill is to keep doing it, mess up, and learn and improve.

At some point I'm going to write an article about type oriented programming, which is a way of thinking and coding that very naturally leads to creating reusable code that is so reusable you will actually yearn to reuse it everywhere. It inevitably leads to the discovery of missing core types, in whatever language you are using, that the language or standard library really should have included in itself long ago or from the beginning.

Juha-Matti Santala • Jun 24 '19

Thank for a great and thoughtful reply, I'll share my thoughts on some of the points you raised.

One thing I disagree with that the development community has been promoting lately is this idea of waiting until some code has been duplicated 3 times before you remove the duplication.

My opinion on this depends a bit on what kind of duplication we are talking about. If it's within a codebase, I totally agree: duplication leads to many issues in maintainability and will eventually cause issues when someone doesn't realize they need to change things in multiple places.

If it's not exactly duplication but the code varies a bit, I'm 50/50. Sometimes it makes sense to parameterize the code but sometimes it creates a situation where the new code is actually harder to read and understand compared to nearly duplicating the code in a couple of places.

When it's about reinventing the wheel in the context of this blog post, I find it useful to reinvent things every now and then a couple of times. It might be that it's never needed again so spending a lot of time making something reusable library upfront can become a wasted time.

Particularly I disagree with the part where people say you won't know what a good abstraction looks like or how it really will be used until you have seen it repeat a few times. Instead I feel the fastest way to a good abstraction is to make it one right away and then look for good opportunities to start reusing it.

Building something and then starting to use it to learn and improve to reach good abstraction is exactly what that statement you seem to disagree with is about. Based on a single use case, we barely ever know the best abstractions or APIs for the generic use case and thus, I think it's a good approach to see what abstractions arise from the usage rather than trying to always define them upfront.

I also disagree with the reason where people say/imply the first attempt will necessarily be so bad/off as to be a waste of time and effort because I think this is a skill we can and should strive to learn. What I mean is, if you had the ability to write good, reusable code on the first attempt, would that not be the best way and a worthwhile skill to have? To say it is impossible and then to not try is a self fulfilling belief, the only way one can learn this skill is to keep doing it, mess up, and learn and improve.

I totally agree with you that this is something we should all strive for. However, though I believe you can become better at it, I don't see this as purely a technical skill. It's rather about realizing our limitations as human beings in predicting future. I believe these ideas in the community stem from the agile movement that promotes the idea that you should not build for the uncertain future.

All in all I think it's about finding the right balance. Sure, we'd be great off if we could spend lots of time building beautiful, generic and reusable code but it's always a tradeoff of losing progress in other places. As we improve as developers, we become better at writing good code but predicting future is still very difficult.

For example in my small script, one thing that could make it more reusable would be to be able to configure in which format and where the episode numbers are. If it only works with Theme A (1) format, it won't be very generically usable.

However, trying to plan for that before knowing what use cases there are will be very challenging. Maybe it's in format of 1 - Theme A, maybe it's Theme A - 1, maybe it's 00001 Theme A and so on. Especially if the names can contain numbers, we cannot just rely on finding any number with regex.

However, if I find myself doing this same thing with very similar directory structures in the future and see it becoming a pattern in use cases, I can definitely parameterize it further to make it more usable. Until that happens, I won't be able to know which direction is the right one.

At some point I'm going to write an article about type oriented programming, which is a way of thinking and coding that very naturally leads to creating reusable code that is so reusable you will actually yearn to reuse it everywhere.

Looking forward to reading it, sounds really interesting!

still-dreaming-1 • Jun 24 '19 • Edited

Thanks for the response. I do agree that ultimately balance is needed and things like this will always be a judgement call. Just one more point I would like to make. One reason why I often like to start writing some reusable code right away, instead of waiting for the different use cases to build up, is I like the way this provides a placeholder for you to evolve the abstraction. So even if it is true that we can't really predict the future and know the different ways something will be used in advance, it is easier to respond to that knowledge as it builds up by already having a place to capture it. It's basically an application of the principal of environmental design, as explained in the book "Willpower Doesn't Work - Discover the Hidden Keys to Success", by Benjamin Hardy. If you make something hard, we tend to put it off, but if you set up the right environment for an activity to be easy, we are much more likely to do it. Environmental design is also about motivating yourself through investment. If you know you have already invested some time and energy in making an abstraction, you don't want that investment to go to waste, so you are more likely to improve and evolve it as needed. But the opposite is also true, if you already invested in making the code work without relying on abstractions, you won't want that investment to go to waste by having to alter it to use an abstraction when it already works and maybe is not too bad as it is. Anyway, I'm speaking in ideal terms and am in no way implying that I or anyone else should operate on this level at all times. I just feel these are ways the development community in general is failing to recognize as areas we should improve in, not because we don't want to improve, but because we don't yet see ways to make this kind of improvement practical. And that is all well and good, recognize your limitations and keep working in a practical and professional way. But at the same time it is good to keep an open mind and occasionally experiment with ways to improve these things that seem like there may not be a good way to improve them.

still-dreaming-1 • Jun 26 '19 • Edited

Forgot to provide another little teaser about a future article. Type oriented programming (TOP) is the core technical practice involved in a software engineering philosophy I call "feedback looped emergence" (FLE). It is used to coax code and software solutions to appear and grow into their truest, purest form, as an emergent property of working on and in a feedback looped system.

ebaad • Jun 24 '19

I love scripting for exactly the same reason. I do not consider myself a professional programmer but scripting small things keep me sharp and give some gratification.
Thanks for good article that resonated to lot of folks.

Juha-Matti Santala • Jun 24 '19

That's how my career got started: I just wrote small Perl scripts to mostly build tools for sports stats aggregation and manipulation and then started building web interfaces around them and few years later found myself working as a software developer.

And even if you never want to become a full professional software developer, the ability to solve your own small problems with computers is a huge skill!

Leonardo Furtado • Jun 23 '19

I think that sometimes it is the real purpose of write a script: solve a problem fast.
If you refactor a simple code like this, probaly you will waste time... And what was the initial purpose? That's right, avoid wasting time. Nice example, thank you for sharing with us.

Juha-Matti Santala • Jun 23 '19

That is such a great point! And it's premature optimization to start refactoring this in case it could become useful as a general script in the future.

View full discussion (25 comments)