Sérgio Araújo

Posted on Oct 10, 2019 • Edited on Feb 3, 2022

Runing linux commands in parallel

#gnu #linux #parallel

Intro

In my daily routine I like to watch tv-series to improve my English. Sometimes I need to read some tv-series transcriptions to figure out what people are saying in them, so I am used to download the whole series transcriptions.

Many sites have those transcriptions in HTML format, to convert a web site to plain text I usually do:

lynx --dump URL > result.txt

Some sites use the "user-agent" directive to identify if you are trying to use some forbiden download tool. So I have the following alias:

alias lynx='lynx -display_charset=utf-8 -useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.79 Safari/537.1"'

Any site will assume that my lynx is actually a firefox browser.

Filtering the content

Let's say I want to figure out what episodes have this string:

A little more, a little more.

For those cases I use the Ag or The Silver Searcher a faster grep command. The command ends up being:

ag 'A little more, a little more.' -i .

The result:

friends-transcript-s10e16.txt
328:A little more, a little more.

The first line gives me the file name, the second one gives me the line number and its content.

Downloading all the transcriptions

The problem is exactly the transcriptions, how can I download all of them at once?

I have found a site where I can download all the transcriptions using the above lynx command, but how pass all the different URL variations with season and episode?

GNU Parallel comes in handy

First I set a "main url" variable:

url='https://www.springfieldspringfield.co.uk/view_episode_scripts.php\?tv-show\=friends\&episode\='

How I can pass to GNU Parallel a range of 10 seasons with 24 episodes each? Simple, using shell expansion like this:

echo friends-s{01..10}e{01..24}.txt

The last season has only 18 episodes, and @samvittighed helped me with this (see comments section):

s{01..09}e{01..24} s10e{01..18}

And how about the download command, how would it be?

parallel --verbose -j20 lynx --dump $url{} '>' friends-transcript-{}.txt ::: s{01..09}e{01..24} s10e{01..18}

Remember, the $url variable was defined above.

It was possible to do the same using pure shell script?

Yes, but it is Huugelly slower than using GNU Parallel. It would be like:

    mainurl='https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=friends&episode='

clear
echo "--------------------------------------------------------"
echo "Baixando transcrições para todos os episódios de Friends"
echo "--------------------------------------------------------"

for i in {01..10}; do
    season=s${i}
    echo

    for j in {01..24}; do
        episode=e${j}

        echo "Baixando temporada $i episódio $j"
        lynx --dump "${mainurl}${season}${episode}" > "friends-${season}${episode}.txt"

        if [ "$i" == 10 ] && [ "$j" == 18 ]; then
            exit
        fi
    done
done

The speed of GNU Parallel is something amazing, because it downloads 20 files at a time instead of one by one sequentially.

If you have any tip to improve this article feel free to share.

Cleaning all the files with vim

After downloading all files I wanted to delete the first 18 lines of each of them and also the last paragraph. So, I did:

vim *.txt
:silent argdo normal! gg18dd
:silent argdo normal! G{kdG
:silent argdo update
:qall

The first "argdo" runs a normal command, jumping to the first line of each file with "gg The second "argdo" command jumps to the end of the file "G", goes back one paragraph with "{", jumps up one line with "k" and delete to the end with "dG". The ":silent argdo update" command writes all of the files and the ":qall" command exists from all the files.

Editing the files with sed

Getting rid of undesired lines with sed it is pretty simple:

sed -i '1,18d' *.txt
sed -i '/References/,$d' *.txt

The first command deletes the first 18 lines. The second one deletes from "References" until the end of the file.

Top comments (2)

Sam Vittighed • Jan 4 '20

parallel --verbose -j20 lynx --dump $url{} '>' friends-transcript-{}.txt ::: s{01..09}e{01..24} s10e{01..18}

Sérgio Araújo • Jan 4 '20

Thanks a lot! I have been thinking about how much we learn as soon as we start sharing knowledge, exactly because other kind people are always willing to help. Actually I am thinking of writing an article about this, I would be a collaborative article to reinforce the idea. How do you like it?