DEV Community: brthanmathwoag

Helping pandoc generate a correct table of contents from HTML input

brthanmathwoag — Thu, 07 Sep 2017 00:00:00 +0000

TL;DR:

pandoc expects chapter headers to be placed directly inside the body node. No <div> wrappers allowed.
pandoc sets the book title after the last <title> tag it sees (e.g. the last file on the commandline).

It's been a long time since I last had to convert an HTML ebook to EPUB. Last time I did, I couldn't make calibre put the chapters in the correct order ¹ and got so angry that I tried to hand-craft the file with bash and a bunch of regexen. It was certainly an interesting experiment and I've learned much about EPUB internals in the process. I've also learned that while it's OK to parse a limited, known set of HTML with regex, it's much more convenient to use an actual HTML parser.

Since then, I fell in love with pandoc and have been using it extensively for various projects ². So when I recently wanted to read the F# Programming Wikibook on my Kindle, I knew I would use pandoc for conversion.

My enthusiasm somewhat dropped when I examined the resulting file and found out that the generated table of contents consisted of a single entry, named after the last chapter of the book. And this was not just a problem of broken navigation.

An EPUB file is essentially a zip archive with chapters stored in separate HTML files. Thanks to that, ebook readers can open them one by one, which means quicker load times and lower memory footprint. Because pandoc didn't know how to split the book into chapters, it put them all into a single file so my reader had to slurp and format the entire text before displaying anything - grinding it to halt for over a minute each time the book was opened.

For HTML input, pandoc is supposed to generate the TOC automatically from the <h1>, <h2>, ... <h6> markup. After some experimentation, it turned out that pandoc expects chapter headers to be placed directly inside the body node. While this makes sense for documents written for the sole purpose of being packaged as EPUB, this is rarely the case with HTML pages on the Internet, where you will often find the actual content wrapped in several layers of divs (or tables, if you are unfortunate to roam such dangerous, god-forgotten places).

Here's a test case. Say, we have a book titled The Book, which consists of four chapters:

seq 1 4 | while read idx; do
    > "ch$idx.html" <<EOF
<html>
    <head>
        <title>Chapter $idx - The Book</title>
    </head>
    <body>
        <div>
            <div>
                <div id="content">
                    <h1>Chapter $idx</h1>
                    <p>Lorem ipsum, dolor sit amet.</p>
                </div>
            </div>
        </div>
    </body>
</html>    
EOF

done

When we feed them to pandoc, we get a broken TOC with a title page and a single chapter, spanning all the input files:

pandoc -o the_book.epub ch*.html

To fix the table of contents, we have to help pandoc a little and move the <h1>s up the tree until they are children of body. Here's how we can do this with Python and Beautiful Soup:

fix.py:

import bs4

filenames = [
    'ch1.html', 'ch2.html', 'ch3.html', 'ch4.html'
]

for filename in filenames:
    with open(filename, 'r') as f:
        soup = bs4.BeautifulSoup(f, 'lxml')

    current = soup.find(id='content')
    while current.name != 'body':
        parent = current.parent
        current.unwrap()
        current = parent

    out_filename = filename.replace('.', '-flat.')
    with open(out_filename, 'w') as f:
        f.write(soup.prettify())

For each file specified, the script creates the DOM and finds the node with the actual content - in this case, the one with content ID (if your div doesn't have an id assigned but it has a specific class, you can get it with soup.find(class_='...') instead). The call to the unwrap method replaces the node with its children and we move move up the tree to the parent of the deleted node. The code is repeated until the body node is reached. Finally, the DOM is saved to a file with -flat appended to its name.

python fix.py

pandoc \
    -o the_book.epub \
    ch1-flat.html \
    ch2-flat.html \
    ch3-flat.html \
    ch4-flat.html

That's better. The chapters were detected and split correctly, but the top two entries, which are supposed to be the title of the book, are incorrectly captioned Chapter 4 - The Book. You might have noticed that this is the text in the <title> of the last file on the command line.

pandoc sets the book title after the contents of the <title> node. When invoked with multiple input files and there is more then one <title> tag, pandoc uses the last one seen. But for ebooks spanning several HTML documents, the <title>s usually denote the chapter names, and shouldn't have impact on the title of the book.

To fix that, we have to dive into the HTML once again ³, make sure there is only one <title> tag in our input files, and that is set to the desired book title:

actual_title = 'The Book'

title_node = soup.find('title')
if filename == filenames[0]:
    title_node.string = actual_title
else:
    title_node.extract()

Finally, the table of contents looks as expected:

pandoc \
    -o the_book.epub \
    ch1-flat.html \
    ch2-flat.html \
    ch3-flat.html \
    ch4-flat.html

This was supposedly controlled by the breadth-first order toggle in Preferences → Plugins → HTML to ZIP plugin but setting it on seemed to have no effect at all.↩

My thesis being the obvious one, but also static website generation - both this blog and ninjastyles.tznvy.eu run on pandoc and some Python magic.↩

This actually sounds like a good use case for a regex. Or $EDITOR, if there are only a few files. But let's do this in Python, just to be consistent.↩

This post was originally published on blog.tznvy.eu

Dealing with dependence on generated source files in a Makefile

brthanmathwoag — Thu, 27 Jul 2017 00:00:00 +0000

I think I finally got it right today.

I have a static website, with pages generated from markdown documents. Some of them are written by hand, but most are generated with a python script from a flat database in a json file. The build process is automated with a Makefile.

I wanted to trigger the generator only when necessary, that is, when either the json or the script were changed. There are too many markdown docs, however, to hardcode their names in the Makefile. Normally, this is solved using a wildcard filemask, which enumerates relevant file names in a source directory, and prepending the path to the target directory:

ALL_MDS_IN_OBJ = $(wildcard $(OBJ)/*.md)
ALL_HTML_IN_BIN = $(addprefix $(BIN)/, $(notdir $(ALL_MDS_IN_OBJ:.md=.html)))

default: $(ALL_HTML_IN_BIN)

$(BIN)/%.html: \
    $(OBJ)/%.md \
    | $(BIN)

    ... convert mds to html with pandoc ...

$(OBJ)/%.md: \
    $(SRC)/source_data.json \
    generate-mds-from-json.py \
    | $(OBJ)

    ./generate-mds-from-json.py

The problem is, when make is run for the for the first time after cloning the repo or doing make clean, there are no markdown files in $(OBJ) yet. The wildcard doesn't match anything, so make happily announces that nothing needs to be done and exits.

To cope with that, I was running the generator manually before make each time I changed the json file or the script, which, of course, in the long run turned out to be tedious and error-prone.

Then I saw this question on stackoverflow and was finally enlightened.

As ChrisW explained, making the generator run for the first time could be forced by introducing a dependency on a sentinel file kept separately from the build artifacts, whose sole purpose was to contain the timestamp of the last generator run; The file would be touched if the generation was successful, and deleted on make clean.

MDS_SENTINEL = .mds_sentinel

default: \
    $(MDS_SENTINEL) \
    $(ALL_HTML_IN_BIN) \
    ... other deps ...

$(MDS_SENTINEL): \
    $(SRC)/source_data.json \
    generate-mds-from-json.py

    ./generate-mds-from-json.py \
        && touch $(MDS_SENTINEL)

clean:
    rm -r $(MDS_SENTINEL) $(OBJ)/* $(BIN)/*

But this was still not sufficient; As mentioned above, the wildcard function would still be evaluated before running the generator, resulting in empty files list and an early exit. It would be necessary to invoke make twice - first time to generate the intermediate files, second time to process them further. And compared to ./generate-mds-from-json.py && make, having to do make && make was not that huge win.

Luckily, this problem was solved in the same thread by Paul Roub, who suggested running make recursively from the recipe. This inner make would have wildcards expanded after all files are generated and process the files correctly.

So the final solution looked something like:

MDS_SENTINEL = .mds_sentinel

ALL_MDS_IN_OBJ = $(wildcard $(OBJ)/*.md)
ALL_HTML_IN_BIN = $(addprefix $(BIN)/, $(notdir $(ALL_MDS_IN_OBJ:.md=.html)))

default: \
    $(MDS_SENTINEL) \
    $(ALL_HTML_IN_BIN) \
    ... other deps ...

$(MDS_SENTINEL): \
    $(SRC)/source_data.json \
    generate-mds-from-json.py

    ./generate-mds-from-json.py \
        && touch $(MDS_SENTINEL) \
        && make mds

mds: $(ALL_HTML_IN_BIN)

clean:
    rm -r $(MDS_SENTINEL) $(OBJ)/* $(BIN)/*

This post was originally published on blog.tznvy.eu

Hands-on Powershell: Pruning an exported VCF contact list

brthanmathwoag — Fri, 09 Jun 2017 00:00:00 +0000

Recently, when hiking in the mountains, I got caught in a raincloud and got completely soaked. When I eventually arrived at the alpine hut, I found out the power switch in my phone no longer worked. It would power on just fine, but once the screen was turned off, it was impossible to turn on again. Luckily, this was just enough to backup all data while I was looking for a new phone.

When the new phone arrived in the mail and I was up to importing the contact list, I remembered I've always wanted to clean it up a bit. See, back when I had my previous phone set up, I logged in to my Google account, which caused the whole Gmail addressbook to download. Thanks, Google, I guess, but I don't even do email on my phone. Some emails were appended to existing contacts, but quite a few did not match, resulting in duplicated entries for some people, and junk entries for individuals and companies that I messaged only once 10 years ago or so.

So I thought, this would be a good opportunity to get rid of them.

I looked at the exported contact list and it luckily turned out to be just a flat textfile with concatenated vCards. This is what it looked like:

BEGIN:VCARD
VERSION:2.1
N:;Lastname;Firstname;;
FN:Nickname
TEL;CELL;PREF:123456789
END:VCARD
BEGIN:VCARD
VERSION:2.1
N:Surname;Firstname;;;
FN:Firstname Surname
EMAIL;PREF:email@test.com
PHOTO;ENCODING=BASE64;JPEG:/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAUDBAoJCAsJCQk
 ...
 LS5apuBYSw02xdyPEsRPJxMByx33iEoQZdohkFlugEIHljtYiFfNHtnp/9k=

END:VCARD
BEGIN:VCARD
VERSION:2.1
N:;Nickname1;;;
FN:Nickname1
TEL;CELL;PREF:456789123
END:VCARD
BEGIN:VCARD
VERSION:2.1
N:;Nickname2;;;
FN:Nickname2
TEL;CELL;PREF:789123456
END:VCARD

etc...

So what I needed to do was to split the text into separate vCard records, then filter out those which don't contain a line beginning with TEL;.

In Powershell, this could be achieved with:

$in = 'unfiltered.vcf'
$out = 'filtered.vcf'

Get-Content $in `
    | Out-String `
    | Select-String -pattern '(?s)BEGIN:VCARD.*?END:VCARD' -AllMatches `
    | % { $_.Matches} `
    | % { $_.Value } `
    | ? { $_.Contains("`nTEL;") } `
    | Out-File -Encoding ascii $out

(pardon me for the broken highlighting, there currently is no powershell highlighter in pandoc, the upcoming release will have loadable language definitions though so I will be able to use m0t0k1's kate-powershell then)

Now, there are some interesting parts here. Because Select-String cannot match over multiple elements on pipe, we have to use Out-String on Get-Content output so that the whole file is processed as one string, rather than line by line. (?s) enables multiline pattern matching just like /s modifier in traditional perl regexp. Also because we are processing one huge string, -AllMatches has to be set so that Powershell doesn't stop after finding first match - similar to /g in perl. Finally, we need to set encoding on Out-File explicitly. Otherwise, it would be saved in UTF-16LE with BOM, which, at least on my phone, caused a generic Could not import contacts error.

Now, the above code ran for a sub-second on my machine for a 200-some KB file but if your contact list contains many base64-encoded photos, slurping the whole file at once and applying regex to it might not be the best idea. Instead, we could read the file line by line and buffer them up manually:

$in = 'unfiltered.vcf'
$out = 'filtered.vcf'

$curr = @()
Get-Content $in | % {
    $curr += $_
    if ($_.StartsWith("TEL;")) {
        $has_tel = $true
    }
    if ($_ -eq 'END:VCARD') {
        if ($has_tel) {
            $curr
        }
        $curr = @()
        $has_tel = $false
    }
} | Out-File -Encoding ascii $out

Less clever, longer, but probably scaling better.

This post was originally published on blog.tznvy.eu