Brian Via

Posted on Oct 13, 2022

Organizing EBook Files with Python 🐍

#python #linux #ebooks

If you want to jump to the code snippets, click here

Full python file available here on GitHub

The Problem

Over the years, I’ve accumulated many hundreds of ebooks. Some from buying digital copies from places like Gumroad, others free online like Software Engineering at Google.

However as I became more and more busy, keeping a clean file structure to allow me to find what I’m looking for became harder and harder. The result was a hodgepodge of files without naming conventions, directories upon directories labeled “unorganized” as I tried to manually sift through hundreds of files, manually renaming them to the format I wanted, and placed inside of a single directory per book. It just became too much.

This was basically me when looking for anything:

What I Wanted the End to Look Like

Books Directory/
  title-author/
    title-author.epub
  title2-author2/
    title2-author.pdf

In wanting to self host a calibre instance with all of my files, ingesting them via 1 directory per book seemed like the best system. It was also incredibly necessary for just reading the files locally from my desktops via my NAS.

Why Python? 🐍

First and foremost, I decided to go with Python the language of choice for this project for a couple reasons.

While it does involve renaming files, which could be easily done with bash, the logic was going to be a little complicated in terms of passing data between functions to-and-fro, which means bash would get a little tough to read IMO.

Python also has a rich ecosystem of ebook parsing libraries, and fairly easily handles things like file renaming, extensions, environment variables on Linux machines, which is what my NAS box runs. And while my first language is typescript/javascript, so I could’ve utilized something like BASH + Google’s ZX, it felt like a good case to try to get some experience with Python, which I’ve never really used. Luckily VS Code’s intellisense (with some Python plugins), and Python’s relatively simple syntax made it quite easy to get from A→ B in terms of getting the pieces all put together.

The Individual Pieces of The Book Sorting Program

This was how I broke down the individual parts of this sorting library

Gather all files from my unorganized directory
Parse metadata from any ebooks in my library.
1. EPUB files
2. PDF files
Organize books into their new location (<author-title>/<author-title>.<ext>
(Optional) Reading library paths for input, outputs and any issue files

Gathering All Unorganized Files

Grabbing all my files and putting them into one flat directory wasn’t too bad. I called this my BOOKSORT_INPUT_PATH variable. Currently it’s grabbed from the command line environment, but it could be refactored to take as CLI args, or just hard-coded defaults.

# Returns all files in a directory
def getAllFiles(path: string):
    files = []
    for r, d, f in os.walk(path): #r - root, d - dir, f - file
        for file in f:
            if file.endswith(".pdf") or file.endswith(".epub"):
                files.append(os.path.join(r, file))
    print(files)
    return files

This chunk is relatively straightforward hopefully. Given a directory, walk the directory and for each file path found that ends with .pdf or .epub add it to an array, and then return the array. This will give us a list of files to lookup the metadata for, and then eventually sort.

The array will look like this:

['/full/path/to/book/book.epub','/full/path/to/book2/book2.pdf',...]

This array gets returned from the function, so we can iterate over the list of book files to sort.

Parsing Metadata

Parsing the metadata was relatively straightforward: Find epub and pdf parsing libraries, implement and grab the correct fields.

For epub files we’re using epub_meta. Make sure you install with pip install epub_meta or pip3 install epub_meta

For pdf files we’re using pdfx. This also needs an install with pip install pdfx or pip3 install pdfx

For all the files in our array, we’re going to pass them to their respective parsing functions like so:

for file in files:
        TitleAndAuthorString = ""
        if file.endswith(".epub"):
            TitleAndAuthorString = getEpubTitleAndAuthorPath(file)
        if file.endswith(".pdf"):
            TitleAndAuthorString = getPdfTitleAndAuthorPath(file)

EPUB Files

# Returns the title and author of an epub file in the format "Title - Author"
def getEpubTitleAndAuthorPath(filepath: string):
    try:
        print("INFO: Getting metadata for: " + filepath)
        data = epub_meta.get_epub_metadata(filepath)
        title = data['title'] or "Unknown"
        authors =", ".join(data['authors']) or "Unknown"
        print("INFO: Got metadata for " + filepath + ": " + title + " - " + authors)
        return(title + " - " + authors)
    except epub_meta.EPubException as e:
        print(e)
        return None

EPUB_META allows us to grab the metadata with this line

data = epub_meta.get_epub_metadata(filepath)

and then specific fields like this:

title = data['title'] or "Unknown"
authors =", ".join(data['authors']) or "Unknown" (In this case, we’re doing a join with a comma in case there is more than 1 author.

Both of these will fallback to Unknown if we can’t parse the metadata for some reason.

PDF Files

def getPdfTitleAndAuthorPath(filepath: string):
    issuesPath = os.environ["BOOKSORT_ISSUES_PATH"]
    file = filepath
    try:
        print("INFO: Getting metadata for: " + filepath)
        pdf = pdfx.PDFx(filepath)
        metadata = pdf.get_metadata()
        title = metadata.get("Title") or "Unknown"
        authors = metadata.get("Author") or "Unknown"
        print("INFO: Got metadata for " + filepath + ": " + title + " - " + authors)
        return(title + " - " + authors)
    except pdfx.exceptions.PDFInvalidError as e:
        print(e)
        print("ERROR: Moving " + getFileName(file) + " to issues folder")
        os.rename(file, issuesPath + "/" + getFileName(file))
        return None
    except pdfx.exceptions.PDFExtractionError as e:
        print(e)
        print("ERROR: Moving " + getFileName(file) + " to issues folder")
        os.rename(file, issuesPath + "/" + getFileName(file))
        return None
    except pdfx.exceptions.FileNotFoundError as e:
        print(e)
        print("ERROR: Moving " + getFileName(file) + " to issues folder")
        os.rename(file, issuesPath + "/" + getFileName(file))
        return None

PDFX allows us to read metadata in a similar fashion.

After creating the pdf and parsing metadata with these two lines

pdf = pdfx.PDFx(filepath)
metadata = pdf.get_metadata()

We can read from the metadata with the .get(<fieldName>) method

title = metadata.get("Title") or "Unknown"
authors = metadata.get("Author") or "Unknown"
# The Authors field is already comma delmited by PDFX, so no need to join here.
...
return(title + " - " + authors)

We’ll also create a function to return the file extension for proper renaming later.

# Returns the file extension of a file
def getFileExtension(file):
    return os.path.splitext(file)[1]

Organizing the Files to their Final Locations

Lastly, we do some os.makedirs and os.rename magic to move things around and create the needed directories if it doesn’t already exist.

extension = getFileExtension(file) # grab this so we can rename easily.

if TitleAndAuthorString and "Unknown" not in TitleAndAuthorString:
    if not os.path.exists(outputPath + "/" + TitleAndAuthorString):
        os.makedirs(outputPath + "/" + TitleAndAuthorString)
    print("SUCCESS: Moving " + TitleAndAuthorString)
    os.rename(file, outputPath + "/" + TitleAndAuthorString + "/" + TitleAndAuthorString + extension)
    # My desired file output path is <BooksDir>/<Title> - <Author>/<Title> - <Author>.{pdf,epub,etc}
# There was an issue parsing the file, let's just move it to an `issues` folder to be manually looked at later
else:
    print("WARN: Moving " + getFileName(file) + " to issues folder")
    os.rename(file, issuesPath + "/" + getFileName(file))
    continue

os.makedirs(...) creates the directory if needed

os.rename(...) takes the existing file at the specified path, and then the final (absolute) path for the file. So in this case it’s <output-directory>/+ "/" + TitleAndAuthorString + "/" + TitleAndAuthorString + extension

Putting it All Together

def main():
    inputPath = os.environ["BOOKSORT_INPUT_PATH"] or "/Users/bvia/Development/Personal/booksort/issues"
    outputPath = os.environ["BOOKSORT_OUTPUT_PATH"] or "/Users/bvia/Development/Personal/booksort/outputs"
    issuesPath = os.environ["BOOKSORT_ISSUES_PATH"] or "/Users/bvia/Development/Personal/booksort/issues"
    sort_books(inputPath, outputPath, issuesPath)

Give it a whirl with a call to main() at the end of the file and you’re off.

This script isn’t perfect, sometimes the rename fails to write to the specified path for reason I can’t figure out, but it’s helped save me many hours of manual organization, and isn’t that one of our favorite parts of programming after all?

Thanks!

Thanks for reading. Hope this maybe made Python more accessible if you haven’t used it before, or you learned a new use case for it. Once again the full script file is here: https://github.com/BrianVia/booksort/blob/main/book-sort.py

I'm Brian. A fullstack software engineer at Clearcover.

If you want to check out my self hosted blog for more it’s here.

You can follow me on Twitter and GitHub as well!