loading...
Feldroy

Adding Metadata to PDFs

danielfeldroy profile image Daniel Feldroy Updated on ・3 min read

Using Code to Write Books (2 Part Series)

1) Adding Metadata to PDFs 2) Autodocumenting Makefiles

For both Django Crash Course and the forthcoming Two Scoops of Django 3.x, we're using a new process to render the PDFs. Unfortunately, until just a few days ago that process didn't include the cover. Instead, covers were inserted manually using Adobe Acrobat.

While that manual process worked, it came with predictable consequences.

Merging the PDFs

This part was easy and found in any number of blog articles and Stack Overflow answers.

  • Step 1: Install pypdf2
  • Step 2: Write a script as seen below
from PyPDF2 import PdfFileMerger

now = datetime.now()

pdfs = [
  'images/Django_Crash_Course_5.5x8in.pdf',
  '_output/dcc.pdf',
]

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("releases/beta-20200226.pdf")
merger.close()    

It was at this point that we discovered that our new file, releases/beta-20200226.pdf, was missing most of the metadata. Oh no!

Adding the Metadata

According to the PyPDF2 docs, adding metadata is very straight-forward. Just pass a dict into the addMetadata() function. I inserted this code right before the call to merger.write():

merger.addMetadata({
    "Title": "Django Crash Course",  
    "Authors": 'Daniel Roy Greenfeld, Audrey Roy Greenfeld',
    "Description": "Covers Python 3.8 and Django 3.x",
    "ContentCreator": "Two Scoops Press",
    "CreateDate": "2020-02-26",
    "ModifyDate": "2020-02-26",
})

The PDF built! Yeah! Time to open it up and see the results!

Alas, no metadata showed up.

Then I spent a long time with trial-and-error trying to get the metadata to show up properly. While there are lots of Python-related articles on extracting metadata using PyPDF2, I struggled to find anything that explained how to add metadata.

Doing My Homework

After a bunch of research (googling, stack overlow-ing, and visiting forums) I found a wonderful book on O'Reilly called PDF Explained by John Whitington. Much credit to John Whitington, he's a good writer and very knowledgable on the topic of PDF.

For my purposes, the two critical sections were found in Chapter 4 of PDF Explained:

Based off what I read, I established the following rules:

  • Every metadata field name had to be prefixed with /
  • Stick to the metadata names found in chapter 4
  • Follow the date format supplied in chapter 4

Writing the Code!

Now armed with my rules I returned to the code. This is what I came up with:

from datetime import datetime
from PyPDF2 import PdfFileMerger

pdfs = [
  'images/Django_Crash_Course_5.5x8in.pdf',
  '_output/dcc.pdf',
]

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

# Make PDF datestamp
now = datetime.now()
pdf_datestamp = now.strftime("D:%Y%m%d%H%M%S-8'00'")

# https://www.oreilly.com/library/view/pdf-explained/9781449321581/ch04.html#didentries
# Fields are **precisely** named
merger.addMetadata({
    "/Author": 'Daniel Roy Greenfeld, Audrey Roy Greenfeld',
    "/Title": "Django Crash Course",
    "/Subject": "Covers Python 3.8 and Django 3.x",
    "/Creator": "Two Scoops Press",
    "/CreationDate": pdf_datestamp,
    "/ModDate": pdf_datestamp,
})

# Write the release
version = f"beta-{now.strftime('%Y%m%d')}"
merger.write(f"releases/{version}.pdf")
merger.close()

Conclusion

The lesson I learned writing this little utility is that as useful as Google and Stack Overflow might be, sometimes you need to explore reference manuals. Which, if you ask me, is a lot of fun. :-)

Speaking of reference manuals, while I referenced the online version of PDF Explained to get my work done, I've ordered a kindle version of the book. It's the least I can do.

Using Code to Write Books (2 Part Series)

1) Adding Metadata to PDFs 2) Autodocumenting Makefiles

Posted on Feb 28 by:

danielfeldroy profile

Daniel Feldroy

@danielfeldroy

Engineer and writer. Co-author of Django Crash Course & Two Scoops of Django, Husband of @audreyfeldroy 💘, father of Uma 🍼

Feldroy

The little creative company behind Two Scoops Press, Impossible Hero Books, Fuzzy Rainbow, and an upcoming SaaS product.

Discussion

markdown guide
 

Great exploration of the process you went thru.
Thanks