Reading and Writing PDFs in Python

#python

This is going to be a quick write-up on how to use PyPDF2 to read and write PDF files. It not only has the capability to read pages, it can also read and write some other parts of PDF files such as bookmarks.

The basic process

If you just want to combine a bunch of PDF files together to make one bigger PDF file, then the way to do it goes somewhat like this:

from PyPDF2 import PdfFileWriter, PdfFileReader

# ...

file1 = PdfFileReader(file(filename1, "rb"))
file2 = PdfFileReader(file(filename2, "rb"))

output = PdfFileWriter()

page1 = file1.getPage(specificPageIndex)
page2 = file2.getPage(specificPageIndex)

output.addPage(page1)
output.addPage(page2)

outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()

Here I will explain each line of the process above.

First, we import PdfFileWriter and PdfFileReader from PyPDF2, these are the classes that write and read PDF files, respectively. There is also another class PdfFileMerger that is in charge of combining entire files to a PDF file. The main difference between PdfFileWriter and PdfFileMerger is that PdfFileMerger bypasses the page objects and dumps the whole PDF into a file, while PdfFileWriter lets you add specific pages from the PDF and flush them to the file.
Then we create a PdfFileReader object out of a file opened in read-binary mode.
We also construct a PdfFileWriter, it doesn't need any arguments passed to it.
In PdfFileReader objects we can use getPage() to access the PDF page at the specified page number. It is not always the label at the bottom of the page as some PDFs have unlabeled pages and pages labeled with roman numerals, so that will cause labels to start labeling from 1 several pages after the first page.
addPage() method of a PdfFileWriter is called when we want to add a page, it takes a PDF page object as its argument.
Finally we call write() on the PdfFileWriter, supplying an open file handle to write the PDF file into.

Additional fields

A PDF file contains more than just pages. Some PDFs have page labels on each page, or no labels on some pages, for example books. In books, some pages are not supposed to be labeled with a number. Other pages, like the introduction section, use letters instead of numbers for labeling. These page labels can be read from PdfFileReader.

Bookmarks (the document outline, essentially the table of contents) can also be read from PDFs using PdfFileReader.

p = PdfFileReader(file(filename, "rb"))

# Gets the document outline (bookmarks). Returns a nested
# list of Destination objects.
dest = p.getOutlines()

# ... Sample Destination object
>>> dest_obj.title
"Appendix"
>>> dest_obj.page
692

# Number of pages in the PDF file
num = p.getNumPages()

# The following demonstrates how to get the document information
# of the PDF file
>>> from PyPDF2 import PdfFileReader
>>> inputPdf = PdfFileReader(open("test.pdf", "rb"))
>>> docInfo = inputPdf.getDocumentInfo()
>>> docInfo.author
Anonymous
>>> docInfo.creator
Hewlett Packard MFP
>>> docInfo.producer
Acrobat Distiller 10.0.0 (Windows)
>>> docInfo.title
A Test
>>> docInfo.subject
testing

Sometimes the PDF is encrypted which will cause most of the functions above that read any property or metadata in it to fail. You can decrypt the pdf by calling the decrypt("password") method in PdfReadFile.

And we're done

That was easy, wasn't it? Now you can manipulate PDF files by adding and removing pages you want to change, and reading the metadata of PDF files.

If you see any errors here, please let me know so I can fix them.

DEV Community

Reading and Writing PDFs in Python

The basic process

Additional fields

And we're done

Top comments (0)

Read next

Transform Any Image into a Sketch with Python 🚀

Detecting Hallucinations in LLMs with Discrete Semantic Entropy and Perplexity

🌍 GeoIP Lookup Tool: Easily Get Geolocation Information of Any IP Address.

How I Saved Myself Hours Using Python, Google Gemini, & Meta Llama to Create a Time Saving Script