This is going to be a quick write-up on how to use PyPDF2 to read and write PDF files. It not only has the capability to read pages, it can also read and write some other parts of PDF files such as bookmarks.
If you just want to combine a bunch of PDF files together to make one bigger PDF file, then the way to do it goes somewhat like this:
from PyPDF2 import PdfFileWriter, PdfFileReader # ... file1 = PdfFileReader(file(filename1, "rb")) file2 = PdfFileReader(file(filename2, "rb")) output = PdfFileWriter() page1 = file1.getPage(specificPageIndex) page2 = file2.getPage(specificPageIndex) output.addPage(page1) output.addPage(page2) outputStream = file("document-output.pdf", "wb") output.write(outputStream) outputStream.close()
Here I will explain each line of the process above.
- First, we import
PdfFileReaderfrom PyPDF2, these are the classes that write and read PDF files, respectively. There is also another class
PdfFileMergerthat is in charge of combining entire files to a PDF file. The main difference between PdfFileWriter and PdfFileMerger is that PdfFileMerger bypasses the page objects and dumps the whole PDF into a file, while PdfFileWriter lets you add specific pages from the PDF and flush them to the file.
- Then we create a PdfFileReader object out of a file opened in read-binary mode.
- We also construct a PdfFileWriter, it doesn't need any arguments passed to it.
- In PdfFileReader objects we can use
getPage()to access the PDF page at the specified page number. It is not always the label at the bottom of the page as some PDFs have unlabeled pages and pages labeled with roman numerals, so that will cause labels to start labeling from 1 several pages after the first page.
addPage()method of a PdfFileWriter is called when we want to add a page, it takes a PDF page object as its argument.
- Finally we call
write()on the PdfFileWriter, supplying an open file handle to write the PDF file into.
A PDF file contains more than just pages. Some PDFs have page labels on each page, or no labels on some pages, for example books. In books, some pages are not supposed to be labeled with a number. Other pages, like the introduction section, use letters instead of numbers for labeling. These page labels can be read from PdfFileReader.
Bookmarks (the document outline, essentially the table of contents) can also be read from PDFs using PdfFileReader.
p = PdfFileReader(file(filename, "rb")) # Gets the document outline (bookmarks). Returns a nested # list of Destination objects. dest = p.getOutlines() # ... Sample Destination object >>> dest_obj.title "Appendix" >>> dest_obj.page 692 # Number of pages in the PDF file num = p.getNumPages() # The following demonstrates how to get the document information # of the PDF file >>> from PyPDF2 import PdfFileReader >>> inputPdf = PdfFileReader(open("test.pdf", "rb")) >>> docInfo = inputPdf.getDocumentInfo() >>> docInfo.author Anonymous >>> docInfo.creator Hewlett Packard MFP >>> docInfo.producer Acrobat Distiller 10.0.0 (Windows) >>> docInfo.title A Test >>> docInfo.subject testing
Sometimes the PDF is encrypted which will cause most of the functions above that read any property or metadata in it to fail. You can decrypt the pdf by calling the
decrypt("password") method in PdfReadFile.
That was easy, wasn't it? Now you can manipulate PDF files by adding and removing pages you want to change, and reading the metadata of PDF files.
If you see any errors here, please let me know so I can fix them.