Honeybadger Staff for Honeybadger

Posted on Nov 18, 2022 • Originally published at honeybadger.io

Working with PDFs in Python

#gratitude

This article was originally written by Giridhar Talla on the Honeybadger Developer Blog.

Working with files in any programming language is a fascinating experience, and Python gives us the ability to work with any file. This article explains how to work with PDF files in Python. Python 3 has a plethora of libraries that can assist you in reading and creating PDF files. This post provides a quick overview of some of the packages you'll need to work with PDF files.

What is a PDF?
Setup
Working with PDF files
- Creating a PDF
- Extracting text from a PDF
- Converting .txt files to a PDF
- Concatenating and merging PDF files
- Encrypting and decrypting PDFs

What Is a PDF?

Working with PDF files is not the same as working with other file formats. A Portable Document Format (PDF) is a binary file format that one can read using a computer. It was initially created by Adobe and is now an open standard managed by the International Organization for Standardization (ISO). A PDF file is more than just a collection of text; it is also a collection of data in binary format. The data can be of any format, including text, images, tables, and rich media, such as audio and video. However, it cannot be modified. It is a popular format for storing documents since it is easy to share or print. Refer to the Wikipedia article on the PDF format for more information.

Setup

I assume Python is already installed on your machine. If not, go to the official website and download it. You’ll need two libraries to work with PDF files. The first is PyPDF2, a Python library for reading and modifying PDF files. The second is FPDF for creating PDF files. PyPDF2 is an excellent package for working with existing PDF files, but you can't create new PDF files with it. You'll use FPDF to create new PDF files.

Note: If you're using Python 2, you can use PyPDF (the old version of PyPDF2) instead. I'll use PyPDF2 with Python 3 in this article, although you can use either PyPDF2 or PyPDF4. Both do the same thing and are compatible with Python 3. Simply swap out the import statements.

Let's get started on this installation. Install PyPDF2 and FPDF using pip or conda (if you're using Anaconda).

pip install pypdf2 fpdf2

You may use the following command to check the installation.

$ pip show pypdf2 fpdf
Name: PyPDF2
Version: 1.26.0
Summary: PDF toolkit
Home-page: http://mstamy2.github.com/PyPDF2
Author: Mathieu Fenniak
Author-email: biziqe@mathieu.fenniak.netLicense: UNKNOWN
Location: c:\\users\\giri\\python3.9\\lib\\site-packages
Requires:
Required-by:
---
Name: fpdf
Version: 1.7.2
Summary: Simple PDF generation for Python
Home-page: http://code.google.com/p/pyfpdf
Author: Olivier PLATHEY ported by Max
Author-email: maxpat78@yahoo.it
License: LGPLv3+
Location: c:\\users\\giri\\python3.9\\lib\\site-packages
Requires:
Required-by:

Quick note: You can find the entire directory of the code and working examples here.

Working with PDF Files

Now that you have PyPDF2 and FPDF installed, let's get started. First, let's look at extracting information about a PDF file. You can use the PdfFileReader class of PyPDF2. It allows you to read the content of the PDF file. The getDocumentInfo method of PdfFileReader returns the metadata of the PDF file in the form of a dictionary. Also, the getNumPages function returns the total number of pages in the PDF file. You can use this information to perform various automated tasks (such as sorting according to the number of pages or author and so on) on your existing PDF files.

pdf_info.py

## Import
from PyPDF2 import PdfFileReader

## Setup
pdf = PdfFileReader(open('pdf_path', "rb"))
info = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()

## Extracting information
pdf_info = f"""
    Information about {info.title}:
    Author: {info.author}
    Creator: {info.creator}
    Producer: {info.producer}
    Subject: {info.subject}
    Title: {info.title}
    Number of pages: {number_of_pages}
  """

print(pdf_info)

You can see the output as shown below:

Information about Test PDF:
Author: Giridhar
Creator: Honeybadger
Producer: PyFPSF 1.7.2 http://pyfpdf.googlecode.com/
Subject: Test PDF created using PyPDF2
Title: Test PDF
Number of pages: 1

You can refer to the documentation for all the different methods and parameters of PdfFileReader.

Creating a PDF

Now, let's create a new PDF file. To create a brand-new PDF file, you can use PdfFileWriter from PyPdf2. However, it does not have any methods to add text and create PDF content flexibly. Instead, you can use FPDF library. Import the package and create a new PDF file object using FPDF() by defining the orientation, size, and format. You can add a new blank page using the method add_page.

create_pdf.py

## Import
from fpdf import FPDF

## Create a new PDF file
## Orientation: P = Portrait, L = Landscape
## Unit = mm, cm, in
## Format = 'A3', 'A4' (default), 'A5', 'Letter', 'Legal', custom size with (width, height)
pdf = FPDF(orientation="P", unit="mm", format="A4")

## Add a page
pdf.add_page()

Note: You can also include some meta-information if you want. The FPDF class provides the required methods.

You can also specify the font, font size, and style using the set_font method and color of the text using the set_text_color method.

You can add text to the PDF file using the cell(w, h, txt) method. You can specify whether to move the cursor to the following line using ln and text-alignment using align.

create_pdf.py


...

## Specify Font
## Font Family: Arial, Courier, Helvetica, Times, Symbol
## Font Style: B = Bold, I = Italic, U = Underline, combinations (i.e., BI, BU, etc.)
pdf.set_font("Arial", size=18)
pdf.set_text_color(0, 0, 255)

## Add text
## Cell(w, h, txt, border, ln, align)
## w = width, h = height
## txt = your text
## ln = (0 or False; 1 or True - move cursor to next line)
## border = (0 or False; 1 or True - draw border around the cell)
pdf.cell(200, 10, txt="Hello World!", ln=1, align="C")

pdf.set_font("Arial", size=12)
pdf.set_text_color(0, 0, 0)
pdf.cell(200, 10, txt="This pdf is created using FPDF in Python.", ln=3, align="C")

You can also add image to your PDF file using the image method. And finally, to output the PDF file, use the output method. It saves the new PDF file in the home directory.

create_pdf.py

...

## Add image
## name = Path or URL of the image
## x = x-coordinate, y = y-coordinate (default = None)
## w = width, h = height (If not specified or equal to zero, they are automatically calculated.)
## type = Image format. JPG, JPEG, PNG and GIF (If not specified, the type is inferred from the file extension.).
pdf.image(name="boy_night.jpg", h=107, type="JPG")

## Output the PDF
pdf.output("test_pdf.pdf")
print("pdf has been created successfully....")

Run the above program, and if you see the success message, your PDF is created. Check out the whole program for creating a PDF file below.

create_pdf.py

## Import
from fpdf import FPDF

pdf = FPDF(orientation="P", unit="mm", format="A4")

## Adding meta data to the PDF file
pdf.set_title("Test PDF")
pdf.set_author("Giridhar")
pdf.set_creator("Honeybadger")
pdf.set_subject("Test PDF created using PypDF2")
pdf.set_keywords("PDF, Python, Tutorial")

pdf.add_page()

## Add text
pdf.set_font("Arial", size=18)
pdf.set_text_color(0, 0, 255)
pdf.cell(200, 10, txt="Hello World!", ln=1, align="C")

pdf.set_font("Arial", size=12)
pdf.set_text_color(0, 0, 0)
pdf.cell(200, 10, txt="This pdf is created using FPDF in Python.", ln=3, align="C")

## Add image
pdf.image(name="boy_night.jpg", h=107, type="JPG")

## Save the PDF file
pdf.output("test_pdf.pdf")
print("pdf has been created successfully....")

Extracting Text from a PDF

Now that you have created a PDF file, let's look at extracting the text using Python. PyPDF2 reads a page in a PDF as an object called PageObject. You can use several methods of the PageOject class to interact with the pages in a PDF file. The getPage(pageNumber) method of the PdfFileReader class returns a PageOject instance of that page. To extract the text from that specific page, you can use the extractText() method of the PageObject class. You are free to do anything you want with the text.

extract_single_page.py

## Import
from PyPDF2 import PdfFileReader

## Create the PdfFileReader instance
pdf = PdfFileReader(open("<path_to_pdf>", "rb"))

## Get the page object and extract the text
page_object = pdf.getPage(0) # page number starts from 0 (0-index)
text = page_object.extractText()

## Print the text
print(text)

Again, the getPage method returns a single page. The PdfFileReader class has a .pages attribute that returns the list of all the pages in a PDF file as PageObjects. You can loop through the pages and extract the text on each page.

extract_text.py

## Import
from PyPDF2 import PdfFileReader

## Create the PdfFileReader instance
pdf = PdfFileReader(open("<path_to_pdf>", "rb"))

## Looping through the page objects array
for page in pdf.pages:
    text = page.extractText()
    print(text)

Now, you can create a .txt file from the contents of the PDF. Follow the comments in the code snippet if you get lost.

extract_text.py

## Import
from PyPDF2 import PdfFileReader

## Declare the PdfFileReader instance
pdf = PdfFileReader(open("<path_to_pdf>", "rb"))

## Create a new text file and open it in write mode
with open("<path_to_text_file>", "w") as f:
  ## Loop through the PDF pages
    for page in pdf.pages:
        text = page.extractText()
      ## Write to the text file
        f.write(text)

You can also create a new PDF file by extracting a specific page or a range of pages from a PDF. Using the PdfFileWriter class in PyPDF2 allows you to create a new PDF file and add these pages.

The PdfFileWriter class creates a new PDF file, and you can add a page to the new PDF file using the addPage() method. It requires an existing pageObject as an input to add to the new PDF file.

extract_text_to_pdf.py

## Import
from PyPDF2 import PdfFileReader, PdfFileWriter

## Declare the PdfFileReader instance and Create a new PDF file using PdfFileWriter
old_pdf = PdfFileReader(open("<path_to_pdf>", "rb"))
new_pdf = PdfFileWriter()

## Loop through the pages and add them to the new PDF file
for page in old_pdf.pages[1:4]: # [1:4] means from page 1 to page 3
    new_pdf.addPage(page)

## Save the new PDF file
with open("<path_to_new_pdf>", "wb") as f:
    new_pdf.write(f)

The above code generates a new PDF file containing the previous PDF's pages from page 1 to 3.

Converting .Txt Files to a PDF

You are already aware that you CANNOT change the contents of a PDF file. Instead, you may convert it to a .txt or other type of file, modify the contents, and then convert it back to a new PDF file. Let's look at converting a .txt file into a PDF file.

To generate a PDF file from text, you should use the FPDF library. You must loop over the lines in the text file and add each line to a blank PDF, just as you created a .txt file from a PDF.

convert_txt_to_pdf.py

## Import
from fpdf import FPDF

## Create a new PDF
pdf = FPDF(orientation="P", unit="mm", format="A4")

pdf.add_page()
pdf.set_font("Arial", size=12)

## Open the .txt file in read mode
text = open("<path-to-text-file>", "r")

## Loop through the lines in the text file and add them to the PDF
for line in text:
    pdf.cell(0, 5, txt=line, ln=1)

## Save the pdf file
pdf.output("<new-path-to-pdf>")
print("PDF created!")

The code snippet generates a new PDF file from an existing text (.txt) file.

Next we’ll see how to work with the existing PDF files. PyPDF2 can combine, encrypt, and decrypt PDF files.

Concatenating and Merging PDF files

In this section, you'll learn how to merge or concatenate PDF files. You can use the PdfFileMerger class to combine the PDF files. It enables us to combine PDF files in two different ways. The first way is to use the append method. It concatenates (adds) a new PDF to the end of the previous one. The second way is to use the merge method, which allows you to define the page range to merge.

To combine PDF files, you have to create a new PDF merger object and then add the two PDF files using the append method. Finally, use the write() function to create a new PDF file. This method saves the new PDF file in the computer's memory.

append_pdf.py

## Import
from PyPDF2 import PdfFileMerger

## Create a PDF merger object
pdf_merger = PdfFileMerger()

## Append the PDFs to the merger
pdf_merger.append("pdf_1.pdf")
pdf_merger.append("pdf_2.pdf")

## Write to file
with open("append_pdf.pdf", "wb") as f:
    pdf_merger.write(f)

You can also append two or more PDF files to the same PDF merger object (append_multiple_pdf.py). It creates a new PDF file with all of the PDF files' pages stacked on top of one another.

The merge method is the same as the append method, except instead of appending the second PDF, you should use the merge method. The merge method takes two arguments; the first one is the page index position, and the second one is the path to the second file. The page index position is the page number of the first PDF file where you want to insert the new one.

merge_pdf.py

## Import
from PyPDF2 import PdfFileMerger

## Create a PDF merger object
pdf_merger = PdfFileMerger()

## Append the PDFs to the merger
pdf_merger.append("pdf_1.pdf")

## Merge the second PDF file using index position and path
pdf_merger.merge(1, "pdf_2.pdf")

## Write to file
with open("merged_pdf.pdf", "wb") as f:
    pdf_merger.write(f)

This generates a new PDF file with the pages from the first PDF and the pages from the second (i.e., index position = 1). You can also choose specific pages from the second PDF file to merge. After the path, specify the index range of pages to combine.

merge_pdf_range.py

## Import
from PyPDF2 import PdfFileMerger

## Create a PDF merger object
pdf_merger = PdfFileMerger()

## Append the PDFs to the merger
pdf_merger.append("pdf_1.pdf")

## Merge the second PDF file using index position and path
pdf_merger.merge(1, "pdf_2.pdf", (1,3))  # pages = (start, stop)

## Write to file
with open("append_1_to_3_pdf.pdf", "wb") as f:
    pdf_merger.write(f)

This creates a new PDF file with the pages of the first PDF file merged with the pages of the second PDF file from the second page (i.e., index position = 1) with only pages 2 and 3 remaining (index positions 1 and 3). This method assists you in merging only the pages you choose.

Encrypting and Decrypting PDFs

Everything revolves around safety. Encryption is the process of protecting data via mathematical algorithms and a password (similar to a 'key') to decode the original data. You can read more about encryption in this article.

Encrypting PDF files might help you feel more secure in terms of security. Only you or your client can open the PDF using the password provided. It allows you to limit access to your PDF file. You can easily encrypt a PDF file using the encrypt(user_password, owner_password) method of the PdfFileWriter class. You can decrypt the PDF file using the decrypt(user_password) method to access it.

The encrypt method accepts the following arguments:

The user_pwd = user password is used to open the PDF file.
The owner_pwd = owner password is used to restrict the PDF file's edit and view access (admin privileges). By default, the owner password is the same as the user password.
The use_128bit = True is used to specify whether to use 128-bit encryption. By default, it employs 40-bit encryption.

Note: At this stage, PyPdf2 allows you to encrypt a PDF file but does not allow you to specify any permissions on the document. You can accomplish this with another library, such as pdfrw.

The code sample below demonstrates how to encrypt a PDF file with PyPdf2.

encrypt_pdf.py

## Import
from PyPDF2 import PdfFileWriter, PdfFileReader

pdf_reader = PdfFileReader("<path_to_pdf_file>")
pdf_writer = PdfFileWriter()

for page in pdf_reader.pages:
    pdf_writer.addPage(page)

# Encrypt the PDF file
pdf_writer.encrypt(user_pwd = "<user-password>", owner_pwd = "<owner-password>", use_128bit = True)

with open("encrypted_pdf.pdf", "wb") as f:
    pdf_writer.write(f)

The code produces a new PDF file, encrypting it with the password. Whenever you try to open the PDF, you must enter the user password to view the contents.

If you try to access the PDF using PyPDF2, it displays the following error:

Traceback (most recent call last):
  File "read_pdf.py", line 6, in <module>
    pdf_reader.getPage(0)
raise utils.PdfReadError("file has not been decrypted")
PyPDF2.utils.PdfReadError: file has not been decrypted

To open the PDF, you must enter the owner's password. You can use the decrypt(owner_pwd) method to decrypt the PDF file.

The decrypt method returns an integer representing the success of the decryption:

0 denotes that the password is incorrect.
1 indicates that the user password is a match.
2 indicates that the owner's password was matched.

decrypt_pdf.py

## Import
from PyPDF2 import PdfFileReader

## Get the encrypted file
pdf_reader = PdfFileReader("encrypted_pdf.pdf")

## Decrypt the file using password
pdf_reader.decrypt("SuperSecret")
print(pdf_reader.getPage(0).extractText())

Now, you can work with the PDF file as you did along this article.

Conclusion

As previously stated, you may use any form of PDF toolkit to work with PDF files in Python, as I did in this post using PyPdf2 and FPdf. You can generate, read, edit, combine, encrypt, and decrypt PDF files. You may also convert a PDF file to another format and vice versa. For your next projects, you could create an online PDF file converter or create an application to create PDF files online. You could also make an application to automate the process of creating invoices. You are not, however, limited to the libraries mentioned in this article. Django and Flask both have their own packages for working with PDF files. I hope that this post provides you with a foundation for working with PDF files in Python.