Filling in PDF forms can be a pain. It’s even more of a pain if you have to do this repeatedly and with almost always the same input data. Surely you can automate this, right? Yup, you can. Just use the Python pypdf
module.
For the first few months of 2024, I had to fill in a PDF form for a German government agency each month. After the second time filling out the form with almost completely the same data as the previous month, I thought “Surely I can automate this”. The short answer is: yes, I could! The long answer is … more involved.
Did I save any time? I don’t think so. Yet, what’s more important than the time saving, I think, was what I learned along the way. That and the reduced stress and annoyance of having to repeat myself when entering much the same information each month. Having everything scripted meant that I was more certain I was entering data correctly. Also, I was less likely to be penalised for submitting incorrect information.
This post shows how to use the pypdf
module by first considering the simple case of entering text into text fields. Then we’ll look at the more complex situation of checking checkboxes. Lastly, we’ll avoid some duplicate manual data entry by using implicit information in dates.
Note that pypdf
is capable of much more than only filling out forms. Check out the module’s extensive documentation for more information.
Before we begin, let’s set up a Python virtual environment for us to play in.
Quick setup
One of the first things we need to do is set up a wee playground for our form-filling project. Let’s create a directory called form_filler
, set up a Python virtual environment in it, and install the pypdf
module:
# create the playground directory
$ mkdir form_filler
$ cd form_filler
# initialise the virtual environment
$ python3 -m venv venv
# activate the virtual environment
$ source venv/bin/activate
# install the Python PDF manipulation module
$ pip install pypdf
Now that we’re all set up, we can start playing with an example PDF form.
Focus on the text form fields
Let’s look at the case of reading an example PDF form and filling in some of its text fields. This discussion follows the pypdf
documentation for filling out forms, so if you want a different perspective or need more info, have a look there.
The example form we’re going to be playing with has many text fields. The ones we’re going to focus on in this section are the customer number (Kundennummer) and full name (Name, Vorname) fields at the beginning of the document
as well as the place (Ort) and date (Datum) fields at the end of the document.
Note that for these fields, the only thing that changed here from month to month for me was the date, and that’s a chore we can automate away.
Reading text fields from the original form
Before we can start filling in the form and its fields, we need to read it. We do this with the PdfReader
class:
from pypdf import PdfReader
# read the PDF document
reader = PdfReader("form.pdf")
where I’ve–very unimaginatively–called the input document form.pdf
.
We can determine the available text fields in the document by asking the reader
object for them via the get_form_text_fields()
method:
from pypdf import PdfReader
# read the PDF document
reader = PdfReader("form.pdf")
# extract its text fields
fields = reader.get_form_text_fields()
If we now look at the fields
object, we’ll see that it’s a dict with the field names as the dict’s keys, and all the values are set to None
. The field values are None
because all fields in the form are currently empty.
Here’s an excerpt:
{
'Kundennummer[0]': None,
'Name_Vorname[0]': None,
'[0]': None,
'Geburtsdatum[0]': None,
'Postleitzahl_Wohnort[0]': None,
<snip>
'Ort-unten[0]': None,
'Datum-unten[0]': None
}
We see the first two fields that we’re interested in at the very beginning: the customer number field (Kundennummer[0]
) followed by the full name field (Name_Vorname[0]
). Then come several other text fields contained within the document. At the end of the output, we see the last two fields of interest: the place field (Ort-unten[0]
) and the date field (Datum-unten[0]
).
You’ll notice that each field name ends with the text [0]
. This looks like the first value of an array, so it seems possible to have more complex data structures supporting text fields in PDF forms. This extra ability hasn’t been used here; all text fields only have the [0]
suffix.
I find the naming of the keys interesting as it gives a glimpse behind the scenes of how the document was created. Most fields are fairly uninteresting because they match the text next to the fields seen when viewed in a PDF reader. However, the third field is odd in that it doesn’t have a name, although it corresponds to the street address information in the rendered PDF. Maybe someone forgot to set a value when creating the form? Dunno. At the very least, the [0]
string is available as a field key. Since this is the only field in the document missing a human-readable field name, it’s still possible to uniquely identify the field and thus fill it in.
The last two fields in the example output also show how whoever created the form was thinking. It seems that they wanted to keep the place name and date information at the bottom of the document separate from any other, similar fields. I’m guessing this is why they called the last two fields Ort-unten[0]
(“placename-bottom”) and Datum-unten[0]
(“date-bottom”).
Filling in text fields
Filling in the form fields is a simple matter of setting appropriate values to the relevant keys. For our example we can do this:
form_data = {
'Kundennummer[0]': 'ABCD1234',
'Name_Vorname[0]': 'Wurst, Hans',
'Ort-unten[0]': 'Dingenskirchen',
'Datum-unten[0]': '30.09.2024',
}
and then write the PDF out to a new filename, which we’ll do in a minute.
But first, why hard-code the date? We can determine that automatically by asking Python for today’s date (whatever that currently happens to be). After all, today’s date is usually what one will want to use when filling in the form by hand. Let’s do that now.
from datetime import date
from pypdf import PdfReader
# read the PDF document
reader = PdfReader("form.pdf")
# extract its text fields
text_fields = reader.get_form_text_fields()
# get today's date as a string
today_dmy = date.today().strftime("%d.%m.%Y")
# set the form fields
form_data = {
'Kundennummer[0]': 'ABCD1234',
'Name_Vorname[0]': 'Wurst, Hans',
'Ort-unten[0]': 'Dingenskirchen',
'Datum-unten[0]': f'{today_dmy}',
}
Note that Germany is one of those sensible countries that uses the day-month-year format. The only odd thing that an Anglophone might stumble over is that the day, month and year values are separated by a full stop rather than a slash. Of course, one should probably use the ISO 8601 format, but these conventions are notoriously difficult to change.
Writing filled-in form data to file
Setting values in the fields
dict is all well and good, but this doesn’t save the information to file. How do we do that? For this, we need to use the PdfWriter
class. Here’s the full code listing at this stage:
from datetime import date
from pypdf import PdfReader, PdfWriter
# read the PDF document
reader = PdfReader("form.pdf")
# extract its text fields
text_fields = reader.get_form_text_fields()
# get today's date as a string
today_dmy = date.today().strftime("%d.%m.%Y")
# set the form fields
form_data = {
'Kundennummer[0]': 'ABCD1234',
'Name_Vorname[0]': 'Wurst, Hans',
'Ort-unten[0]': 'Dingenskirchen',
'Datum-unten[0]': f'{today_dmy}',
}
# pass document read from reader object to writer object
writer = PdfWriter()
writer.append(reader)
# update the form field values all pages
for page in writer.pages:
writer.update_page_form_field_values(
page,
form_data,
auto_regenerate=False,
)
# write the new PDF to file
date_ym = date.today().strftime("%Y-%m")
output_filename = f"filled-in-form-{date_ym}.pdf"
with open(output_filename, "wb") as fp:
writer.write(fp)
Here I’ve tried to be a little more imaginative by giving the output file a more descriptive name. Also, I’ve added the year and month to the filename; remember, I only need to prepare this form once a month, hence the day isn’t necessary.
You might be wondering why I don’t just set values in the text_fields
dict directly and use that as input to update_page_form_field_values()
. It turns out that if any of the values passed to update_page_form_field_values()
are still None
(as they are when read by get_form_text_fields()
), then writing the output PDF will throw an AttributeError
:
Traceback (most recent call last):
File "/home/cochrane/Projekte/PrivatProjekte/form_filler/text_field_filler.py", line 29, in <module>
writer.update_page_form_field_values(
File "/home/cochrane/Projekte/PrivatProjekte/form_filler/venv/lib/python3.9/site-packages/pypdf/_writer.py", line 1024, in update_page_form_field_values
writer_parent_annot[NameObject(FA.V)] = TextStringObject(value)
File "/home/cochrane/Projekte/PrivatProjekte/form_filler/venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 522, in __new__
if value.startswith(("\xfe\xff", "\xff\xfe")):
AttributeError: 'NoneType' object has no attribute 'startswith'
Thus it’s a better idea to create a new dict with the appropriate keys and pass this to update_page_form_field_values()
. This way only those fields you want to fill in are set to a value; all other fields are left empty, which is probably the behaviour you want.
Checking the output
Opening the newly created PDF file, we find that the customer number and full name fields have been set:
as well as the place name and date fields at the bottom:
Nice! That wasn’t too hard, was it?
That’s it for the simple case of setting text-based fields in PDF forms. Hopefully, you can see how to extend this to fill out all remaining text fields in the form.
Things aren’t always as simple as this though. One further issue I had was to tick (erm, check?) various checkboxes, which required digging into a lot more detail.
Ticking all the right boxes
To set checkboxes correctly, we have to do much more work.
Note that what follows is what I managed to work out by reading the PDF reference, digging into the data structure that the get_fields()
method returns, and reading various posts on StackOverflow. There might be a much easier way to do this!
Looking into this issue gave me the opportunity to learn a bit more about PDF document internals, which was interesting. I’ve played a bit with Postscript in the past and since Adobe also created that, it’s got many elements in common.
To get going, let’s read in the PDF document, extract all its fields and pass the document to the writer
object. This we’ll then use to extract the information we’re interested in.
One could equivalently extract field information from the reader
object. But since we want to manipulate information on the writer
object anyway, it seemed easier to use it for both getting and setting.
Here’s an outline of the code to get us started:
from datetime import date
import re
from pypdf import PdfReader, PdfWriter
from pypdf.generic import NameObject
# read the PDF document
reader = PdfReader("form.pdf")
# extract all its fields
fields = reader.get_fields()
# pass document read from reader object to writer object
writer = PdfWriter()
writer.append(reader)
To tick boxes, first, we have to dig in the annotations
Unfortunately, we can’t check the checkbox from the get_fields()
output directly1 because the checkbox objects are buried in a page annotation (/Annot
).2 We have to extract the annotation from the page object and manipulate that. We can get to a page object by specifying an element of the pages
array on the writer
object, e.g. writer.pages[0]
, to access the first page in the document.
We can get all button-like annotations on a page like so:
# find checkboxes on first page
checkboxes = {
annot['/T']: annot for annot in writer.pages[0]['/Annots']
if annot['/FT'] == '/Btn'
}
Here I know that there aren’t any radio buttons or anything similar on the page. Thus all annotations that have the field type (/FT
)3 of “button” (/Btn
)4 will be the checkboxes that I’m interested in.
I’ve used a dictionary comprehension here so that I can refer to each button-like annotation by name. This is easier than using a list comprehension where one has to remember which element refers to which button. Also, I’m setting the key in the checkboxes
dict to the value of the element’s partial field name.5 This turns out to be a nice string that we can use later when referencing document elements we want to edit.
For instance, the list of keys in the checkboxes
dict is:
>>> print(list(checkboxes.keys()))
['Kontrollkaestchen1[0]', 'Ja-Nein-2[0]', 'Ja-Nein-2[1]', 'Ja-Nein-3-1[0]', 'Ja-Nein-3-1[1]', 'Ja-Nein-3-1a[0]', 'Ja-Nein-3-1a[1]']
Hence it’s simple to refer to a given checkbox by name, e.g.:
>>> checkboxes['Ja-Nein-2[0]']
IndirectObject(46, 0, 139727073238848)
Although, that does now require more work to get any useful information.
By looking at the attributes on the writer.pages[0]
object, I found the /Annots
key, which is a list of annotations to the current page. It turned out that the checkbox /Btn
information was buried in there, hence the need to dig around in the annotations. The /Annots
key then points to a list of IndirectObject
instances. These objects contain the details of each of the annotations on the page. The checkbox extraction code above looks at each annotation to see if its field type is /Btn
and if so, filter it from the list of all annotations.
There are lots of other goodies in there as well. One can get a glimpse of the possibilities by looking at the keys of the PageObject
instance:
>>> writer.pages[0].keys()
dict_keys(['/Contents', '/CropBox', '/MediaBox', '/Resources', '/Rotate', '/Type', '/Parent', '/Annots'])
For those so inclined, there’ll be more fun things to discover and play with in there!
What’s the state, Kenneth?
Simply setting the item’s value to \1
(or \Yes
) as described in the discussion How to set a checkbox to true? on the pypdf
project site didn’t check the checkbox in my case. This is probably because each item is an IndirectObject
. Thus, in my case, it was necessary to dig a bit deeper. But first, we need to find out what states our checkboxes can accept.
We can find out the current state of a given checkbox by looking up its /AS
(appearance state) property.6 E.g. for Kontrollkaestchen1
(control box 1) we have:
checkboxes['Kontrollkaestchen1[0]']['/AS'] # => '/Off'
This is the thing we need to turn on. But we can’t go setting this to \1
or \Yes
willy-nilly; at least for the form I’m filling out here. It’s necessary to know which states the form expects to get the checkboxes set correctly.
To know which state to use, we need to use the get_fields()
method and look at the value of the /_States_
key.7 Remember that setting this value to \1
or \Yes
doesn’t work.
Oddly enough, although the state information is available on the object in the fields
dict, it’s not on the object we extracted from the annotations on the page. So, to get the state information, we need to search for the correct key in the fields
dict. This key is more specific than the name we stored as the checkboxes
dict key:
# get the key for checkbox "control box 1"
control_box_key = next(
(
key for key in fields.keys()
if re.search('Kontrollkaestchen1', key)
), None
)
# => 'Arbeitsbescheinigung[0].Seite1[0].Allgemeine_Angaben_1[0].Kontrollkaestchen1[0]'
Here I’ve used a generator expression to filter for the desired key and then use the next
keyword to return the only element returned. Alternatively, one might want to use a list comprehension and return the first element (since we know that there’s only one element to return):
control_box_key = [
key for key in fields.keys()
if re.search('Kontrollkaestchen1', key)
][0]
# => 'Arbeitsbescheinigung[0].Seite1[0].Allgemeine_Angaben_1[0].Kontrollkaestchen1[0]'
From what I understand, the generator expression is more “Pythonic” than, say, a list comprehension in this case. The things ya learn.
Anyway, we now have the key to use in the fields
dict. Extracting the state information, we get:
# get the possible states for this checkbox
control_box_states = fields[control_box_key]['/_States_']
# => ['/1', '/Off']
And now we know that “on” is /1
, which will then check this particular checkbox.
It’s still not that simple
Unfortunately, we can’t set the checkbox object’s /AS
value directly because it’s an IndirectObject
and they don’t support item assignment:
checkboxes['Kontrollkaestchen1[0]']['/AS'] = '/1'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'IndirectObject' object does not support item assignment
The trick is to get the object behind the IndirectObject
and manipulate that. We do this by getting the object with the appropriately named get_object()
method:
# get checkbox "control box 1"
control_box = checkboxes['Kontrollkaestchen1[0]'].get_object()
Although this allows value lookup via a dictionary-like interface (the object itself is a DictionaryObject
):
control_box['/AS']
# => '/Off'
we can’t set values such as /1
directly; they have to be of type PdfObject
:
>>> control_box['/AS'] = '/1'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 404, in __setitem__
raise ValueError("key must be PdfObject")
ValueError: key must be PdfObject
To set values, we must wrap the key and the value in a NameObject
(which is a subtype of a PdfObject
):
# check it
control_box[NameObject('/AS')] = NameObject('/1')
Finally! We managed to set the control box value! Yay! 🎉
Because we manipulated objects directly on the writer
object’s page
, we can write the data to file without needing to use update_page_form_field_values()
:
# write the new PDF to file
date_ym = date.today().strftime("%Y-%m")
output_filename = f"filled-in-form-checkboxes-{date_ym}.pdf"
with open(output_filename, "wb") as fp:
writer.write(fp)
The first checkbox (“control box 1”, a.k.a. Kontrollkaestchen1[0]
) looks like this in its default state:
After updating the checkbox’s state in the writer
object and writing to file, we see this output:
Note that if we update text fields as well as checkboxes, we still need to use update_page_form_field_values()
before writing the data to file.
Crikey, that was hard work! But we got there in the end!
This is what the code looks like now, which sets the “customer number” field and checks “control box 1”:
from datetime import date
import re
from pypdf import PdfReader, PdfWriter
from pypdf.generic import NameObject
# read the PDF document
reader = PdfReader("form.pdf")
# extract all its fields
fields = reader.get_fields()
# pass document read from reader object to writer object
writer = PdfWriter()
writer.append(reader)
# find checkboxes on first page
checkboxes = {
annot['/T']: annot for annot in writer.pages[0]['/Annots']
if annot['/FT'] == '/Btn'
}
# get the key for checkbox "control box 1"
control_box_key = next(
(
key for key in fields.keys()
if re.search('Kontrollkaestchen1', key)
), None
)
# get the possible states for this checkbox
control_box_states = fields[control_box_key]['/_States_']
# get checkbox "control box 1"
control_box = checkboxes['Kontrollkaestchen1[0]'].get_object()
# check it
control_box[NameObject('/AS')] = NameObject('/1')
# set the form fields
form_data = {
'Kundennummer[0]': 'ABCD1234',
}
# update the form field values all pages
for page in writer.pages:
writer.update_page_form_field_values(
page,
form_data,
auto_regenerate=False,
)
# write the new PDF to file
date_ym = date.today().strftime("%Y-%m")
output_filename = f"filled-in-form-checkboxes-{date_ym}.pdf"
with open(output_filename, "wb") as fp:
writer.write(fp)
But wait! There’s more!
Now, you might be thinking “That’s cool, I only need to set all the remaining checkboxes I want ticked to ‘\1
’ and I’m done”. Sorry, not so fast. Many of the other checkboxes in the document need to use \2
to check the box correctly as opposed to \1
. That’s why we extracted in the _States_
information mentioned earlier.8
Let’s now see how to handle this twist to the story.
Most of the remaining checkboxes on the form are yes/no questions like this:
where there’s a checkbox for both the “yes” and “no” answers. Just in case you’re wondering, yes, it is possible to check both boxes so that both “yes” and “no” are ticked. I know that this is a logical paradox. And trust me, I’m not making this stuff up.
Given that this is a Boolean question, it does seem a bit odd. Since both boxes can be checked or unchecked, there are therefore four possible states this one question can have:
- none checked,
- only “yes” checked,
- only “no” checked,
- and both checked.
One can only speculate on why this is so. I presume that this form also needs to be filled in with pen and paper, and is probably based upon an original form which only existed on paper. This would explain the need for a separate checkbox for “yes” and for “no”. Also, I’m guessing that it’s not obvious which of the two options should be the default. Thus we have the initial situation that both “yes” and “no” are unchecked as a kind of compromise. My gut feeling is that this form, instead of being a PDF, really should be a web form. Having the form completely online would solve logical issues such as this. Perhaps it’s only used in a low percentage of cases and hence I managed to stumble my way into a special case. No idea. One tries to make the most of what one is given.
Because there are two checkboxes for a single yes/no question, two PDF checkbox fields match the question. Searching for fields matching a given question thus returns a list. This means we can’t use the generator-and-next()
trick anymore to filter the fields; we have to fall back to a list comprehension.
The name of the question we’ll focus on to illustrate this point is Ja-Nein-2
(yes-no-2
) and is the first kind of yes/no question on the form. I think the ‘2
’ comes from the fact that this appears in the second section of the form (there’s no Ja-Nein-1
yes/no question). It looks that way, but I’m not 100% sure.9
Searching for the key Ja-Nein-2
in the list of all field keys, we get:
# get the keys for the "yes-no-2" question
yes_no_2_keys = [
key for key in fields.keys()
if re.search('Ja-Nein-2', key)
]
# => ['Arbeitsbescheinigung[0].Seite1[0].Angaben_Arbeitszeit_2[0].Ja-Nein-2[0]',
# 'Arbeitsbescheinigung[0].Seite1[0].Angaben_Arbeitszeit_2[0].Ja-Nein-2[1]']
where the first element Ja-Nein-2[0]
refers to the “yes” option and Ja-Nein-2[1]
refers to the “no” option.
The possible states of these elements are:
# find all states for the "yes-no-2" question
yes_no_2_states = [fields[key]['/_States_'] for key in yes_no_2_keys]
# => [['/1', '/Off'], ['/2', '/Off']]
In other words, the expected checked state of the “yes” checkbox is the value /1
and the value of the expected checked state of the “no” checkbox is /2
. In either case if we want to set the checkbox to the unchecked state, we set the checkbox to /Off
.
Because I’m one of those kinds of people who like to prod things to see how they behave, I decided to try setting the “no” box to /1
to see what would happen. It turns out that if one sets a checkbox to /1
which expects a different state for “on” (as in the case for the “no” checkbox above), then it looks like PDF uses a default representation for the checkbox “on” state. This default check mark looks different to that which we got when setting the single checkbox earlier.
For instance, if we set both the “yes” and “no” checkboxes to /1
we get:10
Note how the “no” (“Nein”) box contains a different symbol to the “yes” (“Ja”) box. If one clicks the “no” checkbox with a mouse from within a PDF reader, then each box uses the style as shown in the “yes” (“Ja”) box. Hence it seems that using /2
for the checked state is the expected correct state.
Note that I also tried setting the state of the “no” box to values like /0
and /3
to see what happened. In each case, the default check mark shown in the image above appeared. My guess about what’s happening here is that if PDF sees a value other than /Off
–but other than the expected checked state–it uses the default check mark.
Ok, so to check the boxes using the expected checked states,11 we can use code like this:
# get the checkbox "yes-no-2"
yes_no_2_yes = checkboxes['Ja-Nein-2[0]'].get_object()
yes_no_2_no = checkboxes['Ja-Nein-2[1]'].get_object()
# check them
yes_no_2_yes[NameObject('/AS')] = NameObject('/1')
yes_no_2_no[NameObject('/AS')] = NameObject('/2')
Extending the code with these changes, running it and saving the PDF output to file, we get:12
This is the kind of check mark that I think the government agency expects to see and why I use it here.13 It’s better not to confuse people with a check mark that they’re not expecting. Also, after having uploaded these forms a few times now, some of the information in the form is automatically extracted from the uploaded PDF and it’s best not to confuse whatever software is doing this extraction and extra processing.
These checkbox state options are likely to be different in other PDF forms. Hence when working with a different PDF form, it’s a good idea to dig into the document and its checkboxes to find out what the expected checked states are. That way you’re sure to use the correct states.
The complete code, including checking a yes-no question, now looks like this:
from datetime import date
import re
from pypdf import PdfReader, PdfWriter
from pypdf.generic import NameObject
# read the PDF document
reader = PdfReader("form.pdf")
# extract all its fields
fields = reader.get_fields()
# pass document read from reader object to writer object
writer = PdfWriter()
writer.append(reader)
# find checkboxes on first page
checkboxes = {
annot['/T']: annot for annot in writer.pages[0]['/Annots']
if annot['/FT'] == '/Btn'
}
# get the key for checkbox "control box 1"
control_box_key = next(
(
key for key in fields.keys()
if re.search('Kontrollkaestchen1', key)
), None
)
# get the possible states for this checkbox
control_box_states = fields[control_box_key]['/_States_']
# get checkbox "control box 1"
control_box = checkboxes['Kontrollkaestchen1[0]'].get_object()
# check it
control_box[NameObject('/AS')] = NameObject('/1')
# get the keys for the "yes-no-2" question
yes_no_2_keys = [
key for key in fields.keys()
if re.search('Ja-Nein-2', key)
]
# find all states for the "yes-no-2" question
yes_no_2_states = [fields[key]['/_States_'] for key in yes_no_2_keys]
# get the checkboxes for "yes-no-2"
yes_no_2_yes = checkboxes['Ja-Nein-2[0]'].get_object()
yes_no_2_no = checkboxes['Ja-Nein-2[1]'].get_object()
# check the box to answer "yes"
yes_no_2_yes[NameObject('/AS')] = NameObject('/1')
# yes_no_2_no[NameObject('/AS')] = NameObject('/2')
# alternatively, check the box to answer "no"
# yes_no_2_yes[NameObject('/AS')] = NameObject('/1')
yes_no_2_no[NameObject('/AS')] = NameObject('/2')
# set the form fields
form_data = {
'Kundennummer[0]': 'ABCD1234',
}
# update the form field values all pages
for page in writer.pages:
writer.update_page_form_field_values(
page,
form_data,
auto_regenerate=False,
)
# write the new PDF to file
date_ym = date.today().strftime("%Y-%m")
output_filename = f"filled-in-form-checkboxes-{date_ym}.pdf"
with open(output_filename, "wb") as fp:
writer.write(fp)
Checkboxed summary
So–for this particular form–we have the rules:
- Single checkbox questions need to use
\1
to check the box. - Double checkbox “yes-no” questions need to use
\1
to check the “yes” box and\2
to check the “no” box.
Automating away a bit of German bureaucracy
Another part of the form requires the user to enter a date range within a Monday-to-Sunday week. The form also requires the user to enter the “calendar week” number associated with that date range. The thing is, the “calendar week” can be calculated from one of the dates in the date range, hence this is duplicate information. Fortunately, we can automate away this duplication.
In my experience, it’s common in the German public service, in government agencies, and in large organisations to refer to weeks of a year by using numerical “calendar weeks”. The idea is that one refers to a given week in the year by its number (i.e. from 1 to 52) rather than mentioning the date on which a given week starts. I’ve worked in a public service job in Germany before and once I got used to it, it was quite handy. For instance, instead of referring to a week as “the one starting on Monday the 30th of September”, you only have to say “KW40” (where KW stands for Kalenderwoche (“calendar week” in German)). Maybe that’s this German efficiency thing people keep talking about?
Let’s extend the script we’ve developed to accept command line arguments to set such a date range. Then we can automatically calculate (and set) the date range’s respective calendar week. Note that in the real form, it’s necessary to set up to 5 date ranges in a table to specify information spread over an entire month. We’ll only consider a single date range in the example here.
First, import the argparse
module from the standard library:
import argparse
and then add --date-range-start
and --date-range-end
arguments:
parser = argparse.ArgumentParser()
parser.add_argument(
"--date-range-start",
type=str,
required=True,
help="Start of date range as a string in the format dd.mm.yyyy"
)
parser.add_argument(
"--date-range-end",
type=str,
required=True,
help="End of date range as a string in the format dd.mm.yyyy"
)
args = parser.parse_args()
Finally, extract the start and end date strings and combine them into a date range in the format expected by the PDF form:
# get start and end dates from the args
date_range_start = args.date_range_start
date_range_end = args.date_range_end
# format the date range string as expected in form
date_range = f"{date_range_start} - {date_range_end}"
I know that there’s much more I could do here in terms of error checking and input data validation. I’ll leave that as an exercise for the reader. 😉
We use the date_range
variable to set the Z1S1[0]
text field (yes, it really is called that), which is the first date range field in the table one has to complete. To set this value, we extend the form_data
dict mentioned when filling out the text fields at the beginning of this article. In other words, we extend the form_data
dict like so:
# get today's date as a string
today_dmy = date.today().strftime("%d.%m.%Y")
# set the form fields
form_data = {
'Kundennummer[0]': 'ABCD1234',
'Name_Vorname[0]': 'Wurst, Hans',
'Ort-unten[0]': 'Dingenskirchen',
'Datum-unten[0]': f'{today_dmy}',
'Z1S1[0]': date_range,
}
Running the script as follows (which I’ve called date_range_filler.py
to match the topic of this section):
$ python date_range_filler.py --date-range-start=15.10.2024 --date-range-end=18.10.2024
sets the first date range field as we’d expect:
To set the calendar week value in the form, we parse one of the dates into a datetime
object. Then we use the isocalendar()
method from the Python datetime
library to work out which calendar week we have.
The best way to parse a date string into a datetime
object is via the dateutil
library. Since it’s a third-party library, we need to install it:
$ pip install python-dateutil
Now we can import the parse
function from dateutil.parser
from dateutil.parser import parse
and use this to parse the start date of the date range:
# use dayfirst=True to stop interpretation as weird American dates
start_date = parse(date_range_start, dayfirst=True)
Note that we’re careful to use the dayfirst=True
option here to avoid the parser assuming that the date is in MM.DD.YYYY
format. In Germany, one uses the “little-endian” date format, i.e. day, month, year which is the most popular format worldwide.
Now we can work out what the calendar week is by using the isocalendar()
method on our date object:
# calculate the calendar week associated with the date range
calendar_week = start_date.isocalendar().week
To specify this information in the PDF form, we set it as the value of the Z1S2[0]
field,14 being careful to convert it to a string since text fields can’t accept numerical values (writing the PDF will fail). In other words, the form_data
dict now looks like this:
# set the form fields
form_data = {
'Kundennummer[0]': 'ABCD1234',
'Name_Vorname[0]': 'Wurst, Hans',
'Ort-unten[0]': 'Dingenskirchen',
'Datum-unten[0]': f'{today_dmy}',
'Z1S1[0]': date_range,
'Z1S2[0]': f'{calendar_week}',
}
Running this code like so:
$ python date_range_filler.py --date-range-start=15.10.2024 --date-range-end=18.10.2024
will generate this output in the PDF form:
which is the output we expect, given the input date data.
Note that if you did set the Z1S2[0]
field to a numeric value (instead of converting to a string) you will see this error when writing the PDF to file:
Traceback (most recent call last):
File "/home/cochrane/Projekte/PrivatProjekte/form_filler/date_range_filler.py", line 105, in <module>
writer.update_page_form_field_values(
File "/home/cochrane/Projekte/PrivatProjekte/form_filler/venv/lib/python3.9/site-packages/pypdf/_writer.py", line 1024, in update_page_form_field_values
writer_parent_annot[NameObject(FA.V)] = TextStringObject(value)
File "/home/cochrane/Projekte/PrivatProjekte/form_filler/venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 522, in __new__
if value.startswith(("\xfe\xff", "\xff\xfe")):
AttributeError: 'int' object has no attribute 'startswith'
The entire code to set text fields, the two kinds of checkboxes, and one date range (with its associated calendar week) looks like this:
from datetime import date
import re
import argparse
from pypdf import PdfReader, PdfWriter
from pypdf.generic import NameObject
from dateutil.parser import parse
parser = argparse.ArgumentParser()
parser.add_argument(
"--date-range-start",
type=str,
required=True,
help="Start of date range as a string in the format dd.mm.yyyy"
)
parser.add_argument(
"--date-range-end",
type=str,
required=True,
help="End of date range as a string in the format dd.mm.yyyy"
)
args = parser.parse_args()
# read the PDF document
reader = PdfReader("form.pdf")
# extract all its fields
fields = reader.get_fields()
# pass document read from reader object to writer object
writer = PdfWriter()
writer.append(reader)
# find checkboxes on first page
checkboxes = {
annot['/T']: annot for annot in writer.pages[0]['/Annots']
if annot['/FT'] == '/Btn'
}
# get the key for checkbox "control box 1"
control_box_key = next(
(
key for key in fields.keys()
if re.search('Kontrollkaestchen1', key)
), None
)
# get the possible states for this checkbox
control_box_states = fields[control_box_key]['/_States_']
# get checkbox "control box 1"
control_box = checkboxes['Kontrollkaestchen1[0]'].get_object()
# check it
control_box[NameObject('/AS')] = NameObject('/1')
# get the keys for the "yes-no-2" question
yes_no_2_keys = [
key for key in fields.keys()
if re.search('Ja-Nein-2', key)
]
# find all states for the "yes-no-2" question
yes_no_2_states = [fields[key]['/_States_'] for key in yes_no_2_keys]
# get the checkboxes for "yes-no-2"
yes_no_2_yes = checkboxes['Ja-Nein-2[0]'].get_object()
yes_no_2_no = checkboxes['Ja-Nein-2[1]'].get_object()
# check the box to answer "yes"
yes_no_2_yes[NameObject('/AS')] = NameObject('/1')
# yes_no_2_no[NameObject('/AS')] = NameObject('/2')
# alternatively, check the box to answer "no"
# yes_no_2_yes[NameObject('/AS')] = NameObject('/1')
yes_no_2_no[NameObject('/AS')] = NameObject('/2')
# get start and end dates from the args
date_range_start = args.date_range_start
date_range_end = args.date_range_end
# format the date range string as expected in form
date_range = f"{date_range_start} - {date_range_end}"
# use dayfirst=True to stop interpretation as weird American dates
start_date = parse(date_range_start, dayfirst=True)
# calculate the calendar week associated with the date range
calendar_week = start_date.isocalendar().week
# get today's date as a string
today_dmy = date.today().strftime("%d.%m.%Y")
# set the form fields
form_data = {
'Kundennummer[0]': 'ABCD1234',
'Name_Vorname[0]': 'Wurst, Hans',
'Ort-unten[0]': 'Dingenskirchen',
'Datum-unten[0]': f'{today_dmy}',
'Z1S1[0]': date_range,
'Z1S2[0]': f'{calendar_week}',
}
# update the form field values all pages
for page in writer.pages:
writer.update_page_form_field_values(
page,
form_data,
auto_regenerate=False,
)
# write the new PDF to file
date_ym = date.today().strftime("%Y-%m")
output_filename = f"filled-in-form-checkboxes-{date_ym}.pdf"
with open(output_filename, "wb") as fp:
writer.write(fp)
Putting it all together
Now that we’ve got all the pieces in place, it’s “just a simple matter of programming” to extend everything to handle the entire form, enter all the data, generate the final document, and upload it. Easy peasy, lemon squeezy!.15
In all honesty, extending the concepts presented here to a complete form is a fair bit of work. Even so, I hope that this article has given you an idea of how you can programmatically set various elements of a PDF form via the pypdf
module. Even if it doesn’t necessarily save time, hopefully, this knowledge saves someone from having to enter repetitive data into forms when filling them in regularly.
Addendum: Upload problems with generated PDF
There was one weird thing I noticed when using the pypdf
output in real life. PDF readers could read the generated PDF without a problem, but the government agency website wouldn’t accept it. There was some kind of processing done on the data in the file, and that didn’t work for some reason. Unfortunately, it’s not clear why as there weren’t any error messages: it simply “didn’t work”.
My fix for the problem was to print the document to file as PDF from within the Evince document reader. The "printed" PDF contained (as far as I could tell) all the same information, only now the agency's upload and processing software was able to extract the information it needed. The file size was also reduced. It's not clear what the issue was, though, and it'd be interesting to find out.
The
get_fields()
output is a very large and detailed dictionary structure, hence I’m not going to show its output here. If you’re interested yourself, fire up the Python debugger,pdb
, and have a look at the contents of thefields
dict. ↩See section 12.5, page 381ff in the PDF reference. ↩
See section 12.7.3.1, page 432 as well as section 12.7.4, page 439 of the PDF specification. ↩
See section 12.7.4.2, page 439 as well as section 12.7.4.2.3, page 440 of the PDF specification. ↩
See section 12.7.3.1, page 432 and more specifically section 12.7.3.2, page 434 of the PDF specification. ↩
See e.g. section 12.5.2, page 383 and section 12.7.4.2.3, page 440 of the PDF specification document. ↩
The
/_States_
key seems to be something internal topypdf
as it isn’t mentioned in the PDF specification. ↩Note that other forms will likely behave differently. ↩
At least I thought that was the case. The next yes/no question is labelled
Ja-Nein-3-1
, seemingly because it appears in section 3, subsection 1. The next yes/no question is also in this subsection and is calledJa-Nein-3-1a
, so that sort of makes sense (I’d have thought that usinga
andb
would make more sense, but ya get that). The following yes/no question is in section 3, subsection 2, so it’s naturally called …Ja-Nein-4
. Ok, so my hope of there being some consistent logical pattern here was unfounded. Oh well. ↩Obviously, don’t tick both checkboxes when submitting the form to a government agency. ↩
As far as this particular form is concerned; the behaviour is likely to be different in other PDF forms. YMMV. ↩
Again, don’t tick both checkboxes when submitting the form to a government agency. ↩
Because it’s the check mark used when clicking on the checkbox with a mouse in a PDF reader. ↩
I don’t know why these field names are so complex. I think the
Z
refers to the fact that we’re setting a date range (“date range” translates to “Zeitraum” in German). The number after theZ
seems to refer to the row in the table of date ranges. TheS
probably indicates that a column of the table is to be filled in (“column” translates to “Spalte” in German). The number after theS
seems to refer to the column. Even so, I’m still guessing. ↩Yes, I’m being ironic. ↩
Top comments (0)