acroforge: turn a flat PDF into a real fillable AcroForm, with a deterministic core and zero copyleft

#python #opensource #pdf #showdev

You generated a "fillable" PDF. It looks perfect in Chrome. Then a user opens it in Firefox and the checkboxes are blank, the text fields show nothing until clicked, and one field renders its value an inch to the left. You did not change anything. Two PDF viewers just disagreed about what your form means.

That is the first trap with PDF forms: a field is not "filled" until it renders the same way across viewers. The PDF spec lets a viewer either trust the appearance stream you baked in or regenerate its own, and the two paths drift apart constantly. If you do not control the appearance stream, you are at the mercy of whichever engine opens the file.

The second trap is licensing. Reach for the well-known Python PDF tools and you hit a wall fast: the strongest ones are AGPL (viral copyleft you cannot ship inside a closed product without consequences), or they are paid, or they are a cloud API you have to send documents to. For a lot of teams, "send our forms to someone else's server" is a non-starter, and "relicense our whole app" is worse.

I wanted a small library that does one job: take a flat, non-interactive PDF and turn it into a real, fillable AcroForm, deterministically, locally, with a permissive license. That is acroforge.

pip install acroforge

Apache-2.0, Python 3.11+, no network calls, no AI, no cloud.

The 5-line core

Three functions. They all take bytes and return bytes, so they compose into any pipeline.

import acroforge as af
from acroforge import FieldSpec, FieldType

fields = [FieldSpec(type=FieldType.TEXT, page=0, rect=(200, 700, 450, 730), name="full_name")]
fillable = af.build(flat_pdf_bytes, fields)          # inject real AcroForm fields
filled   = af.fill(fillable, {"full_name": "Jane Doe"})  # set values by name
final    = af.flatten(filled)                         # bake appearances, lock the form

build injects standards-compliant interactive fields at the coordinates you specify.
fill sets field values by name.
flatten bakes the field appearances into the page content, removes interactivity, and locks the result. The flattened PDF is the one you hand to anyone, because there is no interactive layer left to render differently.

Describing fields

A FieldSpec is a plain pydantic model:

class FieldSpec(BaseModel):
    type: FieldType
    page: int                                  # 0-indexed
    rect: tuple[float, float, float, float]    # (x0, y0, x1, y1) in PDF points
    name: str                                  # AcroForm field name
    options: list[str] | None = None           # radio group members
    maxlen: int | None = None                  # TEXT cap / COMB cell count
    export_value: str | None = None            # checkbox/radio on-value
    confidence: float = 1.0                    # 1.0 = you authored it

The field types cover the forms people actually fill:

TEXT single-line text, optional maxlen to cap length.
COMB the segmented box style, where maxlen is the number of cells. An SSN field is maxlen=9.
CHECKBOX with an export_value for the on-state (default "Yes").
RADIO one FieldSpec per button, sharing a name, each with its own export_value.
SIGNATURE a placeholder signature widget.

Here is text plus a checkbox plus a radio group:

fields = [
    FieldSpec(type=FieldType.TEXT, page=0, rect=(200, 700, 450, 730), name="full_name"),
    FieldSpec(type=FieldType.COMB, page=0, rect=(200, 660, 360, 690), name="ssn", maxlen=9),
    FieldSpec(type=FieldType.CHECKBOX, page=0, rect=(200, 620, 220, 640), name="agree", export_value="Yes"),
    FieldSpec(type=FieldType.RADIO, page=0, rect=(200, 580, 220, 600), name="plan", export_value="basic"),
    FieldSpec(type=FieldType.RADIO, page=0, rect=(260, 580, 280, 600), name="plan", export_value="pro"),
]
fillable = af.build(flat_pdf_bytes, fields)
final = af.flatten(af.fill(fillable, {"full_name": "Jane Doe", "ssn": "123456789", "agree": True, "plan": "pro"}))

The honest two-layer design

acroforge is deliberately split into two layers, and the split is the whole point.

Layer one is the deterministic engine: build, fill, flatten. You tell it exactly where fields go, and it puts them there, fills them, and flattens them reliably. This works on any PDF, vector or scanned, because it does not need to understand the document. It only needs coordinates. This is the part you should depend on.

Layer two is a best-effort detector, and it is labeled best-effort everywhere on purpose. af.detect(pdf) reads the PDF's vector geometry and nearby text labels and guesses where fields belong, returning a FormManifest where every field carries confidence < 1.0 to flag it as a guess. af.make_fillable(pdf) runs detect then build in one step.

manifest = af.detect(pdf)            # guesses, each confidence < 1.0
for f in manifest.fields:
    print(f.type, f.name, f.rect, f.confidence)

fillable = af.make_fillable(pdf)     # detect() then build(), one call

The detector handles underline-style forms (write-on rules become text fields), bordered table and grid forms (cells become text fields, label-aware), vector checkbox squares, and font-glyph checkboxes like the box and check characters. It is vector-only: scanned or image-only PDFs are refused with ScannedPDFError, because there is no OCR and I would rather refuse than pretend.

I make no accuracy promises about detection. It will miss fields and invent spurious ones, and quality varies wildly by document. The intended workflow is: run detect, review the draft manifest, correct it, and hand the corrected specs to the engine. The fuzzy layer bootstraps you; the deterministic layer is what ships. Keeping that boundary sharp is what stops the guessing from contaminating the guaranteed path.

Cross-viewer correctness is the contract

Back to the first trap. The way acroforge avoids the Chrome-vs-Firefox split is by treating cross-viewer rendering as the actual test contract, not an afterthought.

Every field type has golden-image render tests in two engines: pdfium, which is what Chrome uses, and pdf.js, which is what Firefox uses. A change that makes a field render differently in either engine fails CI. Adobe Reader is a manual spot-check on top of that. The rule I hold the library to is simple: a field does not "work" until it renders correctly across viewers, so I test before claiming it.

It has been exercised on real documents, including IRS forms W-9 and 1040 and a 43-page credentialing packet, which is where comb fields, duplicate field names, and multi-page layouts stop being theoretical.

The zero-copyleft story

The runtime dependency tree is strictly permissive, by design and by enforcement:

Package	License	Role
reportlab	BSD	field widget rendering
pypdf	BSD-3-Clause	PDF read / merge / flatten
pdfplumber	MIT	geometry utilities
PyPDFForm	MIT	fill helpers
pydantic	MIT	model validation

No GPL, AGPL, LGPL, or SSPL anywhere in the runtime tree, not even MPL. This is not a promise I am asking you to take on faith. CI enforces it on every push with pip-licenses --fail-on='GPL;AGPL;LGPL;SSPL', so a copyleft dependency cannot sneak in through a transitive bump without turning the build red. You can drop acroforge into a commercial product without a licensing conversation.

read_fields: the inverse of build

read_fields(pdf) ingests the AcroForm fields already in a fillable PDF back into FieldSpecs (real registered fields, so confidence = 1.0). It is the exact inverse of build, which means the two round-trip:

specs = af.read_fields(open("fillable.pdf", "rb").read())   # -> list[FieldSpec]

# copy one form's field layout onto another PDF
af.build(other_pdf, af.read_fields(template_pdf))

You get one FieldSpec per widget with coordinates, type, name, and checkbox/radio on-states recovered. Dropdowns come back as text; pushbuttons are skipped. It is handy for inspecting an existing form, diffing layouts, or lifting a known-good field arrangement onto a new document.

Try it and tell me where it breaks

acroforge 0.2.0 is live on PyPI and the source is on GitHub. It is small, focused, and meant to stay that way.

Repo: https://github.com/san64777/acroforge
PyPI: https://pypi.org/project/acroforge/

If you run it against a form that renders wrong in some viewer, or a layout the detector mangles, that is exactly the report I want. Open an issue with the PDF (or a reproducer) and the viewer. And if it saves you from an AGPL dependency or a cloud round-trip, a star helps other people find it.