PDF Generation, Extraction and Modification with PDF::Make

#programming #showdev #software #tooling

I’ve always been fascinated by PDFs. They look simple on the surface. Just a document you can open anywhere but underneath they’re a full layout engine, object graph, drawing model, and archival format all at once. I enjoy that mix of precision and complexity and that is exactly what led me to build PDF::Make. I wanted a fully featured toolkit that could both generate PDFs and let me inspect/edit them programmatically.

At the low level, PDF::Make exposes the raw building blocks of the format: PDF objects, pages, the drawing canvas, a parser/reader, and import/merge primitives. This is the layer you reach for when you need fine grained control or want to work with the structure of a document directly.

For everyday document creation, PDF::Make::Builder sits on top of that foundation and provides a higher level API. It handles the boilerplate of page setup, fonts, text flow, and layout so you can produce a polished PDF in just a few lines of Perl.

The same toolkit is also designed for post-processing. You can open an existing PDF, extract structured text along with its coordinates, and then draw annotations or overlays back onto the page, making it straightforward to build review, QA, or markup workflows on top of documents you didn’t originally generate.

This post shows a practical two step flow:

Create a PDF
Re-open it, extract text coordinates, and draw border highlights around matched words

1) Create a PDF with `PDF::Make::Builder`

Script:

#!/usr/bin/perl
use strict;
use warnings;

use PDF::Make::Builder;

my $pdf = PDF::Make::Builder->new(
    file_name => 'source_demo.pdf',
    configure => {
        text => {
            font => { family => 'Helvetica', size => 12, colour => '#222222' },
        },
    },
);

$pdf->add_page(page_size => 'Letter')
    ->add_h1(text => 'PDF::Make blog demo')
    ->add_text(text => 'PDF::Make builds and edits PDF files directly from Perl.')
    ->add_text(text => 'In the next step we extract text coordinates and highlight matches.')
    ->add_text(text => 'Target terms: PDF::Make, extract_structured, highlight.')
    ->add_text(text => 'This line repeats PDF::Make so multiple boxes are drawn around matches.')
    ->save;

print "source_demo.pdf\n";

That gives us a baseline document to post process.

2) Extract text coordinates, then overlay border highlights

Now we:

open the source PDF as an editable builder document,
run extract_structured page by page,
find words that match a regex,
draw a stroked rectangle on each match.

#!/usr/bin/perl
use strict;
use warnings;
use PDF::Make::Builder;

my $in  = $ARGV[0] // 'source_demo.pdf';
my $out = $ARGV[1] // 'source_demo_highlighted.pdf';
my $re  = $ARGV[2] // 'PDF::Make';

my $b = PDF::Make::Builder->open_existing($in, file_name => $out);
my $pad = 1.5;
my $page_count = $b->page_count;
for my $idx (0 .. $page_count - 1) {
    my $res = $b->extract_structured($in, page => $idx, invisible => 1);
    my $blocks = $res->data || [];

    $b->open_page($idx + 1);
    my $canvas = $b->page->canvas;

    for my $block (@$blocks) {
        my $lines = $block->{lines} || [];
        for my $line (@$lines) {
            my $words = $line->{words} || [];
            for my $w (@$words) {
                my $text = $w->{text} // '';
                next unless $text =~ /$re/i;

                my ($x0, $y0, $x1, $y1) = @{$w}{qw/x0 y0 x1 y1/};
                my $rx  = $x0 - $pad;
                my $ry  = $y0 - $pad;
                my $rw  = ($x1 - $x0) + (2 * $pad);
                my $rh  = ($y1 - $y0) + (2 * $pad);
                # border highlight (red stroke, no fill)
                $canvas->q->w(0.8)->RG(1, 0, 0)->re($rx, $ry, $rw, $rh)->S->Q;
            }
        }
    }
}

$b->save;
print "Created $out\n";

I would recommend reading the full documentation on CPAN to get the most out of the toolkit. The PDF::Make and PDF::Make::Builder pages cover the complete API, configuration options, and additional examples that go well beyond what this post touches on. Whether you’re generating documents from scratch, post-processing existing ones, or building review workflows on top of them, the CPAN docs and examples are the best place to explore what’s possible next. If you do have additional questions please do not hesitate to ask below.