Chinmay Sawant

Posted on Jun 8

What Building a Go PDF Engine Teaches You About Real Engineering

#ai #programming #go #opensource

Building a PDF engine from scratch in Go forces you to confront memory management, binary serialization, concurrency safety, interface design, and performance profiling all at once in a domain where correctness is non-negotiable.

This article covers lessons learned building GoPdfSuit, a production PDF engine written in Go that generates 1.5 million financial PDFs in roughly 45 minutes on a single node, achieves PDF/A-4 and PDF/UA-2 compliance, and exposes itself as a REST API, a Go library, and Python CGO bindings simultaneously.

Note: While I have six years of overall experience including two years working with Go, I rarely encountered these types of systems-level challenges in day-to-day feature work. This project was an intentional deep dive into performance optimization. I used AI tools (Copilot, Cursor, OpenCode) to assist with development.

1. Optimization

Micro-Optimizations That Actually Matter

The kinds of optimizations rarely needed in CRUD work became essential here: direct byte slicing instead of fmt.Sprintf, bit-shift approximations for division, and pre-sized scratch buffers.

Techniques used across the codebase:

appendTextForPDF Zero-alloc text encoding directly into []byte, eliminating string intermediates on every Tj PDF operator (internal/pdf/utils.go, used across draw.go).
Byte-scratch buffers Hot paths use stack-fixed [24]byte and [12]byte scratch buffers for numeric formatting. appendFmtNum avoids strconv.AppendFloat, documented as ~10% CPU savings.
RuneSet bitmap Replaced map[rune]bool for character tracking with a dense 64 KiB bitmap (font/runeset.go), cutting map inserts on the font subsetting hot path.
Fast alpha blending Replaced integer division (/ 255) per pixel component with ((r*a + white) * 0x8081) >> 23 (internal/pdf/image.go).
256-byte lookup tables for hex encoding instead of fmt.Sprintf.
Batched writes Reduced ~25K separate Write calls for a 5K-cell table down to ~5K by batching PDF commands per cell (internal/pdf/draw.go).
Pre-grow all buffers Page content streams pre-grow to 64 KiB, compress buffers to 64 KiB, assembly buffer to 64 KiB, avoiding incremental append growth during hot generation.

The Four-Pass Performance Program

The optimization journey was structured across 4 passes with 41 total tasks:

Pass	Focus	Tasks	Key Outcomes
Pass 1	Low-hanging fruit	10	Buffer pooling, zero-alloc text encoding, batched writes, RuneSet bitmap, image cache singleflight
Pass 2	Architecture	12	Direct-write APIs, parallel decode/compress, incremental MD5, sparse CIDToGIDMap
Pass 3	Advanced memory	5	Allocation-free `WrapTextInto`, typed `[]StructKid` replacing `[]interface{}`, redact parser unification
Pass 4	Load-test hotspots	14	PDF/UA gating, final PDF slice pool, `StructElem` pool, template pool, parallel zlib, p99 fixes

Results:

2061 ops/s peak (1705 ops/s 10-run average) on financial workload (48 workers, PDF/A + tagged + signatures)
Serial 2000-row PDF/A: ~31–36 ms/op, ~163K allocs/op (~46% fewer than pre-optimization)
HTTP load test: ~5.7× throughput, ~18× faster p99, −88% heap in-use (442 MB → 55 MB)
memclr CPU under load: 49.7% → 27.0% (−46% relative)

PDF File Size Optimization

FlateDecode compression All content streams, font streams, ICC profiles, and metadata are compressed using zlib. A central sync.Pool of zlib writers avoids repeated ~256 KB compression table allocation (font/compression.go).
Compression level zlib.BestSpeed balances size vs. throughput. Buffers pre-grow to 64 KiB.
Font subsetting A complete TrueType/OpenType subsetting engine (internal/pdf/font/subset.go, 876 lines) extracts only the glyphs actually used. It handles composite glyph dependencies recursively, remaps glyph IDs, rebuilds all required TTF tables (head, hhea, maxp, glyf, loca, hmtx, cmap, post, name), and generates sparse CIDToGIDMap entries.
Image deduplication FNV-1a hash-based imageCache with sync.RWMutex and singleflight to deduplicate concurrent decodes of the same image.

Concurrency & Memory Pools

Seven active sync.Pool instances across the codebase:

Pool	Size	Purpose
`pdfBufferPool`	64 KB pre-grow	PDF assembly `*bytes.Buffer`
`finalPDFSlicePool`	256 KB cap	Scratch `[]byte` for final PDF assembly
`scratchBufPool`	128 B	Small `strconv` scratch buffers
`rgbDataPool`	1 MB	RGB image conversion buffers
`structElemPool`		PDF/UA structure tree `*StructElem` nodes
`templatePDFPool`		HTTP handler `*models.PDFTemplate` instances
`ZlibWriterPool` / `CompressBufPool`	64 KB	Zlib compression writers and buffers

2. PDF: Not Just a Normal File

A PDF is closer to a programmatic document description language than a simple file format. While HTML describes structure and relies on browser engines for layout, a PDF must define every glyph position, color space, font embedding, and encryption detail explicitly.

PDF 2.0 Structure (ISO 32000-2)

The internal structure of a PDF 2.0 document as generated by GoPdfSuit:

Header %PDF-2.0 magic bytes with optional binary comment marker
Body Sequence of indirect objects (numbered 1..N):
- Catalog: References pages tree, outlines, structure tree, metadata, output intents, viewer preferences
- Pages Tree: Hierarchical page nodes with /Kids, /Count, /MediaBox
- Page Objects: /Contents stream (PDF operators), /Resources (fonts, XObjects), /Annots (links, signatures)
- Font Objects: Type0 fonts with CIDFontType2 + Identity-H encoding, ToUnicode CMaps, font descriptor, font file streams
- XObject Images: DCTDecode/FlateDecode streams with color space, dimensions
- Structure Tree (PDF/UA): /StructTreeRoot → struct elements with types (Document, Table, TR, TD, P, H1-H6...), MCID references
- Outline Tree: Bookmark hierarchy with /Title, /Dest, /Count
- Metadata: XMP stream with Dublin Core, PDF/A, PDF/UA extension schemas
- Output Intents: ICC profile stream with /DestOutputProfile
- Signature: PKCS#7 CMS signature with /ByteRange, /Contents
Cross-Reference Table Maps object numbers to byte offsets, with /W, /Index, and compressed object streams (ObjStm)
Trailer /Size, /Root, /Info, /ID array, /Prev for incremental updates, startxref pointer

Content streams emit standard PDF operators: BT/ET (text blocks), Tj/TJ (text), Tm (matrix), Tf (font), BDC/EMC (marked content for structure tree), q/Q (graphics state), re/f (rectangles), Do (XObjects), cm (coordinate transforms).

Image Encoding

Image handling in internal/pdf/image.go (609 lines):

Supported formats: PNG (image/png), JPEG (passthrough), SVG (parsed via internal/pdf/svg/svg.go)
Color space: Converted to DeviceRGB, with embedded sRGB ICC v2.1 profile (buildSRGBICCProfile in pdfa.go) using hand-corrected TRC curves to prevent washed-out output in Adobe Acrobat
Compression: DCTDecode for JPEG, FlateDecode for RGB rasters
Deduplication: FNV-1a hash-based cache avoids re-decoding and re-compressing duplicate images

Layout

Layout uses a top-down Y model internally while emitting standard PDF bottom-left user space:

PageManager.CurrentYPos = height - topMargin (internal/pdf/pagemanager.go)
Table rendering (draw.go, ~1800+ lines) handles column widths, text wrapping, row heights, cell borders, superscripts/subscripts, checkboxes, and auto-column-width detection
Text wrapping uses WrapTextInto with running lineWidth tracking and reusable [][]byte line buffers
Line width calculations use real TTF hmtx/glyf metrics for custom fonts and hard-coded Standard 14 width tables for WinAnsi fonts

Metadata & Compliance Headers

XMP metadata (internal/pdf/metadata.go, internal/pdf/pdfa.go): Generates PDF/A-4 + PDF/UA-2 compliant XMP with Dublin Core, XMP Media Management, and extension schemas
Catalog entries: /MarkInfo << /Marked true >>, /Lang (en-US), /ViewerPreferences, /StructTreeRoot, /OutputIntents
Document ID: Two-part /ID array with random byte-generated file IDs
Trailer: PDF 2.0 trailers include /ID, /Info, /Size, /Root

3. Compliance

Compliance seemed intimidating until the implementation was underway. Understanding the specifications deeply and building the infrastructure piece by piece made it progressively easier.

PDF/A-4 (ISO 19005-4:2020)

All fonts must be embedded Custom fonts are fully embedded and subsetted. Standard fonts (Helvetica, Courier, Times) are substituted with Liberation font equivalents when PDF/A mode is enabled (internal/pdf/font/pdfa.go). Liberation fonts are downloaded on demand with double-check caching to avoid races.
ICC color profiles required A valid sRGB ICC v2.1 profile is constructed from scratch (buildSRGBICCProfile in pdfa.go, 507 lines). TRC curves are manually corrected.
No encryption in PDF/A (pdfaCompliant + Security.Enabled = rejection)
No external references All resources are embedded.

PDF/UA-2 (ISO 14289-2:2024)

Structure tree (internal/pdf/structure.go, 436 lines): Implements 25+ standard structure types: Document, Part, Sect, Div, H1-H6, P, L, LI, Lbl, LBody, Table, TR, TH, TD, Figure, Caption, Form, Link, Reference.
Marked content: Every content element is wrapped in BDC/EMC operators with MCID references. The StructureManager tracks MCID allocation and manages the ParentTree.
Tagged PDF gating: When TaggedPDF config is off, a no-op StructureManager avoids allocation overhead entirely (Pass 4 optimization).

Validation

The verapdf/ directory contains the veraPDF validation tool for automated compliance checking community members recommended it as the gold standard for PDF/A and PDF/UA validation.

4. pprof & Performance Profiling

Profiling Infrastructure

Server-side endpoints (internal/handlers/handlers.go), all gated to localhost only:

/debug/pprof/profile 30-second CPU profile
/debug/pprof/heap Heap profile
/debug/pprof/goroutine Goroutine dump
/debug/pprof/allocs Allocation profile
/debug/pprof/trace Execution trace
Plus: cmdline, symbol, block, mutex, threadcreate

Opt-in heap dump on exit via ENABLE_PROFILING=1 writes to /tmp/mem.prof.

CPU Hotspot Analysis

Hotspot	Initial	After Optimization	Fix
`drawTable` (cumulative)	~37%	~17.73%	Hoisted scratch buffers, batched writes
`memclrNoHeapPointers` (flat)	49.7% (under load)	27.0% (under load)	Buffer pre-grow, pooling
`compress/flate`	~20%	~5-8%	Zlib writer pool
PNG decoding	Hot for dupes	Eliminated	FNV-1a hash cache + singleflight
`BeginMarkedContentBuf`	~6.8%	Reduced	Tagged PDF gating when not needed

Heap Hotspot Analysis (5000-iteration benchmark)

bytes.growSlice: 443.40 MB (59% of total) addressed by pre-growing buffers
compress/flate.NewWriter: 88.34 MB cumulative addressed by zlib pooling
HTTP load: heap in-use reduced from 442 MB → 55 MB (−88%)

5. Architecture

Design Patterns in Practice

Facade Pattern pkg/gopdflib/ provides a clean public API surface delegating to internal packages. All public types are type aliases (type PDFTemplate = models.PDFTemplate).

Builder Pattern OutlineBuilder (internal/pdf/outline.go, 505 lines) constructs the PDF outline tree with a fluent API.

Strategy / Adapter Patterns via Interfaces:

ObjectEncryptor allows switching between AES-128, AES-256, RC4, and no-op encryption
SignaturePageContext decouples the signature subsystem from PageManager
OCRProvider allows plugging in different OCR backends for redaction

Object Pool Pattern Seven sync.Pool instances reduce GC pressure on hot paths.

Registry / Singleton Pattern CustomFontRegistry with CloneForGeneration() gives each PDF generation a shallow clone with isolated usage tracking and noLock: true to avoid mutex overhead on single-threaded generation paths.

Component Pattern PDF structure is built from typed element components (Table, Spacer, Image, Footer, Title, Bookmark) assembled via ordered Element slices.

Decoupled Architecture

Data flows one way: Template → Parser → PageManager → ContentStreams → Assembly → Final PDF
Font registry cloned per generation eliminates mutex contention, makes concurrent generation safe
Parallelism gated behind runtime.NumCPU() semaphore middleware prevents goroutine thrashing
Per-page zlib compression is parallel (errgroup), but assembly, encryption, and xref writing stay serial for deterministic object numbering
context.Context not used in the PDF pipeline keeps the hot path lightweight

6. CGO Python Bindings

The entire Go PDF engine is exported as a Python package via CGO shared library (bindings/python/cgo/exports.go, 437 lines):

Compiles to a C shared library (.so/.dylib) using go build -buildmode=c-shared
Exports 14 C-callable functions: GeneratePDF, MergePDFs, SplitPDF, FillPDFWithXFDF, ConvertHTMLToPDF, ConvertHTMLToImage, GetAvailableFonts, GetPageInfo, ExtractTextPositions, FindTextOccurrences, ApplyRedactions, ApplyRedactionsAdvanced, ParsePageSpec
Memory management via FreeBytesResult and FreeBytesArrayResult for caller-side cleanup
Python package: bindings/python/pypdfsuit/ with setup.py/pyproject.toml for PyPI distribution

The same high-performance Go engine powers both Go and Python ecosystems only the initial function call crosses the CGO boundary; all PDF generation happens natively in Go memory.

7. Lessons from Open-Source Maintenance

Running GoPdfSuit as a public open-source project involved:

User feedback loop Real users tested the library in production and provided actionable feedback on XFDF form filling, redaction behavior, and PDF/A validation.
Issue-driven development Features like PDF splitting, secure redaction with OCR, HTML-to-PDF conversion, and Python CGO bindings were driven by user requests.
Tooling suggestions Community members recommended veraPDF as the gold standard validator.
Documentation shaped by users The React playground, API docs, and benchmarks were built around what users actually needed to understand.
Clean repository structure Library (pkg/gopdflib/), engine (internal/pdf/), web app, benchmarks, guides, and bindings are cleanly separated.

8. Benchmark Performance

GoPdfSuit was benchmarked against a financial PDF generation workload inspired by the publicly documented infrastructure of Zerodha (India's largest retail brokerage), which uses Typst/LaTeX CLI tools on a 40-node Nomad cluster to generate ~1.5 million digitally signed PDF contract notes daily.

Note: This comparison should be interpreted carefully. Zerodha runs a full production pipeline with signing, distribution, and fault tolerance on a distributed cluster. The GoPdfSuit benchmark measures a local library's raw generation throughput. The numbers highlight the performance ceiling achievable when generation is the only concern, not an apples-to-apples production comparison.

Metric	GoPdfSuit (1 node, 24 vCPUs)
Throughput (peak)	2,061 ops/s
Throughput (10-run avg)	1,705 ops/s
Serial 2000-row PDF/A	31–36 ms/op
Heap in-use under load	55 MB
Time for 1.5M PDFs	~15 minutes (single node)

Why It's Fast

Native binary generation Generates PDF binary structure directly in RAM, no external process spawning.
Zero IO overhead No temporary files, no disk I/O; streams bytes directly in memory.
Goroutine concurrency Thousands of lightweight goroutines saturate all cores without OS thread overhead.
Asset reuse Font subsets and image assets are processed once and reused across millions of documents.

9. Technology Stack

Go Backend (Gin)

Entry point: cmd/gopdfsuit/main.go
Framework: Gin (release mode) with custom lightweight panic recovery
Concurrency control: Semaphore middleware sized to runtime.NumCPU()
Routes: Serves the Vite-built React SPA plus 14 API endpoints under /api/v1 for PDF generation, merging, splitting, XFDF filling, HTML-to-PDF, redaction, font management, and OCR
Middleware: CORS, Google OAuth (Cloud Run only), semaphore-based concurrency gating

React Frontend (Vite SPA)

12 pages: Home, Editor, Viewer, Merge, Split, Filler, HtmlToPdf, HtmlToImage, Comparison, Documentation, Redaction, Screenshots
React Router v6 with HashRouter for GitHub Pages compatibility
MUI (Material UI) components with a custom theme
AuthGuard component for OAuth-gated routes
Vite builds output to docs/, served by Go backend as static assets

10. GCloud Deployments

GCP Deployment & Architecture

Hands-on GCP learning End-to-end experience deploying a Go and React application on Google Cloud Platform.
Strategic architecture decisions Analyzed and compared Google App Engine vs. Cloud Run for optimal deployment strategy based on cost, scalability, and performance.
Resource optimization Dual-deployment approach: App Engine F1 instance class for standard hosting alongside tailored Cloud Run instances (512 MiB memory ceiling), optimized via K_SERVICE environment detection.

Project Configuration

App Engine Standard setup Configured app.yaml to manage runtime (go124), strict autoscaling limits, custom entry points, and environment variables (Google OAuth, Cloud Run URLs, Vite configurations).
React frontend integration Unified the React SPA (built via Vite) with the Go server binary, configuring Gin's StaticFS middleware to serve static assets alongside custom SPA fallback routing.
Security & CORS Google OAuth middleware gates specific routes; precise CORS permissions enable communication between the GitHub Pages frontend and the Cloud Run API.

Docker

Multi-stage Docker build with separate builder and runtime stages
Cloud Run optimized variant (Dockerfile_cloudrun)

Key Takeaways

Profile everything, don't guess. pprof showed memclr was 49.7% of CPU under load something no amount of code review would have caught.
Respect the allocator. Pre-growing buffers, pooling, and avoiding string intermediates are the difference between 442 MB and 55 MB heap.
Interface design pays off. The ObjectEncryptor and SignaturePageContext interfaces made security features composable without entangling the hot path.
Compliance is incremental. PDF/A-4 and PDF/UA-2 felt overwhelming until broken into concrete checklist items fonts, ICC profiles, XMP metadata, structure tree, marked content.
Benchmark honestly. Distinguish peak from average, report worker counts, and be transparent about what's being compared.
Use AI as a tool, not a crutch. AI pair programming accelerated development significantly, but every optimization was validated against measured benchmarks.

What started as a simple XFDF parser side project evolved into a fully compliant PDF engine supporting PDF/A-4 and PDF/UA-2 roughly three months of work accelerated by AI pair programming. By moving away from licensed enterprise solutions, this native Go engine represents meaningful infrastructure cost reduction for teams replacing commercial PDF tooling.

Repository: github.com/chinmay-sawant/gopdfsuit

DEV Community