Building a PDF engine from scratch in Go forces you to confront memory management, binary serialization, concurrency safety, interface design, and performance profiling all at once in a domain where correctness is non-negotiable.
This article covers lessons learned building GoPdfSuit, a production PDF engine written in Go that generates 1.5 million financial PDFs in roughly 45 minutes on a single node, achieves PDF/A-4 and PDF/UA-2 compliance, and exposes itself as a REST API, a Go library, and Python CGO bindings simultaneously.
Note: While I have six years of overall experience including two years working with Go, I rarely encountered these types of systems-level challenges in day-to-day feature work. This project was an intentional deep dive into performance optimization. I used AI tools (Copilot, Cursor, OpenCode) to assist with development.
1. Optimization
Micro-Optimizations That Actually Matter
The kinds of optimizations rarely needed in CRUD work became essential here: direct byte slicing instead of fmt.Sprintf, bit-shift approximations for division, and pre-sized scratch buffers.
Techniques used across the codebase:
-
appendTextForPDFZero-alloc text encoding directly into[]byte, eliminatingstringintermediates on everyTjPDF operator (internal/pdf/utils.go, used acrossdraw.go). -
Byte-scratch buffers Hot paths use stack-fixed
[24]byteand[12]bytescratch buffers for numeric formatting.appendFmtNumavoidsstrconv.AppendFloat, documented as ~10% CPU savings. -
RuneSet bitmap Replaced
map[rune]boolfor character tracking with a dense 64 KiB bitmap (font/runeset.go), cutting map inserts on the font subsetting hot path. -
Fast alpha blending Replaced integer division (
/ 255) per pixel component with((r*a + white) * 0x8081) >> 23(internal/pdf/image.go). -
256-byte lookup tables for hex encoding instead of
fmt.Sprintf. -
Batched writes Reduced ~25K separate
Writecalls for a 5K-cell table down to ~5K by batching PDF commands per cell (internal/pdf/draw.go). -
Pre-grow all buffers Page content streams pre-grow to 64 KiB, compress buffers to 64 KiB, assembly buffer to 64 KiB, avoiding incremental
appendgrowth during hot generation.
The Four-Pass Performance Program
The optimization journey was structured across 4 passes with 41 total tasks:
| Pass | Focus | Tasks | Key Outcomes |
|---|---|---|---|
| Pass 1 | Low-hanging fruit | 10 | Buffer pooling, zero-alloc text encoding, batched writes, RuneSet bitmap, image cache singleflight |
| Pass 2 | Architecture | 12 | Direct-write APIs, parallel decode/compress, incremental MD5, sparse CIDToGIDMap |
| Pass 3 | Advanced memory | 5 | Allocation-free WrapTextInto, typed []StructKid replacing []interface{}, redact parser unification |
| Pass 4 | Load-test hotspots | 14 | PDF/UA gating, final PDF slice pool, StructElem pool, template pool, parallel zlib, p99 fixes |
Results:
- 2061 ops/s peak (1705 ops/s 10-run average) on financial workload (48 workers, PDF/A + tagged + signatures)
- Serial 2000-row PDF/A: ~31–36 ms/op, ~163K allocs/op (~46% fewer than pre-optimization)
- HTTP load test: ~5.7× throughput, ~18× faster p99, −88% heap in-use (442 MB → 55 MB)
-
memclrCPU under load: 49.7% → 27.0% (−46% relative)
PDF File Size Optimization
-
FlateDecode compression All content streams, font streams, ICC profiles, and metadata are compressed using zlib. A central
sync.Poolof zlib writers avoids repeated ~256 KB compression table allocation (font/compression.go). -
Compression level
zlib.BestSpeedbalances size vs. throughput. Buffers pre-grow to 64 KiB. -
Font subsetting A complete TrueType/OpenType subsetting engine (
internal/pdf/font/subset.go, 876 lines) extracts only the glyphs actually used. It handles composite glyph dependencies recursively, remaps glyph IDs, rebuilds all required TTF tables (head,hhea,maxp,glyf,loca,hmtx,cmap,post,name), and generates sparse CIDToGIDMap entries. -
Image deduplication FNV-1a hash-based
imageCachewithsync.RWMutexandsingleflightto deduplicate concurrent decodes of the same image.
Concurrency & Memory Pools
Seven active sync.Pool instances across the codebase:
| Pool | Size | Purpose |
|---|---|---|
pdfBufferPool |
64 KB pre-grow | PDF assembly *bytes.Buffer
|
finalPDFSlicePool |
256 KB cap | Scratch []byte for final PDF assembly |
scratchBufPool |
128 B | Small strconv scratch buffers |
rgbDataPool |
1 MB | RGB image conversion buffers |
structElemPool |
PDF/UA structure tree *StructElem nodes |
|
templatePDFPool |
HTTP handler *models.PDFTemplate instances |
|
ZlibWriterPool / CompressBufPool
|
64 KB | Zlib compression writers and buffers |
2. PDF: Not Just a Normal File
A PDF is closer to a programmatic document description language than a simple file format. While HTML describes structure and relies on browser engines for layout, a PDF must define every glyph position, color space, font embedding, and encryption detail explicitly.
PDF 2.0 Structure (ISO 32000-2)
The internal structure of a PDF 2.0 document as generated by GoPdfSuit:
-
Header
%PDF-2.0magic bytes with optional binary comment marker -
Body Sequence of indirect objects (numbered 1..N):
- Catalog: References pages tree, outlines, structure tree, metadata, output intents, viewer preferences
-
Pages Tree: Hierarchical page nodes with
/Kids,/Count,/MediaBox -
Page Objects:
/Contentsstream (PDF operators),/Resources(fonts, XObjects),/Annots(links, signatures) - Font Objects: Type0 fonts with CIDFontType2 + Identity-H encoding, ToUnicode CMaps, font descriptor, font file streams
- XObject Images: DCTDecode/FlateDecode streams with color space, dimensions
-
Structure Tree (PDF/UA):
/StructTreeRoot→ struct elements with types (Document, Table, TR, TD, P, H1-H6...), MCID references -
Outline Tree: Bookmark hierarchy with
/Title,/Dest,/Count - Metadata: XMP stream with Dublin Core, PDF/A, PDF/UA extension schemas
-
Output Intents: ICC profile stream with
/DestOutputProfile -
Signature: PKCS#7 CMS signature with
/ByteRange,/Contents
-
Cross-Reference Table Maps object numbers to byte offsets, with
/W,/Index, and compressed object streams (ObjStm) -
Trailer
/Size,/Root,/Info,/IDarray,/Prevfor incremental updates,startxrefpointer
Content streams emit standard PDF operators: BT/ET (text blocks), Tj/TJ (text), Tm (matrix), Tf (font), BDC/EMC (marked content for structure tree), q/Q (graphics state), re/f (rectangles), Do (XObjects), cm (coordinate transforms).
Image Encoding
Image handling in internal/pdf/image.go (609 lines):
-
Supported formats: PNG (
image/png), JPEG (passthrough), SVG (parsed viainternal/pdf/svg/svg.go) -
Color space: Converted to DeviceRGB, with embedded sRGB ICC v2.1 profile (
buildSRGBICCProfileinpdfa.go) using hand-corrected TRC curves to prevent washed-out output in Adobe Acrobat - Compression: DCTDecode for JPEG, FlateDecode for RGB rasters
- Deduplication: FNV-1a hash-based cache avoids re-decoding and re-compressing duplicate images
Layout
Layout uses a top-down Y model internally while emitting standard PDF bottom-left user space:
-
PageManager.CurrentYPos = height - topMargin(internal/pdf/pagemanager.go) - Table rendering (
draw.go, ~1800+ lines) handles column widths, text wrapping, row heights, cell borders, superscripts/subscripts, checkboxes, and auto-column-width detection - Text wrapping uses
WrapTextIntowith runninglineWidthtracking and reusable[][]byteline buffers - Line width calculations use real TTF
hmtx/glyfmetrics for custom fonts and hard-coded Standard 14 width tables for WinAnsi fonts
Metadata & Compliance Headers
-
XMP metadata (
internal/pdf/metadata.go,internal/pdf/pdfa.go): Generates PDF/A-4 + PDF/UA-2 compliant XMP with Dublin Core, XMP Media Management, and extension schemas -
Catalog entries:
/MarkInfo << /Marked true >>,/Lang (en-US),/ViewerPreferences,/StructTreeRoot,/OutputIntents -
Document ID: Two-part
/IDarray with random byte-generated file IDs -
Trailer: PDF 2.0 trailers include
/ID,/Info,/Size,/Root
3. Compliance
Compliance seemed intimidating until the implementation was underway. Understanding the specifications deeply and building the infrastructure piece by piece made it progressively easier.
PDF/A-4 (ISO 19005-4:2020)
-
All fonts must be embedded Custom fonts are fully embedded and subsetted. Standard fonts (Helvetica, Courier, Times) are substituted with Liberation font equivalents when PDF/A mode is enabled (
internal/pdf/font/pdfa.go). Liberation fonts are downloaded on demand with double-check caching to avoid races. -
ICC color profiles required A valid sRGB ICC v2.1 profile is constructed from scratch (
buildSRGBICCProfileinpdfa.go, 507 lines). TRC curves are manually corrected. -
No encryption in PDF/A (
pdfaCompliant+Security.Enabled= rejection) - No external references All resources are embedded.
PDF/UA-2 (ISO 14289-2:2024)
-
Structure tree (
internal/pdf/structure.go, 436 lines): Implements 25+ standard structure types: Document, Part, Sect, Div, H1-H6, P, L, LI, Lbl, LBody, Table, TR, TH, TD, Figure, Caption, Form, Link, Reference. -
Marked content: Every content element is wrapped in
BDC/EMCoperators with MCID references. TheStructureManagertracks MCID allocation and manages the ParentTree. -
Tagged PDF gating: When
TaggedPDFconfig is off, a no-opStructureManageravoids allocation overhead entirely (Pass 4 optimization).
Validation
The verapdf/ directory contains the veraPDF validation tool for automated compliance checking community members recommended it as the gold standard for PDF/A and PDF/UA validation.
4. pprof & Performance Profiling
Profiling Infrastructure
Server-side endpoints (internal/handlers/handlers.go), all gated to localhost only:
-
/debug/pprof/profile30-second CPU profile -
/debug/pprof/heapHeap profile -
/debug/pprof/goroutineGoroutine dump -
/debug/pprof/allocsAllocation profile -
/debug/pprof/traceExecution trace - Plus: cmdline, symbol, block, mutex, threadcreate
Opt-in heap dump on exit via ENABLE_PROFILING=1 writes to /tmp/mem.prof.
CPU Hotspot Analysis
| Hotspot | Initial | After Optimization | Fix |
|---|---|---|---|
drawTable (cumulative) |
~37% | ~17.73% | Hoisted scratch buffers, batched writes |
memclrNoHeapPointers (flat) |
49.7% (under load) | 27.0% (under load) | Buffer pre-grow, pooling |
compress/flate |
~20% | ~5-8% | Zlib writer pool |
| PNG decoding | Hot for dupes | Eliminated | FNV-1a hash cache + singleflight |
BeginMarkedContentBuf |
~6.8% | Reduced | Tagged PDF gating when not needed |
Heap Hotspot Analysis (5000-iteration benchmark)
-
bytes.growSlice: 443.40 MB (59% of total) addressed by pre-growing buffers -
compress/flate.NewWriter: 88.34 MB cumulative addressed by zlib pooling - HTTP load: heap in-use reduced from 442 MB → 55 MB (−88%)
5. Architecture
Design Patterns in Practice
Facade Pattern pkg/gopdflib/ provides a clean public API surface delegating to internal packages. All public types are type aliases (type PDFTemplate = models.PDFTemplate).
Builder Pattern OutlineBuilder (internal/pdf/outline.go, 505 lines) constructs the PDF outline tree with a fluent API.
Strategy / Adapter Patterns via Interfaces:
-
ObjectEncryptorallows switching between AES-128, AES-256, RC4, and no-op encryption -
SignaturePageContextdecouples the signature subsystem fromPageManager -
OCRProviderallows plugging in different OCR backends for redaction
Object Pool Pattern Seven sync.Pool instances reduce GC pressure on hot paths.
Registry / Singleton Pattern CustomFontRegistry with CloneForGeneration() gives each PDF generation a shallow clone with isolated usage tracking and noLock: true to avoid mutex overhead on single-threaded generation paths.
Component Pattern PDF structure is built from typed element components (Table, Spacer, Image, Footer, Title, Bookmark) assembled via ordered Element slices.
Decoupled Architecture
- Data flows one way: Template → Parser → PageManager → ContentStreams → Assembly → Final PDF
- Font registry cloned per generation eliminates mutex contention, makes concurrent generation safe
-
Parallelism gated behind
runtime.NumCPU()semaphore middleware prevents goroutine thrashing -
Per-page zlib compression is parallel (
errgroup), but assembly, encryption, and xref writing stay serial for deterministic object numbering -
context.Contextnot used in the PDF pipeline keeps the hot path lightweight
6. CGO Python Bindings
The entire Go PDF engine is exported as a Python package via CGO shared library (bindings/python/cgo/exports.go, 437 lines):
- Compiles to a C shared library (
.so/.dylib) usinggo build -buildmode=c-shared - Exports 14 C-callable functions:
GeneratePDF,MergePDFs,SplitPDF,FillPDFWithXFDF,ConvertHTMLToPDF,ConvertHTMLToImage,GetAvailableFonts,GetPageInfo,ExtractTextPositions,FindTextOccurrences,ApplyRedactions,ApplyRedactionsAdvanced,ParsePageSpec - Memory management via
FreeBytesResultandFreeBytesArrayResultfor caller-side cleanup - Python package:
bindings/python/pypdfsuit/withsetup.py/pyproject.tomlfor PyPI distribution
The same high-performance Go engine powers both Go and Python ecosystems only the initial function call crosses the CGO boundary; all PDF generation happens natively in Go memory.
7. Lessons from Open-Source Maintenance
Running GoPdfSuit as a public open-source project involved:
- User feedback loop Real users tested the library in production and provided actionable feedback on XFDF form filling, redaction behavior, and PDF/A validation.
- Issue-driven development Features like PDF splitting, secure redaction with OCR, HTML-to-PDF conversion, and Python CGO bindings were driven by user requests.
- Tooling suggestions Community members recommended veraPDF as the gold standard validator.
- Documentation shaped by users The React playground, API docs, and benchmarks were built around what users actually needed to understand.
-
Clean repository structure Library (
pkg/gopdflib/), engine (internal/pdf/), web app, benchmarks, guides, and bindings are cleanly separated.
8. Benchmark Performance
GoPdfSuit was benchmarked against a financial PDF generation workload inspired by the publicly documented infrastructure of Zerodha (India's largest retail brokerage), which uses Typst/LaTeX CLI tools on a 40-node Nomad cluster to generate ~1.5 million digitally signed PDF contract notes daily.
Note: This comparison should be interpreted carefully. Zerodha runs a full production pipeline with signing, distribution, and fault tolerance on a distributed cluster. The GoPdfSuit benchmark measures a local library's raw generation throughput. The numbers highlight the performance ceiling achievable when generation is the only concern, not an apples-to-apples production comparison.
| Metric | GoPdfSuit (1 node, 24 vCPUs) |
|---|---|
| Throughput (peak) | 2,061 ops/s |
| Throughput (10-run avg) | 1,705 ops/s |
| Serial 2000-row PDF/A | 31–36 ms/op |
| Heap in-use under load | 55 MB |
| Time for 1.5M PDFs | ~15 minutes (single node) |
Why It's Fast
- Native binary generation Generates PDF binary structure directly in RAM, no external process spawning.
- Zero IO overhead No temporary files, no disk I/O; streams bytes directly in memory.
- Goroutine concurrency Thousands of lightweight goroutines saturate all cores without OS thread overhead.
- Asset reuse Font subsets and image assets are processed once and reused across millions of documents.
9. Technology Stack
Go Backend (Gin)
-
Entry point:
cmd/gopdfsuit/main.go - Framework: Gin (release mode) with custom lightweight panic recovery
-
Concurrency control: Semaphore middleware sized to
runtime.NumCPU() -
Routes: Serves the Vite-built React SPA plus 14 API endpoints under
/api/v1for PDF generation, merging, splitting, XFDF filling, HTML-to-PDF, redaction, font management, and OCR - Middleware: CORS, Google OAuth (Cloud Run only), semaphore-based concurrency gating
React Frontend (Vite SPA)
- 12 pages: Home, Editor, Viewer, Merge, Split, Filler, HtmlToPdf, HtmlToImage, Comparison, Documentation, Redaction, Screenshots
- React Router v6 with
HashRouterfor GitHub Pages compatibility - MUI (Material UI) components with a custom theme
-
AuthGuardcomponent for OAuth-gated routes - Vite builds output to
docs/, served by Go backend as static assets
10. GCloud Deployments
GCP Deployment & Architecture
- Hands-on GCP learning End-to-end experience deploying a Go and React application on Google Cloud Platform.
- Strategic architecture decisions Analyzed and compared Google App Engine vs. Cloud Run for optimal deployment strategy based on cost, scalability, and performance.
-
Resource optimization Dual-deployment approach: App Engine F1 instance class for standard hosting alongside tailored Cloud Run instances (512 MiB memory ceiling), optimized via
K_SERVICEenvironment detection.
Project Configuration
-
App Engine Standard setup Configured
app.yamlto manage runtime (go124), strict autoscaling limits, custom entry points, and environment variables (Google OAuth, Cloud Run URLs, Vite configurations). -
React frontend integration Unified the React SPA (built via Vite) with the Go server binary, configuring Gin's
StaticFSmiddleware to serve static assets alongside custom SPA fallback routing. - Security & CORS Google OAuth middleware gates specific routes; precise CORS permissions enable communication between the GitHub Pages frontend and the Cloud Run API.
Docker
- Multi-stage Docker build with separate builder and runtime stages
- Cloud Run optimized variant (
Dockerfile_cloudrun)
Key Takeaways
-
Profile everything, don't guess. pprof showed
memclrwas 49.7% of CPU under load something no amount of code review would have caught. -
Respect the allocator. Pre-growing buffers, pooling, and avoiding
stringintermediates are the difference between 442 MB and 55 MB heap. -
Interface design pays off. The
ObjectEncryptorandSignaturePageContextinterfaces made security features composable without entangling the hot path. - Compliance is incremental. PDF/A-4 and PDF/UA-2 felt overwhelming until broken into concrete checklist items fonts, ICC profiles, XMP metadata, structure tree, marked content.
- Benchmark honestly. Distinguish peak from average, report worker counts, and be transparent about what's being compared.
- Use AI as a tool, not a crutch. AI pair programming accelerated development significantly, but every optimization was validated against measured benchmarks.
What started as a simple XFDF parser side project evolved into a fully compliant PDF engine supporting PDF/A-4 and PDF/UA-2 roughly three months of work accelerated by AI pair programming. By moving away from licensed enterprise solutions, this native Go engine represents meaningful infrastructure cost reduction for teams replacing commercial PDF tooling.
Repository: github.com/chinmay-sawant/gopdfsuit
Top comments (0)