The Problem with PDF Libraries
Every Go developer who has worked with PDFs knows the pain:
- UniPDF: Powerful, but starts at $299/month
- pdfcpu: Great for manipulation, but no table extraction
- gofpdf: Creation only, abandoned since 2019
I needed to extract tables from bank statements. 740 transactions across multiple pages. Commercial libraries worked, but the cost was prohibitive for an open-source project.
So I built GxPDF.
What is GxPDF?
GxPDF is a pure Go PDF library that handles both reading and creation. No CGO. No external dependencies. MIT licensed.
# Install CLI
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
# Or use as library
go get github.com/coregx/gxpdf@v0.1.0
The Key Innovation: 4-Pass Hybrid Detection
Table extraction is hard. PDFs don't have "tables" - they have positioned text elements scattered across coordinates. Most algorithms fail on:
- Multi-line cells (descriptions that wrap)
- Missing borders (modern designs)
- Merged cells
- Headers vs data discrimination
GxPDF uses a 4-Pass Hybrid Detection algorithm:
Pass 1: Gap Detection (adaptive threshold)
Pass 2: Overlap Detection (Tabula-inspired)
Pass 3: Alignment Detection (geometric clustering)
Pass 4: Multi-line Cell Merger (amount-based discrimination)
The key insight for Pass 4: transaction rows have monetary amounts, continuation rows don't.
// Works on ALL banks without configuration
isTransactionRow := hasAmount(row) // Has amount = new transaction
isContinuation := !hasAmount(row) // No amount = continuation of previous
This universal discriminator works across different PDF generators, layouts, and bank formats.
Results: 100% Accuracy
Tested on real bank statements:
| Bank | Transactions | Accuracy |
|---|---|---|
| Sberbank | 242 | 100% |
| Alfa-Bank | 281 | 100% |
| VTB | 217 | 100% |
| Total | 740 | 100% |
Every transaction extracted correctly. Every multi-line description preserved.
Code Examples
Extract Tables from PDF
package main
import (
"fmt"
"log"
"github.com/coregx/gxpdf"
)
func main() {
// Open PDF
doc, err := gxpdf.Open("bank_statement.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
// Extract all tables
tables := doc.ExtractTables()
for _, t := range tables {
fmt.Printf("Table: %d rows x %d cols\n",
t.RowCount(), t.ColumnCount())
// Access rows
for _, row := range t.Rows() {
fmt.Println(row)
}
}
}
Export to CSV/JSON
// Export to CSV
csv, _ := table.ToCSV()
fmt.Println(csv)
// Export to JSON
json, _ := table.ToJSON()
fmt.Println(json)
// Write to file
file, _ := os.Create("output.csv")
table.ExportCSV(file)
Create PDFs
package main
import (
"log"
"github.com/coregx/gxpdf/creator"
)
func main() {
c := creator.New()
c.SetTitle("Invoice")
c.SetAuthor("GxPDF")
page, _ := c.NewPage()
// Add text with Standard 14 fonts
page.AddText("Invoice #12345", 100, 750, creator.HelveticaBold, 24)
page.AddText("Amount: $1,234.56", 100, 700, creator.Helvetica, 14)
// Draw graphics
opts := &creator.RectOptions{
StrokeColor: &creator.Black,
FillColor: &creator.LightGray,
StrokeWidth: 1.0,
}
page.DrawRect(100, 600, 400, 50, opts)
// Save
if err := c.WriteToFile("invoice.pdf"); err != nil {
log.Fatal(err)
}
}
CLI Tool
GxPDF includes a CLI for quick operations:
# Extract tables
gxpdf tables invoice.pdf
gxpdf tables bank.pdf --format csv > transactions.csv
gxpdf tables report.pdf --format json
# Get PDF info
gxpdf info document.pdf
# Extract text
gxpdf text document.pdf
# Merge PDFs
gxpdf merge part1.pdf part2.pdf -o combined.pdf
# Split PDF
gxpdf split document.pdf --pages 1-5 -o first_five.pdf
Feature Matrix
| Feature | Status |
|---|---|
| Table Extraction | 100% accuracy |
| Text Extraction | Supported |
| Image Extraction | Supported |
| PDF Creation | Supported |
| Standard 14 Fonts | All 14 |
| Embedded Fonts | TTF/OTF |
| Graphics | Lines, Rectangles, Circles, Bezier |
| Encryption | RC4 + AES-128/256 |
| Export | CSV, JSON, Excel |
Architecture
internal/
├── document/ # Document model
├── encoding/ # FlateDecode, DCTDecode
├── extractor/ # Text, image, graphics
├── fonts/ # Standard 14 + embedding
├── models/ # Data structures
├── parser/ # PDF parsing
├── reader/ # PDF reader
├── security/ # RC4/AES encryption
├── tabledetect/ # 4-Pass Hybrid algorithm
└── writer/ # PDF generation
Clean separation. No CGO. Pure Go from top to bottom.
Performance
Table extraction on a 15-page bank statement:
- Time: ~200ms
- Memory: ~15MB peak
- Allocations: Optimized with sync.Pool
PDF creation benchmarks:
BenchmarkNewPage-8 50000 28.4 µs/op
BenchmarkAddText-8 100000 11.2 µs/op
BenchmarkWriteToFile-8 5000 312.5 µs/op
What's Next
The v0.1.0 release covers the core functionality. Planned for future releases:
- Form Filling: Fill existing PDF forms
- Digital Signatures: Sign PDFs cryptographically
- SVG Import: Vector graphics
- PDF Rendering: Convert pages to images
We Need Your PDFs
This is v0.1.0 — our first public release. We've tested on bank statements, invoices, and reports. But PDFs are infinitely diverse.
We need testers with real documents:
- Corporate reports with complex tables
- Invoices from different countries and formats
- Scanned documents with OCR layers
- Multi-language PDFs (CJK, Arabic, Hebrew)
- Legacy PDFs from old generators
- Edge cases that break other libraries
If GxPDF fails on your document, that's valuable data. Open an issue, attach the PDF (or a sanitized version), and we'll fix it.
Our goal is enterprise-grade quality. Not "good enough for hobby projects" — we want GxPDF to handle production workloads at scale. The 740/740 accuracy on bank statements is our baseline, not our ceiling.
This is v0.1.0. Some rough edges exist. But the architecture is solid, the core algorithms work, and we're committed to making this the PDF library Go deserves.
Try It
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
gxpdf version
Repository: github.com/coregx/gxpdf
Documentation and examples in the repo. Issues and PRs welcome.
GxPDF is MIT licensed. Built for the Go community who needed a real PDF library without commercial restrictions.
Top comments (0)