Andrey Kolkov

Posted on Jan 7

GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go

#go #pdf #opensource #database

The Problem with PDF Libraries

Every Go developer who has worked with PDFs knows the pain:

UniPDF: Powerful, but starts at $299/month
pdfcpu: Great for manipulation, but no table extraction
gofpdf: Creation only, abandoned since 2019

I needed to extract tables from bank statements. 740 transactions across multiple pages. Commercial libraries worked, but the cost was prohibitive for an open-source project.

So I built GxPDF.

What is GxPDF?

GxPDF is a pure Go PDF library that handles both reading and creation. No CGO. No external dependencies. MIT licensed.

# Install CLI
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0

# Or use as library
go get github.com/coregx/gxpdf@v0.1.0

The Key Innovation: 4-Pass Hybrid Detection

Table extraction is hard. PDFs don't have "tables" - they have positioned text elements scattered across coordinates. Most algorithms fail on:

Multi-line cells (descriptions that wrap)
Missing borders (modern designs)
Merged cells
Headers vs data discrimination

GxPDF uses a 4-Pass Hybrid Detection algorithm:

Pass 1: Gap Detection (adaptive threshold)
Pass 2: Overlap Detection (Tabula-inspired)
Pass 3: Alignment Detection (geometric clustering)
Pass 4: Multi-line Cell Merger (amount-based discrimination)

The key insight for Pass 4: transaction rows have monetary amounts, continuation rows don't.

// Works on ALL banks without configuration
isTransactionRow := hasAmount(row)  // Has amount = new transaction
isContinuation := !hasAmount(row)   // No amount = continuation of previous

This universal discriminator works across different PDF generators, layouts, and bank formats.

Results: 100% Accuracy

Tested on real bank statements:

Bank	Transactions	Accuracy
Sberbank	242	100%
Alfa-Bank	281	100%
VTB	217	100%
Total	740	100%

Every transaction extracted correctly. Every multi-line description preserved.

Code Examples

Extract Tables from PDF

package main

import (
    "fmt"
    "log"

    "github.com/coregx/gxpdf"
)

func main() {
    // Open PDF
    doc, err := gxpdf.Open("bank_statement.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    // Extract all tables
    tables := doc.ExtractTables()

    for _, t := range tables {
        fmt.Printf("Table: %d rows x %d cols\n",
            t.RowCount(), t.ColumnCount())

        // Access rows
        for _, row := range t.Rows() {
            fmt.Println(row)
        }
    }
}

Export to CSV/JSON

// Export to CSV
csv, _ := table.ToCSV()
fmt.Println(csv)

// Export to JSON
json, _ := table.ToJSON()
fmt.Println(json)

// Write to file
file, _ := os.Create("output.csv")
table.ExportCSV(file)

Create PDFs

package main

import (
    "log"

    "github.com/coregx/gxpdf/creator"
)

func main() {
    c := creator.New()
    c.SetTitle("Invoice")
    c.SetAuthor("GxPDF")

    page, _ := c.NewPage()

    // Add text with Standard 14 fonts
    page.AddText("Invoice #12345", 100, 750, creator.HelveticaBold, 24)
    page.AddText("Amount: $1,234.56", 100, 700, creator.Helvetica, 14)

    // Draw graphics
    opts := &creator.RectOptions{
        StrokeColor: &creator.Black,
        FillColor:   &creator.LightGray,
        StrokeWidth: 1.0,
    }
    page.DrawRect(100, 600, 400, 50, opts)

    // Save
    if err := c.WriteToFile("invoice.pdf"); err != nil {
        log.Fatal(err)
    }
}

CLI Tool

GxPDF includes a CLI for quick operations:

# Extract tables
gxpdf tables invoice.pdf
gxpdf tables bank.pdf --format csv > transactions.csv
gxpdf tables report.pdf --format json

# Get PDF info
gxpdf info document.pdf

# Extract text
gxpdf text document.pdf

# Merge PDFs
gxpdf merge part1.pdf part2.pdf -o combined.pdf

# Split PDF
gxpdf split document.pdf --pages 1-5 -o first_five.pdf

Feature Matrix

Feature	Status
Table Extraction	100% accuracy
Text Extraction	Supported
Image Extraction	Supported
PDF Creation	Supported
Standard 14 Fonts	All 14
Embedded Fonts	TTF/OTF
Graphics	Lines, Rectangles, Circles, Bezier
Encryption	RC4 + AES-128/256
Export	CSV, JSON, Excel

Architecture

internal/
├── document/       # Document model
├── encoding/       # FlateDecode, DCTDecode
├── extractor/      # Text, image, graphics
├── fonts/          # Standard 14 + embedding
├── models/         # Data structures
├── parser/         # PDF parsing
├── reader/         # PDF reader
├── security/       # RC4/AES encryption
├── tabledetect/    # 4-Pass Hybrid algorithm
└── writer/         # PDF generation

Clean separation. No CGO. Pure Go from top to bottom.

Performance

Table extraction on a 15-page bank statement:

Time: ~200ms
Memory: ~15MB peak
Allocations: Optimized with sync.Pool

PDF creation benchmarks:

BenchmarkNewPage-8           50000    28.4 µs/op
BenchmarkAddText-8          100000    11.2 µs/op
BenchmarkWriteToFile-8        5000   312.5 µs/op

What's Next

The v0.1.0 release covers the core functionality. Planned for future releases:

Form Filling: Fill existing PDF forms
Digital Signatures: Sign PDFs cryptographically
SVG Import: Vector graphics
PDF Rendering: Convert pages to images

We Need Your PDFs

This is v0.1.0 — our first public release. We've tested on bank statements, invoices, and reports. But PDFs are infinitely diverse.

We need testers with real documents:

Corporate reports with complex tables
Invoices from different countries and formats
Scanned documents with OCR layers
Multi-language PDFs (CJK, Arabic, Hebrew)
Legacy PDFs from old generators
Edge cases that break other libraries

If GxPDF fails on your document, that's valuable data. Open an issue, attach the PDF (or a sanitized version), and we'll fix it.

Our goal is enterprise-grade quality. Not "good enough for hobby projects" — we want GxPDF to handle production workloads at scale. The 740/740 accuracy on bank statements is our baseline, not our ceiling.

This is v0.1.0. Some rough edges exist. But the architecture is solid, the core algorithms work, and we're committed to making this the PDF library Go deserves.

Try It

go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
gxpdf version

Repository: github.com/coregx/gxpdf

Documentation and examples in the repo. Issues and PRs welcome.

GxPDF is MIT licensed. Built for the Go community who needed a real PDF library without commercial restrictions.

DEV Community