DEV Community

Cover image for GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go
Andrey Kolkov
Andrey Kolkov

Posted on

GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go

The Problem with PDF Libraries

Every Go developer who has worked with PDFs knows the pain:

  • UniPDF: Powerful, but starts at $299/month
  • pdfcpu: Great for manipulation, but no table extraction
  • gofpdf: Creation only, abandoned since 2019

I needed to extract tables from bank statements. 740 transactions across multiple pages. Commercial libraries worked, but the cost was prohibitive for an open-source project.

So I built GxPDF.

What is GxPDF?

GxPDF is a pure Go PDF library that handles both reading and creation. No CGO. No external dependencies. MIT licensed.

# Install CLI
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0

# Or use as library
go get github.com/coregx/gxpdf@v0.1.0
Enter fullscreen mode Exit fullscreen mode

The Key Innovation: 4-Pass Hybrid Detection

Table extraction is hard. PDFs don't have "tables" - they have positioned text elements scattered across coordinates. Most algorithms fail on:

  • Multi-line cells (descriptions that wrap)
  • Missing borders (modern designs)
  • Merged cells
  • Headers vs data discrimination

GxPDF uses a 4-Pass Hybrid Detection algorithm:

Pass 1: Gap Detection (adaptive threshold)
Pass 2: Overlap Detection (Tabula-inspired)
Pass 3: Alignment Detection (geometric clustering)
Pass 4: Multi-line Cell Merger (amount-based discrimination)
Enter fullscreen mode Exit fullscreen mode

The key insight for Pass 4: transaction rows have monetary amounts, continuation rows don't.

// Works on ALL banks without configuration
isTransactionRow := hasAmount(row)  // Has amount = new transaction
isContinuation := !hasAmount(row)   // No amount = continuation of previous
Enter fullscreen mode Exit fullscreen mode

This universal discriminator works across different PDF generators, layouts, and bank formats.

Results: 100% Accuracy

Tested on real bank statements:

Bank Transactions Accuracy
Sberbank 242 100%
Alfa-Bank 281 100%
VTB 217 100%
Total 740 100%

Every transaction extracted correctly. Every multi-line description preserved.

Code Examples

Extract Tables from PDF

package main

import (
    "fmt"
    "log"

    "github.com/coregx/gxpdf"
)

func main() {
    // Open PDF
    doc, err := gxpdf.Open("bank_statement.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    // Extract all tables
    tables := doc.ExtractTables()

    for _, t := range tables {
        fmt.Printf("Table: %d rows x %d cols\n",
            t.RowCount(), t.ColumnCount())

        // Access rows
        for _, row := range t.Rows() {
            fmt.Println(row)
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Export to CSV/JSON

// Export to CSV
csv, _ := table.ToCSV()
fmt.Println(csv)

// Export to JSON
json, _ := table.ToJSON()
fmt.Println(json)

// Write to file
file, _ := os.Create("output.csv")
table.ExportCSV(file)
Enter fullscreen mode Exit fullscreen mode

Create PDFs

package main

import (
    "log"

    "github.com/coregx/gxpdf/creator"
)

func main() {
    c := creator.New()
    c.SetTitle("Invoice")
    c.SetAuthor("GxPDF")

    page, _ := c.NewPage()

    // Add text with Standard 14 fonts
    page.AddText("Invoice #12345", 100, 750, creator.HelveticaBold, 24)
    page.AddText("Amount: $1,234.56", 100, 700, creator.Helvetica, 14)

    // Draw graphics
    opts := &creator.RectOptions{
        StrokeColor: &creator.Black,
        FillColor:   &creator.LightGray,
        StrokeWidth: 1.0,
    }
    page.DrawRect(100, 600, 400, 50, opts)

    // Save
    if err := c.WriteToFile("invoice.pdf"); err != nil {
        log.Fatal(err)
    }
}
Enter fullscreen mode Exit fullscreen mode

CLI Tool

GxPDF includes a CLI for quick operations:

# Extract tables
gxpdf tables invoice.pdf
gxpdf tables bank.pdf --format csv > transactions.csv
gxpdf tables report.pdf --format json

# Get PDF info
gxpdf info document.pdf

# Extract text
gxpdf text document.pdf

# Merge PDFs
gxpdf merge part1.pdf part2.pdf -o combined.pdf

# Split PDF
gxpdf split document.pdf --pages 1-5 -o first_five.pdf
Enter fullscreen mode Exit fullscreen mode

Feature Matrix

Feature Status
Table Extraction 100% accuracy
Text Extraction Supported
Image Extraction Supported
PDF Creation Supported
Standard 14 Fonts All 14
Embedded Fonts TTF/OTF
Graphics Lines, Rectangles, Circles, Bezier
Encryption RC4 + AES-128/256
Export CSV, JSON, Excel

Architecture

internal/
├── document/       # Document model
├── encoding/       # FlateDecode, DCTDecode
├── extractor/      # Text, image, graphics
├── fonts/          # Standard 14 + embedding
├── models/         # Data structures
├── parser/         # PDF parsing
├── reader/         # PDF reader
├── security/       # RC4/AES encryption
├── tabledetect/    # 4-Pass Hybrid algorithm
└── writer/         # PDF generation
Enter fullscreen mode Exit fullscreen mode

Clean separation. No CGO. Pure Go from top to bottom.

Performance

Table extraction on a 15-page bank statement:

  • Time: ~200ms
  • Memory: ~15MB peak
  • Allocations: Optimized with sync.Pool

PDF creation benchmarks:

BenchmarkNewPage-8           50000    28.4 µs/op
BenchmarkAddText-8          100000    11.2 µs/op
BenchmarkWriteToFile-8        5000   312.5 µs/op
Enter fullscreen mode Exit fullscreen mode

What's Next

The v0.1.0 release covers the core functionality. Planned for future releases:

  • Form Filling: Fill existing PDF forms
  • Digital Signatures: Sign PDFs cryptographically
  • SVG Import: Vector graphics
  • PDF Rendering: Convert pages to images

We Need Your PDFs

This is v0.1.0 — our first public release. We've tested on bank statements, invoices, and reports. But PDFs are infinitely diverse.

We need testers with real documents:

  • Corporate reports with complex tables
  • Invoices from different countries and formats
  • Scanned documents with OCR layers
  • Multi-language PDFs (CJK, Arabic, Hebrew)
  • Legacy PDFs from old generators
  • Edge cases that break other libraries

If GxPDF fails on your document, that's valuable data. Open an issue, attach the PDF (or a sanitized version), and we'll fix it.

Our goal is enterprise-grade quality. Not "good enough for hobby projects" — we want GxPDF to handle production workloads at scale. The 740/740 accuracy on bank statements is our baseline, not our ceiling.

This is v0.1.0. Some rough edges exist. But the architecture is solid, the core algorithms work, and we're committed to making this the PDF library Go deserves.

Try It

go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
gxpdf version
Enter fullscreen mode Exit fullscreen mode

Repository: github.com/coregx/gxpdf

Documentation and examples in the repo. Issues and PRs welcome.


GxPDF is MIT licensed. Built for the Go community who needed a real PDF library without commercial restrictions.

Top comments (0)