DEV Community

Tsvetan Gerginov
Tsvetan Gerginov

Posted on

I've built a open source PDF-To-Excel-Converter

Github Repository

Hi community,

I've built a open source PDF to Excel Converter and let me tell you why!

We've all been there: someone sends you a 40-page PDF report and asks for "the numbers in a spreadsheet by Friday." You can copy-paste cell by cell, pay for a SaaS converter that uploads your (possibly confidential) data to who-knows-where, or... build your own tool.

What it does

The core idea is simple: upload a PDF, pick an extraction mode, download an .xlsx. The interesting part is in the modes, because "convert a PDF" means different things depending on the document:

1. All Text + Tables — extracts everything (paragraphs, headings, tables) and consolidates it into a single worksheet. Useful when you need the full content of a document in a structured, searchable format.

2. Tables Only — ignores the prose and hunts down tabular data specifically. Each detected table lands in its own sheet in the workbook. This is the mode you want for financial reports, invoices, or anything where the tables are the data.

That second mode is where most converters fall short — they either flatten tables into mush or miss them entirely. Splitting each table into a separate sheet keeps the structure intact and makes downstream work (pivot tables, formulas, imports) actually possible.

The stack

Nothing exotic, and that's deliberate:

  • Python + Flask for the web app — file upload, mode selection, conversion, download. One form, one job.
  • pdfplumber for text and layout-aware extraction
  • tabula-py for table detection and extraction
  • A separate desktop version in the repo for people who don't want to run a server at all

Why two extraction libraries? Because PDFs are chaos. A PDF is fundamentally a visual format — it knows where to draw characters, not what a "table" is. pdfplumber is excellent at layout-aware text extraction, while tabula's table detection handles structured grids better. Using each for what it does best gives much more reliable output than forcing one library to do everything.

Why local-first matters

Most "free PDF converter" sites are upload services. That's fine for a recipe PDF — less fine for contracts, bank statements, or client data. This tool processes everything locally:

git clone https://github.com/TsvetanG2/PDF-To-Excel-Converter.git
cd pdf-to-excel-converter
pip install -r requirements.txt
python pdftoexcel.py
Enter fullscreen mode Exit fullscreen mode

Then open http://localhost:5000, upload, convert, done. Your files never leave your machine.

Honest limitations

I'm not going to pretend this beats commercial tools on every PDF. Scanned documents (images of text) need OCR, which isn't in scope here — this works on PDFs with an actual text layer. And table detection on documents with creative, merged-cell layouts is a hard problem for every tool in this space, including this one. For typical reports, exports, and structured documents, it does the job well.

Top comments (0)