DEV Community

Tejas
Tejas

Posted on

Parsing PDFs, Spreadsheets & Images in OpenCode Without a Single Binary

I got tired of watching my local agents stall on npm install loops just to read a document. So I built an opencode plugin that parses files without needing any external binaries.

What it does

You drop it in, and suddenly your coding agent can read PDFs, Word docs, Excel spreadsheets, PowerPoint slides, images (via OCR), EPUBs, Jupyter notebooks, ZIP archives, and plain text files. Just parse @report.pdf and it returns structured markdown with metadata, tables, and content.

Why not just use existing tools?

Most parser setups in the AI tooling space either:

  • Shell out to system binaries (pdftotext, pandoc, etc.) that break on different OS setups
  • Require heavy npm installs every session
  • Only handle one format

I wanted something that works fresh out of a single opencode plugin opencode-parser -g command. No apt-get, no brew install, no host dependencies.

How it works

  • File type detection via magic bytes (not just extensions), with extension fallback
  • 15+ format handlers, all returning the same structured output
  • Content truncation with configurable limits so the LLM doesn't choke on a 500-page document
  • Optional save-to-markdown for getting full extractions out
  • OCR is opt-in (default off, since tesseract.js downloads language data on first use)

Stack

TypeScript, runs inside opencode's plugin system. Uses pdf-parse, mammoth, xlsx, tesseract.js, cheerio, and jszip under the hood — all in-process JS libraries, no binaries.

What I'd do differently

The image OCR path is the weakest link. tesseract.js works but it's slow on large images and the initial language data download is clunky. If I rewrote it today I'd probably reach for a smaller WASM OCR engine, but for now it's opt-in so it doesn't get in the way.

Repo: https://github.com/TejasS1233/opencode-parser

MIT, contributions welcome.

Top comments (0)