Sensylze

Posted on Mar 10

DocOCR: Stop Manually Extracting Data From Documents — Let AI Do It

#webdev #javascript #ai #opensource

The Copy-Paste Tax Every Dev Team Pays

You’ve probably seen this problem up close, even if you haven’t been the one suffering through it.

A business user opens a PDF. They copy a number. They paste it into a spreadsheet. They open the next PDF. Somewhere around the fifteenth invoice, they stop thinking entirely and just move their hands.

When that workflow breaks — wrong cell, misread value, missed field — it lands back on your desk.

Most developers solve this once, badly, with a brittle regex (regular expression) or a quick Python script that works fine until the vendor changes their invoice template. Then they solve it again.

DocOCR is a practical attempt to solve it properly: an open-source web app that converts documents into structured JSON using Google Gemini’s vision model, lets users design extraction schemas visually, and spits out a usable API endpoint without anyone writing a parser.

What’s Actually Inside

There are three things DocOCR does, and it’s worth being specific about each one because the combination is what makes it useful.

1. AI Extraction via Gemini Models

Upload a PDF, image, or text file. Select the document type — invoice, receipt, contract, ID card, form. Gemini handles the rest: layout detection, table parsing, field mapping, all in one pass.

What you get back is structured JSON with per-field confidence scores. Not a wall of raw text. Not a flat string dump. Actual named fields with values, organized the way a developer would want to receive them.

The confidence scores matter more than they might seem at first. They surface the fields worth double-checking before you commit to a schema, which makes the whole system more trustworthy in production.

2. No-Code Schema Designer

This is the part that makes non-developers genuinely useful in the loop.

After extraction, there’s a visual schema designer where users can rename fields, change types, mark required fields, add custom fields that don’t exist in the document but belong in the downstream system, and search across all detected fields. The output is a versioned JSON schema — a contract between the document and everything that consumes its data.

A finance analyst can own their invoice schema without touching code. That’s a meaningful shift.

3. Auto-Generated API Endpoint

Once the schema is locked in, DocOCR generates a fully described API endpoint — base URL, auth headers, request body, and runnable examples in cURL, JavaScript, and Python.

bash curl -X POST "https://api.dococr.com/v1/extract/invoice" \ -H "Authorization: Bearer YOUR_API_KEY" \ -F "file=@invoice.pdf" \ -F "schema=InvoiceSchema"

No spec to write. No translation from schema to API contract. It’s generated from what the user already designed.

The JSON Parsing Problem (And How It’s Handled)

If you’ve built anything on top of LLMs for structured output, you’ve hit this: you ask for clean JSON and get markdown fences, an explanation paragraph, or JSON that’s almost valid but has one trailing comma.

Prompting helps. It’s not sufficient.

DocOCR uses a two-layer approach. The prompt sets strict expectations:

But the server doesn’t trust the model to comply perfectly. Every response goes through a cleanup pass:

If that still fails, a fallback key-value extractor runs and returns both the partial result and the raw error. You get visibility into what went wrong rather than a silent failure.

This is the difference between a production tool and a demo—resilience isn’t optional.

Running It Locally

The stack is Next.js 14, TypeScript, Drizzle ORM with SQLite, Tailwind, and the Gemini API. Getting it running takes four commands:

The auth mode is configurable — secure by default (requires session or bearer API key), or flipped to anonymous for local demos. For anything internet-facing, leave the default on.

Where It’s Actually Useful

Invoicing — line items, totals, vendor names directly into ERP or accounting systems without manual entry.

Expense receipts — merchant, total, payment method captured automatically and ready for categorization.

Onboarding ID checks — name, DOB, ID number extracted from uploaded documents and pushed downstream to auto-fill forms.

Contract review — parties, effective dates, expiry dates, and monetary values surfaced without reading every page.

The tables vs. key-value distinction is worth knowing: tabular documents (invoices, receipts) work better as arrays of objects — easier to turn into CSV rows or database records. Flat documents (IDs, forms) work fine as key-value maps. The schema designer lets you pick per field.

Try It or Contribute

DocOCR is MIT licensed. The prompts are in the codebase, the schema logic is editable, and the API is generated from schemas you control.

If you want to try it: clone the repo, add a Gemini API key, run four commands, and you’re in.

If you want to contribute: good starting points are document-type templates for new verticals, improved prompts for specific document types, and test coverage for the JSON parsing fallback.

About Senslyze

DocOCR is built and maintained by Senslyze — an IT services company focused on consulting-first delivery. We design, build, and scale software products across three areas:

Product & platform engineering — Web, mobile, APIs, and backend systems built for production scale
Applied AI — Agents, RAG, OCR, analytics, and automation workflows (DocOCR lives here)

We start with the problem before writing a line of code. If that sounds like the kind of team you want to work with, reach out at senslyze.com.

DEV Community