doceval — eval harness for LLM document extraction pipelines

Dave — Tue, 16 Jun 2026 12:29:37 +0000

I kept seeing the same gap: people ship LLM-based document extractors (invoices, receipts, forms) with no systematic way to know how accurate they actually are. So I built doceval — point it at your extractor function + a labeled dataset and get back field-level accuracy, a failure taxonomy (missed_field / hallucination / wrong_format / wrong_value), and optional per-document cost tracking.

Works with any extractor (Claude, GPT, regex, rules) and any document schema. One JSON label file per document, one Python function, one CLI command.

Includes a working 20-document invoice example with a Claude Haiku extractor so you can run it immediately.

GitHub:

DEV Community: Dave

doceval — eval harness for LLM document extraction pipelines