P VIKRAM KISHORE

Posted on Jun 28

Building docapi: A Reliable Document Extraction Platform for AI Agents

#ai #api #llm #showdev

GITHUB : https://github.com/Waterbottles792/docapi

Large Language Models have made document understanding incredibly accessible.

Give an LLM an invoice, receipt, résumé, or contract, and it can usually tell you what's inside.

The problem begins when you need reliability.

Production systems don't need "usually."

They need predictable outputs, validation, and error handling.

That observation led me to build docapi.

The Problem

Most document extraction pipelines follow a simple pattern:

Document
    ↓
LLM
    ↓
JSON

This works well until the model:

Returns invalid JSON
Hallucinates values
Misinterprets dates
Omits required fields
Produces inconsistent output formats

For AI agents, those failures become difficult to recover from.

I wanted to build something that treated reliability as the primary goal.

The Idea

Instead of prompting an LLM and hoping for the best, docapi works like this:

Document
      │
      ▼
Text Extraction
      │
      ▼
LLM Understanding
      │
      ▼
Schema Validation
      │
      ▼
Grounding Verification
      │
      ▼
Deterministic Normalization
      │
      ▼
Confidence Scoring
      │
      ▼
Schema-Validated JSON

If the system cannot confidently produce valid output, it returns a structured error instead of silently returning incorrect data.

Features

The current version includes:

REST API
MCP server for AI agents
Local inference with Ollama
Cloud inference with Claude
Schema validation
Grounding checks to reduce hallucinations
Deterministic date normalization
Long-document chunking
Confidence scoring
Automated evaluation harness
More than 80 automated tests

Why Deterministic Code Matters

One example I encountered was date parsing.

A language model occasionally interpreted:

26-05-2025

as the year 2605.

That's not an AI problem.

It's a software engineering problem.

Instead of trying to improve the prompt, docapi normalizes dates deterministically after extraction.

The same philosophy applies throughout the project.

Whenever a problem can be solved reliably with code, it shouldn't be delegated to the model.

Building for AI Agents

Another goal was making the system easy for agents to use.

Besides a REST API, docapi also exposes an MCP server, allowing AI assistants to call document extraction as a tool without additional integration code.

The extraction pipeline remains identical regardless of whether the caller is a Python application, an HTTP client, or an AI agent.

What I Learned

Building this project changed the way I think about AI engineering.

The model is only one part of the system.

The surrounding engineering matters just as much:

Validation
Error handling
Evaluation
Grounding
Deterministic processing
Observability
Testing

Those pieces are what make AI systems reliable enough for production.

What's Next?

I'm continuing to expand docapi with:

OCR support for scanned documents
Additional model providers
Larger evaluation datasets
A managed hosted version

The goal remains the same:

Build AI systems that are not only intelligent, but predictable, measurable, and reliable.

If you've built similar AI infrastructure or have ideas for improving document extraction reliability, I'd be interested to hear your thoughts.

DEV Community