When we started building DocImprint, we kept running into the same problem that nobody was talking about.
AI agents were getting good at reading documents — PDFs, web pages, scanned invoices. But there was no way to prove what they had read. No receipt. No audit trail. Just the agent's word for it.
In legal, compliance, and financial workflows, that's a dealbreaker.
The core problem with RAG today
Most RAG pipelines work like this: chunk a document, embed it, store it, retrieve it at query time. The LLM generates an answer with citations. Looks great in a demo.
But those citations are soft references. There's nothing stopping the source document from being modified after ingestion. Nothing proving the chunk the agent retrieved matches what was in the original file. And no offline way to verify any of it.
For developers building internal tools, that's fine. For teams building AI that touches contracts, compliance reports, or regulated data — it isn't.
What we built
DocImprint is a document extraction API that produces evidence bundles alongside every extraction. Each bundle contains:
- The extracted content (structured JSON or markdown)
- A SHA-256 hash of the source document
- A secp256k1 signature over the extraction + hash
- A Merkle proof tying the content to the source
The signature is verifiable offline. Anyone with the bundle can independently confirm that the extracted content came from that exact document, unchanged, at the time of extraction.
A basic API call looks like this:
curl -X POST https://api.docimprint.com/v1/extract \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/contract.pdf",
"formats": ["markdown", "evidence"]
}'
The response includes both the extraction and the full evidence bundle:
{
"content": {
"markdown": "# Service Agreement\n\nThis agreement is entered into..."
},
"evidence": {
"document_hash": "sha256:a3f1c2d...",
"signature": "0x4a8b2e...",
"merkle_root": "0xf9c3a1...",
"verified_at": "2026-06-21T14:32:00Z"
}
}
Store the bundle alongside your RAG chunks and you have an auditable chain of custody from source document to LLM context.
MCP integration
For agent builders using the Model Context Protocol, DocImprint exposes 14 tools at api.docimprint.com/mcp — including extract_document, verify_bundle, and get_evidence.
Your agent can extract a document and verify its own prior extractions in the same session. No extra infrastructure. The MCP server handles auth, extraction, and proof generation in a single tool call.
Offline verification
The verification step is intentionally offline.
You don't need to call our API to prove an extraction is valid — the cryptographic proof is self-contained in the bundle. This matters for compliance scenarios where you can't depend on a third-party service being available during an audit, or where the verifying party has no API access.
A verifier just needs the bundle and our public key. That's it.
Where this fits in your stack
| Scenario | What DocImprint adds |
|---|---|
| RAG over legal documents | Prove each chunk came from the original filing |
| AI-generated compliance reports | Audit trail linking every claim to a source document |
| Agent pipelines reading web content | Tamper-evident snapshot at time of capture |
| MCP-native agents | Native tool calls for extract + verify in one session |
Try it
We're in early access at docimprint.com/try. Free tier covers 100 extractions/month.
If you're building RAG pipelines that touch regulated content, or agentic workflows where document provenance matters — we'd genuinely love your feedback. Drop a comment or reach out directly.
Top comments (0)