Most document AI questions aren't retrieval problems

#ai #llm #rag #showdev

There are two questions you can ask a pile of documents

Most AI document tools only answer one of them.

If you ask a pile of PDFs "find the contract from Acme", retrieval works. Vector search will pull the right chunks, an LLM will summarize them, and you get a useful answer.

If you ask the same pile "how much did Acme charge us last year", retrieval falls apart. The LLM gets a few semantically-similar chunks, none of which contain the full picture, and it confidently makes up a number. The answer is plausible and wrong.

This is not a model problem. It is a shape problem. Search returns chunks. Counting requires structure.

I built Sifter to solve the second question. It is open source, MIT licensed, self-hostable with a single docker compose up. This article is about why the schema-first approach beats RAG for a large class of real-world document workloads, and how it works under the hood.

The class of problems RAG silently fails on

RAG is great for "find me something" tasks. It is not great for any of the following:

"What was our total spend with each supplier last quarter?"
"Which contracts auto-renew in the next 90 days?"
"How many invoices are still unpaid?"
"Show me every receipt over €1000 I expensed last year."

Every one of these is an aggregation. Every one of them needs a deterministic answer, not a similarity-ranked summary. Every one of them is a SQL query hiding inside PDFs.

The dirty secret of RAG demos is that they are almost always retrieval demos. The moment you switch to aggregation, the LLM starts hallucinating numbers because the chunks it is shown do not contain the data needed to compute the answer. You can paper over it with longer context windows and clever prompting, but you cannot fix the fundamental mismatch.

Schema-first extraction: a different shape

Sifter takes a different path. Instead of chunking and indexing, it does this:

You describe in your own words what you want extracted from the collection.
Sifter builds a JSON schema that captures those fields.
Every document gets processed once by an LLM under that schema, producing a typed record.
Records get stored in MongoDB.
You query in natural language. Sifter translates the question into an exact MongoDB aggregation pipeline and runs it.

The end result is the same kind of answer a careful analyst would produce by hand: reproducible, auditable, free of similarity scores.

The documents in the collection do not need to share a format. They can be photos snapped on a phone, scans, PDFs, emails, screenshots, in different languages and layouts. What matters is that they are semantically aggregable: invoices from many vendors in many shapes, contracts written by different lawyers, expense receipts that look nothing alike. Sifter extracts the same kind of information out of each one regardless of how it is presented.

Three lines, end to end

The Python SDK is intentionally minimal. Here is the entire flow:

from sifter import Sifter

s = Sifter(api_key="sk-...")

sift = s.create_sift(
    "Invoices",
    "Extract the client, the date, and the total amount from each invoice."
)
sift.upload("./invoices/")
sift.wait()

sift.query("How much did we spend per client last quarter?")
# [{"client": "Acme Corp", "total": 12340.00},
#  {"client": "Globex",   "total":  8721.50}, ...]

That last line returns rows from a MongoDB aggregation pipeline that an LLM generated for you, on a schema that an LLM inferred for you, against records that an LLM extracted for you. None of those LLMs were asked to invent numbers. Each of them was asked to do the one thing language models are reliably good at: turn natural language into a structured representation.

Architecture

The core stack is small enough to fit in your head:

FastAPI server, with a job queue backed by MongoDB itself (no Celery, no Redis).
MongoDB as the only stateful component. Records, documents, schemas, queue, all in one place.
React frontend served by the same backend in production.
LiteLLM as the LLM provider abstraction, so you can swap Anthropic, OpenAI, Gemini, Vertex AI, or Ollama without touching code.
MCP server that exposes 15 tools so Claude, ChatGPT, Gemini, or any MCP client can talk to your documents directly. Stdio for local clients, HTTP for remote.

There is no vector database. There is no embedding step. The whole stack runs on a single VM with a Mongo container next to it.

Self-hosting is the path

Sifter is MIT licensed with no feature gating. The web UI ships everything: drag-and-drop upload, document and record browser, chat-style natural-language query, auto-generated dashboards from your extracted data.

git clone https://github.com/sifter-ai/sifter
cd sifter/code
cp server/.env.example server/.env.local   # set SIFTER_DEFAULT_API_KEY
docker compose up -d

UI on localhost:3000, API on localhost:8000. Point it at any LiteLLM-compatible model. If you want to keep everything inside your network, run an Ollama instance on the same host and point Sifter at it. There is no telemetry, no phone-home, no required external dependency beyond the LLM provider you choose.

When this approach is wrong

I am not arguing RAG is dead. There is a class of problems where RAG is exactly the right tool: open-ended Q&A over heterogeneous, document-of-record style content where the answer is a synthesis of several passages. Wikis, knowledge bases, technical manuals, legal opinion repositories. Use RAG there.

The argument is narrower: when the documents form a meaningful collection and the questions you actually want to ask are aggregations, schema-first extraction is the right shape, and it composes far better with the rest of your stack (databases, BI tools, Excel) than a vector store ever will.

What I would love feedback on

If you have ever watched an AI document tool answer a finance question with five plausible wrong amounts, this is for you. The repo is at github.com/sifter-ai/sifter. 30-second demo on the README.

I am especially curious whether the schema-first approach maps to use cases I have not thought about yet. The ones I have seen so far: invoices, contracts, receipts, resumes, expense reports, lab results, shipping documents, ID cards, lease agreements. If your collection looks different, the SDK is small enough to try in five minutes.