Bruno Fortunato

Posted on May 7

Documents are records waiting to exist

#llm #rag #ai #data

Humans are remarkably good at seeing structure.

Show someone a folder containing:

receipts
inspection reports
contracts
photos of vehicles
resumes

…and within seconds they understand the shape of the data.

A receipt has:

a merchant
a total
a date

A vehicle photo has:

a brand
a model
a color

An inspection report has:

findings
categories
pass/fail states

The structure is obvious.

The problem is that most software systems cannot see it.

The retrieval trap

Most modern AI tooling approaches files through retrieval.

Chunk documents.
Embed chunks.
Search by similarity.
Feed chunks into an LLM.

This works surprisingly well for retrieval questions:

“find the contract mentioning GDPR”
“show me the invoice from March”
“summarize this document”

But many real-world questions are not retrieval questions.

They are aggregation questions.

Examples:

Which vehicles appear most frequently across this photo collection?
How many reports failed safety checks?
Which suppliers increased prices over time?
Which contracts expire within 90 days?
What is the average spend per month across these receipts?

Retrieval systems are fundamentally optimized to return relevant chunks.

Aggregation requires something else entirely:
structured records.

The structure already exists

The important realization is this:

The structure already exists inside the files.

Humans can see it instantly.

LLMs are now good enough to extract it reliably.

That changes the architecture completely.

Instead of:

files → chunks → embeddings → retrieval

…the pipeline becomes:

files → structured records → query engine

The difference is profound.

Once files become records:

filtering becomes deterministic
aggregation becomes exact
dashboards become trivial
APIs become possible
natural language becomes a query layer over real data

The approach behind Sifter

This idea led me to build Sifter.

The workflow is intentionally simple:

Upload a collection of files
Describe what matters in natural language
Sifter infers a schema
Files are processed into typed records
Query the resulting dataset in natural language

The files can be:

PDFs
images
photos
scanned documents
multilingual content

The key idea is that the system is not retrieving chunks.
It is querying records.

Why this matters

Most organizations already contain enormous amounts of latent structured data.

The problem is not the absence of data.
The problem is that the structure is trapped inside files.

A folder is often just a database waiting to exist.

Links

OSS repo:
https://github.com/sifter-ai/sifter

Cloud:
https://sifter.run

DEV Community