Humans are remarkably good at seeing structure.
Show someone a folder containing:
- receipts
- inspection reports
- contracts
- photos of vehicles
- resumes
…and within seconds they understand the shape of the data.
A receipt has:
- a merchant
- a total
- a date
A vehicle photo has:
- a brand
- a model
- a color
An inspection report has:
- findings
- categories
- pass/fail states
The structure is obvious.
The problem is that most software systems cannot see it.
The retrieval trap
Most modern AI tooling approaches files through retrieval.
Chunk documents.
Embed chunks.
Search by similarity.
Feed chunks into an LLM.
This works surprisingly well for retrieval questions:
- “find the contract mentioning GDPR”
- “show me the invoice from March”
- “summarize this document”
But many real-world questions are not retrieval questions.
They are aggregation questions.
Examples:
- Which vehicles appear most frequently across this photo collection?
- How many reports failed safety checks?
- Which suppliers increased prices over time?
- Which contracts expire within 90 days?
- What is the average spend per month across these receipts?
Retrieval systems are fundamentally optimized to return relevant chunks.
Aggregation requires something else entirely:
structured records.
The structure already exists
The important realization is this:
The structure already exists inside the files.
Humans can see it instantly.
LLMs are now good enough to extract it reliably.
That changes the architecture completely.
Instead of:
files → chunks → embeddings → retrieval
…the pipeline becomes:
files → structured records → query engine
The difference is profound.
Once files become records:
- filtering becomes deterministic
- aggregation becomes exact
- dashboards become trivial
- APIs become possible
- natural language becomes a query layer over real data
The approach behind Sifter
This idea led me to build Sifter.
The workflow is intentionally simple:
- Upload a collection of files
- Describe what matters in natural language
- Sifter infers a schema
- Files are processed into typed records
- Query the resulting dataset in natural language
The files can be:
- PDFs
- images
- photos
- scanned documents
- multilingual content
The key idea is that the system is not retrieving chunks.
It is querying records.
Why this matters
Most organizations already contain enormous amounts of latent structured data.
The problem is not the absence of data.
The problem is that the structure is trapped inside files.
A folder is often just a database waiting to exist.
Links
OSS repo:
https://github.com/sifter-ai/sifter
Cloud:
https://sifter.run

Top comments (0)