CY Ong

Posted on Apr 5

Mixed document packs need triage before they need smarter extraction

#ai #webdev #automation

Most document pipelines are easier to build when you assume each upload is one self-contained document with one obvious role.

That assumption breaks quickly in production.

Real workflows often receive mixed packs: an invoice plus a receipt, a KYC form plus an ID, a claim form plus supporting pages, or a trade packet with primary and secondary documents mixed together. If all of that goes into one extraction path unchanged, downstream interpretation becomes much harder than it needs to be.

What broke

In practice, the failures did not look dramatic. They looked operational.

Supporting pages were interpreted like primary pages.
Partial packets were handled like complete submissions.
Similar-looking fields competed across pages that served different roles.
Reviewers spent time figuring out page purpose before they could judge extraction quality.
Schema logic got more complicated because the intake stage had already thrown away too much context.

This is why a lot of “extraction issues” are really intake-order issues.

A practical approach

If I were designing this from scratch, I would add a triage layer before deep extraction.

That layer would do a few simple things well:

Classify document and page type early.
Preserve packet structure so pages remain grouped.
Mark the likely anchor page for the workflow.
Separate supporting pages from primary pages.
Route mixed or unclear packets for light review before full schema mapping.
Carry page role into downstream extraction so interpretation stays grounded.

This does not need to be perfect to be useful. Even a modest triage step can make later extraction and review noticeably easier to reason about.

Why this helps

There are three concrete benefits.

1) Extraction becomes more explainable

If the system knows which page anchors the case, field mapping becomes easier to interpret later.

2) Reviewer effort drops

A reviewer who can immediately see page role and packet structure spends less time reconstructing the case manually.

3) Schema logic becomes less brittle

Instead of one giant extraction path that tries to account for every possible page, you can keep interpretation scoped to more realistic document roles.

Tradeoffs

There are tradeoffs, of course.

You now have one more stage in the pipeline.
Triage mistakes can still happen.
You need to retain packet-level context rather than flatten everything into one request.

But in most mixed-pack workflows, those tradeoffs are cheaper than the long-term cost of forcing every page through the same logic.

Implementation notes

A lightweight implementation can start with:

packet-level grouping
page-type classification
role labeling
review routing for unclear packs

Only after that would I invest in more complex extraction behavior.

A common mistake is to push complexity into the extractor first. That often makes the output look smarter while leaving the workflow harder to trust.

How I’d evaluate this

Can the system preserve packet structure?
Does it distinguish primary from supporting pages?
Can reviewers see page role quickly?
Does triage reduce ambiguous field mapping?
Is the downstream schema easier to reason about after the change?

A lot of document systems become more reliable not because the extraction layer became more powerful, but because the intake path became more disciplined.

DEV Community