On March 12, 2025, the PDF ingestion pipeline for a document-heavy product crossed a hard limit: nightly batches that used to finish in three hours were now spilling into the business day, causing timeouts, missed SLAs, and angry support tickets. The project was a live production feature used by legal teams to search across contracts, scanned exhibits, and technical manuals. The stakes were clear-lost user trust and a blocked roadmap that depended on faster, more reliable document understanding.
Discovery
We traced the outage to two linked problems: a brittle retrieval layer that failed on scanned PDFs with complex layouts, and an orchestration scheme that treated every file as “same weight” during processing. The existing pipeline used an off-the-shelf OCR + embedding flow that worked for plain text, but degraded fast on mixed-layout documents (tables, figures, two-column scans). The result was high false-negative rates for entity extraction and a queue backlog.
What we needed was a system that could do more than keyword matching: a repeatable, evidence-driven research path for each document that combined layout-aware parsing, citation-style provenance, and prioritized reprocessing. That led us to evaluate specialized tooling focused on long-form analysis; the team agreed we needed a dedicated Deep Research Tool to run a programmatic, document-first investigation at scale without manually curating source lists.
We documented the problem with an example failure. A legal brief processed through the old flow returned this error fragment in logs:
Error: embedding failure - token overflow at pipeline.step.embed(4500 tokens)
Context: OCR produced repeated headers and malformed table OCR output that corrupted the tokenizer.
This concrete error drove two decisions: limit token inputs via smart chunking, and move heavy context reasoning off to a deeper research layer that could synthesize across multiple pages before producing final extractions.
Implementation
Phase 1 - stabilization (week 1): introduce deterministic chunking and prioritize files. We inserted a lightweight preprocessor that extracted page-level layout metadata and classified pages as text-first, table-first, or image-first. That let us route documents down different micro-pipelines.
A small snippet shows the routing logic used in the orchestrator:
# routify.py - determine micro-pipeline based on layout
def route_document(layout_stats):
if layout_stats['table_density'] > 0.3:
return 'table_pipeline'
if layout_stats['image_coverage'] > 0.25:
return 'image_pipeline'
return 'text_pipeline'
# used in orchestration
pipeline = route_document(extract_layout_stats(pdf_blob))
Phase 2 - deep-reasoning overlay (weeks 2-3): the team budgeted time to prototype a research-style assistant that could plan a reading strategy for each document batch (identify tables to extract, pages to OCR at higher quality, and sections to prioritize for citation). We integrated a third-party component that acts like an AI Research Assistant to orchestrate multi-step passes: read → plan → extract → verify. This was not a blind LLM call; it ran a plan, logged decisions, and attached provenance to every extracted fact.
Before fully trusting the overlay, a test run surfaced a friction: the assistant would sometimes conflate citations across documents when references were vague. The fix was to add document-scoped prefixes to all internal identifiers and a strict evidence threshold (two independent page-level matches required for a claim to be promoted).
Phase 3 - scale & resilience (weeks 4-6): after tuning, we replaced the synchronous single-worker model with an async worker pool that scheduled deep passes only when fast-path heuristics failed. This decreased worker contention and allowed us to reserve the heavy reasoning passes for the worst-case documents. The orchestration layer also exposed a “deep audit” endpoint that let engineers replay the reasoning steps for any extraction, which proved critical during debugging.
Here is an example curl to trigger a replay, used heavily in postmortems:
# trigger a deep-research replay for a document id
curl -X POST "https://internal.api/research/replay" \
-H "Authorization: Bearer $TOKEN" \
-d '{"document_id":"doc-20250312-47","mode":"full-audit"}'
To automate quality gates, we implemented a simple rule config that the system loads at runtime. That config allowed us to mark certain fields as “must-have” and to set retry thresholds per document class.
# gates.yml - quality gate sample
fields:
- name: counterparty
required: true
min_confidence: 0.85
- name: effective_date
required: false
min_confidence: 0.70
A mid-implementation decision compared two paths: expand the cheap embedding layer to handle noisy OCR, or adopt a dedicated deep-research overlay and keep the cheap layer limited. We chose the latter because expanding embeddings would have multiplied processing time across the entire corpus; isolating complexity to problematic documents gave better ROI. To run this efficiently we made use of a specialized Deep Research AI mode that supported stepwise plans and stronger provenance, which matched our need for reproducible, auditable extraction.
Results
After six weeks of incremental rollout, the system showed a clear transformation. Nightly batch completion returned to under three hours for 95% of jobs; the remaining 5% entered the deep-research path with documented decisions. The queue backlog that had caused SLA misses was eliminated, and manual reprocesses dropped dramatically.
Key comparative outcomes (qualitative and reproducible):
- Processing reliability moved from fragile to stable; previously failing documents now produced verifiable extractions through multi-pass reasoning.
- The team reduced false negatives on entity extraction by a significant margin after introducing layout-aware routing and deep passes.
- Operational cost stayed efficient: heavy passes were targeted, so average compute per document increased only modestly while overall throughput improved.
Trade-offs were explicit. The deep-research overlay added latency for a minority of documents and required more engineering oversight during early rollout. It also increased complexity in the debugging workflow, which we mitigated by adding the replay and audit endpoints shown above.
A short qualitative ROI summary: the architecture went from "best-effort extraction" to a "tiered confidence pipeline" that delivered reproducible results and much clearer debugging signals. The real lever was separating fast heuristics from slow, evidence-heavy reasoning-this pattern is the core of modern document AI workflows and the reason teams often adopt a research-style orchestration layer rather than pushing every document through a single model.
Practical takeaway: if your document pipeline fails at scale, add layout-aware routing, isolate heavy reasoning into a targeted research pass, and require provenance for promoted facts. These moves keep average costs low and results reliable.
Closing the case, this was not about swapping a single model and hoping for magic; it was about rethinking the research workflow around documents-discover, plan, verify-and giving the team tools that follow that pattern. Teams building similar features should evaluate solutions that offer multi-pass planning, document-scoped provenance, and configurable quality gates; those capabilities are often the difference between brittle extraction and something engineers can trust in production.
Top comments (0)