2025-11-20 - In the middle of a heavy billing cycle our invoice ingestion service (v2.4.1) began misclassifying line-item totals and dropping tables from multi-page PDFs. The pipeline that had been stable for eight months started returning incomplete extractions under load: customer disputes spiked, reconciliation teams escalated cases, and our SRE pager fired twice in a single day. The stakes were clear - lost revenue recognition and growing manual work for a team already at capacity. The problem lived squarely in the space where document understanding meets research-grade verification: this was not a tuning task, it required a change in the research and verification layer of our architecture, the part of the stack you reach when quick web search answers no longer suffice.
Discovery
We call this class of problems "document ambiguity under scale" - when OCR artifacts, inconsistent table layouts, and sparse context mean the model has to reason over dozens of sources and reconfirm facts against the document set itself. The Category Context here is AI Research Assistance, AI Search, and Deep Search: different tools play different roles.
The initial symptoms:
- Increased false negatives for table detection across 3 PDF templates.
- Higher latency in end-to-end processing as retries piled up.
- Escalations from operations when confidence dropped below policy thresholds.
What failed first was our assumption that a conversational AI search layer plus a single LLM prompt would be enough. It wasn't. We had optimized for speed (AI Search) and lost depth (Deep Search). The discovery phase was simple and painful: reproduce at scale, capture logs, and summarize failure modes.
Evidence we captured:
- Error snippet from the extraction worker showing dropped spans and a stack trace on post-processing:
RuntimeError: InconsistentSpanError: expected 24 spans, got 18
at extract_tables (/srv/ingest/processor.py:218)
at lambda_handler (/srv/ingest/processor.py:54)
Caused by: OCRConfidenceError: avg_confidence < 0.6
- Side-by-side outputs from two documents with the same template showing different table boundary detection.
This convinced stakeholders to treat the problem as a research problem: we needed systems that can plan research steps, read multiple documents, and synthesize structured outputs with citations - the space where a Deep Research Tool and an AI Research Assistant provide value beyond raw conversational queries.
Implementation
We split the intervention into three phases: quick containment, parallel evaluation, and staged migration.
Phase 1 - containment (48 hours):
- Added stricter confidence cutoffs to prevent bad data from reaching downstream billing.
- Routed low-confidence items into a manual queue with lightweight human-in-the-loop checks.
Phase 2 - parallel evaluation (2 weeks):
- Ran two flows side-by-side: the original pipeline and a deep-research augmented pipeline that performed multi-document planning and consensus checks before returning a final extraction.
- The deep pipeline included three tactical pillars (the keywords below are how we talked about those pillars internally): "Deep Research Tool - Advanced Tools" for planning and retrieval, "AI Research Assistant - Advanced Tools" for paper-like evidence extraction inside PDFs, and "Deep Research AI - Advanced Tools" for synthesis and consensus scoring.
Why this path was chosen:
- Alternatives considered: larger single-model LLM (fast but brittle), more hand-coded heuristics (slow to scale), and third-party SaaS OCR tuning (expensive and limited). The deep-research approach balanced maintainability and explainability.
- Trade-offs: higher per-job latency (acceptable under batch windows), more infra complexity, but much higher precision and verifiability.
Concrete integration snippet - the orchestration that wrapped the research plan (this is the actual call we tested in staging):
# orchestration.py (staging)
from retriever import PDFRetriever
from planner import ResearchPlanner
from synthesizer import ConsensusSynthesizer
retriever = PDFRetriever(bucket='ingest-bucket', templates=['invoice-v2'])
planner = ResearchPlanner(strategy='multi-doc')
synth = ConsensusSynthesizer(threshold=0.85)
docs = retriever.fetch(document_id)
plan = planner.create_plan(docs)
evidence = planner.run_steps(plan)
result = synth.synthesize(evidence)
Context: this replaced a single-step prompt that passed raw text to the LLM. The planner explicitly broke out sub-questions (table boundaries, currency detection, redactions) and ran verification passes.
Phase 3 - migration and observability (4 weeks):
- Gradual rollout with canary at 10% → 50% → 100% of traffic.
- Added explainability payloads to every result so downstream teams could see source spans and confidence scores.
During the parallel evaluation we integrated a dedicated deep-research workflow provided by an external solution; the paragraph below links to the platform we used for the long-form synthesis and planning during evaluation.
The platform's deep-research workflow for long-form document synthesis demonstrated reliable multi-source cross-checking and kept a persistent trace of evidence per decision, which made it possible to both explain decisions to ops and to audit results for compliance.
(Operational command used to kick off a batch run during tests:)
# Run batch evaluation
./run_eval.sh --pipeline deep-research --dataset invoices-2025-11 --workers 8
The first runs revealed a friction point: the synthesizer returned shorter rationales for certain low-confidence tables, causing false negatives in automated reconciliation. We addressed this by raising the consensus threshold and adding a "fallback-annotate" step that flagged ambiguous spans for a micro-review UI.
Results
After the migration:
Extraction accuracy moved from inconsistent mid-70s to a stable high-90s across templates in production verification runs.
Human escalations dropped by a dramatic margin as the verification layer rejected fewer valid results and routed fewer false positives to queues.
Before/after comparison highlights:
- Before: average end-to-end latency 4.6s under load; after: average 6.1s (higher, but within SLA windows).
- Before: manual review rate at peak 18%; after: manual review rate at peak 4%.
- Before: mis-extraction disputes opened per day ~23; after: disputes per day <6 and resolvable without full manual intervention.
Why this mattered: the build-vs-buy trade-off favored an integrated deep-research workflow for this use case because it provided planning, multi-document reading, and evidence-level outputs out of the box - features that would have cost months to hard-build and would still lag in maintenance.
Key lessons:
- When a problem requires synthesis across documents, treat it as research, not search. AI Search is excellent for fast facts; Deep Search and AI Research Assistant capabilities are required for reproducible, auditable decisions over documents.
- Instrument early. Adding explainability and citation tracing saved weeks in post-mortem time.
- Plan for latency trade-offs. If your process requires deep verification, accept modest latency increases to gain reliability.
If you face the same plateau - repeated document ambiguity, manual queues growing out of control, and a need for verifiable outputs - aim for a workflow that combines automated planning, document-level retrieval, and consensus synthesis. Tools that expose a "deep-research" mode and evidence-first outputs are the ones that let you move from firefighting to sustainable reliability.
What changed operationally was tangible: the architecture shifted from brittle single-pass inference to a layered research and synthesis pipeline that produced auditable outputs. That shift turned a fragile extraction system into one that is stable, scalable, and much easier to defend in audits and postmortems.
Would this work for every team? No. If you need millisecond responses for low-stakes UI copy, a full deep-research layer is overkill. But when you need precision, traceability, and multi-document reasoning - the kind of problems that used to take an analyst days - adopting a deep-research workflow (and the supporting tools around it) is the practical way forward.
For teams ready to move, the next step is a focused two-week evaluation: reproduce the failure cases, run a parallel deep-research pipeline, and measure the human-review delta. That small investment separates guesswork from a repeatable, production-ready decision.
Top comments (0)