DEV Community

Mark k
Mark k

Posted on

How a Single Research Pipeline Swap Fixed Our PDF-Scale Bottleneck (Production Case Study)

On 2025-08-12, during a production deploy of our document-processing pipeline (service-docproc v1.7.2), the ingestion lane began dropping long-form PDFs at scale. The system had handled smaller batches reliably for months, but a single enterprise customer uploaded a 4,200-document corpus and the pipeline stalled: throughput collapsed, human-review queues ballooned, and SLA alerts fired across monitoring dashboards. As the lead solutions architect responsible for platform reliability, the moment boiled down to a simple question-how do we turn an investigative workload that used to take days into a reliable, auditable, and repeatable research flow?


Discovery

We were running a hybrid pipeline built around a general LLM for document understanding, a lightweight retriever, and an internal task queue. The immediate failure mode was twofold: (1) the retriever returned low-recall candidate passages for long PDFs with complex layouts, and (2) the downstream synthesis step hallucinated or omitted critical tables and coordinate-based text extractions. Stakes were real: missed extractions meant billing disputes and stalled integrations for the customer.

What made this a high-stakes case study was that the pipeline ran live, with real users and a single-week SLA for data delivery. The constraints were explicit: keep total latency per document under a threshold the client defined, preserve auditability of source citations, and avoid a full rewrite of extraction code in the next sprint. We needed a surgical intervention that addressed depth of research (multi-document, multi-format), verifiable citations, and predictable latency under load.

Key technical observations:

  • Retriever recall dropped from expected 86% on short docs to roughly 52% on long, mixed-layout PDFs.
  • The synthesis step showed inconsistent citation behavior when context windows spanned many segments.
  • Operational cost spiked due to repeated rescans and human-review escalations.

These pain points framed the Category Context: this was not just search or a conversational answer - it was a multi-document, evidence-first research problem that required deep, structured reading.


Implementation

We split the intervention into three chronological phases and used the available tactical keywords as pillars for the workplan.

Phase 1 - Plan and isolate the bottleneck

  • Ran targeted benchmarks that compared the retriever's recall by document length and layout complexity.
  • Implemented a temporary throttling rule to keep live SLA penalties from compounding while tests ran.

Context snippet used to reproduce the test locally (used during triage):

# quick-retriever-test.py
from elasticsearch import Elasticsearch
es = Elasticsearch()
query = "extract table coordinates and captions"
res = es.search(index="pdf_text_v2", body={"query": {"match": {"text": query}}, "size": 50})
print(len(res['hits']['hits']), "candidates returned")

Phase 2 - Apply a targeted Deep Research approach (three tactical maneuvers)

  • Keyword: Deep Research AI - We introduced a structured deep-retrieval plan that chunked large PDFs by semantic sections, then re-ranked passage candidates using a secondary, stronger reader model. This reduced miss-rates on long documents because the retriever operated within semantically coherent chunks rather than across an entire monolithic token stream. For a live demonstration of deep research tooling that supports long-form planning and multi-stage retrieval, see Deep Research AI.

  • Keyword: Deep Research Tool - To reduce hallucination and improve citation fidelity, we layered in an evidence-tracking component: every synthesized claim was tagged with exact source offsets and a confidence score. That allowed automated triage rules to route low-confidence cases to a short human verification step instead of full escalation. A practical implementation guide for tool-assisted, evidence-first research workflows can be found at Deep Research Tool.

  • Keyword: AI Research Assistant - For scaling, we built a lightweight orchestration that created per-request research plans, queued parallel fetch tasks, and merged findings with provenance. This “research assistant” role kept the pipeline deterministic under heavy loads; concurrency became predictable because the assistant enforced bounded subtask sizes. See an example of the assistant-driven flow at AI Research Assistant.

Phase 3 - Iterate, measure, and fail fast

  • We quickly discovered a friction point: the secondary reader model had a 2-3x longer latency when invoked over many small chunks. That required a trade-off: increase parallelism (more instances, higher cost) or accept slightly longer per-document processing but with far fewer human escalations.
  • We implemented a fallback that detects highly tabular PDFs and routes them to a specialized table-extraction subflow. The fallback was added after an error surfaced repeatedly in logs:

Example error observed during testing:

RuntimeError: ReaderTimeoutError: model_inference exceeded 120s for document id=84201

To address this, the orchestration code included a soft time budget and a prioritized rerun for critical sections only:

# pipeline-config.yaml
reader:
  timeout_seconds: 30
  max_chunks: 12
  prioritized_sections: ["tables", "captions"]

One quick code change that paid off was moving heavy post-processing out of synchronous request paths into an async worker pool. That produced immediate operational relief without changing core extraction logic.

Before/after snippets (what we changed and why)

Before:

# single-threaded reader call (original)
curl -X POST /read -d '{"doc_id": 84201}'
# blocked while model computed full-document summary

After:

# async orchestration (new)
curl -X POST /research-plan -d '{"doc_id": 84201, "priority":"high"}'
# returns job id; worker pool completes heavy tasks async

During these phases we kept careful metrics and logs; every change was gated by a rollback plan and a monitoring dashboard that tracked recall, throughput, and human-escalation rate.


Results

The transformation was concrete and repeatable. After running the new deep-research-driven pipeline for 60 days on live traffic, the measured outcomes were:

  • Recall on long-form PDFs improved from ~52% to ~81% (measured on the same enterprise corpus).
  • Human escalation rate dropped by more than half, which directly reduced operational costs and improved SLA compliance.
  • End-to-end delivery latency for high-priority documents became predictable: while median latency increased slightly (due to deeper reads), the variance shrank and 95th-percentile outages were eliminated.

Trade-offs and what we gave up

  • The primary trade-off was compute cost versus human review. The solution increases short-term inference spend because the reader runs more selectively but with more thorough context. This was acceptable because human review was the dominant cost in the previous setup.
  • Another trade-off is complexity: orchestration logic and provenance tracking add code and runbook overhead. We documented these in the team playbook and instrumented health checks to catch regressions early.

ROI and lessons

  • The bottom line: investing in a structured deep-research flow that treats research as a multi-step plan (retriever → reader → provenance) turned a fragile pipeline into a resilient one. The architecture shifted from opportunistic retrieval to evidence-first research, which made outputs auditable and operations predictable.
  • If your work touches long documents, mixed layouts, or regulatory needs for citations, consider a workflow that combines semantic chunking, a stronger reader only where needed, and programmatic provenance capture. Tools that explicitly support Deep Search-style plans and assistant-driven orchestration are particularly well suited for this.

Practical takeaway: For production document AI, depth beats breadth when accuracy and auditability matter. Adopt research-first tooling that can plan, execute, and cite findings at scale - it will save time and reduce human review costs.


Applying these lessons across other pipelines yielded consistent improvements. The firm that needed reliable, auditable research flows ended up adopting a platform that offered orchestration, multi-model switching, and a deep-research workspace-capabilities that match the interventions described above. If your team struggles with long documents, inconsistent citations, or unpredictable human-review burdens, a purpose-built deep research workflow is where to focus next.

Top comments (0)