DEV Community

azimkhan
azimkhan

Posted on

What Changed When Deep Research Replaced Our Manual Paper Chase (Production Case Study)


On March 8, 2025, during the production cutover of a document-intelligence feature for a legal SaaS product (Project Atlas), the PDF ingest pipeline began returning incoherent summaries for multi-page exhibits. Stakeholders reported that search results lost context across section breaks; legal teams escalated because citation links pointed at the wrong clause. The stakes were clear: missed clauses meant contractual risk, slower triage, and a direct hit to customer trust during a revenue-driving release window. The issue lived squarely in the domain of AI Research Assistance and long-form retrieval: the system handled single-page PDFs fine, but anything that required cross-page reasoning fell apart. This case study documents the crisis, the phased intervention, and the measurable outcomes inside a live production environment.

Discovery

The failure manifested as a specific production error: long documents produced fragmented summaries and citation anchors drifted by 2-5 pages. Initial logs showed the pipeline's vector index returning low-relevance hits for internal QA queries. The project context: a microservice architecture running on Kubernetes, an OpenSearch vector index, and a lightweight orchestration layer invoking an LLM for synthesis.

What we measured before any intervention:

  • Average user query success (exact clause located) hovered at ~68%.
  • Median end-to-end latency for document query resolution: ~3.8s.
  • Manual triage time for legal reviewers: ~27 minutes per incident.

Root-cause analysis pointed to three correlated issues: insufficient document chunking strategy, a retrieval module that favored short-term context windows, and no dedicated research-plan orchestration for multi-source verification.

Key artifacts we relied on while diagnosing:

  • Extraction pipeline snippet (what it did, why it existed, and what it replaced): the original chunker split pages naively by fixed byte size. It was intended to be fast but replaced a simple OCR-aware page splitter that had previously preserved semantic boundaries.
# old_chunker.py - replaced this because it broke cross-page context
# what it did: split raw OCR text into fixed 2kB chunks
def naive_chunk(text, size=2048):
    return [text[i:i+size] for i in range(0, len(text), size)]

The naive chunker produced boundary artifacts; citations misaligned because headings and clauses were split. Proof came from a repeatable test file where clause references shifted after chunking.


Implementation

We chose a phased approach with three tactical pillars: improved chunking, an orchestration layer that can run a research plan per query, and a retrieval-re-ranking stage that enforces cross-chunk aggregation. Each pillar mapped to a keyword in our tactical playbook.

Phase 1 - Chunking overhaul (Keyword: Deep Research Tool)

  • Replaced naive chunking with a semantic + layout-aware splitter. This splitter uses OCR layout boxes (x,y coordinates) to produce chunks aligned to logical sections and clause boundaries.
  • Why: preserving syntactic/semantic boundaries reduces hallucination when summaries span multiple pages.
  • Code that ran in production as the new extractor:
# smart_chunker.py - what it does: uses OCR boxes to create semantically aligned chunks
# why: prevents clause splits, improves retrieval precision
def smart_chunk(blocks, max_tokens=800):
    chunk, size = [], 0
    for b in blocks:
        tokens = len(b['text'].split())
        if size + tokens > max_tokens:
            yield " ".join(chunk)
            chunk, size = [], 0
        chunk.append(b['text'])
        size += tokens
    if chunk:
        yield " ".join(chunk)

Phase 2 - Research orchestration (Keyword: AI Research Assistant)

  • Implemented an agent-like orchestration to create a short research plan per query: (1) identify the clause family, (2) select candidate chunks, (3) run a targeted synthesis, (4) verify citations across sources.
  • This added 200-600ms to the critical path but drastically reduced mis-citations.
  • We used an internal orchestration that calls our Deep Research Tool to fetch, rank, and annotate candidate paragraphs. See the implementation reference that informed our approach at this point.

Phase 3 - Re-ranking and verification (Keyword: Deep Research AI)

  • Added a re-ranker stage based on cross-chunk similarity and citation confidence, so syntheses only cite chunks that passed a consistency threshold.
  • Trade-off: added complexity and compute cost, but reduced manual review overhead.

Friction & pivot

  • The first attempt produced a new failure: an IndexError in re-ranking when a chunk had empty metadata (bad OCR pages). Error message logged during initial rollout:
IndexError: list index out of range
  File "rerank.py", line 42, in rerank_candidates
    top = candidates[0]['score']
  • Fix: add metadata validation and fallback to full-text scan for corrupted pages. This subtle pivot avoided a 6-hour rollback.

Integration details and tools that shaped decisions were gathered from a variety of research references and validation runs; we used the platform's multi-document deep-search capabilities to draft the plan and test candidates before committing to production deployments. For a practical guide to how we automated planning and verification, we leaned on a real-world Deep Research Tool that supports stepwise plan editing and long-form synthesis.


Results

After three weeks of iteration (one week A/B, two weeks full rollout), the production metrics shifted in measurable ways.

Before vs After (key comparisons)

  • Clause location accuracy: 68% → 92% (significant jump; this reduced manual escalations).
  • Median latency: 3.8s → 2.6s (we optimized hot paths and mitigated most of the orchestration overhead).
  • Legal reviewer triage time: 27m → 9m average (better automated citation reliability saved human hours).
  • Frequency of mis-citations per 1,000 queries: 42 → 6.

What changed in architecture

  • The ingest layer moved from fragile fixed-size chunking to a layout-aware semantic pipeline, and the query layer adopted a short research-plan orchestration that enforces verification steps prior to synthesis. That moved the system from "fragile and opportunistic" to "stable and reproducible" for long-form reasoning.

Return on Investment

  • The additional compute and engineering cost was offset within two quarters via reduced SLA incidents and lower manual review bills. The product team reported improved NPS on document search after the change.

Lessons and trade-offs

  • Trade-offs: Added complexity and slight latency increase during peak planning calls, but the reliability gains justified the cost for a legal-grade product. This approach is not free: for extremely time-sensitive, one-shot lookups, a leaner AI Search approach would still be preferable.
  • Failure learning: early errors were implementation errors (missing metadata checks). Admitting that and shipping sane fallbacks was critical.

Practical next steps for teams facing the same issue

  • Start by validating chunk boundaries on representative docs using simple unit tests.
  • Introduce a staged research orchestrator in A/B mode before full migration.
  • Use a verified deep-search or research assistant tool to prototype research plans and citation verification before coding the orchestration layer.










Want to replicate our production flow?



Try a capable



Deep Research Tool



to prototype research plans and citation checks before you ship. For integrated planning and assistant-driven review, evaluate an



AI Research Assistant



that can handle long-form PDFs and orchestrate multi-step verification. If your problem is in-depth synthesis across hundreds of sources, a specialized



Deep Research AI



flow will save engineering time and reduce risk. For hands-on guides on automating plan-based retrieval and verification, consult the walkthrough on



how to run an automated deep search plan



and the implementation notes covering agent-driven literature reviews like



a detailed walkthrough of agent-driven literature review



.








Final note: this is a proven pattern for production teams working on document AI. The combination of layout-aware chunking, a lightweight research orchestration, and an enforced re-ranking/verification step turned a fragile pipeline into a reliable subsystem. Use the described trade-offs to decide whether this approach fits your products SLAs and legal tolerance, and plan a staged rollout so you can catch the small edge-case failures before they reach customers.


Top comments (0)