On 2025-10-12, during a high-volume run of our document ingestion pipeline for the payments team, the review queue ballooned and SLA breaches started appearing on the dashboard. A cluster of PDFs with scanned tables and embedded equations caused repeated failures in automated extraction; manual triage ate engineer hours and delayed releases. The problem was not a single bug - it was a capability gap in how our tooling performed multi-document synthesis: short answers from a search layer, but no way to produce reproducible, citation-backed research on a set of heterogeneous documents. This case study examines that crisis, the stepwise intervention we executed in production, and the measurable shifts in throughput and developer load that followed.
Discovery
The stakes were straightforward: stalled releases, growing compliance backlog, and a support team forced into lengthy manual review. The system in place relied on a conversational search front-end plus a simple PDF parser. It solved single-question queries but choked on higher-order tasks like "find contradictions across these 47 vendor contracts" or "extract and normalize all PCI-related clauses across quarterly reports." In short: AI Search gave quick answers; it did not do deep, reproducible research on collections of documents.
What we observed:
- Frequent hallucinations when context spanned multiple files.
- No native plan-orchestration to break research tasks into sub-queries.
- High variance in latency under concurrent user load.
A production snapshot showed the ingestion queue depth rising by 3x during batch runs and manual review time per document averaging multiple minutes. The architecture context: a microservice that accepted document bundles, produced indexed tokens, and exposed a conversational API. The team needed something that could plan, read, reason across many sources, and surface verifiable outputs - what our category context calls an AI Research Assistance / Deep Search capability.
Before touching code, we documented the current retrieval config that had been in production for two quarters:
We used a basic dense retrieval layer mapped like this:
# retrieval_config.yml
index: faiss-v1
embedding_model: sentence-transformers/all-MiniLM-L6-v2
max_doc_chunk_size: 2048
similarity_threshold: 0.72
reranker: none
This configuration explained part of the problem: chunk size and lack of reranking left context fragmented, and the reranker absence meant low precision on multi-hop queries.
Implementation
We implemented the intervention in three chronological phases: plan, pilot, and rollout. Each phase treated "research" as an engineering feature with SLAs.
Phase 1 - Plan: Define pillars and measurable goals.
- Pillars (keywords used as tactical maneuvers): Deep Research AI, Deep Research Tool, AI Research Assistant. Each pillar represented a capability: autonomous planning, deep multi-document reading, and research-thread management.
- Goals: reduce manual triage by 60%, turn 8-hour research tasks into sub-2-hour reports, and make every answer traceable to source snippets.
Phase 2 - Pilot: Replace the thin search layer with an orchestration layer that could (a) plan sub-questions, (b) fetch and re-rank candidate sources, (c) perform step-by-step synthesis, and (d) emit a structured report with citations. To connect existing services we added a lightweight adapter that wrapped the orchestrator and the parser.
Example call pattern we used in the pilot (context text preceding the snippet):
We invoked the research agent from our ingestion worker with a concise job spec:
curl -X POST https://internal-research-agent/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{
"task": "multi_doc_lit_review",
"documents": ["s3://bucket/report1.pdf","s3://bucket/contractA.pdf"],
"output_format": "structured_report"
}'
During the pilot we found that simply swapping models was not enough: retrieval and plan orchestration had to be integrated. We therefore tied our orchestrator to a deeper retrieval pipeline and instrumented each step for observability. For developer ergonomics, a small Python helper encapsulated job submission and streaming of intermediate reasoning states:
# research_client.py
import requests, json
def submit_research_job(spec):
resp = requests.post("https://internal-research-agent/api/v1/jobs", json=spec, timeout=120)
return resp.json()
Placement of the new "research planner" into the ingestion flow produced friction. One class of failures surfaced as timeouts when a long-chain reasoning task exceeded worker limits. An example error log:
ERROR research.worker JobTimeoutError: planning step exceeded 90s
Traceback (most recent call last):
File "research/worker.py", line 217, in process_job
plan = planner.create_plan(task_spec)
File "research/planner.py", line 78, in create_plan
raise JobTimeoutError("planning step exceeded 90s")
Friction & Pivot: Rather than increasing hard timeouts, we introduced incremental checkpoints: the planner emits a "plan skeleton" within 10s and then executes sub-steps asynchronously. That pivot preserved interactivity while allowing deeper analysis in the background.
Integration choices and why:
- We selected a toolchain that natively supports stepwise plans and source-level citations because alternatives (pure LLM + ad-hoc retrieval) produced irreproducible answers. For this, an orchestration-first approach worked better than a prompt-only strategy.
- We evaluated three options: (A) extend the existing search layer, (B) bolt on an orchestration agent, (C) replace the workflow with an integrated deep-research platform. Option B hit the balance between engineering risk and speed; Option C promised less maintenance but required onboarding. We piloted B, then moved to C for high-value batches.
During implementation we linked our orchestration logs to the research layer that acted like a modern Deep Research AI provider, enabling traceable citations and editable research plans. For tasks that required repeatable academic-style accuracy, the team used an AI Research Assistant mode to surface supporting vs contradicting evidence automatically. For quicker, operational scans we relied on the same service in a shorter-mode called the Deep Research Tool style, which prioritized speed and concise reports.
Results
After the staged rollout (pilot → 2-week side-by-side → full cutover), the production behavior changed in clear, measurable ways.
Before vs After (comparative summary):
- Manual triage: from multi-minute per document to intermittent manual checks - significantly reduced human load.
- Research turnaround: typical multi-document synthesis dropped from an 8-hour ad hoc task to ~90-120 minutes for the same scope.
- Latency in the ingestion pipeline: previously spiky during batch loads; now stable due to asynchronous planning checkpoints.
- Developer time debugging hallucinations: effectively eliminated in cases where the report included source snippets and line-level citations.
Evidence came from logs and parity runs: we captured side-by-side outputs for the same input bundle and stored the structured reports in S3 for audit. The before outputs were short, unsupported summaries; the after outputs contained a plan, per-document extract, and a conclusion with inline citations. That last detail converted previously manual checks into quick confirmations.
Trade-offs and where this would not work:
- Cost: deeper research runs are compute-heavy and not cost-effective for trivial lookups.
- Latency: deep research is slower than conversational search; not suitable for single-fact queries.
- Complexity: the orchestration layer adds operational surface area and needs observability.
Primary lesson: treat Deep Search as a distinct product line inside your platform - design its SLAs and failure modes independently from quick search. For teams that need reproducible, citation-backed synthesis across many documents, an orchestration-first research assistant is the pragmatic engineering choice.
Looking forward, capture the outputs of deep runs as canonical artifacts, let teams reuse plans, and prioritize tooling that supports stepwise inspection of reasoning. For engineering teams building reliable document AI, a platform that bundles orchestration, traceable citations, and multi-document reading becomes inevitable: it reduces manual work, stabilizes pipelines, and produces defensible outputs you can audit during incidents.
Top comments (0)