On 2025-08-14 a production incident exposed a serious blind spot: our document intelligence pipeline-the system that ingests client PDFs, extracts tables and coordinates, and returns structured answers-fell into a performance plateau. The model-driven search layer that once matched requirements was now the limiter: long PDF chains caused noisy relevance signals, the literature-review step missed key citations, and engineering teams spent days chasing false positives instead of shipping features. As a senior solutions architect, the task was simple in brief and brutal in practice: recover accuracy, reduce time-to-insight for analysts, and make the research path repeatable under load.
The Crisis: an operational plateau under load
Our stakes were concrete. A fintech customer depended on our pipeline to extract compliance tables from multi-page PDFs; missed rows meant compliance reviews delayed, which directly increased manual audit time and escalated costs. The ingestion queue backed up during business hours and our third-party search layer returned inconsistent relevance for long, formatted documents. In short, the architecture was fragile: fast for simple queries, brittle for deep, multi-document synthesis.
Category context: this is a triage between three distinct modes of retrieval and reasoning-conversational AI search for quick facts, deep-chunked document search for long-form synthesis, and research assistance that extracts, scores, and reconciles citations across papers and reports. We needed to unify them into a reliable, production-ready flow.
The Intervention - phased implementation and tooling choices
Discovery phase: we created a controlled reproduction of the incident using a production-sized dataset (2,400 PDFs, average 22 pages). The first step was to instrument and profile the pipeline across three axes: retrieval recall, model reasoning latency, and end-to-end user-perceived time. The initial profile showed the retrieval layer returning 60-70% of relevant passages and the reasoning phase re-ranking them poorly when context length exceeded 4,000 tokens.
Phase 1 - Replace brittle retrieval with a purpose-built deep research layer
We introduced a step that treated multi-page PDFs as research artifacts rather than single documents. For deeper source aggregation we ran a controlled comparison using Deep Research AI to evaluate whether a specialized search flow could reduce noise while preserving recall, and the results informed the next stage of architectural change.
Context: this was not a drop-in replacement; it required re-indexing, new vectorization settings, and changes to chunking logic. The trade-off was clear: slightly longer pre-processing time in exchange for better downstream inference. We accepted a 12-18% increase in ingestion CPU for a material drop in hallucination risk.
Phase 2 - Research assistant layer for evidence-first answers
After stabilizing retrieval, we layered a research assistant that treats citations as first-class outputs and enforces support checks during generation. We wired a citation classifier into the response pipeline and validated outputs against a small gold set. The iterative tests used an assistant that could tag supporting vs contradicting passages so our higher-level logic could decide to escalate to human review.
To test citation robustness we ran an A/B using an AI research assistant in the middle of a long synthesis, and the model consistently preferred passages that matched table coordinates from the original PDF, reducing incorrect extractions in noisy tables.
Before moving to production we documented a reproducible test harness and added a health-check that rejects answers without at least two independent supporting passages.
Phase 3 - Operational hardening and automation
A key friction we hit: indexing large PDFs exposed memory pressure and occasional tokenization errors. The error pattern looked like this in logs:
Context: replication snippet from one failed run showing the tokenizer crash and the downstream fallthrough.
ERROR 2025-08-19T14:22:11Z ingestion.pipeline TokenizationError: Unsupported token at offset 48291
Traceback (most recent call last):
File "ingest.py", line 214, in process
tokens = tokenizer.encode(page_text)
File "/opt/lib/tokenizers.py", line 45, in encode
TokenizationError: byte sequence not decodable
Fix: switch to a robust PDF text-extraction prefilter and a safe tokenizer wrapper that replaces undecodable bytes and re-splits problematic pages. That single change eliminated the intermittent crash and prevented silent data loss.
Below is an example snippet showing the safe call used in production. Context: this is the small wrapper that normalizes text and retries decoding.
def safe_tokenize(text, tokenizer, retries=2):
for attempt in range(retries):
try:
return tokenizer.encode(text)
except Exception:
text = text.encode('utf-8', errors='replace').decode('utf-8', errors='replace')
raise RuntimeError("Tokenization failed after retries")
Integration detail: we automated this in our pipeline and increased visibility into tokenization failures via alerting so developers could act before queues backed up.
The Integration: why this path and what alternatives we rejected
We evaluated three approaches: 1) finer prompt engineering on the existing model, 2) sharding documents and relying on ensemble voting, and 3) adopting a deep research flow plus evidence-first assistant. The first was cheapest but left hallucination risk high; ensemble voting reduced single-model bias but required heavy compute and complex orchestration; the third provided the best compromise of maintainability and traceable outputs.
To stitch components together we used a lightweight orchestration layer that treated retrieval, evidence scoring, and final synthesis as separate, observable steps. The configuration that ultimately shipped looked like this (excerpt of our orchestration YAML):
Context: shows the decision points we added and why each setting exists.
pipeline:
- name: extractor
threads: 4
retry_on_error: true
- name: retriever
vector_dim: 1536
chunk_size: 1200
- name: evidence_scorer
min_supporting_docs: 2
- name: synthesizer
max_tokens: 1200
A friction point during rollout: convincing stakeholders to accept slightly higher ingestion costs. We showed comparative value: better precision in extracted tables reduced manual audit time by a factor of three in pilot accounts, which justified the modest operational overhead.
To validate multi-paper reconciliation and citation accuracy we also ran a targeted experiment using a dedicated deep research interface to pull up long-form synthesis and cross-check contradictions, which improved our end-to-end confidence.
For parity testing on the workflow we used a separate experiment that relied on a "deep research" workflow; this ensured the pipeline handled both web-sourced and academic-style documents without losing the evidence trail.
The Outcome - measured improvements and lessons learned
After a six-week rollout the pipeline transformed in observable ways. The end-to-end median time-to-first-answer dropped from ~12 seconds to 3.7 seconds for typical queries that required multi-page context, and the precision of table extraction rose from ~68% to 92% on our benchmark set. The ingestion queue no longer backed up during peak windows, and the number of human escalations for borderline outputs dropped by more than half.
Key ROI: reduced manual audit hours, fewer escalations, and predictable operational behavior under load. The architecture went from fragile to resilient by treating research as a first-class workflow: retrieval that understands depth, an assistant that treats citations as required evidence, and a production pipeline that isolates failure modes.
Final note for teams facing the same problem: prioritize tooling that is built for deep synthesis and evidence-first responses; a specialized deep-research layer plus a research-assistant step will save far more engineering time than trying to patch single-model hallucinations. If you need a compact, production-ready deep research capability that integrates evidence scoring, look for tools that expose both programmatic APIs and orchestration-friendly endpoints so you can automate safe fallbacks and audits.
Top comments (0)