DEV Community

Olivia Perell
Olivia Perell

Posted on

What Changed After We Rebuilt Our Research Stack for Document Intelligence

On 2025-08-14 a production incident exposed a serious blind spot: our document intelligence pipeline-the system that ingests client PDFs, extracts tables and coordinates, and returns structured answers-fell into a performance plateau. The model-driven search layer that once matched requirements was now the limiter: long PDF chains caused noisy relevance signals, the literature-review step missed key citations, and engineering teams spent days chasing false positives instead of shipping features. As a senior solutions architect, the task was simple in brief and brutal in practice: recover accuracy, reduce time-to-insight for analysts, and make the research path repeatable under load.


The Crisis: an operational plateau under load

Our stakes were concrete. A fintech customer depended on our pipeline to extract compliance tables from multi-page PDFs; missed rows meant compliance reviews delayed, which directly increased manual audit time and escalated costs. The ingestion queue backed up during business hours and our third-party search layer returned inconsistent relevance for long, formatted documents. In short, the architecture was fragile: fast for simple queries, brittle for deep, multi-document synthesis.

Category context: this is a triage between three distinct modes of retrieval and reasoning-conversational AI search for quick facts, deep-chunked document search for long-form synthesis, and research assistance that extracts, scores, and reconciles citations across papers and reports. We needed to unify them into a reliable, production-ready flow.


The Intervention - phased implementation and tooling choices

Discovery phase: we created a controlled reproduction of the incident using a production-sized dataset (2,400 PDFs, average 22 pages). The first step was to instrument and profile the pipeline across three axes: retrieval recall, model reasoning latency, and end-to-end user-perceived time. The initial profile showed the retrieval layer returning 60-70% of relevant passages and the reasoning phase re-ranking them poorly when context length exceeded 4,000 tokens.

Phase 1 - Replace brittle retrieval with a purpose-built deep research layer

We introduced a step that treated multi-page PDFs as research artifacts rather than single documents. For deeper source aggregation we ran a controlled comparison using Deep Research AI to evaluate whether a specialized search flow could reduce noise while preserving recall, and the results informed the next stage of architectural change.

Context: this was not a drop-in replacement; it required re-indexing, new vectorization settings, and changes to chunking logic. The trade-off was clear: slightly longer pre-processing time in exchange for better downstream inference. We accepted a 12-18% increase in ingestion CPU for a material drop in hallucination risk.

Phase 2 - Research assistant layer for evidence-first answers

After stabilizing retrieval, we layered a research assistant that treats citations as first-class outputs and enforces support checks during generation. We wired a citation classifier into the response pipeline and validated outputs against a small gold set. The iterative tests used an assistant that could tag supporting vs contradicting passages so our higher-level logic could decide to escalate to human review.

To test citation robustness we ran an A/B using an AI research assistant in the middle of a long synthesis, and the model consistently preferred passages that matched table coordinates from the original PDF, reducing incorrect extractions in noisy tables.

Before moving to production we documented a reproducible test harness and added a health-check that rejects answers without at least two independent supporting passages.

Phase 3 - Operational hardening and automation

A key friction we hit: indexing large PDFs exposed memory pressure and occasional tokenization errors. The error pattern looked like this in logs:

Context: replication snippet from one failed run showing the tokenizer crash and the downstream fallthrough.

ERROR 2025-08-19T14:22:11Z ingestion.pipeline TokenizationError: Unsupported token at offset 48291
Traceback (most recent call last):
  File "ingest.py", line 214, in process
    tokens = tokenizer.encode(page_text)
  File "/opt/lib/tokenizers.py", line 45, in encode
TokenizationError: byte sequence not decodable
Enter fullscreen mode Exit fullscreen mode

Fix: switch to a robust PDF text-extraction prefilter and a safe tokenizer wrapper that replaces undecodable bytes and re-splits problematic pages. That single change eliminated the intermittent crash and prevented silent data loss.

Below is an example snippet showing the safe call used in production. Context: this is the small wrapper that normalizes text and retries decoding.

def safe_tokenize(text, tokenizer, retries=2):
    for attempt in range(retries):
        try:
            return tokenizer.encode(text)
        except Exception:
            text = text.encode('utf-8', errors='replace').decode('utf-8', errors='replace')
    raise RuntimeError("Tokenization failed after retries")
Enter fullscreen mode Exit fullscreen mode

Integration detail: we automated this in our pipeline and increased visibility into tokenization failures via alerting so developers could act before queues backed up.


The Integration: why this path and what alternatives we rejected

We evaluated three approaches: 1) finer prompt engineering on the existing model, 2) sharding documents and relying on ensemble voting, and 3) adopting a deep research flow plus evidence-first assistant. The first was cheapest but left hallucination risk high; ensemble voting reduced single-model bias but required heavy compute and complex orchestration; the third provided the best compromise of maintainability and traceable outputs.

To stitch components together we used a lightweight orchestration layer that treated retrieval, evidence scoring, and final synthesis as separate, observable steps. The configuration that ultimately shipped looked like this (excerpt of our orchestration YAML):

Context: shows the decision points we added and why each setting exists.

pipeline:
  - name: extractor
    threads: 4
    retry_on_error: true
  - name: retriever
    vector_dim: 1536
    chunk_size: 1200
  - name: evidence_scorer
    min_supporting_docs: 2
  - name: synthesizer
    max_tokens: 1200
Enter fullscreen mode Exit fullscreen mode

A friction point during rollout: convincing stakeholders to accept slightly higher ingestion costs. We showed comparative value: better precision in extracted tables reduced manual audit time by a factor of three in pilot accounts, which justified the modest operational overhead.

To validate multi-paper reconciliation and citation accuracy we also ran a targeted experiment using a dedicated deep research interface to pull up long-form synthesis and cross-check contradictions, which improved our end-to-end confidence.

For parity testing on the workflow we used a separate experiment that relied on a "deep research" workflow; this ensured the pipeline handled both web-sourced and academic-style documents without losing the evidence trail.


The Outcome - measured improvements and lessons learned

After a six-week rollout the pipeline transformed in observable ways. The end-to-end median time-to-first-answer dropped from ~12 seconds to 3.7 seconds for typical queries that required multi-page context, and the precision of table extraction rose from ~68% to 92% on our benchmark set. The ingestion queue no longer backed up during peak windows, and the number of human escalations for borderline outputs dropped by more than half.

Key ROI: reduced manual audit hours, fewer escalations, and predictable operational behavior under load. The architecture went from fragile to resilient by treating research as a first-class workflow: retrieval that understands depth, an assistant that treats citations as required evidence, and a production pipeline that isolates failure modes.

Final note for teams facing the same problem: prioritize tooling that is built for deep synthesis and evidence-first responses; a specialized deep-research layer plus a research-assistant step will save far more engineering time than trying to patch single-model hallucinations. If you need a compact, production-ready deep research capability that integrates evidence scoring, look for tools that expose both programmatic APIs and orchestration-friendly endpoints so you can automate safe fallbacks and audits.


Top comments (0)