<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gunjan Tailor</title>
    <description>The latest articles on DEV Community by Gunjan Tailor (@gunjantailor).</description>
    <link>https://dev.to/gunjantailor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3938215%2F3cbbb4e8-fd61-4eac-aeda-e6d393ac966c.png</url>
      <title>DEV Community: Gunjan Tailor</title>
      <link>https://dev.to/gunjantailor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gunjantailor"/>
    <language>en</language>
    <item>
      <title>RAG answered 70% of my questions with zero LLM tokens — here's the ingestion trick that made it possible</title>
      <dc:creator>Gunjan Tailor</dc:creator>
      <pubDate>Mon, 18 May 2026 13:35:15 +0000</pubDate>
      <link>https://dev.to/gunjantailor/i-built-a-pdf-parser-that-actually-preserves-table-structure-for-rag-heres-why-it-matters-19fo</link>
      <guid>https://dev.to/gunjantailor/i-built-a-pdf-parser-that-actually-preserves-table-structure-for-rag-heres-why-it-matters-19fo</guid>
      <description>&lt;p&gt;70% of queries. Zero LLM tokens. $0.00.&lt;br&gt;
That's what happens when you fix ingestion instead of obsessing over retrieval.&lt;br&gt;
I ran 25 questions against a 500-page nutrition textbook. 24/25 correct (96%). Most tutorials stop there. What they don't show: 17 of those 25 questions never touched an LLM at all — answered by BM25 + cosine similarity in under 20ms.&lt;br&gt;
Here's why that's possible, and what I built to make it work.&lt;/p&gt;

&lt;p&gt;The ingestion problem nobody admits&lt;br&gt;
Every RAG tutorial shows the same pipeline:&lt;br&gt;
PDF → extract text → split every 512 tokens → embed → store → query&lt;br&gt;
It works fine for blog posts. It falls apart completely for anything structured.&lt;br&gt;
Take a financial report with this revenue table:&lt;br&gt;
RegionQ2 RevenueQ3 RevenueChangeEurope38.1%45.2%+7.1ppAsia29.3%41.7%+12.4ppAmericasn/a52.1%—&lt;br&gt;
After blind chunking at 512 tokens, your LLM receives:&lt;br&gt;
"45.2%  Q3  Europe  38.1%  Q2  Europe  41.7%  Q3  Asia   29.3%"&lt;br&gt;
Numbers with no column headers. No caption. No context.&lt;br&gt;
Ask "which region grew the most?" and you get an approximate guess — not an answer. The LLM isn't hallucinating because it's dumb. It's working with garbage input.&lt;br&gt;
The same silent failure happens with:&lt;/p&gt;

&lt;p&gt;Legal contracts — clause split mid-sentence, both halves meaningless alone&lt;br&gt;
API docs — code example separated from its description&lt;br&gt;
Research papers — figure caption disconnected from its analysis&lt;/p&gt;

&lt;p&gt;This is not a retrieval problem. It's an ingestion problem. And almost no one fixes it at the source.&lt;/p&gt;

&lt;p&gt;What I built: DocNest&lt;br&gt;
I spent the last few months building DocNest — a document normalization engine that reads structure before touching content.&lt;br&gt;
Instead of chunks, every heading becomes a navigable §section with its own ID. Every table is preserved as structured JSON. Every section gets a one-sentence LLM summary and a BM25 keyword index — computed once at ingest, never again.&lt;br&gt;
The output is a .udf file (Unified Document Format) — a self-contained, portable knowledge base. Share it by email, copy it to S3, open it in the VSCode extension.&lt;br&gt;
pythonfrom docnest.parsers.pymupdf_pdf import PyMuPDFParser&lt;br&gt;
from docnest.normalizer import SectionNormaliser&lt;br&gt;
from docnest.writer import UDFWriter&lt;br&gt;
from docnest.reader import UDFIndex&lt;/p&gt;

&lt;h1&gt;
  
  
  Parse → normalise → save
&lt;/h1&gt;

&lt;h1&gt;
  
  
  No API key needed for this step
&lt;/h1&gt;

&lt;p&gt;raw = PyMuPDFParser().parse("report.pdf")&lt;br&gt;
doc = SectionNormaliser().normalise(raw)&lt;br&gt;
UDFWriter().write(doc, "report.udf")&lt;/p&gt;

&lt;h1&gt;
  
  
  Query
&lt;/h1&gt;

&lt;p&gt;idx = UDFIndex.load("report.udf")&lt;br&gt;
result = idx.query(&lt;br&gt;
    "Which region had the highest Q3 growth?",&lt;br&gt;
    llm_provider="groq",&lt;br&gt;
    llm_model="llama-3.3-70b-versatile",&lt;br&gt;
    llm_api_key="gsk_...",  # free tier at console.groq.com&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;print(result.answer)      # "Asia grew the most, up +12.4pp"&lt;br&gt;
print(result.layer_used)  # 1 — answered from index, 0 LLM tokens&lt;br&gt;
print(result.tokens_used) # 0&lt;/p&gt;

&lt;p&gt;The five-layer query engine&lt;br&gt;
This is the part that makes the zero-token results possible.&lt;br&gt;
Instead of sending the full document to an LLM, queries escalate through 5 layers — stopping as soon as one can answer confidently:&lt;br&gt;
LayerMechanismTokensTypical latency0Pre-computed summary + key numbers0&amp;lt; 1ms1BM25 + cosine → navigate to exact §section0&amp;lt; 20ms2Section-scoped LLM call~3001–3s3Multi-section synthesis~9002–5s4Full document fallback~4000+5–15s&lt;br&gt;
In practice on real documents: Layers 0 and 1 handle ~70% of questions — the factual ones, the number lookups, the "what does section 3 say about X" type queries. You only pay for LLM compute when the question genuinely requires reasoning.&lt;/p&gt;

&lt;p&gt;Handling large PDFs without running out of RAM&lt;br&gt;
Standard Docling (the ML-quality PDF parser) loads the full document into RAM. A 600-page PDF can exhaust most machines.&lt;br&gt;
DocNest solves this with automatic page chunking:&lt;br&gt;
pythonfrom docnest.parsers.pdf import DoclingPDFParser&lt;/p&gt;

&lt;h1&gt;
  
  
  Auto-detects large PDFs and chunks automatically
&lt;/h1&gt;

&lt;p&gt;raw = DoclingPDFParser().parse("600-page-annual-report.pdf")&lt;/p&gt;

&lt;h1&gt;
  
  
  Or tune explicitly for your hardware
&lt;/h1&gt;

&lt;p&gt;raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf")  # low RAM&lt;br&gt;
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf")  # high throughput&lt;br&gt;
PyMuPDF splits the PDF into N-page temp files. Docling processes each at full ML quality. Sections are merged. The output is identical to processing everything at once — peak RAM stays constant regardless of document size.&lt;/p&gt;

&lt;p&gt;Real accuracy numbers&lt;br&gt;
I tested against a 500-page open-source nutrition textbook, 25 questions, using PyMuPDF + Groq free tier:&lt;br&gt;
Question typeScoreBasic facts (calories, macros)5/5Detailed nutrition (fiber, glycemic index)5/5Micronutrients (vitamins, minerals)4/5Hard synthesis (BMR, omega-3, antioxidants)5/5Edge cases (hallucination traps, tables, out-of-scope)5/5Total24/25 (96%)&lt;br&gt;
The one failure: a table-only page where PyMuPDF extracted no text content. Fix: use DoclingPDFParser for documents where tables are the primary information carrier.&lt;/p&gt;

&lt;p&gt;Try it&lt;br&gt;
bashpip install docnest-ai&lt;br&gt;
Supported formats: PDF (Docling ML + PyMuPDF), DOCX, XLSX, HTML, Markdown&lt;br&gt;
LLM providers: Groq (free tier works), OpenAI, Ollama (fully local), Anthropic, Google, Mistral, Cohere, Bedrock, Together&lt;br&gt;
Vector backends: numpy (zero deps), FAISS, ChromaDB&lt;br&gt;
CLI:&lt;br&gt;
bashdocnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile&lt;br&gt;
docnest query report.udf "What are the key risks mentioned?"&lt;br&gt;
docnest view report.udf   # opens structured HTML viewer in browser&lt;br&gt;
→ GitHub: &lt;a href="https://github.com/tailorgunjan93/docnest" rel="noopener noreferrer"&gt;https://github.com/tailorgunjan93/docnest&lt;/a&gt;&lt;br&gt;
→ PyPI: &lt;a href="https://pypi.org/project/docnest-ai" rel="noopener noreferrer"&gt;https://pypi.org/project/docnest-ai&lt;/a&gt;&lt;br&gt;
→ Format spec: &lt;a href="https://github.com/tailorgunjan93/udf-spec" rel="noopener noreferrer"&gt;https://github.com/tailorgunjan93/udf-spec&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's broken, what's coming&lt;br&gt;
Current version is 0.4.0a2 — alpha, but works on real documents.&lt;br&gt;
Open for contributions:&lt;/p&gt;

&lt;p&gt;PPTX parser (PowerPoint slides → §sections)&lt;br&gt;
Qdrant / Weaviate vector backends&lt;br&gt;
SharePoint + Confluence connectors&lt;br&gt;
EPUB parser for ebook indexing&lt;/p&gt;

&lt;p&gt;If you've hit the table-structure problem in your own RAG pipeline — where the LLM gets numbers without context — I'd genuinely like to hear what document type caused it. Drop it in the comments.&lt;/p&gt;

&lt;p&gt;Built in the open. Issues and PRs welcome.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>python</category>
      <category>ai</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
