<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ambuj Tripathi</title>
    <description>The latest articles on DEV Community by Ambuj Tripathi (@ambuj_tripathi).</description>
    <link>https://dev.to/ambuj_tripathi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4012395%2Fcc638ce4-1156-4fb3-ba3b-db3d1f0701f9.jpeg</url>
      <title>DEV Community: Ambuj Tripathi</title>
      <link>https://dev.to/ambuj_tripathi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ambuj_tripathi"/>
    <language>en</language>
    <item>
      <title>PyPDFLoader, LlamaParse, Custom Regex — I Tried Everything on Indian Government PDFs. Here's What Actually Worked.</title>
      <dc:creator>Ambuj Tripathi</dc:creator>
      <pubDate>Thu, 02 Jul 2026 15:09:42 +0000</pubDate>
      <link>https://dev.to/ambuj_tripathi/pypdfloader-llamaparse-custom-regex-i-tried-everything-on-indian-government-pdfs-heres-what-58ej</link>
      <guid>https://dev.to/ambuj_tripathi/pypdfloader-llamaparse-custom-regex-i-tried-everything-on-indian-government-pdfs-heres-what-58ej</guid>
      <description>&lt;p&gt;Six months ago I asked the same questions you're asking. "How do I handle merged cells?" "Why does my table extraction break?" "Which parser should I use?"&lt;/p&gt;

&lt;p&gt;I tried &lt;strong&gt;every popular approach&lt;/strong&gt; — PyPDFLoader, Unstructured, LlamaParse, custom regex — on some of the most painful PDFs you can imagine: Indian Government Budget documents, Finance Bills, and the &lt;strong&gt;Constitution of India&lt;/strong&gt; (400+ pages of dense legal text with footnotes on every page).&lt;/p&gt;

&lt;p&gt;This article is an honest post-mortem of what went wrong, why, and the &lt;strong&gt;only architecture that actually survived production.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🤯 &lt;strong&gt;The Document From Hell&lt;/strong&gt;&lt;br&gt;
Most RAG tutorials use clean, simple PDFs. The Constitution of India is not that.&lt;/p&gt;

&lt;p&gt;Here's what you're dealing with on every single page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;19. Protection of certain rights regarding freedom of speech, etc.—
(1) All citizens shall have the right—
    (a) to freedom of speech and expression;
    (b) to assemble peaceably and without arms;
...
______________________________________________
1. Subs. by the Constitution (First Amendment) Act, 1951, s. 3
2. Ins. by the Constitution (Forty-fourth Amendment) Act, 1978.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every page has three zones:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Article content&lt;/strong&gt; (what users actually want)&lt;br&gt;
&lt;strong&gt;A separator line&lt;/strong&gt; (______)&lt;br&gt;
&lt;strong&gt;Footnotes&lt;/strong&gt; (amendment citations that ALSO begin with numbers like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1., 19., 34.)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those footnotes start with the same numbers as real Articles. Embedding models encode them with equal weight. This is where hallucinations are born.&lt;/p&gt;

&lt;p&gt;❌ &lt;strong&gt;Attempt 1:&lt;/strong&gt; &lt;strong&gt;LlamaParse (Agentic Tier) — The Expensive Failure&lt;/strong&gt;&lt;br&gt;
My initial setup: LlamaParse at Agentic tier (10 credits/page) + LangChain's MarkdownHeaderTextSplitter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I expected:&lt;/strong&gt; Clean, hierarchically separated chunks per Article.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I got:&lt;/strong&gt; 624 giant chunks from a 402-page document.&lt;/p&gt;

&lt;p&gt;LlamaParse is excellent for tables, invoices, and structured forms. But for dense continuous legal text with hundreds of numbered items, &lt;strong&gt;it merged multiple pages into single Markdown blocks.&lt;/strong&gt; Article 19 wasn't a standalone chunk — it was buried inside a 5,000-character blob alongside Articles 17, 18, 20, and a dozen footnotes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Hallucination Test:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query: "What is Article 19?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Vector similarity matched a footnote&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(19. Ins. by Constitution (Forty-fourth Amendment)...)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;higher than actual Article 19 text. The LLM received garbage context and returned garbage output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost damage:&lt;/strong&gt; 402 pages × 10 credits = 4,020 credits per sync. Multiple debugging iterations = 30K+ credits burned.&lt;/p&gt;

&lt;p&gt;🛡️ &lt;strong&gt;The Idempotency Layer: Never Waste an API Call Twice&lt;/strong&gt;&lt;br&gt;
Before fixing retrieval, I built a safety net. After burning 30K+ credits on debugging, I swore: never again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SHA-256 File Hashing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;

&lt;span class="c1"&gt;# sync.py — Hash every PDF before processing
&lt;/span&gt;&lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;registry_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;supabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_registry_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;registry_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# File unchanged — skip entirely. Zero API calls.
&lt;/span&gt;    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# File changed — delete old vectors, re-process
&lt;/span&gt;    &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="nf"&gt;reprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;supabase&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every PDF is hashed with &lt;strong&gt;SHA-256&lt;/strong&gt; before processing. Hash stored in Supabase. On re-sync, if hash matches → entire file skipped. Zero parsing, zero embedding, zero Pinecone calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic Chunk IDs&lt;/strong&gt;&lt;br&gt;
python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# chunker.py — Same input = Same IDs, always
&lt;/span&gt;&lt;span class="n"&gt;parent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page_number&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;parent_index&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;child_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;parent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;child_index&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No random UUIDs. Chunk IDs derived from file name + page + position. Re-syncing same file = identical IDs. Pinecone upsert overwrites instead of duplicating.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is the difference between a script that works once and a system you can safely run in production every day&lt;br&gt;
.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;✅ &lt;strong&gt;Attempt 2:&lt;/strong&gt; The Deterministic Pipeline (What Actually Worked)&lt;br&gt;
I asked a fundamental question: "For this specific document, do I actually need an LLM to parse it?"&lt;/p&gt;

&lt;p&gt;No. The Constitution has a completely predictable structure:&lt;/p&gt;

&lt;p&gt;Articles always start with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;\n[number]. [Title]—
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Footnotes are always after underscores&lt;br&gt;
Page headers always say "THE CONSTITUTION OF INDIA"&lt;br&gt;
&lt;strong&gt;This is regex territory, not LLM territory.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt; &lt;strong&gt;Aggressive Footnote Removal&lt;/strong&gt; &lt;br&gt;
python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# parser.py
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_count&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove page headers
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;THE CONSTITUTION OF\s*INDIA\n\(Part.*?\)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Split at footnote separator — discard everything below
&lt;/span&gt;    &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;_{10,}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;clean_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Only main text survives
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Zero footnotes in the vector index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; &lt;strong&gt;Article-Boundary Chunking&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;

&lt;span class="c1"&gt;# chunker.py — Split at Article boundaries, not character counts
&lt;/span&gt;&lt;span class="n"&gt;raw_splits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\n(?=\d{1,3}[A-Z]*\.\s+[A-Z])&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw_splits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Each split = exactly one Article
&lt;/span&gt;    &lt;span class="n"&gt;article_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;^(\d{1,3}[A-Z]*)\.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;article_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;article_match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;article_match&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="c1"&gt;# e.g., "19", "21A", "370"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 624 messy blobs → 3,248 precise chunks, each one Article.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; &lt;strong&gt;Metadata Injection into Pinecone&lt;/strong&gt;&lt;br&gt;
python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chunk_metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;constitution of india.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parent_child&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_omitted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;is_omitted&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article_num&lt;/span&gt;  &lt;span class="c1"&gt;# Hard-tagged at ingestion
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every chunk carries its Article identity in Pinecone. Not inferred. Not guessed. Deterministically tagged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4:&lt;/strong&gt; &lt;strong&gt;Smart LangGraph Routing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;

&lt;span class="c1"&gt;# graph.py — LangGraph Retriever Node
&lt;/span&gt;&lt;span class="n"&gt;target_article&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_article&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;target_article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;null&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Bypass vector similarity — database-level equality filter
&lt;/span&gt;    &lt;span class="n"&gt;pinecone_filter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$and&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$eq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;target_article&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is WHERE&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;article_number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;19&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;in SQL. The vector index &lt;strong&gt;cannot&lt;/strong&gt; return chunks from any other Article.&lt;/p&gt;

&lt;p&gt;🎯 &lt;strong&gt;Validation:&lt;/strong&gt; &lt;strong&gt;The Hallucination Test Suite&lt;/strong&gt;&lt;br&gt;
Results independently scored by a third-party LLM evaluator:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
What is Article 20? &lt;br&gt;
&lt;strong&gt;Key Behavior&lt;/strong&gt;&lt;br&gt;
Returned all 3 safeguards (Ex Post Facto, Double Jeopardy, Self-Incrimination) precisely&lt;br&gt;
&lt;strong&gt;Score&lt;/strong&gt;   9/10&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffzdbg3tchlm6kvul06it.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffzdbg3tchlm6kvul06it.png" alt=" " width="799" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Article 34?&lt;/strong&gt; &lt;br&gt;
&lt;strong&gt;Key Behavior&lt;/strong&gt;&lt;br&gt;
Correctly retrieved martial law provisions with no Schedule noise   *&lt;em&gt;Score *&lt;/em&gt;           9/10&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fskqh0obj5p80njndxyci.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fskqh0obj5p80njndxyci.png" alt=" " width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frhjv01piiisq5xk6wqhc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frhjv01piiisq5xk6wqhc.png" alt=" " width="799" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;&lt;br&gt;
Article 31C + Kesavananda Bharati?&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Key Behavior&lt;/strong&gt;&lt;br&gt;
Retrieved 31C accurately; correctly refused to hallucinate case law *&lt;em&gt;Score *&lt;/em&gt;            92/100&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F49ricznfvkeliw3nqiy1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F49ricznfvkeliw3nqiy1.png" alt=" " width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffzo9q10h8klrvi9ac70s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffzo9q10h8klrvi9ac70s.png" alt=" " width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;&lt;br&gt;
Basic Structure Doctrine?&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Key Behavior&lt;/strong&gt;&lt;br&gt;
Identified as judicial principle; stated it appears in no constitutional article    Pass&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4dtsllqf27t2h05p55v2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4dtsllqf27t2h05p55v2.png" alt=" " width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;&lt;br&gt;
Article 31B + Ninth Schedule?&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Key Behavior&lt;/strong&gt;&lt;br&gt;
Correctly framed the Basic Structure vs Ninth Schedule tension  8.8/10&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Filp38i3rb90vz2glp9i7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Filp38i3rb90vz2glp9i7.png" alt=" " width="799" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqj18e7ddn54cskra35c1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqj18e7ddn54cskra35c1.png" alt=" " width="800" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The most significant result is from Query 3. The system responded:&lt;br&gt;
_&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The provided documents do not contain specific details regarding the Kesavananda Bharati case."_&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;That's not a failure. That's correct, production-grade RAG behavior. A null response is a success. A hallucinated response is a disaster.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;🏗️ &lt;strong&gt;The Full Architecture&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query: "What is Article 19?"
         ↓
   [LLM Classifier Node]
   → Extracts: article_number = "19"
         ↓
   [Retriever Node]
   → pinecone_filter = {
       "$and": [
         {"source_file": {"$eq": "constitution of india.pdf"}},
         {"article_number": {"$eq": "19"}}
       ]
     }
         ↓
   [Pinecone — Database lookup, NOT vector similarity]
         ↓
   [LLM Generator — clean, precise context]
         ↓
   Accurate response. Hallucination-resistant.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⚠️ &lt;strong&gt;Known Limitations (Being Honest)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Seventh Schedule Overlap The Schedule uses numbered entries
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(19. Price control, 34. Betting and gambling)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;. The regex tags these as&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;article_number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;19"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;. Current impact: Low — &lt;em&gt;LLM differentiates them in generation.&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;*&lt;em&gt;General Conceptual Queries *&lt;/em&gt;&lt;em&gt;"What are all Fundamental Rights?"&lt;/em&gt; doesn't trigger metadata filter. Falls back to semantic search.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No Cross-Article Relationships&lt;/strong&gt; The system doesn't model that Article 32 enforces Article 19. Each Article indexed independently.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔧 &lt;strong&gt;Tech Stack&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Parser&lt;/strong&gt;:** PyMuPDF (free, local)&lt;br&gt;
&lt;strong&gt;Chunker&lt;/strong&gt;:** Custom regex-based hierarchical chunker&lt;br&gt;
&lt;strong&gt;Embeddings:&lt;/strong&gt; Jina AI v3 (MRL: 1024→256 dims, 75% storage savings)&lt;br&gt;
&lt;strong&gt;Vector DB:&lt;/strong&gt; Pinecone Serverless (with metadata filtering)&lt;br&gt;
&lt;strong&gt;Orchestration:&lt;/strong&gt; LangGraph (8-node agentic pipeline)&lt;br&gt;
&lt;strong&gt;LLM:&lt;/strong&gt; Google Gemini&lt;br&gt;
&lt;strong&gt;Registry:&lt;/strong&gt; Supabase (file hashing + sync tracking)&lt;br&gt;
&lt;strong&gt;Monitoring:&lt;/strong&gt; Langfuse (LLM observability)&lt;br&gt;
💡 &lt;strong&gt;Three Takeaways&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Assess document structure before choosing a parser.&lt;/strong&gt; LlamaParse is excellent for semi-structured documents. For continuous legal text with predictable patterns, a custom regex parser gives you more control at zero cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design for metadata from day one.&lt;/strong&gt; Vector similarity is a fallback, not a first choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test the hallucination boundary, not just the happy path.&lt;/strong&gt; Asking your RAG system about things that aren't in the documents is as important as asking about things that are.&lt;/p&gt;

&lt;p&gt;📊 &lt;strong&gt;Community Response&lt;/strong&gt;&lt;br&gt;
This approach got significant traction in the AI community:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reddit (r/LangChain):&lt;/strong&gt; 50,000+ views, 500+ shares across two posts&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; 64 stars, 22 forks&lt;br&gt;
&lt;strong&gt;HuggingFace:&lt;/strong&gt; 3 published fine-tuned models (1B, 3B, 8B) with 5,500+ downloads&lt;br&gt;
🔗 Links&lt;br&gt;
&lt;strong&gt;GitHub (Full Source Code):&lt;/strong&gt; github.com/Ambuj123-lab/agentic-rag-financial-parser&lt;br&gt;
&lt;strong&gt;Live Demo:&lt;/strong&gt; ambuj-portfolio-v2.netlify.app&lt;br&gt;
&lt;strong&gt;LinkedIn:&lt;/strong&gt; linkedin.com/in/ambuj-tripathi-042b4a118&lt;br&gt;
_&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Has anyone else dealt with footnote-heavy PDFs or failed LlamaParse attempts? How did you handle them? Drop your approach in the comments — I'd love to compare notes.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
_&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you found this useful, drop a ❤️ and follow for more production RAG content!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>langchain</category>
      <category>python</category>
    </item>
  </channel>
</rss>
