Ambuj Tripathi

Posted on Jul 2

PyPDFLoader, LlamaParse, Custom Regex — I Tried Everything on Indian Government PDFs. Here's What Actually Worked.

#ai #langchain #rag #python

Six months ago I asked the same questions you're asking. "How do I handle merged cells?" "Why does my table extraction break?" "Which parser should I use?"

I tried every popular approach — PyPDFLoader, Unstructured, LlamaParse, custom regex — on some of the most painful PDFs you can imagine: Indian Government Budget documents, Finance Bills, and the Constitution of India (400+ pages of dense legal text with footnotes on every page).

This article is an honest post-mortem of what went wrong, why, and the only architecture that actually survived production.

🤯 The Document From Hell
Most RAG tutorials use clean, simple PDFs. The Constitution of India is not that.

Here's what you're dealing with on every single page:

19. Protection of certain rights regarding freedom of speech, etc.—
(1) All citizens shall have the right—
    (a) to freedom of speech and expression;
    (b) to assemble peaceably and without arms;
...
______________________________________________
1. Subs. by the Constitution (First Amendment) Act, 1951, s. 3
2. Ins. by the Constitution (Forty-fourth Amendment) Act, 1978.

Every page has three zones:

Article content (what users actually want)
A separator line (______)
Footnotes (amendment citations that ALSO begin with numbers like

1., 19., 34.)

Those footnotes start with the same numbers as real Articles. Embedding models encode them with equal weight. This is where hallucinations are born.

❌ Attempt 1: LlamaParse (Agentic Tier) — The Expensive Failure
My initial setup: LlamaParse at Agentic tier (10 credits/page) + LangChain's MarkdownHeaderTextSplitter.

What I expected: Clean, hierarchically separated chunks per Article.

What I got: 624 giant chunks from a 402-page document.

LlamaParse is excellent for tables, invoices, and structured forms. But for dense continuous legal text with hundreds of numbered items, it merged multiple pages into single Markdown blocks. Article 19 wasn't a standalone chunk — it was buried inside a 5,000-character blob alongside Articles 17, 18, 20, and a dozen footnotes.

The Hallucination Test:

Query: "What is Article 19?"

Vector similarity matched a footnote

(19. Ins. by Constitution (Forty-fourth Amendment)...)

higher than actual Article 19 text. The LLM received garbage context and returned garbage output.

Cost damage: 402 pages × 10 credits = 4,020 credits per sync. Multiple debugging iterations = 30K+ credits burned.

🛡️ The Idempotency Layer: Never Waste an API Call Twice
Before fixing retrieval, I built a safety net. After burning 30K+ credits on debugging, I swore: never again.

SHA-256 File Hashing

python

# sync.py — Hash every PDF before processing
current_hash = hashlib.sha256(open(file_path, "rb").read()).hexdigest()
registry_hash = supabase.get_registry_entry(filename).get("file_hash")
if current_hash == registry_hash:
    # File unchanged — skip entirely. Zero API calls.
    pass
else:
    # File changed — delete old vectors, re-process
    pinecone.delete_vectors(filter={"source_file": filename})
    reprocess(file_path)
    supabase.update_hash(filename, current_hash)

Every PDF is hashed with SHA-256 before processing. Hash stored in Supabase. On re-sync, if hash matches → entire file skipped. Zero parsing, zero embedding, zero Pinecone calls.

Deterministic Chunk IDs
python

# chunker.py — Same input = Same IDs, always
parent_id = f"{source_file}_{page_number}_{parent_index}"
child_id = hashlib.md5(f"{parent_id}_{child_index}".encode()).hexdigest()

No random UUIDs. Chunk IDs derived from file name + page + position. Re-syncing same file = identical IDs. Pinecone upsert overwrites instead of duplicating.

This is the difference between a script that works once and a system you can safely run in production every day
.

✅ Attempt 2: The Deterministic Pipeline (What Actually Worked)
I asked a fundamental question: "For this specific document, do I actually need an LLM to parse it?"

No. The Constitution has a completely predictable structure:

Articles always start with

\n[number]. [Title]—

Footnotes are always after underscores
Page headers always say "THE CONSTITUTION OF INDIA"
This is regex territory, not LLM territory.

Step 1: Aggressive Footnote Removal
python

# parser.py
for page_num in range(doc.page_count):
    text = doc[page_num].get_text("text")

    # Remove page headers
    text = re.sub(r'THE CONSTITUTION OF\s*INDIA\n\(Part.*?\)', '', text)

    # Split at footnote separator — discard everything below
    parts = re.split(r'_{10,}', text)
    clean_text = parts[0]  # Only main text survives

Result: Zero footnotes in the vector index.

Step 2: Article-Boundary Chunking

python

# chunker.py — Split at Article boundaries, not character counts
raw_splits = re.split(r'\n(?=\d{1,3}[A-Z]*\.\s+[A-Z])', page_text)
for split in raw_splits:
    # Each split = exactly one Article
    article_match = re.match(r'^(\d{1,3}[A-Z]*)\.', split)
    article_num = article_match.group(1) if article_match else None
    # e.g., "19", "21A", "370"

Result: 624 messy blobs → 3,248 precise chunks, each one Article.

Step 3: Metadata Injection into Pinecone
python

chunk_metadata = {
    "source_file": "constitution of india.pdf",
    "chunk_type": "parent_child",
    "is_omitted": is_omitted,
    "article_number": article_num  # Hard-tagged at ingestion
}

Every chunk carries its Article identity in Pinecone. Not inferred. Not guessed. Deterministically tagged.

Step 4: Smart LangGraph Routing

python

# graph.py — LangGraph Retriever Node
target_article = intent.get("article_number")
if target_article and target_article.lower() not in ("null", "none", ""):
    # Bypass vector similarity — database-level equality filter
    pinecone_filter["$and"].append({
        "article_number": {"$eq": target_article}
    })

This is WHERE

article_number = '19'

in SQL. The vector index cannot return chunks from any other Article.

🎯 Validation: The Hallucination Test Suite
Results independently scored by a third-party LLM evaluator:

Query

What is Article 20?
Key Behavior
Returned all 3 safeguards (Ex Post Facto, Double Jeopardy, Self-Incrimination) precisely
Score 9/10

What is Article 34?
Key Behavior
Correctly retrieved martial law provisions with no Schedule noise *Score * 9/10

Query
Article 31C + Kesavananda Bharati?

Key Behavior
Retrieved 31C accurately; correctly refused to hallucinate case law *Score * 92/100

Query
Basic Structure Doctrine?

Key Behavior
Identified as judicial principle; stated it appears in no constitutional article Pass

Query
Article 31B + Ninth Schedule?

Key Behavior
Correctly framed the Basic Structure vs Ninth Schedule tension 8.8/10

The most significant result is from Query 3. The system responded:
_

"The provided documents do not contain specific details regarding the Kesavananda Bharati case."_

That's not a failure. That's correct, production-grade RAG behavior. A null response is a success. A hallucinated response is a disaster.

🏗️ The Full Architecture

Query: "What is Article 19?"
         ↓
   [LLM Classifier Node]
   → Extracts: article_number = "19"
         ↓
   [Retriever Node]
   → pinecone_filter = {
       "$and": [
         {"source_file": {"$eq": "constitution of india.pdf"}},
         {"article_number": {"$eq": "19"}}
       ]
     }
         ↓
   [Pinecone — Database lookup, NOT vector similarity]
         ↓
   [LLM Generator — clean, precise context]
         ↓
   Accurate response. Hallucination-resistant.

⚠️ Known Limitations (Being Honest)

The Seventh Schedule Overlap The Schedule uses numbered entries

(19. Price control, 34. Betting and gambling)

. The regex tags these as

article_number: "19"

. Current impact: Low — LLM differentiates them in generation.

*General Conceptual Queries *"What are all Fundamental Rights?" doesn't trigger metadata filter. Falls back to semantic search.
No Cross-Article Relationships The system doesn't model that Article 32 enforces Article 19. Each Article indexed independently.

🔧 Tech Stack
Parser:** PyMuPDF (free, local)
Chunker:** Custom regex-based hierarchical chunker
Embeddings: Jina AI v3 (MRL: 1024→256 dims, 75% storage savings)
Vector DB: Pinecone Serverless (with metadata filtering)
Orchestration: LangGraph (8-node agentic pipeline)
LLM: Google Gemini
Registry: Supabase (file hashing + sync tracking)
Monitoring: Langfuse (LLM observability)
💡 Three Takeaways
Assess document structure before choosing a parser. LlamaParse is excellent for semi-structured documents. For continuous legal text with predictable patterns, a custom regex parser gives you more control at zero cost.

Design for metadata from day one. Vector similarity is a fallback, not a first choice.

Test the hallucination boundary, not just the happy path. Asking your RAG system about things that aren't in the documents is as important as asking about things that are.

📊 Community Response
This approach got significant traction in the AI community:

Reddit (r/LangChain): 50,000+ views, 500+ shares across two posts
GitHub: 64 stars, 22 forks
HuggingFace: 3 published fine-tuned models (1B, 3B, 8B) with 5,500+ downloads
🔗 Links
GitHub (Full Source Code): github.com/Ambuj123-lab/agentic-rag-financial-parser
Live Demo: ambuj-portfolio-v2.netlify.app
LinkedIn: linkedin.com/in/ambuj-tripathi-042b4a118
_

Has anyone else dealt with footnote-heavy PDFs or failed LlamaParse attempts? How did you handle them? Drop your approach in the comments — I'd love to compare notes.
_

If you found this useful, drop a ❤️ and follow for more production RAG content!

DEV Community

PyPDFLoader, LlamaParse, Custom Regex — I Tried Everything on Indian Government PDFs. Here's What Actually Worked.

Top comments (0)