<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mukesh Z</title>
    <description>The latest articles on DEV Community by Mukesh Z (@techtrojan).</description>
    <link>https://dev.to/techtrojan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3799044%2F25af88b2-e9e3-400d-851c-e687157a7ed6.png</url>
      <title>DEV Community: Mukesh Z</title>
      <link>https://dev.to/techtrojan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/techtrojan"/>
    <language>en</language>
    <item>
      <title>I Built a Baseline RAG System — Then Measured Where It Actually Breaks</title>
      <dc:creator>Mukesh Z</dc:creator>
      <pubDate>Sun, 01 Mar 2026 03:29:42 +0000</pubDate>
      <link>https://dev.to/techtrojan/i-built-a-baseline-rag-system-then-measured-where-it-actually-breaks-3724</link>
      <guid>https://dev.to/techtrojan/i-built-a-baseline-rag-system-then-measured-where-it-actually-breaks-3724</guid>
      <description>&lt;p&gt;Most RAG demos stop at:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Look, it answers correctly.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I wanted to go further.&lt;/p&gt;

&lt;p&gt;Instead of building a flashy Retrieval-Augmented Generation system, I built a &lt;strong&gt;baseline RAG architecture&lt;/strong&gt; and focused heavily on evaluation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context adherence&lt;/li&gt;
&lt;li&gt;Context precision&lt;/li&gt;
&lt;li&gt;Answer relevance&lt;/li&gt;
&lt;li&gt;Groundedness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post walks through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The architecture&lt;/li&gt;
&lt;li&gt;The dataset&lt;/li&gt;
&lt;li&gt;The evaluation framework&lt;/li&gt;
&lt;li&gt;The real failure modes&lt;/li&gt;
&lt;li&gt;And what I’d fix next&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;🧠 &lt;strong&gt;The Goal&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build and evaluate a structured RAG system that:2. &lt;/li&gt;
&lt;li&gt;Extracts and chunks PDFs &lt;/li&gt;
&lt;li&gt;Creates a vector retrieval layer &lt;/li&gt;
&lt;li&gt;Generates grounded answers &lt;/li&gt;
&lt;li&gt;Evaluates answers using LLM-as-Judge &lt;/li&gt;
&lt;li&gt;Produces measurable metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This was not about "chatbot performance".&lt;/p&gt;

&lt;p&gt;It was about &lt;strong&gt;architectural clarity + measurable quality&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;🏗 &lt;strong&gt;Architecture Overview&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDFs → Chunking → Embeddings → FAISS → Retrieval
Retrieval → Context + Question → gpt-4o-mini → Answer
Answer → LLM-as-Judge → Evaluation Metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stack used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LangChain&lt;/li&gt;
&lt;li&gt;FAISS (locally persisted)&lt;/li&gt;
&lt;li&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/li&gt;
&lt;li&gt;gpt-4o-mini&lt;/li&gt;
&lt;li&gt;Windows local environment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Simple. Reproducible. Baseline-first.&lt;/p&gt;




&lt;p&gt;📂 &lt;strong&gt;Dataset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I intentionally used complex, table-heavy documents:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA 10-K&lt;/td&gt;
&lt;td&gt;Financial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft 10-K&lt;/td&gt;
&lt;td&gt;Financial + Business&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Well-Architected Framework&lt;/td&gt;
&lt;td&gt;Cloud Architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total PDFs: 3&lt;br&gt;
Chunk size: 1000&lt;br&gt;
Overlap: 200&lt;/p&gt;



&lt;p&gt;🪓 &lt;strong&gt;Chunking Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recursive character splitting&lt;/li&gt;
&lt;li&gt;Chunk size: 1000&lt;/li&gt;
&lt;li&gt;Overlap: 200&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why 1000?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To reduce embedding cost and maintain context continuity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happened?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Precision dropped.&lt;/p&gt;

&lt;p&gt;Financial documents contain large multi-column tables. Large chunks diluted retrieval precision.&lt;/p&gt;

&lt;p&gt;Lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Bigger chunks ≠ better RAG.&lt;/p&gt;
&lt;/blockquote&gt;



&lt;p&gt;🔎 &lt;strong&gt;Retrieval Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding model&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sentence-transformers/all-MiniLM-L6-v2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chosen because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast&lt;/li&gt;
&lt;li&gt;Strong semantic baseline&lt;/li&gt;
&lt;li&gt;Lightweight for local experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;*&lt;em&gt;Vector store&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAISS (local persistent index)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;✨ Answer Generation&lt;/p&gt;

&lt;p&gt;Model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gpt-4o-mini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prompt strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strictly answer from context&lt;/li&gt;
&lt;li&gt;Avoid hallucination&lt;/li&gt;
&lt;li&gt;Say “I don’t know” if answer absent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This conservative approach reduced hallucination — but introduced new behavior (we’ll get to that).&lt;/p&gt;




&lt;p&gt;📊 &lt;strong&gt;Evaluation Framework (LLM-as-Judge)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I evaluated 20 questions across documents.&lt;/p&gt;

&lt;p&gt;Each answer was scored on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Context Adherence&lt;/li&gt;
&lt;li&gt;Context Precision&lt;/li&gt;
&lt;li&gt;Answer Relevance&lt;/li&gt;
&lt;li&gt;Groundedness&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This separation is critical.&lt;/p&gt;

&lt;p&gt;Most RAG systems fail because teams don’t know where the failure happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Retrieval?&lt;/li&gt;
&lt;li&gt;Generation?&lt;/li&gt;
&lt;li&gt;Alignment?&lt;/li&gt;
&lt;li&gt;Table parsing?&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;📈 &lt;a href="https://github.com/TechTrojan/AdvanceRAG/blob/main/Baseline_Chunking/result.txt" rel="noopener noreferrer"&gt;Results Summary&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From 20 evaluated questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context Adherence: ~76%&lt;/li&gt;
&lt;li&gt;Context Precision: ~0.48 average&lt;/li&gt;
&lt;li&gt;Answer Relevance: ~0.74&lt;/li&gt;
&lt;li&gt;Groundedness: High (except temporal mismatch cases)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall maturity:&lt;br&gt;
&lt;strong&gt;7.5 / 10 Baseline RAG&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;🔎** What Actually Broke?**&lt;/p&gt;

&lt;p&gt;This is where things get interesting.&lt;/p&gt;

&lt;p&gt;1️⃣ Temporal Misalignment (High Risk)&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;The system extracted an operating income value from the wrong fiscal year column.&lt;/p&gt;

&lt;p&gt;The answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Looked correct&lt;/li&gt;
&lt;li&gt;Existed in context&lt;/li&gt;
&lt;li&gt;Was grounded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But belonged to the wrong year.&lt;/p&gt;

&lt;p&gt;This is dangerous.&lt;/p&gt;

&lt;p&gt;Financial tables with multiple years introduce alignment risk that naive RAG systems fail to detect.&lt;/p&gt;




&lt;p&gt;2️⃣ “I Don’t Know” Even When Context Exists&lt;/p&gt;

&lt;p&gt;Several cases where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context contained the answer&lt;/li&gt;
&lt;li&gt;Model still said: “I don’t know”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Likely causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chunk too large&lt;/li&gt;
&lt;li&gt;Table parsing ambiguity&lt;/li&gt;
&lt;li&gt;Conservative prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not hallucination.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is &lt;strong&gt;extraction hesitation&lt;/strong&gt;.
&lt;/h2&gt;

&lt;p&gt;3️⃣ Low Context Precision&lt;/p&gt;

&lt;p&gt;Many correct answers had low precision scores because:&lt;/p&gt;

&lt;p&gt;Chunk size = 1000&lt;br&gt;
Financial tables = noisy&lt;/p&gt;

&lt;p&gt;The answer was present, but buried inside large irrelevant context.&lt;/p&gt;




&lt;p&gt;🧠 Key Insight&lt;/p&gt;

&lt;p&gt;Most RAG failures are not hallucinations.&lt;/p&gt;

&lt;p&gt;They are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval precision failures&lt;/li&gt;
&lt;li&gt;Column alignment failures&lt;/li&gt;
&lt;li&gt;Temporal reasoning failures&lt;/li&gt;
&lt;li&gt;Overly conservative generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Evaluation-first design makes these visible.&lt;/p&gt;

&lt;p&gt;Without metrics, you’d never see this.&lt;/p&gt;




&lt;p&gt;🚀 What I Would Improve&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reduce chunk size to 600–800&lt;/li&gt;
&lt;li&gt;Increase overlap to maintain continuity&lt;/li&gt;
&lt;li&gt;Add year-alignment guardrail in prompt&lt;/li&gt;
&lt;li&gt;Add table-aware extraction logic&lt;/li&gt;
&lt;li&gt;Add reranker (hybrid retrieval or cross-encoder)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Baseline RAG works.&lt;/p&gt;

&lt;p&gt;Architected RAG works better.&lt;/p&gt;




&lt;p&gt;🏁 Why This Project Matters&lt;/p&gt;

&lt;p&gt;There’s a difference between:&lt;/p&gt;

&lt;p&gt;“RAG that answers”&lt;br&gt;
and&lt;br&gt;
“RAG that can be trusted”&lt;/p&gt;

&lt;p&gt;This experiment focused on trust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Measuring grounding&lt;/li&gt;
&lt;li&gt;Detecting temporal misalignment&lt;/li&gt;
&lt;li&gt;Identifying precision loss&lt;/li&gt;
&lt;li&gt;Structuring evaluation signals&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;📌 Final Rating&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Rating&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generation&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision&lt;/td&gt;
&lt;td&gt;⭐⭐⭐☆☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temporal Robustness&lt;/td&gt;
&lt;td&gt;⭐⭐☆☆☆&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Baseline: Strong&lt;br&gt;
Production-ready: Not yet&lt;/p&gt;




&lt;p&gt;If you're building RAG systems, I strongly recommend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate retrieval metrics from generation metrics&lt;/li&gt;
&lt;li&gt;Always test on table-heavy documents&lt;/li&gt;
&lt;li&gt;Measure groundedness independently&lt;/li&gt;
&lt;li&gt;Add temporal alignment checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAG is easy to build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reliable RAG is engineering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/TechTrojan/AdvanceRAG/tree/main/Baseline_Chunking" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;RAG is easy to build&lt;/u&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Reliable RAG is engineering&lt;/u&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Detecting temporal misalignment&lt;/p&gt;

&lt;p&gt;Identifying precision loss&lt;/p&gt;

&lt;p&gt;Structuring evaluation signals&lt;/p&gt;

&lt;p&gt;That’s the difference between demo-level AI and production-level AI.&lt;/p&gt;

&lt;p&gt;📌 Final Rating&lt;br&gt;
Category    Rating&lt;br&gt;
Retrieval   ⭐⭐⭐⭐☆&lt;br&gt;
Generation  ⭐⭐⭐⭐☆&lt;br&gt;
Grounding   ⭐⭐⭐⭐☆&lt;br&gt;
Precision   ⭐⭐⭐☆☆&lt;br&gt;
Temporal Robustness ⭐⭐☆☆☆&lt;/p&gt;

&lt;p&gt;Baseline: Strong&lt;br&gt;
Production-ready: Not yet&lt;/p&gt;

&lt;p&gt;If you're building RAG systems, I strongly recommend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate retrieval metrics from generation metrics&lt;/li&gt;
&lt;li&gt;Always test on table-heavy documents&lt;/li&gt;
&lt;li&gt;Measure groundedness independently&lt;/li&gt;
&lt;li&gt;Add temporal alignment checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;RAG is easy to build.&lt;br&gt;
_&lt;br&gt;
**_Reliable RAG is engineering&lt;/em&gt;**.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>evaluate</category>
      <category>langchain</category>
      <category>vectordatabase</category>
    </item>
    <item>
      <title>I Built a Baseline RAG System — Then Measured Where It Actually Breaks</title>
      <dc:creator>Mukesh Z</dc:creator>
      <pubDate>Sun, 01 Mar 2026 03:29:42 +0000</pubDate>
      <link>https://dev.to/techtrojan/i-built-a-baseline-rag-system-then-measured-where-it-actually-breaks-3kna</link>
      <guid>https://dev.to/techtrojan/i-built-a-baseline-rag-system-then-measured-where-it-actually-breaks-3kna</guid>
      <description>&lt;p&gt;Most RAG demos stop at:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Look, it answers correctly.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I wanted to go further.&lt;/p&gt;

&lt;p&gt;Instead of building a flashy Retrieval-Augmented Generation system, I built a &lt;strong&gt;baseline RAG architecture&lt;/strong&gt; and focused heavily on evaluation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context adherence&lt;/li&gt;
&lt;li&gt;Context precision&lt;/li&gt;
&lt;li&gt;Answer relevance&lt;/li&gt;
&lt;li&gt;Groundedness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post walks through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The architecture&lt;/li&gt;
&lt;li&gt;The dataset&lt;/li&gt;
&lt;li&gt;The evaluation framework&lt;/li&gt;
&lt;li&gt;The real failure modes&lt;/li&gt;
&lt;li&gt;And what I’d fix next&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;🧠 &lt;strong&gt;The Goal&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build and evaluate a structured RAG system that:2. &lt;/li&gt;
&lt;li&gt;Extracts and chunks PDFs &lt;/li&gt;
&lt;li&gt;Creates a vector retrieval layer &lt;/li&gt;
&lt;li&gt;Generates grounded answers &lt;/li&gt;
&lt;li&gt;Evaluates answers using LLM-as-Judge &lt;/li&gt;
&lt;li&gt;Produces measurable metrics&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This was not about "chatbot performance".&lt;/p&gt;

&lt;p&gt;It was about &lt;strong&gt;architectural clarity + measurable quality&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;🏗 &lt;strong&gt;Architecture Overview&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDFs → Chunking → Embeddings → FAISS → Retrieval
Retrieval → Context + Question → gpt-4o-mini → Answer
Answer → LLM-as-Judge → Evaluation Metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stack used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LangChain&lt;/li&gt;
&lt;li&gt;FAISS (locally persisted)&lt;/li&gt;
&lt;li&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/li&gt;
&lt;li&gt;gpt-4o-mini&lt;/li&gt;
&lt;li&gt;Windows local environment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Simple. Reproducible. Baseline-first.&lt;/p&gt;




&lt;p&gt;📂 &lt;strong&gt;Dataset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I intentionally used complex, table-heavy documents:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA 10-K&lt;/td&gt;
&lt;td&gt;Financial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft 10-K&lt;/td&gt;
&lt;td&gt;Financial + Business&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Well-Architected Framework&lt;/td&gt;
&lt;td&gt;Cloud Architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total PDFs: 3&lt;br&gt;
Chunk size: 1000&lt;br&gt;
Overlap: 200&lt;/p&gt;



&lt;p&gt;🪓 &lt;strong&gt;Chunking Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recursive character splitting&lt;/li&gt;
&lt;li&gt;Chunk size: 1000&lt;/li&gt;
&lt;li&gt;Overlap: 200&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why 1000?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To reduce embedding cost and maintain context continuity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happened?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Precision dropped.&lt;/p&gt;

&lt;p&gt;Financial documents contain large multi-column tables. Large chunks diluted retrieval precision.&lt;/p&gt;

&lt;p&gt;Lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Bigger chunks ≠ better RAG.&lt;/p&gt;
&lt;/blockquote&gt;



&lt;p&gt;🔎 &lt;strong&gt;Retrieval Layer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding model&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sentence-transformers/all-MiniLM-L6-v2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chosen because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast&lt;/li&gt;
&lt;li&gt;Strong semantic baseline&lt;/li&gt;
&lt;li&gt;Lightweight for local experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;*&lt;em&gt;Vector store&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAISS (local persistent index)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;✨ Answer Generation&lt;/p&gt;

&lt;p&gt;Model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gpt-4o-mini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prompt strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strictly answer from context&lt;/li&gt;
&lt;li&gt;Avoid hallucination&lt;/li&gt;
&lt;li&gt;Say “I don’t know” if answer absent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This conservative approach reduced hallucination — but introduced new behavior (we’ll get to that).&lt;/p&gt;




&lt;p&gt;📊 &lt;strong&gt;Evaluation Framework (LLM-as-Judge)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I evaluated 20 questions across documents.&lt;/p&gt;

&lt;p&gt;Each answer was scored on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Context Adherence&lt;/li&gt;
&lt;li&gt;Context Precision&lt;/li&gt;
&lt;li&gt;Answer Relevance&lt;/li&gt;
&lt;li&gt;Groundedness&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This separation is critical.&lt;/p&gt;

&lt;p&gt;Most RAG systems fail because teams don’t know where the failure happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Retrieval?&lt;/li&gt;
&lt;li&gt;Generation?&lt;/li&gt;
&lt;li&gt;Alignment?&lt;/li&gt;
&lt;li&gt;Table parsing?&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;📈 &lt;a href="https://github.com/TechTrojan/AdvanceRAG/blob/main/Baseline_Chunking/result.txt" rel="noopener noreferrer"&gt;Results Summary&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From 20 evaluated questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context Adherence: ~76%&lt;/li&gt;
&lt;li&gt;Context Precision: ~0.48 average&lt;/li&gt;
&lt;li&gt;Answer Relevance: ~0.74&lt;/li&gt;
&lt;li&gt;Groundedness: High (except temporal mismatch cases)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall maturity:&lt;br&gt;
&lt;strong&gt;7.5 / 10 Baseline RAG&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;🔎** What Actually Broke?**&lt;/p&gt;

&lt;p&gt;This is where things get interesting.&lt;/p&gt;

&lt;p&gt;1️⃣ Temporal Misalignment (High Risk)&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;The system extracted an operating income value from the wrong fiscal year column.&lt;/p&gt;

&lt;p&gt;The answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Looked correct&lt;/li&gt;
&lt;li&gt;Existed in context&lt;/li&gt;
&lt;li&gt;Was grounded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But belonged to the wrong year.&lt;/p&gt;

&lt;p&gt;This is dangerous.&lt;/p&gt;

&lt;p&gt;Financial tables with multiple years introduce alignment risk that naive RAG systems fail to detect.&lt;/p&gt;




&lt;p&gt;2️⃣ “I Don’t Know” Even When Context Exists&lt;/p&gt;

&lt;p&gt;Several cases where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context contained the answer&lt;/li&gt;
&lt;li&gt;Model still said: “I don’t know”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Likely causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chunk too large&lt;/li&gt;
&lt;li&gt;Table parsing ambiguity&lt;/li&gt;
&lt;li&gt;Conservative prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not hallucination.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is &lt;strong&gt;extraction hesitation&lt;/strong&gt;.
&lt;/h2&gt;

&lt;p&gt;3️⃣ Low Context Precision&lt;/p&gt;

&lt;p&gt;Many correct answers had low precision scores because:&lt;/p&gt;

&lt;p&gt;Chunk size = 1000&lt;br&gt;
Financial tables = noisy&lt;/p&gt;

&lt;p&gt;The answer was present, but buried inside large irrelevant context.&lt;/p&gt;




&lt;p&gt;🧠 Key Insight&lt;/p&gt;

&lt;p&gt;Most RAG failures are not hallucinations.&lt;/p&gt;

&lt;p&gt;They are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval precision failures&lt;/li&gt;
&lt;li&gt;Column alignment failures&lt;/li&gt;
&lt;li&gt;Temporal reasoning failures&lt;/li&gt;
&lt;li&gt;Overly conservative generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Evaluation-first design makes these visible.&lt;/p&gt;

&lt;p&gt;Without metrics, you’d never see this.&lt;/p&gt;




&lt;p&gt;🚀 What I Would Improve&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reduce chunk size to 600–800&lt;/li&gt;
&lt;li&gt;Increase overlap to maintain continuity&lt;/li&gt;
&lt;li&gt;Add year-alignment guardrail in prompt&lt;/li&gt;
&lt;li&gt;Add table-aware extraction logic&lt;/li&gt;
&lt;li&gt;Add reranker (hybrid retrieval or cross-encoder)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Baseline RAG works.&lt;/p&gt;

&lt;p&gt;Architected RAG works better.&lt;/p&gt;




&lt;p&gt;🏁 Why This Project Matters&lt;/p&gt;

&lt;p&gt;There’s a difference between:&lt;/p&gt;

&lt;p&gt;“RAG that answers”&lt;br&gt;
and&lt;br&gt;
“RAG that can be trusted”&lt;/p&gt;

&lt;p&gt;This experiment focused on trust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Measuring grounding&lt;/li&gt;
&lt;li&gt;Detecting temporal misalignment&lt;/li&gt;
&lt;li&gt;Identifying precision loss&lt;/li&gt;
&lt;li&gt;Structuring evaluation signals&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;📌 Final Rating&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Rating&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generation&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grounding&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision&lt;/td&gt;
&lt;td&gt;⭐⭐⭐☆☆&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temporal Robustness&lt;/td&gt;
&lt;td&gt;⭐⭐☆☆☆&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Baseline: Strong&lt;br&gt;
Production-ready: Not yet&lt;/p&gt;




&lt;p&gt;If you're building RAG systems, I strongly recommend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate retrieval metrics from generation metrics&lt;/li&gt;
&lt;li&gt;Always test on table-heavy documents&lt;/li&gt;
&lt;li&gt;Measure groundedness independently&lt;/li&gt;
&lt;li&gt;Add temporal alignment checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAG is easy to build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reliable RAG is engineering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/TechTrojan/AdvanceRAG/tree/main/Baseline_Chunking" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;RAG is easy to build&lt;/u&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Reliable RAG is engineering&lt;/u&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Detecting temporal misalignment&lt;/p&gt;

&lt;p&gt;Identifying precision loss&lt;/p&gt;

&lt;p&gt;Structuring evaluation signals&lt;/p&gt;

&lt;p&gt;That’s the difference between demo-level AI and production-level AI.&lt;/p&gt;

&lt;p&gt;📌 Final Rating&lt;br&gt;
Category    Rating&lt;br&gt;
Retrieval   ⭐⭐⭐⭐☆&lt;br&gt;
Generation  ⭐⭐⭐⭐☆&lt;br&gt;
Grounding   ⭐⭐⭐⭐☆&lt;br&gt;
Precision   ⭐⭐⭐☆☆&lt;br&gt;
Temporal Robustness ⭐⭐☆☆☆&lt;/p&gt;

&lt;p&gt;Baseline: Strong&lt;br&gt;
Production-ready: Not yet&lt;/p&gt;

&lt;p&gt;If you're building RAG systems, I strongly recommend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate retrieval metrics from generation metrics&lt;/li&gt;
&lt;li&gt;Always test on table-heavy documents&lt;/li&gt;
&lt;li&gt;Measure groundedness independently&lt;/li&gt;
&lt;li&gt;Add temporal alignment checks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;RAG is easy to build.&lt;br&gt;
_&lt;br&gt;
**_Reliable RAG is engineering&lt;/em&gt;**.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>evaluate</category>
      <category>langchain</category>
      <category>vectordatabase</category>
    </item>
  </channel>
</rss>
