<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Parth Roy</title>
    <description>The latest articles on DEV Community by Parth Roy (@parth_roy_a1ec4703407d025).</description>
    <link>https://dev.to/parth_roy_a1ec4703407d025</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1698730%2F46ad2fe8-2710-4a48-806e-1617ce3d7a46.jpg</url>
      <title>DEV Community: Parth Roy</title>
      <link>https://dev.to/parth_roy_a1ec4703407d025</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/parth_roy_a1ec4703407d025"/>
    <language>en</language>
    <item>
      <title>RAGEval: Scenario-specific RAG evaluation dataset generation framework</title>
      <dc:creator>Parth Roy</dc:creator>
      <pubDate>Wed, 11 Sep 2024 03:55:00 +0000</pubDate>
      <link>https://dev.to/parth_roy_a1ec4703407d025/rageval-scenario-specific-rag-evaluation-dataset-generation-framework-4elb</link>
      <guid>https://dev.to/parth_roy_a1ec4703407d025/rageval-scenario-specific-rag-evaluation-dataset-generation-framework-4elb</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Evaluating Retrieval-Augmented Generation (RAG) systems in specialized domains like finance, healthcare, and legal presents unique challenges that existing benchmarks, focused on general question-answering, fail to address. In this blog, we will explore the &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;research paper&lt;/a&gt; "RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework," which introduces RAGEval. RAGEval offers a solution that automatically generates domain-specific evaluation datasets, reducing manual effort and privacy concerns. By focusing on creating scenario-specific datasets, RAGEval provides a more accurate and reliable assessment of RAG systems in complex, data-sensitive fields.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations of existing benchmarks for RAG
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Focus on general domains:&lt;/strong&gt; Existing RAG benchmarks primarily evaluate factual correctness in general question-answering tasks, which may not accurately reflect the performance of RAG systems in specialized or vertical domains like finance, healthcare, and legal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual data curation:&lt;/strong&gt; One limitation is that evaluating or benchmarking RAG systems requires manually curating a dataset with input queries and expected outputs (golden answers). This necessity arises because domain-specific benchmarks are not publicly available due to concerns over safety and data privacy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data leakage:&lt;/strong&gt; Challenges in evaluating RAG systems include data leakage from traditional benchmarks such as &lt;a href="https://hotpotqa.github.io/" rel="noopener noreferrer"&gt;HotpotQA&lt;/a&gt;, &lt;a href="https://huggingface.co/datasets/mandarjoshi/trivia_qa" rel="noopener noreferrer"&gt;TriviaQA&lt;/a&gt;, &lt;a href="https://microsoft.github.io/msmarco/" rel="noopener noreferrer"&gt;MS Marco&lt;/a&gt;&lt;a href="https://ai.google.com/research/NaturalQuestions/" rel="noopener noreferrer"&gt;, Natural Questions&lt;/a&gt;, &lt;a href="https://arxiv.org/pdf/2011.01060" rel="noopener noreferrer"&gt;2WikiMultiHopQA&lt;/a&gt;, and &lt;a href="https://ai.meta.com/tools/kilt/" rel="noopener noreferrer"&gt;KILT&lt;/a&gt;. Data leakage occurs when information from the answers inadvertently appears in the training data, allowing systems to achieve inflated performance metrics by memorizing rather than genuinely understanding and retrieving information.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is RAGEval?
&lt;/h2&gt;

&lt;p&gt;RAGEval is a framework that automatically creates evaluation datasets for assessing RAG systems. It generates a schema from seed documents, applies it to create diverse documents, and constructs question-answering pairs based on these documents and configurations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zc6t9k9h8pxugvqkk5p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1zc6t9k9h8pxugvqkk5p.png" alt="Image description" width="800" height="574"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenges addressed by RAGEval
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;RAGEval automates dataset creation by summarizing schemas and generating varied documents, reducing manual effort.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;By using data derived from seed documents and specific schemas, RAGEval ensures consistent evaluation and minimizes biases and privacy concerns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Additionally, RAGEval tackles the issue of general domain focus by creating specialized datasets for vertical domains like finance, healthcare, and legal, which are often overlooked in existing benchmarks.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;💡&lt;strong&gt;Is RAGeval and synthetic data generation the same?&lt;/strong&gt; RAGEval and synthetic data generation create datasets for models but with different goals. RAGEval generates evaluation datasets by deriving schemas from real documents and creating question-answering pairs to assess RAG systems. In contrast, synthetic data generation produces artificial, varied data to support model training and testing across various applications. RAGEval focuses on structured evaluation, while synthetic data generation emphasizes diverse, fictional data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Components of RAGEval
&lt;/h2&gt;

&lt;p&gt;Building a close-domain RAG evaluation dataset presents two major challenges: the high cost of collecting and annotating sensitive vertical documents and the complexity of evaluating detailed, comprehensive answers typical in vertical domains. To tackle these issues, RAGEval uses a “schema-configuration-document-QAR-keypoint” pipeline. This approach emphasizes factual information and improves answer estimation accuracy and reliability. The following sub-sections will detail each component of this pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 1: Schema summary
&lt;/h3&gt;

&lt;p&gt;In domain-specific scenarios, texts follow a common knowledge framework, represented by schema &lt;strong&gt;S&lt;/strong&gt;, which captures the essential factual information such as organization, type, events, date, and place. This schema is derived by using LLMs to analyze a small set of seed texts, even if they differ in style and content. For example, financial reports can cover various industries. This method ensures the schema's validity and comprehensiveness, improving text generation control and producing coherent, domain-relevant content.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;From seed financial reports, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Report A: "Tech Innovations Inc. in San Francisco released its annual report on March 15, 2023."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Report B: "Green Energy Ltd. in Austin published its quarterly report on June 30, 2022."&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The schema captures essential elements like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Company name&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Report type (annual, quarterly)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Key events&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dates&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Location&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using this schema, new content can be generated, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  "Global Tech Solutions in New York announced its quarterly earnings on July 20, 2024."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F47ncyae3fbr6g56imlro.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F47ncyae3fbr6g56imlro.png" alt="Image description" width="800" height="546"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Document generation
&lt;/h3&gt;

&lt;p&gt;To create effective evaluation datasets, RAGEval first generates configurations derived from a schema, ensuring consistency and coherence in the virtual texts. We use a hybrid approach: rule-based methods for accurate, structured data like dates and categories and LLMs for more complex, nuanced content. For instance, in financial reports, configurations cover various sectors like “agriculture” and “aviation,” with 20 business domains included. The configurations are integrated into structured documents, such as medical records or legal texts, following domain-specific guidelines. For financial documents, we divide content into sections (e.g., “Financial Report,” “Corporate Governance”) to ensure coherent and relevant outputs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ezqnjslw5k8jlyaq0f4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ezqnjslw5k8jlyaq0f4.png" alt="Image description" width="800" height="543"&gt;&lt;/a&gt;&lt;br&gt;
Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: QRA generation
&lt;/h3&gt;

&lt;p&gt;QRA Generation involves creating Question-Reference-Answer (QRA) triples from given documents ( &lt;strong&gt;D&lt;/strong&gt; ) and configurations ( &lt;strong&gt;C&lt;/strong&gt; ). This process is designed to establish a comprehensive evaluation framework for testing information retrieval and reasoning capabilities. It includes four key steps: first, formulating questions ( &lt;strong&gt;Q&lt;/strong&gt; ) based on the documents; second, extracting relevant information fragments ( &lt;strong&gt;R&lt;/strong&gt; ) from the documents to support the answers; third, generating initial answers ( &lt;strong&gt;A&lt;/strong&gt; ) to the questions using the extracted references; and fourth, optimizing the answers and references to ensure accuracy and alignment, addressing any discrepancies or irrelevant content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Utilizing configurations for questions and initial answers generation&lt;/strong&gt; involves using specific configurations (&lt;strong&gt;C&lt;/strong&gt;) to guide the creation of questions and initial answers. These configurations are embedded in prompts to ensure that generated questions are precise and relevant, and the answers are accurate. The configurations help generate a diverse set of question types, including factual, multi-hop reasoning, summarization, etc. The GPT-4o model produces targeted and accurate questions (&lt;strong&gt;Q&lt;/strong&gt;) and answers (&lt;strong&gt;A&lt;/strong&gt;) by including detailed instructions and examples for each question type. The approach aims to evaluate different facets of language understanding and information processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8ni23lo5nen6lavm43f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8ni23lo5nen6lavm43f.png" alt="Image description" width="800" height="576"&gt;&lt;/a&gt;&lt;br&gt;
Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extracting references&lt;/strong&gt;, given the constructed questions ( &lt;strong&gt;Q&lt;/strong&gt; ) and initial answers ( &lt;strong&gt;A&lt;/strong&gt; ), the process involves extracting pertinent information fragments (references) ( &lt;strong&gt;R&lt;/strong&gt; ) from the articles using a tailored extracting prompt. This prompt emphasizes the importance of grounding answers in the source material to ensure their reliability and traceability. Applying specific constraints and rules during the extraction phase ensures that the references are directly relevant and supportive of the answers, resulting in more precise and comprehensive QRA triples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimizing answers and references&lt;/strong&gt; involves refining answer &lt;strong&gt;A&lt;/strong&gt; to ensure accuracy and alignment with the provided references &lt;strong&gt;R&lt;/strong&gt;. If &lt;strong&gt;R&lt;/strong&gt; contains information not present in &lt;strong&gt;A&lt;/strong&gt;, the answers are supplemented accordingly. Conversely, if &lt;strong&gt;A&lt;/strong&gt; includes content not found in &lt;strong&gt;R&lt;/strong&gt;, the article is checked for overlooked references. If additional references are found, they are added to &lt;strong&gt;R&lt;/strong&gt; while keeping &lt;strong&gt;A&lt;/strong&gt; unchanged. If no corresponding references are found, irrelevant content is removed from &lt;strong&gt;A&lt;/strong&gt;. This approach helps address hallucinations in the answer-generation process, ensuring that the final answers are accurate and well-supported by &lt;strong&gt;R&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generating key points&lt;/strong&gt; focuses on identifying critical information in answers rather than just correctness or keyword matching. Key points are extracted from standard answers ( &lt;strong&gt;A&lt;/strong&gt; ) for each question ( &lt;strong&gt;Q&lt;/strong&gt; ) using a predefined prompt with the GPT-4o model. This prompt, supporting both Chinese and English, uses in-context learning and examples to guide key point extraction across various domains and question types, including unanswerable ones. Typically, 3-5 key points are distilled from responses, capturing essential facts, relevant inferences, and conclusions. This method ensures that the evaluation is based on relevant and precise information, enhancing the reliability of subsequent metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsy1obnxceg7nmvgd326.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzsy1obnxceg7nmvgd326.png" alt="Image description" width="800" height="173"&gt;&lt;/a&gt;&lt;br&gt;
Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality assessment of RAGEval
&lt;/h2&gt;

&lt;p&gt;In this section, the authors of the &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;paper&lt;/a&gt;-"RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework" introduce the human verification process used to assess the quality of the generated dataset and the evaluation within the RAGEval framework. This assessment is divided into three main tasks: evaluating the quality of QARs and generated documents and validating the automated evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QAR quality assessment&lt;/strong&gt; involves having annotators evaluate the correctness of the Question-Answer-Reference (QAR) triples generated under various configurations. Annotators score the QARs ranging from completely correct and fluent responses (5) to irrelevant or completely incorrect responses (0).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3pwdk3zauzbhyy0uncf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3pwdk3zauzbhyy0uncf.png" alt="Image description" width="578" height="280"&gt;&lt;/a&gt;&lt;br&gt;
Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For annotation, 10 samples per question type were randomly selected for each language and domain, totaling 420 samples. Annotators were provided with the document, question, question type, generated response, and references.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fizvl3x5gbqg02gv8li04.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fizvl3x5gbqg02gv8li04.png" alt="Image description" width="456" height="166"&gt;&lt;/a&gt;&lt;br&gt;
CN- Chinese, EN- English&lt;/p&gt;

&lt;p&gt;Results show that QAR quality scores are consistently high across different domains, with only slight variations between languages. The combined proportion of scores 4 and 5 is approximately 95% or higher across all domains, indicating that the approach upholds a high standard of accuracy and fluency in QARs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generated document quality assessment&lt;/strong&gt; involves comparing documents generated using RAGEval with those produced by baseline methods, including zero-shot and one-shot prompting. For each domain—finance, legal, and medical—20 or 19 documents were randomly selected and grouped with two baseline documents for comparison. Annotators ranked the documents based on clarity, safety, richness, and conformity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vk87zrmqsyjxvg18g7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vk87zrmqsyjxvg18g7b.png" alt="Image description" width="566" height="232"&gt;&lt;/a&gt;&lt;br&gt;
Document quality comparison criteria. Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results indicate that RAGEval consistently outperforms both baseline methods, particularly excelling in safety, clarity, and richness. For Chinese and English datasets, RAGEval ranked highest in over 85% of cases for richness, clarity, and safety, demonstrating its effectiveness in generating high-quality documents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15fcdihw83xamni6xmns.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15fcdihw83xamni6xmns.png" alt="Image description" width="800" height="604"&gt;&lt;/a&gt;&lt;br&gt;
Document generation comparison by domain. Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validation of automated evaluation&lt;/strong&gt; involves comparing LLM-reported metrics for completeness, hallucination, and irrelevance with human assessments. Using the same 420 examples from the QAR quality assessment, human annotators evaluated answers from Baichuan-2-7B-chat, and these results were compared with LLM metrics. Figure 6 shows that machine and human evaluations align closely, with absolute differences under 0.015, validating the reliability and consistency of the automated evaluation metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6uujz6wjp2p46agnph5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6uujz6wjp2p46agnph5x.png" alt="Image description" width="754" height="564"&gt;&lt;/a&gt;&lt;br&gt;
Automated metric validation results. Source: &lt;a href="https://arxiv.org/pdf/2408.01262" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2408.01262&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In conclusion, RAGEval represents a significant advancement in evaluating Retrieval-Augmented Generation (RAG) systems by automating the creation of scenario-specific datasets that emphasize factual accuracy and domain relevance. This framework addresses the limitations of existing benchmarks, particularly in sectors requiring detailed and accurate information such as finance, healthcare, and legal fields. The human evaluation results demonstrate the robustness and effectiveness of RAGEval in generating content that is accurate, safe, and rich. Furthermore, the alignment between automated metrics and human judgment validates the reliability of the evaluation approach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tini.fyi/uX3Dp" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt; is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with &lt;a href="https://tini.fyi/uX3Dp" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>openai</category>
    </item>
    <item>
      <title>Graph RAG</title>
      <dc:creator>Parth Roy</dc:creator>
      <pubDate>Tue, 10 Sep 2024 04:00:00 +0000</pubDate>
      <link>https://dev.to/parth_roy_a1ec4703407d025/graph-rag-5p7</link>
      <guid>https://dev.to/parth_roy_a1ec4703407d025/graph-rag-5p7</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;This blog explores &lt;a href="https://arxiv.org/pdf/2404.16130" rel="noopener noreferrer"&gt;Microsoft's Graph-based Retrieval-Augmented Generation (Graph RAG)&lt;/a&gt; approach. While traditional RAG excels at retrieving specific information, it struggles with global queries, like identifying key themes in a dataset, which require query-focused summarization (QFS). Graph RAG combines the strengths of RAG and QFS by using entity knowledge graphs and community summaries to handle both broad questions and large datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the challenges with RAG?
&lt;/h2&gt;

&lt;p&gt;The primary challenges with RAG are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Global question handling&lt;/strong&gt;: RAG struggles with global questions that require understanding the entire text corpus, such as "What are the main themes in the dataset?". These questions require query-focused summarization (QFS), which differs from RAG's typical focus on retrieving and generating content from specific, localized text regions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;QFS&lt;/strong&gt;: Traditional QFS methods, which summarize content based on specific queries, do not scale well to the large volumes of text typically indexed by RAG systems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context window limitations&lt;/strong&gt;: While modern Large Language Models (LLMs) like GPT, Llama, and Gemini can perform in-context learning to summarize content, they are limited by the size of their context windows. When dealing with large text corpora, crucial information can be "lost in the middle" of these longer contexts, making it challenging to provide comprehensive summaries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inadequacy of direct retrieval&lt;/strong&gt;: For QFS tasks, directly retrieving text chunks in a naive RAG system is often inadequate. RAG's standard approach is not well-suited for summarizing entire datasets or handling global questions, which requires a more sophisticated indexing and retrieval mechanism tailored to global summarization needs.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What is Graph RAG?
&lt;/h2&gt;

&lt;p&gt;Graph RAG is an advanced question-answering method that combines RAG with a graph-based text index. It builds an entity knowledge graph from source documents and generates summaries for related entities. For a given question, partial responses from these summaries are combined into a comprehensive final answer. This approach scales well with large datasets and provides more comprehensive answers than traditional RAG methods.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graph RAG approach &amp;amp; pipeline
&lt;/h3&gt;

&lt;p&gt;Source: &lt;a href="https://arxiv.org/pdf/2404.16130" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2404.16130&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Graph RAG pipeline uses an LLM-derived graph index to organize source document text into nodes (like entities), edges (relationships), and covariates (claims). These elements are detected, extracted, and summarized using LLM prompts customized for the dataset. The graph index is partitioned into groups of related elements through community detection. Summaries for each group are generated in parallel both during indexing and when a query is made. To answer a query, a final query-focused summarization is performed over all relevant community summaries, producing a comprehensive "global answer."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: source documents → text chunks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this stage, texts from source documents are split into chunks for processing. Each chunk is then passed to LLM prompts to extract elements for a graph index.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4teahm6fehmf8mi9jva3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4teahm6fehmf8mi9jva3.png" alt="Image description" width="800" height="401"&gt;&lt;/a&gt;&lt;br&gt;
While longer chunks reduce the number of LLM calls needed to process the entire document because more text is handled during each call, they can impair the LLM’s ability to recall details due to an extended context window. For example, in the HotPotQA dataset, a 600-token chunk extracted nearly twice as many entity references as a 2400-token chunk. Thus, balancing recall and precision is crucial.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lkrcsebmh1s806y7vmn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lkrcsebmh1s806y7vmn.png" alt="Image description" width="800" height="272"&gt;&lt;/a&gt;&lt;br&gt;
The number of entity references detected in the HotPotQA dataset varies with chunk size and the number of gleanings when employing a generic entity extraction prompt with GPT-4-turbo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: text chunks → element instances&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this step, graph nodes and edges are identified and extracted from each chunk of source text using a multipart LLM prompt. The prompt first extracts all entities, including their name, type, and description, and then identifies relationships between these entities, detailing the source, target, and nature of each relationship. Both entities and relationships are output as a list of delimited tuples.&lt;/p&gt;

&lt;p&gt;The prompt can be tailored to the document corpus by including domain-specific few-shot examples, improving extraction accuracy for specialized fields like science or medicine. Additionally, a secondary prompt extracts covariates, such as claims related to the entities, including details like subject, object, type, description, source text span, and dates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F400i1k2rlmyykkudw36o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F400i1k2rlmyykkudw36o.png" alt="Image description" width="794" height="582"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: element instances → element summaries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this step, an LLM is used to perform abstractive summarization, creating meaningful summaries of entities, relationships, and claims from source texts. The LLM extracts these elements and summarizes them into single blocks of descriptive text for each graph element: entity nodes, relationship edges, and claim covariates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6r38vxqeu7287710qps.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6r38vxqeu7287710qps.png" alt="Image description" width="800" height="316"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, a challenge here is that the LLM might extract and describe the same entity in different formats, potentially leading to duplicate nodes in the entity graph. To address this, the process includes a subsequent step where related groups of entities (or "communities") are detected and summarized together. This helps to consolidate variations and ensures that the entity graph remains consistent, as the LLM can recognize and connect different names or descriptions of the same entity. Overall, the approach leverages LLM capabilities to handle variations and produce a comprehensive, coherent graph structure.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; Imagine the LLM is tasked with extracting information about a well-known historical figure, "Albert Einstein," from various texts. The LLM might encounter different ways of referring to Einstein, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;"Albert Einstein"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"Einstein"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"Dr. Einstein"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"The physicist Albert Einstein"&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During the extraction process, the LLM may generate separate entity nodes for each of these variations, resulting in duplicate entries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Node 1: Albert Einstein&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Node 2: Einstein&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Node 3: Dr. Einstein&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Node 4: The physicist Albert Einstein&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These variations could lead to multiple nodes for the same person in the entity graph, causing redundancy and inconsistencies.&lt;/p&gt;

&lt;p&gt;In the subsequent step, the LLM summarizes and consolidates these related variations into a single node by detecting and linking all references to the same entity. For instance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Node: Albert Einstein (consolidates all variations: Einstein, Dr. Einstein, The physicist Albert Einstein)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By summarizing all related variations into a single descriptive block, the process ensures that the entity graph is accurate and does not contain duplicate nodes for the same individual.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: element instances → element summaries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this step, the index is modeled as a weighted undirected graph where entity nodes are linked by relationship edges, with edge weights reflecting the frequency of relationships. The graph is then partitioned into communities using the &lt;a href="https://arxiv.org/pdf/1810.08473" rel="noopener noreferrer"&gt;Leiden algorithm&lt;/a&gt;, efficiently identifying hierarchical community structures. This allows for a detailed, hierarchical graph division, enabling targeted and comprehensive global summarization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknftoil5g39dwxa8r8rg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknftoil5g39dwxa8r8rg.png" alt="Image description" width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡The Leiden algorithm is a community detection method in network analysis that improves upon the Louvain algorithm by addressing its shortcomings in optimizing modularity, which measures the quality of community structures. It operates through a multi-level approach, incrementally refining clusters from fine to coarse, and guarantees that all communities are well-connected, unlike Louvain's potential for disconnected clusters. The algorithm iteratively partitions and refines clusters until no further improvement is possible, ensuring higher modularity. Despite its enhanced accuracy, the Leiden algorithm remains computationally efficient and can be applied to both weighted and unweighted networks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 5: graph communities → community summaries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this step, report-like summaries are created for each community detected in the graph. These summaries help understand the global structure and details of the dataset and are useful for answering global queries. Here's how it's done:&lt;/p&gt;

&lt;p&gt;Leaf-level communities: Summarize the details of the smallest communities first, prioritizing descriptions of nodes, edges, and related information. These summaries are added to the LLM context window until the token limit is reached.&lt;/p&gt;

&lt;p&gt;Higher-level communities: If there’s room in the context window, summarize all elements of these larger communities. If not, prioritize summaries of sub-communities over detailed element descriptions, fitting them into the context window by substituting longer descriptions with shorter summaries.&lt;/p&gt;

&lt;p&gt;This approach ensures that both detailed and high-level summaries are generated efficiently, fitting within the LLM's token limits while maintaining comprehensive coverage of the dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: community summaries → community answers → global answer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Given a user query, the community summaries created in the previous step are used to produce a final answer through a multi-stage process. The hierarchical structure of the communities allows the system to select the most appropriate level of detail for answering different types of questions. The process for generating a global answer at a specific community level is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Prepare community summaries: The community summaries are shuffled and split into chunks of a predefined token size. This approach helps ensure that relevant information is spread across multiple chunks, reducing the risk of losing important details in a single context window.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Generate intermediate answers: For each chunk, the LLM generates intermediate answers in parallel. Along with the answer, the LLM assigns a helpfulness score between 0-100, indicating how well the answer addresses the user’s query. Any answers scoring 0 are discarded.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compile the global answer: The remaining intermediate answers are sorted in descending order based on their helpfulness scores. These answers are then added to a new context window, one by one, until the token limit is reached. This final compilation is used to generate the global answer provided to the user.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9bo4jp033yb207i6tr7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9bo4jp033yb207i6tr7.png" alt="Image description" width="800" height="243"&gt;&lt;/a&gt;   &lt;/p&gt;

&lt;h2&gt;
  
  
  Performance evaluation
&lt;/h2&gt;

&lt;p&gt;The authors evaluated the performance of both Graph-based Retrieval-Augmented Generation (Graph RAG) and standard RAG approaches using two datasets, each containing approximately one million tokens, equivalent to around 10 novels of text. The evaluation was conducted across four different metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Datasets
&lt;/h3&gt;

&lt;p&gt;The datasets were chosen to reflect the types of text corpora users might typically encounter in real-world applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Podcast transcripts&lt;/strong&gt;: This dataset includes transcripts from the "&lt;a href="https://www.microsoft.com/en-us/behind-the-tech" rel="noopener noreferrer"&gt;Behind the Tech&lt;/a&gt;" podcast, where Kevin Scott, Microsoft's CTO, converses with other technology leaders. The dataset comprises 1,669 text chunks, each containing 600 tokens, with a 100-token overlap between chunks, resulting in approximately 1 million tokens overall.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;News articles&lt;/strong&gt;: The second dataset is a benchmark collection of news articles published between September 2013 and December 2023. It covers a range of categories, including entertainment, business, sports, technology, health, and science. This dataset consists of 3,197 text chunks, each containing 600 tokens with a 100-token overlap, totaling around 1.7 million tokens.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conditions
&lt;/h3&gt;

&lt;p&gt;The authors compared six different evaluation conditions to assess the performance of the Graph-based RAG system. These conditions included four levels of graph communities (C0, C1, C2, C3), a text summarization method (TS), and a naive "semantic search" RAG approach (SS). Each condition represented a distinct method for creating the context window used to answer queries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;C0&lt;/strong&gt;: Utilized summaries from root-level communities, which were the fewest in number.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;C1&lt;/strong&gt;: Employed high-level summaries derived from sub-communities of C0.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;C2&lt;/strong&gt;: Applied intermediate-level summaries from sub-communities of C1.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;C3&lt;/strong&gt;: Low-level summaries from sub-communities of C2 were used, which were the most numerous.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TS&lt;/strong&gt;: Implemented a map-reduce summarization method directly on the source texts, shuffling and chunking them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SS&lt;/strong&gt;: Employed a naive RAG approach where text chunks were retrieved and added to the context window until the token limit was reached.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All conditions used the same context window size and prompts, with differences only in how the context was constructed. The graph index supporting the C0-C3 conditions was generated using prompts designed for entity and relationship extraction, with modifications made to align with the specific domain of the data. The indexing process used a 600-token context window, with varying numbers of "gleanings" (passes over the text) depending on the dataset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡Map-reduce summarization condenses large texts by splitting them into smaller chunks, summarizing each chunk independently ("map" phase), and then combining these summaries into a cohesive final summary ("reduce" phase). This technique efficiently handles large datasets, ensuring the final summary captures the key points from the entire text.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Metrics
&lt;/h3&gt;

&lt;p&gt;The evaluation employs an LLM to conduct a head-to-head comparison of generated answers based on specific metrics. Given the multi-stage nature of the Graph RAG mechanism, the multiple conditions the authors wanted to compare, and the lack of gold standard answers for activity-based sensemaking questions, the authors decided to adopt a head-to-head comparison approach using an LLM evaluator, this approach helps in assessing system performance through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Comprehensiveness:&lt;/strong&gt; This metric measures the extent to which the answer provides detailed coverage of all aspects of the question.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Diversity:&lt;/strong&gt; This evaluates the richness and variety of perspectives and insights presented in the answer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Empowerment:&lt;/strong&gt; This assesses how effectively the answer aids the reader in understanding the topic and making informed judgments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Directness:&lt;/strong&gt; This measures the specificity and clarity with which the answer addresses the question.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdb8hhq2vpbopuu37zrc6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdb8hhq2vpbopuu37zrc6.png" alt="Image description" width="800" height="1096"&gt;&lt;/a&gt;    &lt;/p&gt;

&lt;p&gt;Example question for the News article dataset, with generated answers from Graph RAG (C2) and Naive RAG, as well as LLM-generated assessments&lt;/p&gt;

&lt;p&gt;Since directness often conflicts with comprehensiveness and diversity, it is unlikely for any method to excel across all metrics. The LLM evaluator compares pairs of answers based on these metrics, determining a winner or a tie if differences are negligible. Each comparison is repeated five times to account for the stochastic nature of LLMs, with mean scores used for final assessments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd4oimvfy8lr99gyjjprg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd4oimvfy8lr99gyjjprg.png" alt="Image description" width="800" height="532"&gt;&lt;/a&gt;&lt;br&gt;
Head-to-head win rate percentages of (row condition) over (column condition) across two datasets, four metrics, and 125 questions per comparison (each repeated five times and averaged). The overall winner per dataset and metric is shown in bold.&lt;/p&gt;

&lt;p&gt;The indexing process created graphs with 8,564 nodes and 20,691 edges for the Podcast dataset and 15,754 nodes and 19,520 edges for the News dataset. Graph RAG approaches consistently outperformed the naïve RAG (SS) method in both comprehensiveness and diversity across datasets, with win rates of 72-83% for Podcasts and 72-80% for News in comprehensiveness, and 75-82% and 62-71% for diversity, respectively. Community summaries provided modest improvements in comprehensiveness and diversity compared to source texts, with root-level summaries in Graph RAG being highly efficient. Root-level Graph RAG offers a highly efficient method for iterative question answering that characterizes sensemaking activity while retaining advantages in comprehensiveness (72% win rate) and diversity (62% win rate) over naive RAG. Empowerment results were mixed, with naive RAG outperforming in directness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Graph RAG advances traditional RAG by effectively addressing complex global queries that require comprehensive summarization of large datasets. By combining knowledge graph generation with query-focused summarization, Graph RAG provides detailed and nuanced answers, outperforming naive RAG in both comprehensiveness and diversity.&lt;/p&gt;

&lt;p&gt;This global approach integrates RAG, QFS, and entity-based graph indexing to support sensemaking across entire text corpora. Initial evaluations show significant improvements over naive RAG and competitive performance against other global methods like map-reduce summarization.GraphRAG improves upon naive RAG in scenarios requiring complex reasoning, high factual accuracy, and deep data understanding—such as financial analysis, legal review, and life sciences. However, naive RAG may be more efficient for straightforward queries or when speed is key. In short, GraphRAG is best for complex tasks, but naive RAG suffices for simpler ones.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tini.fyi/AWU4G" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt; is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with &lt;a href="https://tini.fyi/AWU4G" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>openai</category>
    </item>
    <item>
      <title>Understanding RAG (Part 5): Recommendations and wrap-up</title>
      <dc:creator>Parth Roy</dc:creator>
      <pubDate>Mon, 09 Sep 2024 10:34:11 +0000</pubDate>
      <link>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-5-recommendations-and-wrap-up-418o</link>
      <guid>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-5-recommendations-and-wrap-up-418o</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;In the five-part series "&lt;a href="https://blog.getmaxim.ai/understanding-rag-part-1-rag-overview-2/" rel="noopener noreferrer"&gt;Understanding RAG&lt;/a&gt;," we began by explaining the foundational Retrieval-Augmented Generation (RAG) framework and progressively explored advanced techniques to refine each component. In Part 1, we provided an overview of the RAG framework. Subsequent parts offered an in-depth analysis of the three main components: indexing, retrieval, and generation. In this final blog, we will discuss the findings from the &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;research paper&lt;/a&gt; "Searching for Best Practices in Retrieval-Augmented Generation," in which the authors undertook a comprehensive study to identify and evaluate the most effective techniques for optimizing RAG systems. To provide a clear and comprehensive overview of the advanced RAG framework and its components, we include an illustrative image below. This visual representation encapsulates the various components and techniques designed to enhance the performance of the RAG system. We will discuss each component and the associated techniques that contribute to optimizing the system's overall performance.&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftzmkx6uejbyohydv3wpx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftzmkx6uejbyohydv3wpx.png" alt="Image description" width="800" height="243"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Indexing component
&lt;/h3&gt;

&lt;p&gt;The indexing component consists of chunking documents and embedding these chunks to store the embeddings in a vector database. In &lt;a href="https://blog.getmaxim.ai/understanding-rag-part-2-rag-retrieval/" rel="noopener noreferrer"&gt;earlier parts&lt;/a&gt; of the "Understanding RAG" series, we explored advanced chunking techniques, including sliding window chunking and small-to-big chunking, and highlighted the importance of selecting the optimal chunk size. Larger chunks offer more context but can increase processing time, while smaller chunks enhance recall but may provide insufficient context.&lt;/p&gt;

&lt;p&gt;Equally important is the choice of an effective embedding model. Embeddings are crucial as they deliver compact, semantically meaningful representations of words and entities. The quality of these embeddings has a significant impact on the performance of retrieval and generation processes.&lt;/p&gt;

&lt;h4&gt;
  
  
  Recommendation for document chunking
&lt;/h4&gt;

&lt;p&gt;In &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;the paper&lt;/a&gt; "Searching for Best Practices in Retrieval-Augmented Generation," the authors evaluated various chunking techniques using &lt;a href="https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha" rel="noopener noreferrer"&gt;zephyr-7b-alpha3&lt;/a&gt; and &lt;a href="https://platform.openai.com/docs/models" rel="noopener noreferrer"&gt;gpt-3.5-turbo4&lt;/a&gt; models for generation and evaluation. The chunk overlap was set to 20 tokens, and the first 60 pages of the document &lt;a href="https://s27.q4cdn.com/263799617/files/doc_financials/2021/AR/Lyft-Annual-Report-2021.pdf" rel="noopener noreferrer"&gt;lyft_2021&lt;/a&gt; were used as the corpus. Additionally, the authors prompted LLMs to generate approximately 170 queries based on the chosen corpus to use these as input queries.&lt;/p&gt;

&lt;p&gt;The study suggests that chunk sizes of &lt;strong&gt;256&lt;/strong&gt; and &lt;strong&gt;512&lt;/strong&gt; tokens offer the best balance between providing sufficient context and maintaining high faithfulness and relevancy. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feymbgid14a1r7m0hbimf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feymbgid14a1r7m0hbimf.png" alt="Image description" width="638" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Optimal chunk sizes:&lt;/strong&gt; 256 and 512 tokens provide the best balance of high faithfulness and relevancy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Larger chunks (2048 tokens):&lt;/strong&gt; Offer more context but at the cost of slightly lower faithfulness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smaller chunks (128 tokens):&lt;/strong&gt; Improve retrieval recall but may lack sufficient context, leading to slightly lower faithfulness.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Additionally, it highlights that using advanced chunking techniques, such as &lt;strong&gt;sliding window chunking&lt;/strong&gt;, further optimizes these benefits.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq0omqcp8x1ul0e13hhxx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq0omqcp8x1ul0e13hhxx.png" alt="Image description" width="654" height="288"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt; is the extent to which the generated output accurately reflects the information from the original document. &lt;strong&gt;Relevancy&lt;/strong&gt; is the degree to which the generated content is pertinent and useful in relation to the query.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Recommendation for embedding model
&lt;/h4&gt;

&lt;p&gt;Choosing the right embedding model is equally important for effective semantic matching of queries and chunk blocks. To select the appropriate open-source embedding model, the authors conducted another experiment using the evaluation module of &lt;a href="https://github.com/FlagOpen/FlagEmbedding" rel="noopener noreferrer"&gt;FlagEmbedding&lt;/a&gt;, which uses the dataset &lt;a href="https://huggingface.co/datasets/namespace-Pt/msmarco" rel="noopener noreferrer"&gt;namespace-Pt/msmarco7&lt;/a&gt; for queries and the dataset &lt;a href="https://huggingface.co/datasets/namespace-Pt/msmarco-corpus" rel="noopener noreferrer"&gt;namespace-Pt/msmarco-corpus8&lt;/a&gt; for the corpus and metrics like RR and MRR were used for evaluation. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw14x4865hxmq3maibwdw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw14x4865hxmq3maibwdw.png" alt="Image description" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;RR&lt;/strong&gt; (Reciprocal Rank) is the rank of the first relevant result in a single query. It is the inverse of the rank of this relevant result. &lt;strong&gt;MRR&lt;/strong&gt; (Mean Reciprocal Rank) is the average of the reciprocal ranks of the first relevant item across a set of queries. It measures how well the system ranks the first relevant result.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqnzvugy110zzuv65jne.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbqnzvugy110zzuv65jne.png" alt="Image description" width="542" height="108"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Q is the total number of queries. RR_i​ is the reciprocal rank of the first relevant result for the i-th query.&lt;/p&gt;

&lt;p&gt;In the study, the authors selected &lt;strong&gt;&lt;a href="https://huggingface.co/BAAI/llm-embedder" rel="noopener noreferrer"&gt;LLM-Embedder&lt;/a&gt;&lt;/strong&gt; as the embedding model due to its ability to deliver results comparable to the &lt;a href="https://huggingface.co/BAAI/bge-large-en" rel="noopener noreferrer"&gt;BAAI/bge-large-en &lt;/a&gt; model, while being three times smaller in size. This choice strikes a balance between performance and model size efficiency, making it a practical option. Following the discussion on embedding models, we will now turn our attention to the retrieval component. This segment will provide a brief overview of the retrieval process and present the recommendations made by the authors in &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;the paper&lt;/a&gt; "Searching for Best Practices in Retrieval-Augmented Generation."&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieval component
&lt;/h3&gt;

&lt;p&gt;The retrieval component of RAG can be further divided into two key stages. The first stage involves retrieving document chunks relevant to the query from the vector database. The second stage, reranking, focuses on further evaluating these retrieved documents to rank them based on their relevance, ultimately selecting only the most pertinent documents.&lt;/p&gt;

&lt;p&gt;In the earlier parts of this blog series, we explored various retrieval methods, including &lt;a href="https://arxiv.org/pdf/2303.07678" rel="noopener noreferrer"&gt;Query2doc&lt;/a&gt;, &lt;a href="https://arxiv.org/pdf/2212.10496" rel="noopener noreferrer"&gt;HyDE &lt;/a&gt;(Hypothetical Document Embeddings), and &lt;a href="https://arxiv.org/pdf/2310.14696" rel="noopener noreferrer"&gt;TOC&lt;/a&gt; (TREE OF CLARIFICATIONS). Additionally, in Part four, we delved into reranking, talking about the different models that can be employed for this purpose.&lt;/p&gt;

&lt;p&gt;In this section, we will present the findings from the &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;research paper&lt;/a&gt; "Searching for Best Practices in Retrieval-Augmented Generation," which investigates the most effective retrieval methods and reranking models. We will also provide a brief overview of the experimental setup used in the study.&lt;/p&gt;

&lt;h4&gt;
  
  
  Recommendations for retrieval methods
&lt;/h4&gt;

&lt;p&gt;The performance of various retrieval methods was evaluated on the &lt;a href="https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020.html" rel="noopener noreferrer"&gt;TREC DL 2019 and 2020&lt;/a&gt; passage ranking datasets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfvj7ltm9j7iywae3uf9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnfvj7ltm9j7iywae3uf9.png" alt="Image description" width="800" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results for different retrieval methods on TREC DL19/20. The best result for each method is made bold and the second is underlined.&lt;/p&gt;

&lt;p&gt;The performance of various retrieval methods was evaluated on the TREC DL 2019 and 2020 passage ranking datasets. The results indicate that the combination of &lt;strong&gt;Hybrid Search with HyDE&lt;/strong&gt; and the LLM-Embedder achieved the highest scores. This approach combines sparse retrieval (BM25) and dense retrieval (original embedding), delivering notable performance with relatively low latency while maintaining efficiency. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;mAP&lt;/strong&gt; (Mean Average Precision) assesses the overall precision of retrieved documents, focusing on the system's ability to rank relevant documents higher across all queries. &lt;strong&gt;nDCG@__10&lt;/strong&gt; (Normalized Discounted Cumulative Gain at rank 10) evaluates the quality of the top 10 results, emphasizing the importance of placing relevant documents at higher ranks. &lt;strong&gt;R@50&lt;/strong&gt; (Recall at 50) measures the proportion of relevant documents retrieved within the top 50 results, indicating the system's effectiveness in retrieving relevant information.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Recommendations for reranking models
&lt;/h4&gt;

&lt;p&gt;Similar experiments were conducted on the &lt;a href="https://microsoft.github.io/msmarco/" rel="noopener noreferrer"&gt;MS MARCO&lt;/a&gt; Passage ranking dataset to evaluate several bi-encoder and cross-encoder reranking models, such as &lt;a href="https://arxiv.org/pdf/2003.06713" rel="noopener noreferrer"&gt;monoT5&lt;/a&gt;, &lt;a href="https://arxiv.org/pdf/1910.14424" rel="noopener noreferrer"&gt;monoBERT&lt;/a&gt;, &lt;a href="https://arxiv.org/pdf/1910.14424" rel="noopener noreferrer"&gt;RankLLaMA&lt;/a&gt;, and &lt;a href="https://arxiv.org/pdf/2108.08513" rel="noopener noreferrer"&gt;TILDEv2&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhic3m6f8dwgezmfbq1r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhic3m6f8dwgezmfbq1r.png" alt="Image description" width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results of different reranking methods on the dev set of the MS MARCO Passage ranking dataset. For each query, the top-1000 candidate passages retrieved by BM25 are reranked. Latency is measured in seconds per query.&lt;/p&gt;

&lt;p&gt;The findings in &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;the paper&lt;/a&gt; "Searching for Best Practices in Retrieval-Augmented Generation" recommend &lt;strong&gt;monoT5&lt;/strong&gt; as a well-rounded method that balances performance and efficiency. For those seeking the best possible performance, &lt;strong&gt;RankLLaMA&lt;/strong&gt; is the preferred choice, whereas &lt;strong&gt;TILDEv2&lt;/strong&gt; is recommended for its speed when working with a fixed collection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generation component
&lt;/h3&gt;

&lt;p&gt;After the indexing and retrieval component comes the generation component in RAG systems. In this component, the documents retrieved from the reranking stage are further summarized and ordered (based on the relevancy score to the query) to provide them as input to the fine-tuned LLM model for the final answer generation.&lt;/p&gt;

&lt;p&gt;Techniques like document repacking, document summarization, and generator model fine-tuning are used to enhance the generation component. In &lt;a href="https://blog.getmaxim.ai/understanding-rag-part-4-2/" rel="noopener noreferrer"&gt;Part Four&lt;/a&gt; of the "Understanding RAG" blog series, we have discussed all the methods in detail. In this section, we will discuss the findings from &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;the paper&lt;/a&gt; "Searching for Best Practices in Retrieval-Augmented Generation," where the authors tried to evaluate each process and find the most efficient approach.&lt;/p&gt;

&lt;h4&gt;
  
  
  Recommendation for document repacking
&lt;/h4&gt;

&lt;p&gt;Document repacking, discussed in detail in &lt;a href="https://blog.getmaxim.ai/understanding-rag-part-4-2/" rel="noopener noreferrer"&gt;part Four&lt;/a&gt; of the "Understanding RAG" series, is a vital technique in the RAG workflow that enhances response generation. After reranking the top K documents by relevancy scores, this technique optimizes their order for the language model (LLM), ensuring more accurate and relevant responses. The three primary repacking methods include the forward method, which arranges documents in descending relevancy; the reverse method, which arranges them in ascending relevancy; and the sides method, which places the most relevant documents at both the beginning and end of the sequence.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi7h1cjsys1065ati0ym.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsi7h1cjsys1065ati0ym.png" alt="Image description" width="800" height="156"&gt;&lt;/a&gt;&lt;br&gt;
In the research paper "Searching for Best Practices in Retrieval-Augmented Generation," various repacking techniques were evaluated across several datasets, including Commonsense Reasoning, fact-checking, Open-Domain QA, MultiHop QA, and Medical QA. The study identified the "&lt;strong&gt;reverse&lt;/strong&gt;" method as the most effective repacking approach based on metrics such as accuracy (Acc), exact match (EM), and RAG scores across all tasks. The evaluation also included an assessment of average latency, measured in seconds per query, to determine efficiency. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The Exact match (&lt;strong&gt;EM&lt;/strong&gt;) metric measures the percentage of generated output that exactly matches the ground truth answers. EM = (Number of exact matches / Total number of examples) x 100. Accuracy (&lt;strong&gt;Acc&lt;/strong&gt;) measures the proportion of correct outputs out of the total number of outputs. Accuracy = (Number of correct outputs / Total number of outputs) x 100.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Recommendation for document summarization
&lt;/h4&gt;

&lt;p&gt;Summarization in RAG aims to improve response relevance and efficiency by condensing retrieved documents and reducing redundancy before inputting them into the LLM. In Part Four of the "Understanding RAG" series, we covered various summarization techniques, including &lt;a href="https://arxiv.org/pdf/2310.04408" rel="noopener noreferrer"&gt;RECOMP&lt;/a&gt; and &lt;a href="https://arxiv.org/pdf/2310.06839" rel="noopener noreferrer"&gt;LongLLMLingua&lt;/a&gt;. This section will focus on identifying the most efficient technique based on &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;the paper&lt;/a&gt; "Searching for Best Practices in Retrieval-Augmented Generation." &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fugpexp6amhbhqcfbowgm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fugpexp6amhbhqcfbowgm.png" alt="Image description" width="800" height="190"&gt;&lt;/a&gt;&lt;br&gt;
Results of the search for optimal RAG practices. The “Avg” (average score) is calculated based on the Acc, EM, and RAG scores for all tasks, while the average latency is measured in seconds per query. The best scores are highlighted in bold.&lt;/p&gt;

&lt;p&gt;The paper "Searching for Best Practices in Retrieval-Augmented Generation" evaluated various summarization methods across three benchmark datasets: &lt;a href="https://ai.google.com/research/NaturalQuestions/download" rel="noopener noreferrer"&gt;NQ&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/1705.03551" rel="noopener noreferrer"&gt;TriviaQA&lt;/a&gt;, and &lt;a href="https://arxiv.org/abs/1809.09600" rel="noopener noreferrer"&gt;HotpotQA&lt;/a&gt;. &lt;strong&gt;RECOMP&lt;/strong&gt; emerged as the superior technique, demonstrating exceptional performance across metrics such as accuracy (Acc) and exact match (EM). The “Avg” (average score) reflects the mean performance across all tasks, while average latency is recorded in seconds per query. The best scores are highlighted in bold.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendation for Generator Fine-Tuning
&lt;/h3&gt;

&lt;p&gt;The paper "Searching for Best Practices in Retrieval-Augmented Generation" investigates how fine-tuning affects the generator, particularly with relevant versus irrelevant contexts. The study employed different context compositions for training, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pairs of query-relevant documents (Dg)&lt;/li&gt;
&lt;li&gt;Combinations of relevant and randomly sampled documents (Dgr)&lt;/li&gt;
&lt;li&gt;Combinations of only randomly sampled documents (Dr)&lt;/li&gt;
&lt;li&gt;Contexts with two copies of a relevant document (Dgg)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this study, &lt;a href="https://huggingface.co/meta-llama/Llama-2-7b" rel="noopener noreferrer"&gt;Llama-2-7b&lt;/a&gt; was selected as the base model. The base LM generator without fine-tuning is referred to as M_b, and the model fine-tuned with different contexts is referred to as M_g, M_r, M_gr, and M_gg. The models were fine-tuned using several QA and reading comprehension datasets, including ASQA, HotpotQA, NarrativeQA, NQ, SQuAD, TriviaQA, and TruthfulQA.&lt;/p&gt;

&lt;p&gt;Following the training process, all trained models were evaluated on validation sets with D_g, D_r, D_gr, and D_∅, where D_∅ indicates inference without retrieval.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0efrx83rwyl9nkxpyx37.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0efrx83rwyl9nkxpyx37.png" alt="Image description" width="800" height="495"&gt;&lt;/a&gt;&lt;br&gt;
Results of generator fine-tuning&lt;/p&gt;

&lt;p&gt;The results demonstrate that models trained with a mix of relevant and random documents (&lt;strong&gt;M_gr&lt;/strong&gt;) perform best when provided with either gold or mixed contexts. This finding suggests that incorporating both relevant and random contexts during training can enhance the generator's robustness to irrelevant information while ensuring effective utilization of relevant contexts.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;coverage score&lt;/strong&gt; used as an evaluation metric measures how much of the content of the generated response overlaps with the ground-truth answer. It evaluates the proportion of ground-truth content that is covered by the generated output.&lt;/p&gt;




&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;In the final installment of the "Understanding RAG" series, we reviewed key findings from "Searching for Best Practices in Retrieval-Augmented Generation." While the study suggests effective strategies like chunk sizes of 256 and 512 tokens, LLM-Embedder for embedding, and specific reranking and generation methods, it's clear that the best approach depends on your specific use case. The right combination of techniques—retrieval methods, embedding strategies, or summarization techniques—should be tailored to meet your application's unique needs and goals, and you should run robust evaluations to figure out what works best for you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tini.fyi/tkBnX" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt; is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with &lt;a href="https://tini.fyi/tkBnX" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>rag</category>
      <category>ai</category>
      <category>chatgpt</category>
    </item>
    <item>
      <title>Understanding RAG (Part 4): Optimizing the generation component</title>
      <dc:creator>Parth Roy</dc:creator>
      <pubDate>Sat, 17 Aug 2024 10:33:36 +0000</pubDate>
      <link>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-4-optimizing-the-generation-component-4jen</link>
      <guid>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-4-optimizing-the-generation-component-4jen</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://blog.getmaxim.ai/understanding-rag-part-1-rag-overview-2/" rel="noopener noreferrer"&gt;part one&lt;/a&gt; of the "Understanding RAG" series, we outlined the three main components of RAG: indexing, retrieval, and generation. In the previous parts, we discussed techniques for improving indexing and retrieval. This blog will focus on methods to enhance the generation component of Retrieval-Augmented Generation (RAG). We will explore techniques like document repacking, summarization, and fine-tuning LLM models for generation, all of which are crucial for refining the information that feeds into the generation process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Document Repacking
&lt;/h2&gt;

&lt;p&gt;Document repacking is a technique used in the RAG workflow to enhance response generation performance. After the reranking stage, where the top K documents are selected based on their relevancy scores, document repacking optimizes the order in which these documents are presented to the LLM model. This rearrangement ensures that the LLM can generate more precise and relevant responses by focusing on the most pertinent information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o8dk48ivlqeb1vhyajs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o8dk48ivlqeb1vhyajs.png" alt="Image description" width="800" height="204"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Different Repacking Methods
&lt;/h3&gt;

&lt;p&gt;There are three primary repacking methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Forward Method&lt;/strong&gt;: In this approach, documents are repacked in descending order based on their relevancy scores. This means that the most relevant documents, as determined in the reranking phase, are placed at the beginning of the input.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reverse Method&lt;/strong&gt;: Conversely, the reverse method arranges documents in ascending order of their relevancy scores. Here, the least relevant documents are placed first, and the most relevant ones are at the end.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sides Method&lt;/strong&gt;: The sides method strategically places the most relevant documents at both the head and tail of the input sequence, ensuring that critical information is prominently positioned for the LLM.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flm7k5frx711x7aln00si.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flm7k5frx711x7aln00si.png" alt="Image description" width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Selecting the Best Repacking Method
&lt;/h3&gt;

&lt;p&gt;In the research paper, &lt;em&gt;"Searching for Best Practices in RAGeneration,"&lt;/em&gt; the authors assessed the performance of various repacking techniques to determine the most effective approach, and the "reverse" method emerged as the most effective repacking method. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fccqrggwfs267sd9yx6ew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fccqrggwfs267sd9yx6ew.png" alt="Image description" width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results of the search for optimal RAG practices show that the “Avg” (average score) is calculated based on the Acc, EM, and RAG scores for all tasks, while the average latency is measured in seconds per query. The best scores are highlighted in bold.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Document Summarization
&lt;/h2&gt;

&lt;p&gt;Summarization in RAG involves condensing retrieved documents to enhance the relevance and efficiency of response generation by removing redundancy and reducing prompt length before sending them to the LLM model for generation. Summarization tasks can be extractive or abstractive. Extractive methods segment documents into sentences, then score and rank them based on importance. Abstractive compressors synthesize information from multiple documents to rephrase and generate a cohesive summary. These tasks can be either query-based, focusing on information relevant to a specific query, or non-query-based, focusing on general information compression.&lt;/p&gt;

&lt;h3&gt;
  
  
  Purpose of Summarization in RAG
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reduce Redundancy&lt;/strong&gt;: Retrieved documents can contain repetitive or irrelevant details. Summarization helps filter out this unnecessary information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enhance Efficiency&lt;/strong&gt;: Long documents or prompts can slow down the language model (LLM) during inference. Summarization helps create more concise inputs, speeding up the response generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwhgw7rykgu52a51xmy5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwhgw7rykgu52a51xmy5.png" alt="Image description" width="511" height="350"&gt;&lt;/a&gt;&lt;br&gt;
Performance vs. Document Number&lt;/p&gt;

&lt;p&gt;In the above figure, it can be clearly seen that the performance of LLMs in downstream tasks decreases as the number of documents in the prompt increases. Therefore, summarization of the retrieved documents is necessary to maintain performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summarization Techniques
&lt;/h3&gt;

&lt;p&gt;There are multiple summarization techniques that can be employed in RAG frameworks. Here in this blog, we will discuss two such techniques:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/pdf/2310.04408" rel="noopener noreferrer"&gt;&lt;strong&gt;RECOMP&lt;/strong&gt;&lt;/a&gt;: It is a technique that compresses retrieved documents using one of two methods, extractive or abstractive compression, depending on the use case. Extractive is used to maintain exact text, while abstractive synthesizes summaries. The decision is based on task needs, document length, computational resources, and performance. The extractive compressor selects key sentences by using a dual encoder model to rank sentences based on their similarity to the query. The abstractive compressor, on the other hand, generates new summaries with an encoder-decoder model trained on large datasets containing document-summary pairs. These compressed summaries are then added to the original input, providing the language model with a focused and concise context. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7h9tg49indo5k7hp7x38.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7h9tg49indo5k7hp7x38.png" alt="Image description" width="800" height="231"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://arxiv.org/pdf/2310.04408" rel="noopener noreferrer"&gt;Source&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/pdf/2310.06839" rel="noopener noreferrer"&gt;&lt;strong&gt;LongLLMLingua&lt;/strong&gt;&lt;/a&gt;: This technique uses a combination of question-aware coarse-grained and fine-grained compression methods to increase key information density and improve the model's performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In question-aware coarse-grained compression, perplexity is measured for each retrieved document with respect to the given query. Then, all documents are ranked based on their perplexity scores, from lowest to highest. Documents with lower perplexity are considered more relevant to the question.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bn8fm8kqikmkhs734n4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bn8fm8kqikmkhs734n4.png" alt="Image description" width="758" height="206"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Perplexity Formula&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In the above formula, k represents the index of the document or sequence being considered. N_k​ is the number of words or tokens in the k-th document or sequence. p(x_doc_k, i​​) represents the probability of the i-th word or token x in the k-th document according to the model&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Question-aware fine-grained compression is a method to further refine the documents or passages that were retained after the coarse-grained compression by focusing on the most relevant parts of those documents. For each word or token within the retained documents, the importance is computed with respect to the given query. Then, the tokens are ranked based on their importance scores, from most important to least important.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb92oz8lafl58nij1jg6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb92oz8lafl58nij1jg6.png" alt="Image description" width="800" height="96"&gt;&lt;/a&gt;&lt;br&gt;
Importance Formula&lt;/p&gt;

&lt;h3&gt;
  
  
  Selecting the Best Summarization Technique
&lt;/h3&gt;

&lt;p&gt;In the &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;research paper&lt;/a&gt;, &lt;em&gt;"Searching for Best Practices in Retrieval-Augmented Generation,"&lt;/em&gt; the authors evaluated various summarization methods on three benchmark datasets: &lt;a href="https://ai.google.com/research/NaturalQuestions/download" rel="noopener noreferrer"&gt;NQ&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/1705.03551" rel="noopener noreferrer"&gt;TriviaQA&lt;/a&gt;, and &lt;a href="https://arxiv.org/abs/1809.09600" rel="noopener noreferrer"&gt;HotpotQA&lt;/a&gt;. RECOMP was recommended for its outstanding performance. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmt9tz7ohjrueew1w59r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmt9tz7ohjrueew1w59r.png" alt="Image description" width="800" height="190"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Results of the search for optimal RAG practices show that the “Avg” (average score) is calculated based on the Acc, EM, and RAG scores for all tasks, while the average latency is measured in seconds per query. The best scores are highlighted in bold.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Generator Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;Fine-tuning is the process of taking a pre-trained model and making further adjustments to its parameters on a smaller, task-specific dataset to improve its performance on that specific task. Here, in this case, the specific task would be answer generation when given a query paired with context. In the &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;research paper&lt;/a&gt; &lt;em&gt;"Searching for Best Practices in Retrieval-Augmented Generation,"&lt;/em&gt; the authors focused on fine-tuning the generator. Their goal was to investigate the impact of fine-tuning, particularly how relevant or irrelevant contexts influence the generator’s performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deep Dive into the Fine-Tuning Process
&lt;/h3&gt;

&lt;p&gt;For the process of fine-tuning, the query input to the RAG system is denoted as 'x' and 'D' as the contexts for the input. The fine-tuning loss for the generator was defined as the negative log-likelihood of the ground-truth output 'y'. The negative log-likelihood measures how well the predicted probability distribution matches the actual distribution of the data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;L (Loss function) =−logP(y∣x,D) , here P(y∣x,D) is the probability of the ground-truth output y given the query x and contexts D.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To explore the impact of fine-tuning, especially with relevant and irrelevant contexts, the authors defined (d_{gold}) as a context relevant to the query and (d_{random}) as a randomly retrieved context. They trained the model using different compositions of (D) as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;(D_g)&lt;/strong&gt;: The augmented context consisted of query-relevant documents, denoted as (D_g = {d_gold}).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;(D_r)&lt;/strong&gt;: The context contained one randomly sampled document, denoted as (D_r = {d_random}).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;(D_{gr})&lt;/strong&gt;: The augmented context comprised a relevant document and a randomly selected one, denoted as (D_{gr} = {d_gold}, d_{random}}).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;(D_{gg})&lt;/strong&gt;: The augmented context consisted of two copies of a query-relevant document, denoted as (D_{gg} = {d_{gold}, d_{gold}}).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this study, Llama-2-7b was selected as the base model. The base LM generator without fine-tuning is referred to as (M_b), and the model fine-tuned with different contexts is referred to as (M_g), (M_r), (M_{gr}), and (M_{gg}). The models were fine-tuned using various QA and reading comprehension datasets. Ground-truth coverage was employed as the evaluation metric due to the typically short nature of QA task answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Selecting the Best Context Method for Fine-Tuning
&lt;/h3&gt;

&lt;p&gt;Following the training process, all trained models were evaluated on validation sets with (D_g), (D_r), (D_{gr}), and (D_{∅}), where (D_{∅}) indicates inference without retrieval.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftlywc8amhy34d4b8xxyr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftlywc8amhy34d4b8xxyr.png" alt="Image description" width="800" height="495"&gt;&lt;/a&gt;&lt;br&gt;
Results of Generator Fine-Tuning&lt;/p&gt;

&lt;p&gt;The results demonstrate that models trained with a mix of relevant and random documents ((M_{gr})) perform best when provided with either gold or mixed contexts. This finding suggests that incorporating both relevant and random contexts during training can enhance the generator's robustness to irrelevant information while ensuring effective utilization of relevant contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this blog, we explored methods to enhance the generation component of RAG systems. Document repacking, with the "sides" method, improves response accuracy by optimizing document order. Summarization techniques like RECOMP and LongLLMLingua reduce redundancy and enhance efficiency, with RECOMP showing superior performance. Fine-tuning the generator improves model performance for specific tasks using relevant and irrelevant.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>rag</category>
      <category>ai</category>
    </item>
    <item>
      <title>Understanding RAG (Part 3): Re-Ranker is all you need.</title>
      <dc:creator>Parth Roy</dc:creator>
      <pubDate>Sat, 17 Aug 2024 10:07:11 +0000</pubDate>
      <link>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-3-re-ranker-is-all-you-need-35fg</link>
      <guid>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-3-re-ranker-is-all-you-need-35fg</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Creating a robust Retrieval Augmented Generation (RAG) application presents numerous challenges. As the complexity of the documents increases, we often encounter a significant decrease in the accuracy of the generated answers. This issue can stem from various factors, such as chunk length, metadata quality, clarity of the document, or the nature of the questions asked. By leveraging extended context lengths and refining our Retrieval-Augmented Generation (RAG) strategies, we can significantly enhance the relevance and accuracy of our responses. One effective strategy is the implementation of a re-ranker to ensure more precise and informative outcomes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a Re-ranker?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ewzcduwdypj01v65xpr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ewzcduwdypj01v65xpr.png" alt="Image description" width="800" height="287"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Source: &lt;a href="https://cohere.com/blog/rerank" rel="noopener noreferrer"&gt;Cohere Blog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the context of Retrieval-Augmented Generation (RAG), a re-ranker is a model used to refine and improve the initial set of retrieved documents or search results before they are passed to a language model for generating responses. The process begins with an initial retrieval phase where a set of candidate documents is fetched based on the search query using keyword-based search, vector-based search, or a hybrid approach. After this initial retrieval, a re-ranker model re-evaluates and reorders the candidate documents by computing a relevance score for each document-query pair. This re-ranking step prioritizes the most relevant documents according to the context and nuances of the query. The top-ranked documents from this re-ranking process are then selected and passed to the language model, which uses them as context to generate a more accurate and informative response.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Re-Rankers?
&lt;/h2&gt;

&lt;p&gt;Hallucinations, as well as inaccurate and insufficient outputs, often occur when unrelated retrieved documents are included in the output context. This is where re-rankers can be invaluable. They rearrange document records to prioritize the most relevant ones. The current problems with the existing retrieval and generation framework are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Information loss in vector embeddings&lt;/strong&gt;: In RAG, we use vector search to find relevant documents by converting them into numerical vectors. Typically, the text is compressed into vectors of 768 or 1536 dimensions. This process can miss some relevant information due to compression. For example, in "I like going to the beach" vs. "I don't like going to the beach," the presence of "don't" completely changes the meaning, but embeddings might place these sentences close together due to their similar structure and content.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context Window Limit&lt;/strong&gt;: To capture more relevant documents, we can increase the number of documents returned (top_k) to the LLM (Large Language Model) for generating responses. However, LLMs have a limit on the amount of text they can process at once, known as the context window. Also, stuffing too much text into the context window can reduce the LLM’s performance (needle in a haystack problem) in understanding and recalling relevant information.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flblea17qzq4pk4cdgqkd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flblea17qzq4pk4cdgqkd.png" alt="Image description" width="410" height="508"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://arxiv.org/pdf/2402.10790v2" rel="noopener noreferrer"&gt;In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://arxiv.org/pdf/2402.10790v2" rel="noopener noreferrer"&gt;research paper&lt;/a&gt;, "In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss," they found that the mistarl-medium model's performance scales only for some tasks but quickly degenerates for the majority of others as context grows. Every row shows accuracy in % of solving the corresponding &lt;a href="https://arxiv.org/pdf/2402.10790v2" rel="noopener noreferrer"&gt;BABILong task (’qa1’-’qa10’)&lt;/a&gt;, and every column corresponds to the task size submitted to mistarl-medium with a 32K context window.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do we Incorporate Re-Rankers in RAG?
&lt;/h2&gt;

&lt;p&gt;To incorporate re-rankers in a Retrieval-Augmented Generation (RAG) system and enhance the accuracy and relevance of the generated responses, we employ a two-stage retrieval system. This system consists of:&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieval Stage
&lt;/h3&gt;

&lt;p&gt;In the retrieval stage, the goal is to quickly and efficiently retrieve a set of potentially relevant documents from a large text corpus. This is achieved using a vector database, which allows us to perform a similarity search.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Vector representation&lt;/strong&gt;: First, we convert the text documents into vector representations. This involves transforming the text into high-dimensional vectors that capture the semantic meaning of the text by using a bi-encoder. Typically, models like BERT or other pre-trained language models are used for this purpose.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Storing vectors&lt;/strong&gt;: These vector representations are then stored in a vector database. The vector database enables efficient storage and retrieval of these vectors, making it possible to perform fast similarity searches.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Query vector&lt;/strong&gt;: When a query is received, it is also converted into a vector representation using the same model. This query vector captures the semantic meaning of the query.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Similarity search&lt;/strong&gt;: The query vector is then compared against the document vectors in the vector database using a similarity metric, such as cosine similarity. This step identifies the top k documents whose vectors are most similar to the query vector.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reranking Stage
&lt;/h3&gt;

&lt;p&gt;The retrieval stage often returns documents that are relevant but not necessarily the most relevant. This is where the reranking stage comes into play. The reranking model reorders the initially retrieved documents based on their relevance to the query. This will minimize the "needle in a haystack" problem and address the context limit issue, as some models have constraints on their context window, thereby improving the overall quality of the results.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Initial set of documents&lt;/strong&gt;: The documents retrieved in the first stage are used as the input for the reranking stage. This set typically contains more documents than will ultimately be used, ensuring that we have a broad base from which to select the most relevant ones.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reranking model&lt;/strong&gt;: There are multiple reranking models that can be used to refine search results by re-evaluating and reordering initially retrieved documents.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross Encoder&lt;/strong&gt;: It takes pairs of the query and each retrieved document and computes a relevance score. Unlike the bi-encoder used in the retrieval stage, which independently encodes the query and documents, the cross-encoder considers the interaction between the query and document during scoring.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsdvpznqeuemcj4hkmfyu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsdvpznqeuemcj4hkmfyu.png" alt="Image description" width="800" height="455"&gt;&lt;/a&gt;&lt;br&gt;
 &lt;em&gt;Source: &lt;a href="https://www.sbert.net/" rel="noopener noreferrer"&gt;SBERT&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A cross-encoder concatenates the query and documents into a single input sequence and passes this combined sequence through the encoder model to generate a joint representation. While bi-encoders are efficient and scalable for large-scale retrieval tasks due to their ability to precompute embeddings, cross-encoders provide more accurate and contextually rich relevance scoring by processing query-document pairs together. This makes cross-encoders particularly advantageous for tasks where precision and nuanced understanding of the query-document relationship are crucial.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Vector Rerankers&lt;/strong&gt;: Models like &lt;a href="https://arxiv.org/pdf/2112.01488" rel="noopener noreferrer"&gt;ColBERT&lt;/a&gt; (Contextualized Late Interaction over BERT) strike a balance between bi-encoder and cross-encoder approaches. ColBERT maintains the efficiency of bi-encoders by precomputing document representations while enhancing the interaction between query and document tokens during similarity computation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmqoxjrda63p5eav4x5z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmqoxjrda63p5eav4x5z.png" alt="Image description" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Example Scenario
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "Impact of climate change on coral reefs"&lt;/p&gt;

&lt;p&gt;For the document pre-computation, consider the following documents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Document A&lt;/strong&gt;: "Climate change affects ocean temperatures, which in turn impacts coral reef ecosystems."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document B&lt;/strong&gt;: "The Great Barrier Reef is facing severe bleaching events due to rising sea temperatures."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each document is precomputed into token-level embeddings without considering any specific query.&lt;/p&gt;

&lt;p&gt;When the query "Impact of climate change on coral reefs" is received, it is encoded into token-level embeddings at inference.&lt;/p&gt;

&lt;p&gt;For token-level similarity computation, we first calculate the cosine similarity. For each token in the query, such as "Impact", we calculate its similarity with every token in Document A and Document B. This process is repeated for each token in the query ("of", "climate", "change", "on", "coral", "reefs").&lt;/p&gt;

&lt;p&gt;Next, we apply the maximum similarity (maxSim) method. For each query token, we keep the highest similarity score. For instance, the highest similarity score for "coral" might come from the word "coral" in Document A, and for "change" from "change" in Document B.&lt;/p&gt;

&lt;p&gt;Finally, we sum the maxSim scores to get the final similarity score for each document. Document A receives a final similarity score of 0.85, while Document B receives a score of 0.75. Therefore, Document A would be ranked higher as its final similarity score is greater.&lt;/p&gt;

&lt;h3&gt;
  
  
  Relevance Scoring
&lt;/h3&gt;

&lt;p&gt;For each query-document pair, the reranking model outputs a similarity score that indicates how relevant the document is to the query. This score is computed using the full transformer model, allowing for a more nuanced understanding of the document's relevance in the context of the query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reordering Documents
&lt;/h3&gt;

&lt;p&gt;The documents are then reordered based on their relevance scores. The most relevant documents are moved to the top of the list, ensuring that the most pertinent information is prioritized.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Rerankers offer a promising solution to the limitations of basic RAG pipelines. By reordering retrieved documents based on their relevance to the query, we can improve the accuracy and relevance of generated responses. Additionally, injecting summarization into the context further enhances the LLM's ability to provide accurate answers. While implementing rerankers may introduce additional computational overhead, the benefits in terms of improved accuracy and relevance make it a worthwhile investment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tini.fyi/SqtqS" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt; is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with &lt;a href="https://tini.fyi/SqtqS" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt;. Book a &lt;a href="https://tini.fyi/ZpBMr" rel="noopener noreferrer"&gt;demo&lt;/a&gt; today!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>chatgpt</category>
    </item>
    <item>
      <title>Understanding RAG (Part 2) : RAG Retrieval</title>
      <dc:creator>Parth Roy</dc:creator>
      <pubDate>Wed, 07 Aug 2024 09:23:18 +0000</pubDate>
      <link>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-2-rag-retrieval-4m4j</link>
      <guid>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-2-rag-retrieval-4m4j</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In part one of the "Understanding RAG" series, we covered the basics and advanced concepts of Retrieval-Augmented Generation (RAG). This part delves deeper into the retrieval component and explores enhancement strategies.&lt;/p&gt;

&lt;p&gt;In this blog, we will explore techniques to improve retrieval and make pre-retrieval processes, such as document chunking, more efficient. Effective &lt;strong&gt;retrieval&lt;/strong&gt; requires accurate, clear, and detailed queries. Even with embeddings, semantic differences between queries and documents can persist. Several methods enhance query information to improve retrieval. For instance, &lt;a href="https://arxiv.org/pdf/2303.07678" rel="noopener noreferrer"&gt;Query2Doc&lt;/a&gt;and &lt;a href="https://arxiv.org/pdf/2212.10496" rel="noopener noreferrer"&gt;HyDE&lt;/a&gt; generate pseudo-documents from original queries, while &lt;a href="https://arxiv.org/pdf/2310.14696" rel="noopener noreferrer"&gt;TOC&lt;/a&gt; decomposes queries into subqueries, aggregating the results. &lt;br&gt;
&lt;strong&gt;Document chunking&lt;/strong&gt; significantly influences retrieval performance. Common strategies involve dividing documents into chunks, but finding the optimal chunk length is challenging. Small chunks may fragment sentences, while large chunks can include irrelevant context.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is document chunking?
&lt;/h2&gt;

&lt;p&gt;Chunking is the process of dividing a document into smaller, manageable segments or chunks. The choice of chunking techniques is crucial for the effectiveness of this process. In this section, we will explore various chunking techniques mentioned in the research paper and look at their findings to determine the most effective method.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz83sxh1zxhksulx5h419.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz83sxh1zxhksulx5h419.png" alt="Image description" width="800" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Levels of chunking
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Token-level chunking&lt;/strong&gt;: Token-level chunking involves splitting the text at the token level. This method is simple to implement but has the drawback of potentially splitting sentences inappropriately, which can affect retrieval quality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Semantic-level chunking&lt;/strong&gt;: Semantic-level chunking involves using LLMs to determine logical breakpoints based on context. This method preserves the context and meaning of the text but is time-consuming to implement.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sentence-level chunking&lt;/strong&gt;: Sentence-level chunking involves splitting the text at sentence boundaries. This method balances simplicity and the preservation of text semantics. However, it is potentially less precise than semantic-level chunking, though this drawback is not explicitly mentioned.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Chunk size and its impact
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Larger chunks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Provide more context, enhancing comprehension.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Increase processing time.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Smaller chunks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Improve retrieval recall and reduce processing time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: May lack sufficient context.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finding the optimal chunk size involves balancing metrics such as faithfulness and relevancy. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Faithfulness measures whether the response is hallucinated or matches the retrieved texts, while relevancy measures whether the retrieved texts and responses match the queries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F994n2m8m6m87dbo057zh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F994n2m8m6m87dbo057zh.png" alt="Image description" width="638" height="360"&gt;&lt;/a&gt;&lt;br&gt;
Comparison of different chunk sizes in the research paper: &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In an experiment detailed in the &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;research paper&lt;/a&gt;, the &lt;a href="https://platform.openai.com/docs/guides/embeddings" rel="noopener noreferrer"&gt;text-embedding-ada-0022&lt;/a&gt; model was used for embedding. The &lt;a href="https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha" rel="noopener noreferrer"&gt;zephyr-7b-alpha3&lt;/a&gt; model served as the generation model, while GPT-3.5-turbo was utilized for evaluation. A chunk overlap of 20 tokens was maintained throughout the process. The corpus for this experiment comprised the first sixty pages of the document &lt;a href="https://s27.q4cdn.com/263799617/files/doc_financials/2021/AR/Lyft-Annual-Report-2021.pdf" rel="noopener noreferrer"&gt;lyft_2021&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Optimal chunk sizes&lt;/strong&gt;: 256 and 512 tokens provide the best balance of high faithfulness and relevancy.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Larger chunks (2048 tokens)&lt;/strong&gt;: Offer more context but at the cost of slightly lower faithfulness.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Smaller chunks (128 tokens)&lt;/strong&gt;: Improve retrieval recall but may lack sufficient context, leading to slightly lower faithfulness.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Advanced Chunking Techniques
&lt;/h3&gt;

&lt;p&gt;Advanced chunking techniques, such as small-to-big and sliding windows, improve retrieval quality by organizing chunk relationships and maintaining context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sliding window chunking&lt;/strong&gt;: Sliding window chunking segments text into overlapping chunks, combining fixed-size and sentence-based chunking advantages. Each chunk overlaps with the previous one, preserving context and ensuring continuity and meaning in the text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Small-to-big chunking&lt;/strong&gt;: Small-to-big chunking involves using smaller, targeted text chunks for embedding and retrieval to enhance accuracy. After retrieval, the larger text chunk containing the smaller chunk is provided to the large language model for synthesis. This approach combines precise retrieval with comprehensive contextual information.&lt;/p&gt;

&lt;p&gt;In an experiment detailed in the &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;research paper&lt;/a&gt;, the effectiveness of advanced chunking techniques is demonstrated using the &lt;a href="https://arxiv.org/pdf/2310.07554" rel="noopener noreferrer"&gt;LLM-Embedder&lt;/a&gt; model for embedding. The study utilizes a smaller chunk size of 175 tokens, a larger chunk size of 512 tokens, and a chunk overlap of 20 tokens. Techniques such as small-to-big and sliding window are employed to enhance retrieval quality by preserving context and ensuring the retrieval of relevant information.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl2erzqgty83jrtr190jt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl2erzqgty83jrtr190jt.png" alt="Image description" width="654" height="288"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Comparison of different chunk sizes in the research paper: &lt;a href="https://arxiv.org/pdf/2407.01219" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval Methods
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Query Rewriting
&lt;/h3&gt;

&lt;p&gt;Refines user queries to better match relevant documents by prompting an LLM to rewrite queries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/pdf/2303.07678" rel="noopener noreferrer"&gt;&lt;strong&gt;Query2Doc&lt;/strong&gt;&lt;/a&gt;: Given a query q, the method generates a pseudo-document d′ through few-shot prompting. The original query q is then concatenated with the pseudo-document d′ to form an enhanced query q+, enhancing the query's context and improving retrieval accuracy. The enhanced query q+ is a straightforward concatenation of q and d′, separated by a special token [SEP].&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83pf2j4n0eev7o860n5y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83pf2j4n0eev7o860n5y.png" alt="Image description" width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The “DPR + query2doc” variant consistently outperforms the baseline DPR model by approximately 1% in MRR on the &lt;a href="https://microsoft.github.io/msmarco/" rel="noopener noreferrer"&gt;MS-MARCO&lt;/a&gt; dev set, regardless of the amount of labeled data used for fine-tuning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This indicates that the improvement provided by the query2doc method is not dependent on the quantity of labeled data available for training.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpwt2i0w32padmq27xif.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpwt2i0w32padmq27xif.png" alt="Image description" width="782" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MRR on MS-MARCO dev set w.r.t the percentage of labeled data used for fine-tuning.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Dense Passage Retrieval (DPR) uses neural networks to embed queries and documents into dense vectors for comparison. At the same time, the Mean Reciprocal Rank (MRR) evaluates retrieval performance by averaging the reciprocal ranks of the first relevant document across multiple queries.&lt;br&gt;
&lt;strong&gt;Reciprocal Ranks (RR)&lt;/strong&gt;: 1 / (rank of the first relevant document)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Query Decomposition
&lt;/h3&gt;

&lt;p&gt;Involves breaking down the original query into sub-questions. This method retrieves documents based on these sub-questions, potentially enhancing retrieval accuracy by addressing different aspects of the original query.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/pdf/2310.14696" rel="noopener noreferrer"&gt;TOC&lt;/a&gt; (TREE OF CLARIFICATIONS), which addresses addresses ambiguous questions in open-domain QA by generating disambiguated questions (DQs) through few-shot prompting. It retrieves relevant passages to identify various interpretations, like different types of medals or Olympics. TOC prunes redundant DQs and creates a comprehensive long-form answer covering all interpretations, ensuring thoroughness and depth without needing user clarification.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffznt4l3xme6wo5z80yt1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffznt4l3xme6wo5z80yt1.png" alt="Image description" width="800" height="933"&gt;&lt;/a&gt;&lt;br&gt;
Overview of TREE OF CLARIFICATIONS. (1) relevant passages for the ambiguous question (AQ) are retrieved. (2) leveraging the passages, disambiguated questions (DQs) for the AQ are recursively generated via few-shot prompting and pruned as necessary. (3) a long-form answer addressing all DQs is generated.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Pseudo-Documents Generation
&lt;/h3&gt;

&lt;p&gt;Techniques like &lt;a href="https://arxiv.org/pdf/2212.10496" rel="noopener noreferrer"&gt;HyDE&lt;/a&gt; (Hypothetical Document Embeddings) improve retrieval by generating hypothetical or pseudo documents with a generative model and encoding them into semantic embeddings using a contrastive encoder. This approach enhances relevance-based retrieval without relying on exact text matches, utilizing unsupervised learning principles effectively.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fng0xa5zvq6j2wmglhrpl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fng0xa5zvq6j2wmglhrpl.png" alt="Image description" width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An illustration of the HyDE model from &lt;a href="https://arxiv.org/pdf/2212.10496" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's dive deep into HYDE's concept by examining each step it undergoes through an example scenario. At a high level, it performs two tasks: a Generative Task and a Document-Document Similarity Task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example query:&lt;/strong&gt; What are the causes of climate change?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Generative model&lt;/strong&gt;&lt;br&gt;
Purpose: Generate a hypothetical document that answers the query.&lt;br&gt;
Generated Document: "Climate change is caused by human activities like burning fossil fuels, deforestation, and industrial processes. It releases greenhouse gases such as carbon dioxide and methane, trapping heat and causing global warming. Natural factors like volcanic eruptions and solar radiation also contribute."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It captures key causes of climate change but may contain errors.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Contrastive encoder&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Purpose: Encode the generated document using semantic embedding.&lt;/p&gt;

&lt;p&gt;Encoded Vector: A numerical representation in a high-dimensional space that emphasizes semantic content (causal factors of climate change) while filtering out unnecessary details.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Retrieval using document-document similarity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Purpose: Compare the encoded vector of the hypothetical document with embeddings of real documents in a corpus.&lt;/p&gt;

&lt;p&gt;Process: Compute similarity (e.g., cosine similarity) between the hypothetical document's vector and corpus document vectors.&lt;/p&gt;

&lt;p&gt;Retrieval: Retrieve real documents from the corpus that are most semantically similar to the hypothetical document, focusing on relevance rather than exact wording.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Advanced chunking techniques like the sliding window significantly improve retrieval quality by maintaining context and ensuring the extraction of relevant information. Among various retrieval methods evaluated, &lt;a href="https://arxiv.org/pdf/2204.10558" rel="noopener noreferrer"&gt;Hybrid Search&lt;/a&gt; with HyDE stands out as the best, combining the speed of sparse retrieval with the accuracy of dense retrieval to achieve superior performance with acceptable latency.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In the next part of this series, we will focus on the re-ranking component of RAG to further improve content generation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/?utm_source=content&amp;amp;utm_medium=social&amp;amp;utm_campaign=devto&amp;amp;utm_content=RAG_part5" rel="noopener noreferrer"&gt;Maxim AI&lt;/a&gt; is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with &lt;a href="https://www.getmaxim.ai/?utm_source=content&amp;amp;utm_medium=social&amp;amp;utm_campaign=devto&amp;amp;utm_content=RAG_part5" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>rag</category>
      <category>chatgpt</category>
      <category>llama</category>
    </item>
    <item>
      <title>Understanding RAG (Part 1): RAG overview</title>
      <dc:creator>Parth Roy</dc:creator>
      <pubDate>Fri, 02 Aug 2024 06:05:00 +0000</pubDate>
      <link>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-1-rag-overview-p44</link>
      <guid>https://dev.to/parth_roy_a1ec4703407d025/understanding-rag-part-1-rag-overview-p44</guid>
      <description>&lt;h1&gt;
  
  
  What is RAG?
&lt;/h1&gt;

&lt;p&gt;Retrieval-augmented generation (RAG) is a process designed to enhance the output of a large language model (LLM) by incorporating information from an external, authoritative knowledge base. This approach ensures that the responses generated by the LLM are not solely dependent on the model's training data. Large Language Models are trained on extensive datasets and utilize billions of parameters to perform tasks such as answering questions, translating languages, and completing sentences. By using RAG, these models can tap into specific domains or an organization's internal knowledge base without needing to be retrained. This method is cost-effective and helps maintain the relevance, accuracy, and utility of the LLM's output in various contexts.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why is Retrieval-Augmented Generation important?
&lt;/h1&gt;

&lt;p&gt;Large language models (LLMs) are a critical component of artificial intelligence (AI) technologies, especially for intelligent chatbots and other natural language processing (NLP) applications. The primary goal is to create chatbots that can accurately answer user questions by referencing reliable knowledge sources. However, there are inherent challenges with LLM technology:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinations&lt;/strong&gt;: LLMs can present incorrect information if they do not have the right answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-date responses&lt;/strong&gt;: The static nature of LLM training data means the model might provide outdated or overly generic information instead of specific, current responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-authoritative sources&lt;/strong&gt;: Responses may be generated from unreliable sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terminology confusion&lt;/strong&gt;: Different sources might use the same terms to refer to different things, leading to inaccurate responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAG addresses these challenges by directing the LLM to retrieve relevant information from authoritative, pre-determined knowledge sources. This approach provides several benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control over output&lt;/strong&gt;: Organizations can better control the text generated by the LLM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accurate responses&lt;/strong&gt;: Users receive more accurate and relevant information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparency&lt;/strong&gt;: Users gain insights into the sources used by the LLM to generate responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2m9x9lsj2aeouwsfqzbi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2m9x9lsj2aeouwsfqzbi.png" alt="Diagram illustrating the components of a RAG system" width="800" height="468"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  How Does Retrieval-Augmented Generation (RAG) Work?
&lt;/h1&gt;

&lt;p&gt;A basic Retrieval-Augmented Generation (RAG) framework comprises three main components: indexing, retrieval, and generation. This framework operates by first indexing data into vector representations. Upon receiving a user query, it retrieves the most relevant chunks of information based on their similarity to the query. Finally, these retrieved chunks are used to generate a well-informed response. This process ensures that the model's output is accurate, relevant, and contextually appropriate.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Indexing
&lt;/h2&gt;

&lt;p&gt;Indexing is the initial phase where raw data is prepared and stored for retrieval:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data cleaning and extraction&lt;/strong&gt;: Raw data from various formats such as PDF, HTML, Word, and Markdown is cleaned and extracted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversion to plain text&lt;/strong&gt;: This data is converted into a uniform plain text format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text segmentation&lt;/strong&gt;: The text is segmented into smaller, digestible chunks to accommodate the context limitations of language models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector encoding&lt;/strong&gt;: These chunks are encoded into vector representations using an embedding model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage in vector database&lt;/strong&gt;: The encoded vectors are stored in a vector database, which is crucial for enabling efficient similarity searches in the retrieval phase.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Facwscbp9tx2z32r76lzc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Facwscbp9tx2z32r76lzc.png" alt="Indexing component of RAG" width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Retrieval
&lt;/h2&gt;

&lt;p&gt;Retrieval is the phase where relevant data is fetched based on a user query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query encoding&lt;/strong&gt;: Upon receiving a user query, the RAG system uses the same encoding model from the indexing phase to convert the query into a vector representation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Similarity calculation&lt;/strong&gt;: The system computes similarity scores between the query vector and the vectors of the indexed chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk retrieval&lt;/strong&gt;: The system prioritizes and retrieves the top K chunks that have the highest similarity scores to the query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expanded context creation&lt;/strong&gt;: These retrieved chunks are used to expand the context of the prompt that will be given to the language model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08j3a60wyi6og3usjprw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08j3a60wyi6og3usjprw.png" alt="Retrieval component of RAG" width="800" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Generation
&lt;/h2&gt;

&lt;p&gt;Generation is the final phase where the response is created based on the retrieved information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt synthesis&lt;/strong&gt;: The user query and the selected documents (retrieved chunks) are synthesized into a coherent prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response formulation&lt;/strong&gt;: A large language model is tasked with formulating a response to the synthesized prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task-specific criteria&lt;/strong&gt;: The model may draw upon its inherent parametric knowledge or limit its response to the information within the provided documents, depending on the task-specific criteria.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-turn dialogue&lt;/strong&gt;: For ongoing dialogues, the existing conversational history can be integrated into the prompt, enabling the model to engage in effective multi-turn interactions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq8um8lo14uxj3p27csdc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq8um8lo14uxj3p27csdc.png" alt="Generation component of RAG" width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Drawbacks of Basic Retrieval-Augmented Generation (RAG)
&lt;/h1&gt;

&lt;p&gt;Basic RAG faces challenges in retrieval precision, recall, and generating accurate, relevant responses. It struggles with integrating retrieved information coherently, often resulting in disjointed, redundant, or repetitive outputs. Additionally, determining the significance and maintaining consistency in the responses add to the complexity. The single retrieval approach often falls short, and there is a risk of over-reliance on the augmented data, leading to uninspired outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval Challenges
&lt;/h2&gt;

&lt;p&gt;The retrieval phase in Basic RAG often struggles with the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Precision and Recall&lt;/strong&gt;: It frequently selects chunks of information that are misaligned with the query or irrelevant, and it can miss crucial information needed for a comprehensive response.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Precision&lt;/strong&gt;: The proportion of documents that were retrieved relevant to the query out of all the documents that were retrieved.&lt;br&gt;
&lt;strong&gt;Recall&lt;/strong&gt;: The proportion of relevant documents that were retrieved out of all relevant documents in the database. It measures the completeness of the retrieval process.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Generation Difficulties
&lt;/h2&gt;

&lt;p&gt;In the generation phase, Basic RAG can encounter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination&lt;/strong&gt;: The model might produce content not supported by the retrieved context, creating fabricated or inaccurate information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Irrelevance, Toxicity, and Bias&lt;/strong&gt;: The outputs can sometimes be off-topic, offensive, or biased, negatively impacting the quality and reliability of the responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Augmentation Hurdles
&lt;/h2&gt;

&lt;p&gt;When integrating retrieved information into responses, Basic RAG faces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Disjointed or incoherent outputs&lt;/strong&gt;: Combining the retrieved data with the task at hand can result in responses that lack coherence or flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redundancy&lt;/strong&gt;: Similar information retrieved from multiple sources can lead to repetitive content in the responses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Significance and relevance&lt;/strong&gt;: Determining the importance and relevance of different passages is challenging, as is maintaining stylistic and tonal consistency in the final output.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Complexity of Information Acquisition
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single retrieval limitation&lt;/strong&gt;: A single retrieval based on the original query often fails to provide enough context, necessitating more complex retrieval strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-reliance on augmented information&lt;/strong&gt;: Generation models might depend too heavily on the retrieved content, leading to responses that merely echo this information without offering additional insight or synthesis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xkybvegmz248qoyks58.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xkybvegmz248qoyks58.png" alt="Basic RAG" width="616" height="652"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Advanced Retrieval-Augmented Generation (RAG)
&lt;/h1&gt;

&lt;p&gt;Advanced RAG builds on the basic RAG framework by addressing its limitations and enhancing retrieval quality through pre-retrieval and post-retrieval strategies. Here’s a detailed explanation:&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Pre-retrieval Process
&lt;/h2&gt;

&lt;p&gt;The pre-retrieval process aims to optimize both the indexing structure and the original query to ensure high-quality content retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optimizing Indexing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enhancing data granularity&lt;/strong&gt;: Breaking down data into smaller, more precise chunks to improve indexing accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimizing index structures&lt;/strong&gt;: Improving the structure of indexes to facilitate efficient and accurate retrieval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adding metadata&lt;/strong&gt;: Incorporating additional information like timestamps, authorship, and categorization to enhance context and relevance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alignment optimization&lt;/strong&gt;: Aligning data chunks to maintain context and continuity across segments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixed retrieval&lt;/strong&gt;: Combining various retrieval techniques to improve overall search results.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Query Optimization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query rewriting&lt;/strong&gt;: Rephrasing the user's original question to improve clarity and accuracy in retrieval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query transformation&lt;/strong&gt;: Altering the structure of the query to better match the indexed data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query expansion&lt;/strong&gt;: Adding related terms or synonyms to the query to capture a broader range of relevant results.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Post-retrieval Process
&lt;/h2&gt;

&lt;p&gt;After retrieving relevant context, the post-retrieval process focuses on effectively integrating it with the query to generate accurate and focused responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Re-ranking Chunks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Re-ranking&lt;/strong&gt;: Prioritizing the retrieved information by relocating the most relevant content to the edges of the prompt. This method is implemented in frameworks like LlamaIndex, LangChain, and HayStack.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Context Compressing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mitigating information overload&lt;/strong&gt;: Directly feeding all relevant documents into LLMs can lead to information overload, where key details are diluted by irrelevant content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selecting essential information&lt;/strong&gt;: Focusing on the most crucial parts of the retrieved content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emphasizing critical sections&lt;/strong&gt;: Highlighting the most important sections of the context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shortening the context&lt;/strong&gt;: Reducing the amount of information to be processed by the LLM to maintain focus on key details.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl9ktrbyzj4mnw7qk61ew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl9ktrbyzj4mnw7qk61ew.png" alt="Advanced RAG" width="800" height="792"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In the subsequent parts of this blog, we will delve into detailed discussions on the advancements in pre-retrieval and post-retrieval techniques.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Retrieval-augmented generation (RAG) significantly improves the accuracy and relevance of responses from large language models (LLMs) by incorporating external, authoritative knowledge sources. It addresses common LLM challenges like misinformation and outdated data, enhancing control, transparency, and reliability in content generation. While basic RAG faces retrieval precision and response coherence issues, advanced strategies refine indexing, optimize queries, and integrate information more effectively. Ongoing advancements in RAG are crucial for enhancing AI-generated responses, building trust, and improving natural language processing interactions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://getmaxim.ai" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt; is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with &lt;a href="https://getmaxim.ai" rel="noopener noreferrer"&gt;Maxim&lt;/a&gt; before adding advanced RAG features.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>rag</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
