DEV Community: Parth Roy

RAGEval: Scenario-specific RAG evaluation dataset generation framework

Parth Roy — Wed, 11 Sep 2024 03:55:00 +0000

Introduction

Evaluating Retrieval-Augmented Generation (RAG) systems in specialized domains like finance, healthcare, and legal presents unique challenges that existing benchmarks, focused on general question-answering, fail to address. In this blog, we will explore the research paper "RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework," which introduces RAGEval. RAGEval offers a solution that automatically generates domain-specific evaluation datasets, reducing manual effort and privacy concerns. By focusing on creating scenario-specific datasets, RAGEval provides a more accurate and reliable assessment of RAG systems in complex, data-sensitive fields.

Limitations of existing benchmarks for RAG

Focus on general domains: Existing RAG benchmarks primarily evaluate factual correctness in general question-answering tasks, which may not accurately reflect the performance of RAG systems in specialized or vertical domains like finance, healthcare, and legal.

Manual data curation: One limitation is that evaluating or benchmarking RAG systems requires manually curating a dataset with input queries and expected outputs (golden answers). This necessity arises because domain-specific benchmarks are not publicly available due to concerns over safety and data privacy.

Data leakage: Challenges in evaluating RAG systems include data leakage from traditional benchmarks such as HotpotQA, TriviaQA, MS Marco , Natural Questions, 2WikiMultiHopQA, and KILT. Data leakage occurs when information from the answers inadvertently appears in the training data, allowing systems to achieve inflated performance metrics by memorizing rather than genuinely understanding and retrieving information.

What is RAGEval?

RAGEval is a framework that automatically creates evaluation datasets for assessing RAG systems. It generates a schema from seed documents, applies it to create diverse documents, and constructs question-answering pairs based on these documents and configurations.

Source: https://arxiv.org/pdf/2408.01262

Challenges addressed by RAGEval

RAGEval automates dataset creation by summarizing schemas and generating varied documents, reducing manual effort.
By using data derived from seed documents and specific schemas, RAGEval ensures consistent evaluation and minimizes biases and privacy concerns.
Additionally, RAGEval tackles the issue of general domain focus by creating specialized datasets for vertical domains like finance, healthcare, and legal, which are often overlooked in existing benchmarks.

💡Is RAGeval and synthetic data generation the same? RAGEval and synthetic data generation create datasets for models but with different goals. RAGEval generates evaluation datasets by deriving schemas from real documents and creating question-answering pairs to assess RAG systems. In contrast, synthetic data generation produces artificial, varied data to support model training and testing across various applications. RAGEval focuses on structured evaluation, while synthetic data generation emphasizes diverse, fictional data.

Components of RAGEval

Building a close-domain RAG evaluation dataset presents two major challenges: the high cost of collecting and annotating sensitive vertical documents and the complexity of evaluating detailed, comprehensive answers typical in vertical domains. To tackle these issues, RAGEval uses a “schema-configuration-document-QAR-keypoint” pipeline. This approach emphasizes factual information and improves answer estimation accuracy and reliability. The following sub-sections will detail each component of this pipeline.

Stage 1: Schema summary

In domain-specific scenarios, texts follow a common knowledge framework, represented by schema S, which captures the essential factual information such as organization, type, events, date, and place. This schema is derived by using LLMs to analyze a small set of seed texts, even if they differ in style and content. For example, financial reports can cover various industries. This method ensures the schema's validity and comprehensiveness, improving text generation control and producing coherent, domain-relevant content.

Example

From seed financial reports, such as:

Report A: "Tech Innovations Inc. in San Francisco released its annual report on March 15, 2023."
Report B: "Green Energy Ltd. in Austin published its quarterly report on June 30, 2022."

The schema captures essential elements like:

Company name
Report type (annual, quarterly)
Key events
Dates
Location

Using this schema, new content can be generated, such as:

"Global Tech Solutions in New York announced its quarterly earnings on July 20, 2024."

Source: https://arxiv.org/pdf/2408.01262

Stage 2: Document generation

To create effective evaluation datasets, RAGEval first generates configurations derived from a schema, ensuring consistency and coherence in the virtual texts. We use a hybrid approach: rule-based methods for accurate, structured data like dates and categories and LLMs for more complex, nuanced content. For instance, in financial reports, configurations cover various sectors like “agriculture” and “aviation,” with 20 business domains included. The configurations are integrated into structured documents, such as medical records or legal texts, following domain-specific guidelines. For financial documents, we divide content into sections (e.g., “Financial Report,” “Corporate Governance”) to ensure coherent and relevant outputs.

Source: https://arxiv.org/pdf/2408.01262

Stage 3: QRA generation

QRA Generation involves creating Question-Reference-Answer (QRA) triples from given documents ( D ) and configurations ( C ). This process is designed to establish a comprehensive evaluation framework for testing information retrieval and reasoning capabilities. It includes four key steps: first, formulating questions ( Q ) based on the documents; second, extracting relevant information fragments ( R ) from the documents to support the answers; third, generating initial answers ( A ) to the questions using the extracted references; and fourth, optimizing the answers and references to ensure accuracy and alignment, addressing any discrepancies or irrelevant content.

Utilizing configurations for questions and initial answers generation involves using specific configurations (C) to guide the creation of questions and initial answers. These configurations are embedded in prompts to ensure that generated questions are precise and relevant, and the answers are accurate. The configurations help generate a diverse set of question types, including factual, multi-hop reasoning, summarization, etc. The GPT-4o model produces targeted and accurate questions (Q) and answers (A) by including detailed instructions and examples for each question type. The approach aims to evaluate different facets of language understanding and information processing.

Source: https://arxiv.org/pdf/2408.01262

Extracting references, given the constructed questions ( Q ) and initial answers ( A ), the process involves extracting pertinent information fragments (references) ( R ) from the articles using a tailored extracting prompt. This prompt emphasizes the importance of grounding answers in the source material to ensure their reliability and traceability. Applying specific constraints and rules during the extraction phase ensures that the references are directly relevant and supportive of the answers, resulting in more precise and comprehensive QRA triples.

Optimizing answers and references involves refining answer A to ensure accuracy and alignment with the provided references R. If R contains information not present in A, the answers are supplemented accordingly. Conversely, if A includes content not found in R, the article is checked for overlooked references. If additional references are found, they are added to R while keeping A unchanged. If no corresponding references are found, irrelevant content is removed from A. This approach helps address hallucinations in the answer-generation process, ensuring that the final answers are accurate and well-supported by R.

Generating key points focuses on identifying critical information in answers rather than just correctness or keyword matching. Key points are extracted from standard answers ( A ) for each question ( Q ) using a predefined prompt with the GPT-4o model. This prompt, supporting both Chinese and English, uses in-context learning and examples to guide key point extraction across various domains and question types, including unanswerable ones. Typically, 3-5 key points are distilled from responses, capturing essential facts, relevant inferences, and conclusions. This method ensures that the evaluation is based on relevant and precise information, enhancing the reliability of subsequent metrics.

Source: https://arxiv.org/pdf/2408.01262

Quality assessment of RAGEval

In this section, the authors of the paper-"RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework" introduce the human verification process used to assess the quality of the generated dataset and the evaluation within the RAGEval framework. This assessment is divided into three main tasks: evaluating the quality of QARs and generated documents and validating the automated evaluation.

QAR quality assessment involves having annotators evaluate the correctness of the Question-Answer-Reference (QAR) triples generated under various configurations. Annotators score the QARs ranging from completely correct and fluent responses (5) to irrelevant or completely incorrect responses (0).

Source: https://arxiv.org/pdf/2408.01262

For annotation, 10 samples per question type were randomly selected for each language and domain, totaling 420 samples. Annotators were provided with the document, question, question type, generated response, and references.

CN- Chinese, EN- English

Results show that QAR quality scores are consistently high across different domains, with only slight variations between languages. The combined proportion of scores 4 and 5 is approximately 95% or higher across all domains, indicating that the approach upholds a high standard of accuracy and fluency in QARs.

Generated document quality assessment involves comparing documents generated using RAGEval with those produced by baseline methods, including zero-shot and one-shot prompting. For each domain—finance, legal, and medical—20 or 19 documents were randomly selected and grouped with two baseline documents for comparison. Annotators ranked the documents based on clarity, safety, richness, and conformity.

Document quality comparison criteria. Source: https://arxiv.org/pdf/2408.01262

Results indicate that RAGEval consistently outperforms both baseline methods, particularly excelling in safety, clarity, and richness. For Chinese and English datasets, RAGEval ranked highest in over 85% of cases for richness, clarity, and safety, demonstrating its effectiveness in generating high-quality documents.

Document generation comparison by domain. Source: https://arxiv.org/pdf/2408.01262

Validation of automated evaluation involves comparing LLM-reported metrics for completeness, hallucination, and irrelevance with human assessments. Using the same 420 examples from the QAR quality assessment, human annotators evaluated answers from Baichuan-2-7B-chat, and these results were compared with LLM metrics. Figure 6 shows that machine and human evaluations align closely, with absolute differences under 0.015, validating the reliability and consistency of the automated evaluation metrics.

Automated metric validation results. Source: https://arxiv.org/pdf/2408.01262

Conclusion

In conclusion, RAGEval represents a significant advancement in evaluating Retrieval-Augmented Generation (RAG) systems by automating the creation of scenario-specific datasets that emphasize factual accuracy and domain relevance. This framework addresses the limitations of existing benchmarks, particularly in sectors requiring detailed and accurate information such as finance, healthcare, and legal fields. The human evaluation results demonstrate the robustness and effectiveness of RAGEval in generating content that is accurate, safe, and rich. Furthermore, the alignment between automated metrics and human judgment validates the reliability of the evaluation approach.

Maxim is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with Maxim.

Graph RAG

Parth Roy — Tue, 10 Sep 2024 04:00:00 +0000

Introduction

This blog explores Microsoft's Graph-based Retrieval-Augmented Generation (Graph RAG) approach. While traditional RAG excels at retrieving specific information, it struggles with global queries, like identifying key themes in a dataset, which require query-focused summarization (QFS). Graph RAG combines the strengths of RAG and QFS by using entity knowledge graphs and community summaries to handle both broad questions and large datasets.

What are the challenges with RAG?

The primary challenges with RAG are:

Global question handling: RAG struggles with global questions that require understanding the entire text corpus, such as "What are the main themes in the dataset?". These questions require query-focused summarization (QFS), which differs from RAG's typical focus on retrieving and generating content from specific, localized text regions.
QFS: Traditional QFS methods, which summarize content based on specific queries, do not scale well to the large volumes of text typically indexed by RAG systems.
Context window limitations: While modern Large Language Models (LLMs) like GPT, Llama, and Gemini can perform in-context learning to summarize content, they are limited by the size of their context windows. When dealing with large text corpora, crucial information can be "lost in the middle" of these longer contexts, making it challenging to provide comprehensive summaries.
Inadequacy of direct retrieval: For QFS tasks, directly retrieving text chunks in a naive RAG system is often inadequate. RAG's standard approach is not well-suited for summarizing entire datasets or handling global questions, which requires a more sophisticated indexing and retrieval mechanism tailored to global summarization needs.

What is Graph RAG?

Graph RAG is an advanced question-answering method that combines RAG with a graph-based text index. It builds an entity knowledge graph from source documents and generates summaries for related entities. For a given question, partial responses from these summaries are combined into a comprehensive final answer. This approach scales well with large datasets and provides more comprehensive answers than traditional RAG methods.

Graph RAG approach & pipeline

Source: https://arxiv.org/pdf/2404.16130

The Graph RAG pipeline uses an LLM-derived graph index to organize source document text into nodes (like entities), edges (relationships), and covariates (claims). These elements are detected, extracted, and summarized using LLM prompts customized for the dataset. The graph index is partitioned into groups of related elements through community detection. Summaries for each group are generated in parallel both during indexing and when a query is made. To answer a query, a final query-focused summarization is performed over all relevant community summaries, producing a comprehensive "global answer."

Step 1: source documents → text chunks

In this stage, texts from source documents are split into chunks for processing. Each chunk is then passed to LLM prompts to extract elements for a graph index.

While longer chunks reduce the number of LLM calls needed to process the entire document because more text is handled during each call, they can impair the LLM’s ability to recall details due to an extended context window. For example, in the HotPotQA dataset, a 600-token chunk extracted nearly twice as many entity references as a 2400-token chunk. Thus, balancing recall and precision is crucial.

The number of entity references detected in the HotPotQA dataset varies with chunk size and the number of gleanings when employing a generic entity extraction prompt with GPT-4-turbo.

Step 2: text chunks → element instances

In this step, graph nodes and edges are identified and extracted from each chunk of source text using a multipart LLM prompt. The prompt first extracts all entities, including their name, type, and description, and then identifies relationships between these entities, detailing the source, target, and nature of each relationship. Both entities and relationships are output as a list of delimited tuples.

The prompt can be tailored to the document corpus by including domain-specific few-shot examples, improving extraction accuracy for specialized fields like science or medicine. Additionally, a secondary prompt extracts covariates, such as claims related to the entities, including details like subject, object, type, description, source text span, and dates.

Step 3: element instances → element summaries

In this step, an LLM is used to perform abstractive summarization, creating meaningful summaries of entities, relationships, and claims from source texts. The LLM extracts these elements and summarizes them into single blocks of descriptive text for each graph element: entity nodes, relationship edges, and claim covariates.

However, a challenge here is that the LLM might extract and describe the same entity in different formats, potentially leading to duplicate nodes in the entity graph. To address this, the process includes a subsequent step where related groups of entities (or "communities") are detected and summarized together. This helps to consolidate variations and ensures that the entity graph remains consistent, as the LLM can recognize and connect different names or descriptions of the same entity. Overall, the approach leverages LLM capabilities to handle variations and produce a comprehensive, coherent graph structure.

Example: Imagine the LLM is tasked with extracting information about a well-known historical figure, "Albert Einstein," from various texts. The LLM might encounter different ways of referring to Einstein, such as:

"Albert Einstein"
"Einstein"
"Dr. Einstein"
"The physicist Albert Einstein"

During the extraction process, the LLM may generate separate entity nodes for each of these variations, resulting in duplicate entries:

Node 1: Albert Einstein
Node 2: Einstein
Node 3: Dr. Einstein
Node 4: The physicist Albert Einstein

These variations could lead to multiple nodes for the same person in the entity graph, causing redundancy and inconsistencies.

In the subsequent step, the LLM summarizes and consolidates these related variations into a single node by detecting and linking all references to the same entity. For instance:

Node: Albert Einstein (consolidates all variations: Einstein, Dr. Einstein, The physicist Albert Einstein)

By summarizing all related variations into a single descriptive block, the process ensures that the entity graph is accurate and does not contain duplicate nodes for the same individual.

Step 4: element instances → element summaries

In this step, the index is modeled as a weighted undirected graph where entity nodes are linked by relationship edges, with edge weights reflecting the frequency of relationships. The graph is then partitioned into communities using the Leiden algorithm, efficiently identifying hierarchical community structures. This allows for a detailed, hierarchical graph division, enabling targeted and comprehensive global summarization.

💡The Leiden algorithm is a community detection method in network analysis that improves upon the Louvain algorithm by addressing its shortcomings in optimizing modularity, which measures the quality of community structures. It operates through a multi-level approach, incrementally refining clusters from fine to coarse, and guarantees that all communities are well-connected, unlike Louvain's potential for disconnected clusters. The algorithm iteratively partitions and refines clusters until no further improvement is possible, ensuring higher modularity. Despite its enhanced accuracy, the Leiden algorithm remains computationally efficient and can be applied to both weighted and unweighted networks.

Step 5: graph communities → community summaries

In this step, report-like summaries are created for each community detected in the graph. These summaries help understand the global structure and details of the dataset and are useful for answering global queries. Here's how it's done:

Leaf-level communities: Summarize the details of the smallest communities first, prioritizing descriptions of nodes, edges, and related information. These summaries are added to the LLM context window until the token limit is reached.

Higher-level communities: If there’s room in the context window, summarize all elements of these larger communities. If not, prioritize summaries of sub-communities over detailed element descriptions, fitting them into the context window by substituting longer descriptions with shorter summaries.

This approach ensures that both detailed and high-level summaries are generated efficiently, fitting within the LLM's token limits while maintaining comprehensive coverage of the dataset.

Step 6: community summaries → community answers → global answer

Given a user query, the community summaries created in the previous step are used to produce a final answer through a multi-stage process. The hierarchical structure of the communities allows the system to select the most appropriate level of detail for answering different types of questions. The process for generating a global answer at a specific community level is as follows:

Prepare community summaries: The community summaries are shuffled and split into chunks of a predefined token size. This approach helps ensure that relevant information is spread across multiple chunks, reducing the risk of losing important details in a single context window.
Generate intermediate answers: For each chunk, the LLM generates intermediate answers in parallel. Along with the answer, the LLM assigns a helpfulness score between 0-100, indicating how well the answer addresses the user’s query. Any answers scoring 0 are discarded.
Compile the global answer: The remaining intermediate answers are sorted in descending order based on their helpfulness scores. These answers are then added to a new context window, one by one, until the token limit is reached. This final compilation is used to generate the global answer provided to the user.

Performance evaluation

The authors evaluated the performance of both Graph-based Retrieval-Augmented Generation (Graph RAG) and standard RAG approaches using two datasets, each containing approximately one million tokens, equivalent to around 10 novels of text. The evaluation was conducted across four different metrics.

Datasets

The datasets were chosen to reflect the types of text corpora users might typically encounter in real-world applications:

Podcast transcripts: This dataset includes transcripts from the "Behind the Tech" podcast, where Kevin Scott, Microsoft's CTO, converses with other technology leaders. The dataset comprises 1,669 text chunks, each containing 600 tokens, with a 100-token overlap between chunks, resulting in approximately 1 million tokens overall.
News articles: The second dataset is a benchmark collection of news articles published between September 2013 and December 2023. It covers a range of categories, including entertainment, business, sports, technology, health, and science. This dataset consists of 3,197 text chunks, each containing 600 tokens with a 100-token overlap, totaling around 1.7 million tokens.

Conditions

The authors compared six different evaluation conditions to assess the performance of the Graph-based RAG system. These conditions included four levels of graph communities (C0, C1, C2, C3), a text summarization method (TS), and a naive "semantic search" RAG approach (SS). Each condition represented a distinct method for creating the context window used to answer queries:

C0: Utilized summaries from root-level communities, which were the fewest in number.
C1: Employed high-level summaries derived from sub-communities of C0.
C2: Applied intermediate-level summaries from sub-communities of C1.
C3: Low-level summaries from sub-communities of C2 were used, which were the most numerous.
TS: Implemented a map-reduce summarization method directly on the source texts, shuffling and chunking them.
SS: Employed a naive RAG approach where text chunks were retrieved and added to the context window until the token limit was reached.

All conditions used the same context window size and prompts, with differences only in how the context was constructed. The graph index supporting the C0-C3 conditions was generated using prompts designed for entity and relationship extraction, with modifications made to align with the specific domain of the data. The indexing process used a 600-token context window, with varying numbers of "gleanings" (passes over the text) depending on the dataset.

💡Map-reduce summarization condenses large texts by splitting them into smaller chunks, summarizing each chunk independently ("map" phase), and then combining these summaries into a cohesive final summary ("reduce" phase). This technique efficiently handles large datasets, ensuring the final summary captures the key points from the entire text.

Metrics

The evaluation employs an LLM to conduct a head-to-head comparison of generated answers based on specific metrics. Given the multi-stage nature of the Graph RAG mechanism, the multiple conditions the authors wanted to compare, and the lack of gold standard answers for activity-based sensemaking questions, the authors decided to adopt a head-to-head comparison approach using an LLM evaluator, this approach helps in assessing system performance through:

Comprehensiveness: This metric measures the extent to which the answer provides detailed coverage of all aspects of the question.
Diversity: This evaluates the richness and variety of perspectives and insights presented in the answer.
Empowerment: This assesses how effectively the answer aids the reader in understanding the topic and making informed judgments.
Directness: This measures the specificity and clarity with which the answer addresses the question.

Example question for the News article dataset, with generated answers from Graph RAG (C2) and Naive RAG, as well as LLM-generated assessments

Since directness often conflicts with comprehensiveness and diversity, it is unlikely for any method to excel across all metrics. The LLM evaluator compares pairs of answers based on these metrics, determining a winner or a tie if differences are negligible. Each comparison is repeated five times to account for the stochastic nature of LLMs, with mean scores used for final assessments.

Results

Head-to-head win rate percentages of (row condition) over (column condition) across two datasets, four metrics, and 125 questions per comparison (each repeated five times and averaged). The overall winner per dataset and metric is shown in bold.

The indexing process created graphs with 8,564 nodes and 20,691 edges for the Podcast dataset and 15,754 nodes and 19,520 edges for the News dataset. Graph RAG approaches consistently outperformed the naïve RAG (SS) method in both comprehensiveness and diversity across datasets, with win rates of 72-83% for Podcasts and 72-80% for News in comprehensiveness, and 75-82% and 62-71% for diversity, respectively. Community summaries provided modest improvements in comprehensiveness and diversity compared to source texts, with root-level summaries in Graph RAG being highly efficient. Root-level Graph RAG offers a highly efficient method for iterative question answering that characterizes sensemaking activity while retaining advantages in comprehensiveness (72% win rate) and diversity (62% win rate) over naive RAG. Empowerment results were mixed, with naive RAG outperforming in directness.

Conclusion

Graph RAG advances traditional RAG by effectively addressing complex global queries that require comprehensive summarization of large datasets. By combining knowledge graph generation with query-focused summarization, Graph RAG provides detailed and nuanced answers, outperforming naive RAG in both comprehensiveness and diversity.

This global approach integrates RAG, QFS, and entity-based graph indexing to support sensemaking across entire text corpora. Initial evaluations show significant improvements over naive RAG and competitive performance against other global methods like map-reduce summarization.GraphRAG improves upon naive RAG in scenarios requiring complex reasoning, high factual accuracy, and deep data understanding—such as financial analysis, legal review, and life sciences. However, naive RAG may be more efficient for straightforward queries or when speed is key. In short, GraphRAG is best for complex tasks, but naive RAG suffices for simpler ones.

Maxim is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with Maxim.

Understanding RAG (Part 5): Recommendations and wrap-up

Parth Roy — Mon, 09 Sep 2024 10:34:11 +0000

Introduction

In the five-part series "Understanding RAG," we began by explaining the foundational Retrieval-Augmented Generation (RAG) framework and progressively explored advanced techniques to refine each component. In Part 1, we provided an overview of the RAG framework. Subsequent parts offered an in-depth analysis of the three main components: indexing, retrieval, and generation. In this final blog, we will discuss the findings from the research paper "Searching for Best Practices in Retrieval-Augmented Generation," in which the authors undertook a comprehensive study to identify and evaluate the most effective techniques for optimizing RAG systems. To provide a clear and comprehensive overview of the advanced RAG framework and its components, we include an illustrative image below. This visual representation encapsulates the various components and techniques designed to enhance the performance of the RAG system. We will discuss each component and the associated techniques that contribute to optimizing the system's overall performance.

Indexing component

The indexing component consists of chunking documents and embedding these chunks to store the embeddings in a vector database. In earlier parts of the "Understanding RAG" series, we explored advanced chunking techniques, including sliding window chunking and small-to-big chunking, and highlighted the importance of selecting the optimal chunk size. Larger chunks offer more context but can increase processing time, while smaller chunks enhance recall but may provide insufficient context.

Equally important is the choice of an effective embedding model. Embeddings are crucial as they deliver compact, semantically meaningful representations of words and entities. The quality of these embeddings has a significant impact on the performance of retrieval and generation processes.

Recommendation for document chunking

In the paper "Searching for Best Practices in Retrieval-Augmented Generation," the authors evaluated various chunking techniques using zephyr-7b-alpha3 and gpt-3.5-turbo4 models for generation and evaluation. The chunk overlap was set to 20 tokens, and the first 60 pages of the document lyft_2021 were used as the corpus. Additionally, the authors prompted LLMs to generate approximately 170 queries based on the chosen corpus to use these as input queries.

The study suggests that chunk sizes of 256 and 512 tokens offer the best balance between providing sufficient context and maintaining high faithfulness and relevancy.

Optimal chunk sizes: 256 and 512 tokens provide the best balance of high faithfulness and relevancy.

Larger chunks (2048 tokens): Offer more context but at the cost of slightly lower faithfulness.

Smaller chunks (128 tokens): Improve retrieval recall but may lack sufficient context, leading to slightly lower faithfulness.

Additionally, it highlights that using advanced chunking techniques, such as sliding window chunking, further optimizes these benefits.

Faithfulness is the extent to which the generated output accurately reflects the information from the original document. Relevancy is the degree to which the generated content is pertinent and useful in relation to the query.

Recommendation for embedding model

Choosing the right embedding model is equally important for effective semantic matching of queries and chunk blocks. To select the appropriate open-source embedding model, the authors conducted another experiment using the evaluation module of FlagEmbedding, which uses the dataset namespace-Pt/msmarco7 for queries and the dataset namespace-Pt/msmarco-corpus8 for the corpus and metrics like RR and MRR were used for evaluation.

RR (Reciprocal Rank) is the rank of the first relevant result in a single query. It is the inverse of the rank of this relevant result. MRR (Mean Reciprocal Rank) is the average of the reciprocal ranks of the first relevant item across a set of queries. It measures how well the system ranks the first relevant result.

Q is the total number of queries. RR_i is the reciprocal rank of the first relevant result for the i-th query.

In the study, the authors selected LLM-Embedder as the embedding model due to its ability to deliver results comparable to the BAAI/bge-large-en model, while being three times smaller in size. This choice strikes a balance between performance and model size efficiency, making it a practical option. Following the discussion on embedding models, we will now turn our attention to the retrieval component. This segment will provide a brief overview of the retrieval process and present the recommendations made by the authors in the paper "Searching for Best Practices in Retrieval-Augmented Generation."

Retrieval component

The retrieval component of RAG can be further divided into two key stages. The first stage involves retrieving document chunks relevant to the query from the vector database. The second stage, reranking, focuses on further evaluating these retrieved documents to rank them based on their relevance, ultimately selecting only the most pertinent documents.

In the earlier parts of this blog series, we explored various retrieval methods, including Query2doc, HyDE (Hypothetical Document Embeddings), and TOC (TREE OF CLARIFICATIONS). Additionally, in Part four, we delved into reranking, talking about the different models that can be employed for this purpose.

In this section, we will present the findings from the research paper "Searching for Best Practices in Retrieval-Augmented Generation," which investigates the most effective retrieval methods and reranking models. We will also provide a brief overview of the experimental setup used in the study.

Recommendations for retrieval methods

The performance of various retrieval methods was evaluated on the TREC DL 2019 and 2020 passage ranking datasets.

Results for different retrieval methods on TREC DL19/20. The best result for each method is made bold and the second is underlined.

The performance of various retrieval methods was evaluated on the TREC DL 2019 and 2020 passage ranking datasets. The results indicate that the combination of Hybrid Search with HyDE and the LLM-Embedder achieved the highest scores. This approach combines sparse retrieval (BM25) and dense retrieval (original embedding), delivering notable performance with relatively low latency while maintaining efficiency.

mAP (Mean Average Precision) assesses the overall precision of retrieved documents, focusing on the system's ability to rank relevant documents higher across all queries. nDCG@__10 (Normalized Discounted Cumulative Gain at rank 10) evaluates the quality of the top 10 results, emphasizing the importance of placing relevant documents at higher ranks. R@50 (Recall at 50) measures the proportion of relevant documents retrieved within the top 50 results, indicating the system's effectiveness in retrieving relevant information.

Recommendations for reranking models

Similar experiments were conducted on the MS MARCO Passage ranking dataset to evaluate several bi-encoder and cross-encoder reranking models, such as monoT5, monoBERT, RankLLaMA, and TILDEv2.

Results of different reranking methods on the dev set of the MS MARCO Passage ranking dataset. For each query, the top-1000 candidate passages retrieved by BM25 are reranked. Latency is measured in seconds per query.

The findings in the paper "Searching for Best Practices in Retrieval-Augmented Generation" recommend monoT5 as a well-rounded method that balances performance and efficiency. For those seeking the best possible performance, RankLLaMA is the preferred choice, whereas TILDEv2 is recommended for its speed when working with a fixed collection.

Generation component

After the indexing and retrieval component comes the generation component in RAG systems. In this component, the documents retrieved from the reranking stage are further summarized and ordered (based on the relevancy score to the query) to provide them as input to the fine-tuned LLM model for the final answer generation.

Techniques like document repacking, document summarization, and generator model fine-tuning are used to enhance the generation component. In Part Four of the "Understanding RAG" blog series, we have discussed all the methods in detail. In this section, we will discuss the findings from the paper "Searching for Best Practices in Retrieval-Augmented Generation," where the authors tried to evaluate each process and find the most efficient approach.

Recommendation for document repacking

Document repacking, discussed in detail in part Four of the "Understanding RAG" series, is a vital technique in the RAG workflow that enhances response generation. After reranking the top K documents by relevancy scores, this technique optimizes their order for the language model (LLM), ensuring more accurate and relevant responses. The three primary repacking methods include the forward method, which arranges documents in descending relevancy; the reverse method, which arranges them in ascending relevancy; and the sides method, which places the most relevant documents at both the beginning and end of the sequence.

In the research paper "Searching for Best Practices in Retrieval-Augmented Generation," various repacking techniques were evaluated across several datasets, including Commonsense Reasoning, fact-checking, Open-Domain QA, MultiHop QA, and Medical QA. The study identified the "reverse" method as the most effective repacking approach based on metrics such as accuracy (Acc), exact match (EM), and RAG scores across all tasks. The evaluation also included an assessment of average latency, measured in seconds per query, to determine efficiency.

The Exact match (EM) metric measures the percentage of generated output that exactly matches the ground truth answers. EM = (Number of exact matches / Total number of examples) x 100. Accuracy (Acc) measures the proportion of correct outputs out of the total number of outputs. Accuracy = (Number of correct outputs / Total number of outputs) x 100.

Recommendation for document summarization

Summarization in RAG aims to improve response relevance and efficiency by condensing retrieved documents and reducing redundancy before inputting them into the LLM. In Part Four of the "Understanding RAG" series, we covered various summarization techniques, including RECOMP and LongLLMLingua. This section will focus on identifying the most efficient technique based on the paper "Searching for Best Practices in Retrieval-Augmented Generation."

Results of the search for optimal RAG practices. The “Avg” (average score) is calculated based on the Acc, EM, and RAG scores for all tasks, while the average latency is measured in seconds per query. The best scores are highlighted in bold.

The paper "Searching for Best Practices in Retrieval-Augmented Generation" evaluated various summarization methods across three benchmark datasets: NQ, TriviaQA, and HotpotQA. RECOMP emerged as the superior technique, demonstrating exceptional performance across metrics such as accuracy (Acc) and exact match (EM). The “Avg” (average score) reflects the mean performance across all tasks, while average latency is recorded in seconds per query. The best scores are highlighted in bold.

Recommendation for Generator Fine-Tuning

The paper "Searching for Best Practices in Retrieval-Augmented Generation" investigates how fine-tuning affects the generator, particularly with relevant versus irrelevant contexts. The study employed different context compositions for training, including:

Pairs of query-relevant documents (Dg)
Combinations of relevant and randomly sampled documents (Dgr)
Combinations of only randomly sampled documents (Dr)
Contexts with two copies of a relevant document (Dgg)

In this study, Llama-2-7b was selected as the base model. The base LM generator without fine-tuning is referred to as M_b, and the model fine-tuned with different contexts is referred to as M_g, M_r, M_gr, and M_gg. The models were fine-tuned using several QA and reading comprehension datasets, including ASQA, HotpotQA, NarrativeQA, NQ, SQuAD, TriviaQA, and TruthfulQA.

Following the training process, all trained models were evaluated on validation sets with D_g, D_r, D_gr, and D_∅, where D_∅ indicates inference without retrieval.

Results of generator fine-tuning

The results demonstrate that models trained with a mix of relevant and random documents (M_gr) perform best when provided with either gold or mixed contexts. This finding suggests that incorporating both relevant and random contexts during training can enhance the generator's robustness to irrelevant information while ensuring effective utilization of relevant contexts.

The coverage score used as an evaluation metric measures how much of the content of the generated response overlaps with the ground-truth answer. It evaluates the proportion of ground-truth content that is covered by the generated output.

Conclusion

In the final installment of the "Understanding RAG" series, we reviewed key findings from "Searching for Best Practices in Retrieval-Augmented Generation." While the study suggests effective strategies like chunk sizes of 256 and 512 tokens, LLM-Embedder for embedding, and specific reranking and generation methods, it's clear that the best approach depends on your specific use case. The right combination of techniques—retrieval methods, embedding strategies, or summarization techniques—should be tailored to meet your application's unique needs and goals, and you should run robust evaluations to figure out what works best for you.

Maxim is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with Maxim.

Understanding RAG (Part 4): Optimizing the generation component

Parth Roy — Sat, 17 Aug 2024 10:33:36 +0000

Introduction

In part one of the "Understanding RAG" series, we outlined the three main components of RAG: indexing, retrieval, and generation. In the previous parts, we discussed techniques for improving indexing and retrieval. This blog will focus on methods to enhance the generation component of Retrieval-Augmented Generation (RAG). We will explore techniques like document repacking, summarization, and fine-tuning LLM models for generation, all of which are crucial for refining the information that feeds into the generation process.

Understanding Document Repacking

Document repacking is a technique used in the RAG workflow to enhance response generation performance. After the reranking stage, where the top K documents are selected based on their relevancy scores, document repacking optimizes the order in which these documents are presented to the LLM model. This rearrangement ensures that the LLM can generate more precise and relevant responses by focusing on the most pertinent information.

Different Repacking Methods

There are three primary repacking methods:

Forward Method: In this approach, documents are repacked in descending order based on their relevancy scores. This means that the most relevant documents, as determined in the reranking phase, are placed at the beginning of the input.
Reverse Method: Conversely, the reverse method arranges documents in ascending order of their relevancy scores. Here, the least relevant documents are placed first, and the most relevant ones are at the end.
Sides Method: The sides method strategically places the most relevant documents at both the head and tail of the input sequence, ensuring that critical information is prominently positioned for the LLM.

Selecting the Best Repacking Method

In the research paper, "Searching for Best Practices in RAGeneration," the authors assessed the performance of various repacking techniques to determine the most effective approach, and the "reverse" method emerged as the most effective repacking method.

Results of the search for optimal RAG practices show that the “Avg” (average score) is calculated based on the Acc, EM, and RAG scores for all tasks, while the average latency is measured in seconds per query. The best scores are highlighted in bold.

Understanding Document Summarization

Summarization in RAG involves condensing retrieved documents to enhance the relevance and efficiency of response generation by removing redundancy and reducing prompt length before sending them to the LLM model for generation. Summarization tasks can be extractive or abstractive. Extractive methods segment documents into sentences, then score and rank them based on importance. Abstractive compressors synthesize information from multiple documents to rephrase and generate a cohesive summary. These tasks can be either query-based, focusing on information relevant to a specific query, or non-query-based, focusing on general information compression.

Purpose of Summarization in RAG

Reduce Redundancy: Retrieved documents can contain repetitive or irrelevant details. Summarization helps filter out this unnecessary information.
Enhance Efficiency: Long documents or prompts can slow down the language model (LLM) during inference. Summarization helps create more concise inputs, speeding up the response generation.

Performance vs. Document Number

In the above figure, it can be clearly seen that the performance of LLMs in downstream tasks decreases as the number of documents in the prompt increases. Therefore, summarization of the retrieved documents is necessary to maintain performance.

Summarization Techniques

There are multiple summarization techniques that can be employed in RAG frameworks. Here in this blog, we will discuss two such techniques:

RECOMP: It is a technique that compresses retrieved documents using one of two methods, extractive or abstractive compression, depending on the use case. Extractive is used to maintain exact text, while abstractive synthesizes summaries. The decision is based on task needs, document length, computational resources, and performance. The extractive compressor selects key sentences by using a dual encoder model to rank sentences based on their similarity to the query. The abstractive compressor, on the other hand, generates new summaries with an encoder-decoder model trained on large datasets containing document-summary pairs. These compressed summaries are then added to the original input, providing the language model with a focused and concise context.

Source

LongLLMLingua: This technique uses a combination of question-aware coarse-grained and fine-grained compression methods to increase key information density and improve the model's performance.

In question-aware coarse-grained compression, perplexity is measured for each retrieved document with respect to the given query. Then, all documents are ranked based on their perplexity scores, from lowest to highest. Documents with lower perplexity are considered more relevant to the question.

Perplexity Formula

In the above formula, k represents the index of the document or sequence being considered. N_k is the number of words or tokens in the k-th document or sequence. p(x_doc_k, i) represents the probability of the i-th word or token x in the k-th document according to the model

Question-aware fine-grained compression is a method to further refine the documents or passages that were retained after the coarse-grained compression by focusing on the most relevant parts of those documents. For each word or token within the retained documents, the importance is computed with respect to the given query. Then, the tokens are ranked based on their importance scores, from most important to least important.

Importance Formula

Selecting the Best Summarization Technique

In the research paper, "Searching for Best Practices in Retrieval-Augmented Generation," the authors evaluated various summarization methods on three benchmark datasets: NQ, TriviaQA, and HotpotQA. RECOMP was recommended for its outstanding performance.

Understanding Generator Fine-Tuning

Fine-tuning is the process of taking a pre-trained model and making further adjustments to its parameters on a smaller, task-specific dataset to improve its performance on that specific task. Here, in this case, the specific task would be answer generation when given a query paired with context. In the research paper "Searching for Best Practices in Retrieval-Augmented Generation," the authors focused on fine-tuning the generator. Their goal was to investigate the impact of fine-tuning, particularly how relevant or irrelevant contexts influence the generator’s performance.

Deep Dive into the Fine-Tuning Process

For the process of fine-tuning, the query input to the RAG system is denoted as 'x' and 'D' as the contexts for the input. The fine-tuning loss for the generator was defined as the negative log-likelihood of the ground-truth output 'y'. The negative log-likelihood measures how well the predicted probability distribution matches the actual distribution of the data.

L (Loss function) =−logP(y∣x,D) , here P(y∣x,D) is the probability of the ground-truth output y given the query x and contexts D.

To explore the impact of fine-tuning, especially with relevant and irrelevant contexts, the authors defined (d_{gold}) as a context relevant to the query and (d_{random}) as a randomly retrieved context. They trained the model using different compositions of (D) as follows:

(D_g): The augmented context consisted of query-relevant documents, denoted as (D_g = {d_gold}).
(D_r): The context contained one randomly sampled document, denoted as (D_r = {d_random}).
(D_{gr}): The augmented context comprised a relevant document and a randomly selected one, denoted as (D_{gr} = {d_gold}, d_{random}}).
(D_{gg}): The augmented context consisted of two copies of a query-relevant document, denoted as (D_{gg} = {d_{gold}, d_{gold}}).

In this study, Llama-2-7b was selected as the base model. The base LM generator without fine-tuning is referred to as (M_b), and the model fine-tuned with different contexts is referred to as (M_g), (M_r), (M_{gr}), and (M_{gg}). The models were fine-tuned using various QA and reading comprehension datasets. Ground-truth coverage was employed as the evaluation metric due to the typically short nature of QA task answers.

Selecting the Best Context Method for Fine-Tuning

Following the training process, all trained models were evaluated on validation sets with (D_g), (D_r), (D_{gr}), and (D_{∅}), where (D_{∅}) indicates inference without retrieval.

Results of Generator Fine-Tuning

The results demonstrate that models trained with a mix of relevant and random documents ((M_{gr})) perform best when provided with either gold or mixed contexts. This finding suggests that incorporating both relevant and random contexts during training can enhance the generator's robustness to irrelevant information while ensuring effective utilization of relevant contexts.

Conclusion

In this blog, we explored methods to enhance the generation component of RAG systems. Document repacking, with the "sides" method, improves response accuracy by optimizing document order. Summarization techniques like RECOMP and LongLLMLingua reduce redundancy and enhance efficiency, with RECOMP showing superior performance. Fine-tuning the generator improves model performance for specific tasks using relevant and irrelevant.

Understanding RAG (Part 3): Re-Ranker is all you need.

Parth Roy — Sat, 17 Aug 2024 10:07:11 +0000

Introduction

Creating a robust Retrieval Augmented Generation (RAG) application presents numerous challenges. As the complexity of the documents increases, we often encounter a significant decrease in the accuracy of the generated answers. This issue can stem from various factors, such as chunk length, metadata quality, clarity of the document, or the nature of the questions asked. By leveraging extended context lengths and refining our Retrieval-Augmented Generation (RAG) strategies, we can significantly enhance the relevance and accuracy of our responses. One effective strategy is the implementation of a re-ranker to ensure more precise and informative outcomes.

What is a Re-ranker?

Source: Cohere Blog

In the context of Retrieval-Augmented Generation (RAG), a re-ranker is a model used to refine and improve the initial set of retrieved documents or search results before they are passed to a language model for generating responses. The process begins with an initial retrieval phase where a set of candidate documents is fetched based on the search query using keyword-based search, vector-based search, or a hybrid approach. After this initial retrieval, a re-ranker model re-evaluates and reorders the candidate documents by computing a relevance score for each document-query pair. This re-ranking step prioritizes the most relevant documents according to the context and nuances of the query. The top-ranked documents from this re-ranking process are then selected and passed to the language model, which uses them as context to generate a more accurate and informative response.

Why Re-Rankers?

Hallucinations, as well as inaccurate and insufficient outputs, often occur when unrelated retrieved documents are included in the output context. This is where re-rankers can be invaluable. They rearrange document records to prioritize the most relevant ones. The current problems with the existing retrieval and generation framework are:

Information loss in vector embeddings: In RAG, we use vector search to find relevant documents by converting them into numerical vectors. Typically, the text is compressed into vectors of 768 or 1536 dimensions. This process can miss some relevant information due to compression. For example, in "I like going to the beach" vs. "I don't like going to the beach," the presence of "don't" completely changes the meaning, but embeddings might place these sentences close together due to their similar structure and content.
Context Window Limit: To capture more relevant documents, we can increase the number of documents returned (top_k) to the LLM (Large Language Model) for generating responses. However, LLMs have a limit on the amount of text they can process at once, known as the context window. Also, stuffing too much text into the context window can reduce the LLM’s performance (needle in a haystack problem) in understanding and recalling relevant information.

Source: In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

In the research paper, "In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss," they found that the mistarl-medium model's performance scales only for some tasks but quickly degenerates for the majority of others as context grows. Every row shows accuracy in % of solving the corresponding BABILong task (’qa1’-’qa10’), and every column corresponds to the task size submitted to mistarl-medium with a 32K context window.

How do we Incorporate Re-Rankers in RAG?

To incorporate re-rankers in a Retrieval-Augmented Generation (RAG) system and enhance the accuracy and relevance of the generated responses, we employ a two-stage retrieval system. This system consists of:

Retrieval Stage

In the retrieval stage, the goal is to quickly and efficiently retrieve a set of potentially relevant documents from a large text corpus. This is achieved using a vector database, which allows us to perform a similarity search.

Vector representation: First, we convert the text documents into vector representations. This involves transforming the text into high-dimensional vectors that capture the semantic meaning of the text by using a bi-encoder. Typically, models like BERT or other pre-trained language models are used for this purpose.
Storing vectors: These vector representations are then stored in a vector database. The vector database enables efficient storage and retrieval of these vectors, making it possible to perform fast similarity searches.
Query vector: When a query is received, it is also converted into a vector representation using the same model. This query vector captures the semantic meaning of the query.
Similarity search: The query vector is then compared against the document vectors in the vector database using a similarity metric, such as cosine similarity. This step identifies the top k documents whose vectors are most similar to the query vector.

Reranking Stage

The retrieval stage often returns documents that are relevant but not necessarily the most relevant. This is where the reranking stage comes into play. The reranking model reorders the initially retrieved documents based on their relevance to the query. This will minimize the "needle in a haystack" problem and address the context limit issue, as some models have constraints on their context window, thereby improving the overall quality of the results.

Initial set of documents: The documents retrieved in the first stage are used as the input for the reranking stage. This set typically contains more documents than will ultimately be used, ensuring that we have a broad base from which to select the most relevant ones.
Reranking model: There are multiple reranking models that can be used to refine search results by re-evaluating and reordering initially retrieved documents.
- Cross Encoder: It takes pairs of the query and each retrieved document and computes a relevance score. Unlike the bi-encoder used in the retrieval stage, which independently encodes the query and documents, the cross-encoder considers the interaction between the query and document during scoring.

Source: SBERT

A cross-encoder concatenates the query and documents into a single input sequence and passes this combined sequence through the encoder model to generate a joint representation. While bi-encoders are efficient and scalable for large-scale retrieval tasks due to their ability to precompute embeddings, cross-encoders provide more accurate and contextually rich relevance scoring by processing query-document pairs together. This makes cross-encoders particularly advantageous for tasks where precision and nuanced understanding of the query-document relationship are crucial.

Multi-Vector Rerankers: Models like ColBERT (Contextualized Late Interaction over BERT) strike a balance between bi-encoder and cross-encoder approaches. ColBERT maintains the efficiency of bi-encoders by precomputing document representations while enhancing the interaction between query and document tokens during similarity computation.

Example Scenario

Query: "Impact of climate change on coral reefs"

For the document pre-computation, consider the following documents:

Document A: "Climate change affects ocean temperatures, which in turn impacts coral reef ecosystems."
Document B: "The Great Barrier Reef is facing severe bleaching events due to rising sea temperatures."

Each document is precomputed into token-level embeddings without considering any specific query.

When the query "Impact of climate change on coral reefs" is received, it is encoded into token-level embeddings at inference.

For token-level similarity computation, we first calculate the cosine similarity. For each token in the query, such as "Impact", we calculate its similarity with every token in Document A and Document B. This process is repeated for each token in the query ("of", "climate", "change", "on", "coral", "reefs").

Next, we apply the maximum similarity (maxSim) method. For each query token, we keep the highest similarity score. For instance, the highest similarity score for "coral" might come from the word "coral" in Document A, and for "change" from "change" in Document B.

Finally, we sum the maxSim scores to get the final similarity score for each document. Document A receives a final similarity score of 0.85, while Document B receives a score of 0.75. Therefore, Document A would be ranked higher as its final similarity score is greater.

Relevance Scoring

For each query-document pair, the reranking model outputs a similarity score that indicates how relevant the document is to the query. This score is computed using the full transformer model, allowing for a more nuanced understanding of the document's relevance in the context of the query.

Reordering Documents

The documents are then reordered based on their relevance scores. The most relevant documents are moved to the top of the list, ensuring that the most pertinent information is prioritized.

Conclusion

Rerankers offer a promising solution to the limitations of basic RAG pipelines. By reordering retrieved documents based on their relevance to the query, we can improve the accuracy and relevance of generated responses. Additionally, injecting summarization into the context further enhances the LLM's ability to provide accurate answers. While implementing rerankers may introduce additional computational overhead, the benefits in terms of improved accuracy and relevance make it a worthwhile investment.

Maxim is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with Maxim. Book a demo today!

Understanding RAG (Part 2) : RAG Retrieval

Parth Roy — Wed, 07 Aug 2024 09:23:18 +0000

Introduction

In part one of the "Understanding RAG" series, we covered the basics and advanced concepts of Retrieval-Augmented Generation (RAG). This part delves deeper into the retrieval component and explores enhancement strategies.

In this blog, we will explore techniques to improve retrieval and make pre-retrieval processes, such as document chunking, more efficient. Effective retrieval requires accurate, clear, and detailed queries. Even with embeddings, semantic differences between queries and documents can persist. Several methods enhance query information to improve retrieval. For instance, Query2Docand HyDE generate pseudo-documents from original queries, while TOC decomposes queries into subqueries, aggregating the results.
Document chunking significantly influences retrieval performance. Common strategies involve dividing documents into chunks, but finding the optimal chunk length is challenging. Small chunks may fragment sentences, while large chunks can include irrelevant context.

What is document chunking?

Chunking is the process of dividing a document into smaller, manageable segments or chunks. The choice of chunking techniques is crucial for the effectiveness of this process. In this section, we will explore various chunking techniques mentioned in the research paper and look at their findings to determine the most effective method.

Levels of chunking

Token-level chunking: Token-level chunking involves splitting the text at the token level. This method is simple to implement but has the drawback of potentially splitting sentences inappropriately, which can affect retrieval quality.
Semantic-level chunking: Semantic-level chunking involves using LLMs to determine logical breakpoints based on context. This method preserves the context and meaning of the text but is time-consuming to implement.
Sentence-level chunking: Sentence-level chunking involves splitting the text at sentence boundaries. This method balances simplicity and the preservation of text semantics. However, it is potentially less precise than semantic-level chunking, though this drawback is not explicitly mentioned.

Chunk size and its impact

Larger chunks:

Pros: Provide more context, enhancing comprehension.
Cons: Increase processing time.

Smaller chunks:

Pros: Improve retrieval recall and reduce processing time.
Cons: May lack sufficient context.

Finding the optimal chunk size involves balancing metrics such as faithfulness and relevancy.

Faithfulness measures whether the response is hallucinated or matches the retrieved texts, while relevancy measures whether the retrieved texts and responses match the queries.

Comparison of different chunk sizes in the research paper: Link

In an experiment detailed in the research paper, the text-embedding-ada-0022 model was used for embedding. The zephyr-7b-alpha3 model served as the generation model, while GPT-3.5-turbo was utilized for evaluation. A chunk overlap of 20 tokens was maintained throughout the process. The corpus for this experiment comprised the first sixty pages of the document lyft_2021.

Optimal chunk sizes: 256 and 512 tokens provide the best balance of high faithfulness and relevancy.

Larger chunks (2048 tokens): Offer more context but at the cost of slightly lower faithfulness.

Smaller chunks (128 tokens): Improve retrieval recall but may lack sufficient context, leading to slightly lower faithfulness.

Advanced Chunking Techniques

Advanced chunking techniques, such as small-to-big and sliding windows, improve retrieval quality by organizing chunk relationships and maintaining context.

Sliding window chunking: Sliding window chunking segments text into overlapping chunks, combining fixed-size and sentence-based chunking advantages. Each chunk overlaps with the previous one, preserving context and ensuring continuity and meaning in the text.

Small-to-big chunking: Small-to-big chunking involves using smaller, targeted text chunks for embedding and retrieval to enhance accuracy. After retrieval, the larger text chunk containing the smaller chunk is provided to the large language model for synthesis. This approach combines precise retrieval with comprehensive contextual information.

In an experiment detailed in the research paper, the effectiveness of advanced chunking techniques is demonstrated using the LLM-Embedder model for embedding. The study utilizes a smaller chunk size of 175 tokens, a larger chunk size of 512 tokens, and a chunk overlap of 20 tokens. Techniques such as small-to-big and sliding window are employed to enhance retrieval quality by preserving context and ensuring the retrieval of relevant information.

Comparison of different chunk sizes in the research paper: Link

Retrieval Methods

1. Query Rewriting

Refines user queries to better match relevant documents by prompting an LLM to rewrite queries.

Query2Doc: Given a query q, the method generates a pseudo-document d′ through few-shot prompting. The original query q is then concatenated with the pseudo-document d′ to form an enhanced query q+, enhancing the query's context and improving retrieval accuracy. The enhanced query q+ is a straightforward concatenation of q and d′, separated by a special token [SEP].

Performance

The “DPR + query2doc” variant consistently outperforms the baseline DPR model by approximately 1% in MRR on the MS-MARCO dev set, regardless of the amount of labeled data used for fine-tuning.
This indicates that the improvement provided by the query2doc method is not dependent on the quantity of labeled data available for training.

MRR on MS-MARCO dev set w.r.t the percentage of labeled data used for fine-tuning.

Dense Passage Retrieval (DPR) uses neural networks to embed queries and documents into dense vectors for comparison. At the same time, the Mean Reciprocal Rank (MRR) evaluates retrieval performance by averaging the reciprocal ranks of the first relevant document across multiple queries.
Reciprocal Ranks (RR): 1 / (rank of the first relevant document)

2. Query Decomposition

Involves breaking down the original query into sub-questions. This method retrieves documents based on these sub-questions, potentially enhancing retrieval accuracy by addressing different aspects of the original query.

TOC (TREE OF CLARIFICATIONS), which addresses addresses ambiguous questions in open-domain QA by generating disambiguated questions (DQs) through few-shot prompting. It retrieves relevant passages to identify various interpretations, like different types of medals or Olympics. TOC prunes redundant DQs and creates a comprehensive long-form answer covering all interpretations, ensuring thoroughness and depth without needing user clarification.

Overview of TREE OF CLARIFICATIONS. (1) relevant passages for the ambiguous question (AQ) are retrieved. (2) leveraging the passages, disambiguated questions (DQs) for the AQ are recursively generated via few-shot prompting and pruned as necessary. (3) a long-form answer addressing all DQs is generated.

3. Pseudo-Documents Generation

Techniques like HyDE (Hypothetical Document Embeddings) improve retrieval by generating hypothetical or pseudo documents with a generative model and encoding them into semantic embeddings using a contrastive encoder. This approach enhances relevance-based retrieval without relying on exact text matches, utilizing unsupervised learning principles effectively.

An illustration of the HyDE model from Link

Let's dive deep into HYDE's concept by examining each step it undergoes through an example scenario. At a high level, it performs two tasks: a Generative Task and a Document-Document Similarity Task.

Example query: What are the causes of climate change?

Step 1: Generative model
Purpose: Generate a hypothetical document that answers the query.
Generated Document: "Climate change is caused by human activities like burning fossil fuels, deforestation, and industrial processes. It releases greenhouse gases such as carbon dioxide and methane, trapping heat and causing global warming. Natural factors like volcanic eruptions and solar radiation also contribute."

It captures key causes of climate change but may contain errors.

Step 2: Contrastive encoder

Purpose: Encode the generated document using semantic embedding.

Encoded Vector: A numerical representation in a high-dimensional space that emphasizes semantic content (causal factors of climate change) while filtering out unnecessary details.

Step 3: Retrieval using document-document similarity

Purpose: Compare the encoded vector of the hypothetical document with embeddings of real documents in a corpus.

Process: Compute similarity (e.g., cosine similarity) between the hypothetical document's vector and corpus document vectors.

Retrieval: Retrieve real documents from the corpus that are most semantically similar to the hypothetical document, focusing on relevance rather than exact wording.

Conclusion

Advanced chunking techniques like the sliding window significantly improve retrieval quality by maintaining context and ensuring the extraction of relevant information. Among various retrieval methods evaluated, Hybrid Search with HyDE stands out as the best, combining the speed of sparse retrieval with the accuracy of dense retrieval to achieve superior performance with acceptable latency.

In the next part of this series, we will focus on the re-ranking component of RAG to further improve content generation.

Maxim AI is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with Maxim.

Understanding RAG (Part 1): RAG overview

Parth Roy — Fri, 02 Aug 2024 06:05:00 +0000

What is RAG?

Retrieval-augmented generation (RAG) is a process designed to enhance the output of a large language model (LLM) by incorporating information from an external, authoritative knowledge base. This approach ensures that the responses generated by the LLM are not solely dependent on the model's training data. Large Language Models are trained on extensive datasets and utilize billions of parameters to perform tasks such as answering questions, translating languages, and completing sentences. By using RAG, these models can tap into specific domains or an organization's internal knowledge base without needing to be retrained. This method is cost-effective and helps maintain the relevance, accuracy, and utility of the LLM's output in various contexts.

Why is Retrieval-Augmented Generation important?

Large language models (LLMs) are a critical component of artificial intelligence (AI) technologies, especially for intelligent chatbots and other natural language processing (NLP) applications. The primary goal is to create chatbots that can accurately answer user questions by referencing reliable knowledge sources. However, there are inherent challenges with LLM technology:

Hallucinations: LLMs can present incorrect information if they do not have the right answer.
Out-of-date responses: The static nature of LLM training data means the model might provide outdated or overly generic information instead of specific, current responses.
Non-authoritative sources: Responses may be generated from unreliable sources.
Terminology confusion: Different sources might use the same terms to refer to different things, leading to inaccurate responses.

RAG addresses these challenges by directing the LLM to retrieve relevant information from authoritative, pre-determined knowledge sources. This approach provides several benefits:

Control over output: Organizations can better control the text generated by the LLM.
Accurate responses: Users receive more accurate and relevant information.
Transparency: Users gain insights into the sources used by the LLM to generate responses.

How Does Retrieval-Augmented Generation (RAG) Work?

A basic Retrieval-Augmented Generation (RAG) framework comprises three main components: indexing, retrieval, and generation. This framework operates by first indexing data into vector representations. Upon receiving a user query, it retrieves the most relevant chunks of information based on their similarity to the query. Finally, these retrieved chunks are used to generate a well-informed response. This process ensures that the model's output is accurate, relevant, and contextually appropriate.

1. Indexing

Indexing is the initial phase where raw data is prepared and stored for retrieval:

Data cleaning and extraction: Raw data from various formats such as PDF, HTML, Word, and Markdown is cleaned and extracted.
Conversion to plain text: This data is converted into a uniform plain text format.
Text segmentation: The text is segmented into smaller, digestible chunks to accommodate the context limitations of language models.
Vector encoding: These chunks are encoded into vector representations using an embedding model.
Storage in vector database: The encoded vectors are stored in a vector database, which is crucial for enabling efficient similarity searches in the retrieval phase.

2. Retrieval

Retrieval is the phase where relevant data is fetched based on a user query:

Query encoding: Upon receiving a user query, the RAG system uses the same encoding model from the indexing phase to convert the query into a vector representation.
Similarity calculation: The system computes similarity scores between the query vector and the vectors of the indexed chunks.
Chunk retrieval: The system prioritizes and retrieves the top K chunks that have the highest similarity scores to the query.
Expanded context creation: These retrieved chunks are used to expand the context of the prompt that will be given to the language model.

3. Generation

Generation is the final phase where the response is created based on the retrieved information:

Prompt synthesis: The user query and the selected documents (retrieved chunks) are synthesized into a coherent prompt.
Response formulation: A large language model is tasked with formulating a response to the synthesized prompt.
Task-specific criteria: The model may draw upon its inherent parametric knowledge or limit its response to the information within the provided documents, depending on the task-specific criteria.
Multi-turn dialogue: For ongoing dialogues, the existing conversational history can be integrated into the prompt, enabling the model to engage in effective multi-turn interactions.

Drawbacks of Basic Retrieval-Augmented Generation (RAG)

Basic RAG faces challenges in retrieval precision, recall, and generating accurate, relevant responses. It struggles with integrating retrieved information coherently, often resulting in disjointed, redundant, or repetitive outputs. Additionally, determining the significance and maintaining consistency in the responses add to the complexity. The single retrieval approach often falls short, and there is a risk of over-reliance on the augmented data, leading to uninspired outputs.

Retrieval Challenges

The retrieval phase in Basic RAG often struggles with the following:

Precision and Recall: It frequently selects chunks of information that are misaligned with the query or irrelevant, and it can miss crucial information needed for a comprehensive response.

Precision: The proportion of documents that were retrieved relevant to the query out of all the documents that were retrieved.
Recall: The proportion of relevant documents that were retrieved out of all relevant documents in the database. It measures the completeness of the retrieval process.

Generation Difficulties

In the generation phase, Basic RAG can encounter:

Hallucination: The model might produce content not supported by the retrieved context, creating fabricated or inaccurate information.
Irrelevance, Toxicity, and Bias: The outputs can sometimes be off-topic, offensive, or biased, negatively impacting the quality and reliability of the responses.

Augmentation Hurdles

When integrating retrieved information into responses, Basic RAG faces:

Disjointed or incoherent outputs: Combining the retrieved data with the task at hand can result in responses that lack coherence or flow.
Redundancy: Similar information retrieved from multiple sources can lead to repetitive content in the responses.
Significance and relevance: Determining the importance and relevance of different passages is challenging, as is maintaining stylistic and tonal consistency in the final output.

Complexity of Information Acquisition

Single retrieval limitation: A single retrieval based on the original query often fails to provide enough context, necessitating more complex retrieval strategies.
Over-reliance on augmented information: Generation models might depend too heavily on the retrieved content, leading to responses that merely echo this information without offering additional insight or synthesis.

Advanced Retrieval-Augmented Generation (RAG)

Advanced RAG builds on the basic RAG framework by addressing its limitations and enhancing retrieval quality through pre-retrieval and post-retrieval strategies. Here’s a detailed explanation:

1. Pre-retrieval Process

The pre-retrieval process aims to optimize both the indexing structure and the original query to ensure high-quality content retrieval.

Optimizing Indexing

Enhancing data granularity: Breaking down data into smaller, more precise chunks to improve indexing accuracy.
Optimizing index structures: Improving the structure of indexes to facilitate efficient and accurate retrieval.
Adding metadata: Incorporating additional information like timestamps, authorship, and categorization to enhance context and relevance.
Alignment optimization: Aligning data chunks to maintain context and continuity across segments.
Mixed retrieval: Combining various retrieval techniques to improve overall search results.

Query Optimization

Query rewriting: Rephrasing the user's original question to improve clarity and accuracy in retrieval.
Query transformation: Altering the structure of the query to better match the indexed data.
Query expansion: Adding related terms or synonyms to the query to capture a broader range of relevant results.

2. Post-retrieval Process

After retrieving relevant context, the post-retrieval process focuses on effectively integrating it with the query to generate accurate and focused responses.

Re-ranking Chunks

Re-ranking: Prioritizing the retrieved information by relocating the most relevant content to the edges of the prompt. This method is implemented in frameworks like LlamaIndex, LangChain, and HayStack.

Context Compressing

Mitigating information overload: Directly feeding all relevant documents into LLMs can lead to information overload, where key details are diluted by irrelevant content.
Selecting essential information: Focusing on the most crucial parts of the retrieved content.
Emphasizing critical sections: Highlighting the most important sections of the context.
Shortening the context: Reducing the amount of information to be processed by the LLM to maintain focus on key details.

In the subsequent parts of this blog, we will delve into detailed discussions on the advancements in pre-retrieval and post-retrieval techniques.

Conclusion

Retrieval-augmented generation (RAG) significantly improves the accuracy and relevance of responses from large language models (LLMs) by incorporating external, authoritative knowledge sources. It addresses common LLM challenges like misinformation and outdated data, enhancing control, transparency, and reliability in content generation. While basic RAG faces retrieval precision and response coherence issues, advanced strategies refine indexing, optimize queries, and integrate information more effectively. Ongoing advancements in RAG are crucial for enhancing AI-generated responses, building trust, and improving natural language processing interactions.

Maxim is an evaluation platform for testing and evaluating LLM applications. Test your RAG performance with Maxim before adding advanced RAG features.