Rumor is that GitHub Copilot has thousands of lines of defensive code parsing its LLM responses to catch undesired model behaviors. Bryan Bischof (Head of AI at Hex) recently mentioned that this defensive coding can be created with evaluation metrics. This exemplifies a core part of building widely used production-grade LLM applications: quality control and evaluation. When building an LLM application, one should add an evaluation metric for any failure case to ensure this doesn't happen again under new user inputs (e.g., Copilot not closing code brackets).
When evaluating LLM apps, one must distinguish between end-to-end and step/component-wise evaluation.
- The end-to-end evaluation gives a sense of the overall quality, which is valuable to compare different approaches.
- The step/component-wise evaluation helps identify & mitigate failure modes that have cascading effects impacting the overall quality of the LLM app (e.g., ensuring that the correct context was retrieved in RAG).
Eval metrics are a highly sought-after topic in the LLM community, and getting started with them is hard. The following is an overview of evaluation metrics for different scenarios applicable for end-to-end and component-wise evaluation. The following insights were collected from research literature and discussions with other LLM app builders. Code examples are also provided in Python.
General Purpose Evaluation Metrics
These evaluation metrics can be applied to any LLM call and are a good starting point for determining output quality.
Rating LLMs Calls on a Scale from 1-10
The Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena paper introduces a general-purpose zero-shot prompt to rate responses from an LLM to a given question on a scale from 1-10. They find that GPT-4's ratings agree as much with a human rater as a human annotator agrees with another one (>80%). Further, they observe that the agreement with a human annotator increases as the response rating gets clearer. Additionally, they investigated how much the evaluating LLM overestimated its responses and found that GPT-4 and Claude-1 were the only models that didn't overestimate themselves.
Code: here.
Relevance of Generated Response to Query
Another general-purpose way to evaluate any LLM call is to measure if the retrieved contexts are ranked by relevancy to the given query. But instead of using an LLM to rate the relevancy on a scale, the RAGAS: Automated Evaluation of Retrieval Augmented Generation paper suggests using an LLM to generate multiple questions that fit the generated answer and measure the cosine similarity of the generated questions with the original one.
Code: here.
Assessing Uncertainty of LLM Predictions (w/o perplexity)
Given that many API-based LLMs, such as GPT-4, don't give access to the log probabilities of the generated tokens, assessing the certainty of LLM predictions via perplexity isn't possible. The SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models paper suggests measuring the average factuality of every sentence in a generated response. They generate additional responses from the LLM at a high temperature and check how much every sentence in the original answer is supported by the other generations. The intuition behind this is that if the LLM knows a fact, it's more likely to sample it. The authors find that this works well in detecting non-factual and factual sentences and ranking passages in terms of factuality. The authors noted that correlation with human judgment doesn't increase after 4-6 additional generations when using gpt-3.5-turbo
to evaluate biography generations.
Code: here.
Cross-Examination for Hallucination Detection
The LM vs LM: Detecting Factual Errors via Cross Examination paper proposes using another LLM to assess an LLM response's factuality. To do this, the examining LLM generates follow-up questions to the original response until it can confidently determine the factuality of the response. This method outperforms prompting techniques such as asking the original model, "Are you sure?" or instructing the model to say, "I don't know," if it is uncertain.
Code: here.
RAG Specific Evaluation Metrics
In its simplest form, a RAG application consists of retrieval and generation steps. The retrieval step fetches for context given a specific query. The generation step answers the initial query after being supplied with the fetched context.
The following is a collection of evaluation metrics to evaluate the retrieval and generation steps in an RAG application.
Relevance of Context to Query
For RAG to work well, the retrieved context should only consist of relevant information to the given query such that the model doesn't need to "filter out" irrelevant information. The RAGAS paper suggests first using an LLM to extract any sentence from the retrieved context relevant to the query. Then, calculate the ratio of relevant sentences to the total number of sentences in the retrieved context.
Code: here.
Context Ranked by Relevancy to Query
Another way to assess the quality of the retrieved context is to measure the relevancy rank of the retrieved context on a given query. This is supported by the intuition from the Lost in the Middle paper, which finds that performance degrades if the relevant information is in the middle of the context window. And that performance is greatest if the relevant information is at the beginning of the context window.
The RAGAS paper also suggests using an LLM to check if every extracted context is relevant. Then, they measure how well the contexts are ranked by calculating the mean average precision. Note that this approach considers any two relevant contexts equally important/relevant to the query.
Code: here.
Instead of estimating the relevancy of every rank individually and measuring the rank based on that, one can also use an LLM to rerank a list of contexts and use that to evaluate how well the contexts are ranked by relevancy to the given query. The Zero-Shot Listwise Document Reranking with a Large Language Model paper finds that listwise reranking outperforms pointwise reranking with an LLM. The authors used a progressive listwise reordering if the retrieved contexts don't fit into the context window of the LLM.
Aman Sanger (Co-Founder at Cursor) mentioned (tweet) that they leveraged this listwise reranking with a variant of the Trueskill rating system to efficiently create a large dataset of queries with 100 well-ranked retrieved code blocks per query. He underlined the paper's claim by mentioning that using GPT-4 to estimate the rank of every code block individually performed worse.
Code: here.
Faithfulness of Generated Answer to Context
Once the relevance of the retrieved context is ensured, one should assess how much the LLM reuses the provided context to generate the answer, i.e., how faithful is the generated answer to the retrieved context?
One way to do this is to use an LLM to flag any information in the generated answer that cannot be deduced from the given context. This is the approach taken by the authors of Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering. They find that GPT-4 is the best model for this analysis as measured by correlation with human judgment.
Code: here.
A classical yet predictive way to assess the faithfulness of a generated answer to a given context is to measure how many tokens in the generated answer are also present in the retrieved context. This method only slightly lags behind GPT-4 and outperforms GPT-3.5-turbo (see Table 4 from the above paper).
Code: here.
The RAGAS paper spins the idea of measuring the faithfulness of the generated answer via an LLM by measuring how many factual statements from the generated answer can be inferred from the given context. They suggest creating a list of all statements in the generated answer and assessing whether the given context supports each statement.
Code: here.
AI Assistant/Chatbot-Specific Evaluation Metrics
Typically, a user interacts with a chatbot or AI assistant to achieve specific goals. This motivates to measure the quality of a chatbot by counting how many messages a user has to send before they reach their goal. One can further break this down by successful and unsuccessful goals to analyze user & LLM behavior.
Concretely:
- Delineate the conversation into segments by splitting them by the goals the user wants to achieve.
- Assess if every goal has been reached.
- Calculate the average number of messages sent per segment.
Code: here.
Evaluation Metrics for Summarization Tasks
Text summaries can be assessed based on different dimensions, such as factuality and conciseness.
Evaluating Factual Consistency of Summaries w.r.t. Original Text
The ChatGPT as a Factual Inconsistency Evaluator for Text Summarization paper used gpt-3.5-turbo-0301
to assess the factuality of a summary by measuring how consistent the summary is with the original text, posed as a binary classification and a grading task. They find that gpt-3.5-turbo-0301
outperforms baseline methods such as SummaC and QuestEval when identifying factually inconsistent summaries. They also found that using gpt-3.5-turbo-0301
leads to a higher correlation with human expert judgment when grading the factuality of summaries on a scale from 1 to 10.
Code: binary classification and 1-10 grading.
Likert Scale for Grading Summaries
Among other methods, the Human-like Summarization Evaluation with ChatGPT paper used gpt-3.5-0301
to evaluate summaries on a Likert scale from 1-5 along the dimensions of relevance, consistency, fluency, and coherence. They find that this method outperforms other methods in most cases in terms of correlation with human expert annotation. Noteworthy is that BARTScore was very competitive to gpt-3.5-0301
.
Code: Likert scale grading.
How To Use Above Evaluation Metrics
You can use these evaluation metrics on your own or through Parea by following our onboarding wizard. Alternatively, you can get started by:
- Deploy any of the above evaluation function.
- Add a small code snippet.
- View logs in the dashboard.
Top comments (0)