DEV Community: Joschka Braun

Hill climbing generative AI problems: When ground truth values are expensive to obtain & launching fast is important

Joschka Braun — Sun, 25 Aug 2024 18:43:59 +0000

For many generative AI applications it is expensive to create ground truth answers for a set of inputs (e.g. summarization tasks). This makes experimentation slow as you can't even run a LLM eval assessing if the output matches the ground truth. In this guide, you will learn about how to quickly experiment with your LLM app as you still figure out your data.

In such scenarios, you want to split your experimentation process into two hill climbing phases with different goals. The term hill climbing is inspired by the numerical optimization algorithm of the same name which starts with an initial solution and iteratively improves upon it. Concretely:

Hill climb your data: Iterate on your application to understand your data & find ground truth values/targets.
Hill climb your app: Iterate on your application to find a compound system fitting all targets.

While your ultimate goal is to increase the "accuracy" of your LLM app (a lagging indicator), you will get there by maximizing learnings, i.e., running as many experiments as possible (a leading indicator). Read more about focusing on leading metrics by Jason Liu.

Phase 1: Hill climb your data

Your goal in this phase is to find the best ground truth values/target for your data. You do that by iterating on your LLM app and judge if the new the outputs are better, i.e. you continuously label your data.

So, taking the example of summarization. To have some ground truth values, you can use a simple version of your LLM app on your unlabeled dataset to generate initial summaries. After manually reviewing your outputs, you will find some failure modes of the summaries (e.g. they don't mention numbers). Then, you tweak your LLM system to incorporate this feedback and generate a new round of summaries.

Now you are getting into hill-climbing mode. As you compare the newly generated summary with the ground truth summary (the previous one) for every sample, update the ground truth summary if necessary. During that pairwise comparison, you will get insights into the failure modes of your LLM app. You will then update your LLM app to address these failure modes, generate new summaries, and continue hill-climbing your data. You can stop this phase once you don't improve your summaries anymore. Summarizing in a diagram:

How do you keep track of the best version of your LLM app? While this process does not entail a direct comparison between different iterations of the LLM app, you can still get a sense of it. You can use the pairwise comparisons between the new and ground truth summaries to score any item in your experiments with +1, 0 or -1, depending on if the new summary is better, comparable or worse than the ground truth one. With that information you can approximately assess which experiment is closest to the ground truth summaries.

This process is akin to how the training data for Llama2 were created. Instead of writing responses for supervised finetuning data ($3.5 per unit), pairwise-comparisons ($25 per unit) were used. Watch Thomas Scialom, one of the authors, talk about it here.

Phase 2: Hill climb your app

In this phase, you focus on creating a compound AI system which fits all targets/ground truth values at the same time. For that you need to be able to measure how closely your outputs are to the ground truth values. While you can assess their closeness by manually comparing outputs with targets, LLM-based evals come in handy to speed up your iteration cycle.

You will need to iterate on your LLM evals to ensure they are aligned with human judgement. As you manually review your experiment results, measure the alignment with your LLM eval. Then tweak the eval to mimic human annotations. Once, there is good alignment (as measured by Cohen's kappa for categorical annotations or Spearman correlation for continuous judgement), you can rely more on the LLM evals and less on manual review. This will unlock a faster feedback loop. Those effects will be even more pronounced when domain experts such as lawyers or doctors manually review responses. Before any major release, you should still have a human-in-the-loop process to verify quality and to assess the correctness of your LLM evals.

Note, you may find better ground truth values during manual review in this phase. Hence, dataset versioning becomes important to understand if any drift in evaluation scores is due to moving targets.

Continuous improvement

Once you have data with good ground truth values/targets and an application which is close to those targets, you are ready to launch the app with your beta users. During that process, you will encounter failure cases which you haven't seen before. You will want to use those samples to improve your application.

For the new samples, you go through Phase 1 followed by Phase 2. Whereas for the previous samples in your dataset, you continue with Phase 2 as you tweak your application to fit the new data.

How does Parea help?

You can use Parea to run experiments, track ground truth values in datasets, review & comment on logs, and compare experiment results with ground truth values in a queue during Phase 1. For Phase 2, Parea helps by tracking the alignment of your LLM evals with manual review and bootstraping LLM evals from manual review data.

Conclusion

When ground truth values are expensive to create (e.g. for summarization tasks), you can use pairwise comparisons of your LLM outputs to iteratively label your data as you experiment with your LLM app. Then, you want to build a compound system fitting all ground truth values. In that later process, aligned LLM-based evals are crucial to speed up your iteration cycle.

Tactics for multi-step LLM app experimentation

Joschka Braun — Thu, 25 Jul 2024 16:34:38 +0000

In this article, we will discuss tactics specific to testing & improving multi-step AI apps. We will introduce every tactic, demonstrate the ideas on a sample RAG app, and see how Parea simplifies the application of this idea. The aim of this blog is to give guidance on how to improve multi-component AI apps no matter if you use Parea or not.

Note, a version with TypeScript code is available here - I left it out as markdown doesn't have code groups / accordions to simplify navigating the article.

Sample app: finance chatbot

A simple chatbot over the AirBnB 10k 2023 dataset will lend itself as our sample application. We will assume that the user only writes keywords to ask questions about AirBnB's 2023 10k filing.
Given the user's keywords, we will expand the query. Then use the expanded query to retrieve relevant contexts which are used to generate the answer. Checkout the pseudocode below illustrating the structure:

def query_expansion(keyword_query: str) -> str:
    # LLM call to expand query
    pass

def context_retrieval(query: str) -> list[str]:
    # fetch top 10 indexed contexts
    pass

def answer_generation(query: str, contexts: list[str]) -> str:
    # LLM call to generate answer given queries & contexts
    pass

def chatbot(keyword_query: str) -> str:
    expanded_query = query_expansion(keyword_query)
    contexts = context_retrieval(expanded_query)
    return answer_generation(expanded_query, contexts)

Tactic 1: QA of every sub-step

Assuming a 90% accuracy of any step in our AI application, implies a 60% error for a 10-step application (cascading effects of failed sub-steps). Hence, quality assessment (QA) of every possible sub-step is crucial. It goes without saying that testing every sub-step simplifies identifying where to improve our application.

How to exactly evaluate a given sub-step is domain specific. Yet, you might want to check out these lists of reference-free and referenced-based eval metrics for inspiration. Reference-free means that you don't know the correct answer, while reference-based means that you have some ground truth data to check the output against. Typically, it becomes a lot easier to evaluate when you have some ground truth data to verify the output.

Applied to sample app

Evaluating every sub-step of our sample app means that we need to evaluate the query expansion, context retrieval, and answer generation step. In tactic 2, we will look at the actual evaluation functions of these components.

With Parea

Parea helps in two ways with this step.
It simplifies instrumenting & testing a step as well as creating reports on how the components perform. We will use the trace decorator for instrumentation and evaluation of any step. This decorator logs any inputs, output, latency, etc., creates traces (hierarchical logs), executes any specified evaluation functions to score the output and saves their scores. To report the quality of an app, we will run experiments. Experiments measure the performance of our app on a dataset and enable identifying regressions across experiments. Below you can see how to use Parea to instrument & evaluate every component.

# pip install -U parea-ai
from parea import Parea, trace

# instantiate Parea client
p = Parea(api_key="PAREA_API_KEY")

# observing & testing query expansion; query_expansion_accuracy defined in tactic 2
@trace(eval_funcs=[query_expansion_accuracy])
def query_expansion(keyword_query: str) -> str:
    ...

# observing & testing context fetching; correct_context defined in tactic 2
@trace(eval_funcs=[correct_context])
def context_retrieval(query: str) -> list[str]:
    ...

# observing & answer generation; answer_accuracy defined in tactic 2
@trace(eval_funcs=[answer_accuracy])
def answer_generation(query: str, contexts: list[str]) -> str:
    ...

# decorate with trace to group all traces for sub-steps under a root trace
@trace
def chatbot(keyword_query: str) -> str:
    ...

# test data are a list of dictionaries
test_data = ...

# evaluate chatbot on dataset
p.experiment(
    name='AirBnB 10k',
    data=test_data,
    func=chatbot,
).run()

Tactic 2: Reference-based evaluation

As mentioned above, reference-based evaluation is a lot easier & more grounded than reference-free evaluation. This also applies to testing sub-steps. Using production logs as your test data is very useful.
You should collect & store them with any (corrected) sub-step outputs as test data. For the case that you do not have ground truth/target values, esp. for sub-steps, you should consider synthetic data generation incl. ground truths for every step. Synthetic data also come in handy when you can't leverage production logs as your test data. To create synthetic data for sub-steps, you need to incorporate the relationship between components into the data generation. See below for how this can look like.

Applied to sample app

We will start with generating some synthetic data for our app. For that we will use Virat’s processed AirBnB 2023 10k filings dataset and generate synthetic data for the sub-step (expanding the keyword into a query). As this dataset contains triplets of question, context and answer, we will do the inverse of the sub-step: generate a keyword query from the provided question. To do that, we will use Instructor with the OpenAI API to generate the keyword query.

# pip install -U instructor openai
import os
import json
import instructor
from pydantic import BaseModel, Field
from openai import OpenAI

# Download the AirBnB 10k dataset
path_qca = "airbnb-2023-10k-qca.json"
if not os.path.exists(path_qca):
    !wget https://virattt.github.io/datasets/abnb-2023-10k.json -O airbnb-2023-10k-qca.json
with open(path_qca, "r") as f:
    question_context_answers = json.load(f)

# Define the response model to create the keyword query
class KeywordQuery(BaseModel):
    keyword_query: str = Field(..., description="few keywords that represent the question")

# Patch the OpenAI client
client = instructor.from_openai(OpenAI())

test_data = []
for qca in question_context_answers:
    # generate the keyword query
    keyword_query: KeywordQuery = client.chat.completions.create(
        model="gpt-3.5-turbo",
        response_model=KeywordQuery,
        messages=[{"role": "user", "content": "Create a keyword query for the following question: " + qca["question"]}],
    )
    test_data.append(
        {
            'keyword_query': keyword_query.keyword_query,
            'target': json.dumps(
                {
                    'expanded_query': qca['question'],
                    'context': qca['context'],
                    'answer': qca['answer']
                }
            )
        }
    )

# Save the test data
with open("test_data.json", "w") as f:
    json.dump(test_data, f)

With these data, we can evaluate our sub-steps now as follows:

query expansion: Levenshtein distance between the original question from the dataset and the generated query
context retrieval: hit rate at 10, i.e., if the correct context was retrieved in the top 10 results
answer generation: Levenshtein distance between the answer from the dataset and the generated answer

With Parea

Using the synthetic data, we can formulate our evals using Parea as shown below. Note, an eval function in Parea receives a Log object and returns a score. We will use the Log object to access the output of that step and the target from our dataset. The target is a stringified dictionary containing the correctly expanded query, context, and answer.

from parea.schemas import Log
from parea.evals.general.levenshtein import levenshtein_distance

# testing query expansion
def query_expansion_accuracy(log: Log) -> float:
    target = json.loads(log.target)['expanded_query']  # log.target is of type string
    return levenshtein_distance(log.output, target)

# testing context fetching
def correct_context(log: Log) -> bool:
    correct_context = json.loads(log.target)['context']
    retrieved_contexts = json.loads(log.output)  # log.output is of type string
    return correct_context in retrieved_contexts

# testing answer generation
def answer_accuracy(log: Log) -> float:
    target = json.loads(log.target)['answer']
    return levenshtein_distance(log.output, target)

# loading generated test data
with open('test_data.json') as fp:
    test_data = json.load(fp)

Tactic 3: Cache LLM calls

Once, you can assess the quality of the individual components, you can iterate on them with confidence. To do that you will want to cache LLM calls to speed up the iteration time & avoid unnecessary cost as other sub-steps might not have changed. This will also lead to deterministic behaviors of your app simplifying testing. Below is an implementation of a general cache:

For Python, you can see a slightly modified version of the file caching Sweep AI uses (original code).

import hashlib
import os
import pickle

MAX_DEPTH = 6


def recursive_hash(value, depth=0, ignore_params=[]):
    """Hash primitives recursively with maximum depth."""
    if depth > MAX_DEPTH:
        return hashlib.md5("max_depth_reached".encode()).hexdigest()

    if isinstance(value, (int, float, str, bool, bytes)):
        return hashlib.md5(str(value).encode()).hexdigest()
    elif isinstance(value, (list, tuple)):
        return hashlib.md5(
            "".join(
                [recursive_hash(item, depth + 1, ignore_params) for item in value]
            ).encode()
        ).hexdigest()
    elif isinstance(value, dict):
        return hashlib.md5(
            "".join(
                [
                    recursive_hash(key, depth + 1, ignore_params)
                    + recursive_hash(val, depth + 1, ignore_params)
                    for key, val in value.items()
                    if key not in ignore_params
                ]
            ).encode()
        ).hexdigest()
    elif hasattr(value, "__dict__") and value.__class__.__name__ not in ignore_params:
        return recursive_hash(value.__dict__, depth + 1, ignore_params)
    else:
        return hashlib.md5("unknown".encode()).hexdigest()


def file_cache(ignore_params=[]):
    """Decorator to cache function output based on its inputs, ignoring specified parameters."""

    def decorator(func):
        def wrapper(*args, **kwargs):
            cache_dir = "/tmp/file_cache"
            os.makedirs(cache_dir, exist_ok=True)

            # Convert args to a dictionary based on the function's signature
            args_names = func.__code__.co_varnames[: func.__code__.co_argcount]
            args_dict = dict(zip(args_names, args))

            # Remove ignored params
            kwargs_clone = kwargs.copy()
            for param in ignore_params:
                args_dict.pop(param, None)
                kwargs_clone.pop(param, None)

            # Create hash based on function name and input arguments
            arg_hash = recursive_hash(
                args_dict, ignore_params=ignore_params
            ) + recursive_hash(kwargs_clone, ignore_params=ignore_params)
            cache_file = os.path.join(
                cache_dir, f"{func.__module__}_{func.__name__}_{arg_hash}.pickle"
            )

            # If cache exists, load and return it
            if os.path.exists(cache_file):
                print("Used cache for function: " + func.__name__)
                with open(cache_file, "rb") as f:
                    return pickle.load(f)

            # Otherwise, call the function and save its result to the cache
            result = func(*args, **kwargs)
            with open(cache_file, "wb") as f:
                pickle.dump(result, f)

            return result

        return wrapper

    return decorator

Applied to sample app

To do this, you might want to introduce an abstraction over your LLM calls to apply the cache decorator:

@file_cache
def call_llm(model: str, messages: list[dict[str, str]], **kwargs) -> str:
    ...

Summary

Test every sub-step to minimize the cascading effect of their failure. Use the full trace from production logs or generate synthetic data (incl. for the sub-steps) for reference-based evaluation of individual components. Finally, cache LLM calls to speed up & save cost when iterating on independent sub-steps.

How does Parea help?

Using the trace decorator, you can create nested tracing of steps and apply functions to score their outputs. After instrumenting your application, you can track the quality of your AI app and identify regressions across runs using experiments.

LLM Evaluation Metrics for Labeled Data

Joschka Braun — Wed, 21 Feb 2024 17:29:30 +0000

The following is an overview of general purpose evaluation metrics based on foundational models and fine-tuned LLMs as well as RAG specific evaluation metrics. The evaluation metrics rely on ground truth annotations/reference answers to assess the correctness of the model response. They were collected from research literature and discussions with other LLM app builders. Implementation in Python or links to the models are provided where available.

General Purpose Evaluation Metrics using Foundational Models

A classical yet predictive way to assess how much the model response agrees with the reference answer is to measure the overlap between the two. This is suggested in Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering. Concretely, the authors suggest to measure the proportion of tokens of the reference answer which are also part of the model response, i.e., measure the recall. They find that this metric only slightly lags behind using GPT-3.5-turbo (see table 2 from the paper) to compare output & reference answer.

Code: here

The authors compared more methods by their correlation with human judgment and found that the most predictive metric for the correctness of the model response is to use another LLM for grading it, in this case, GPT-4. In particular, they instruct the LLM to compare the generated response with the ground truth answer and output "no" if there is any information missing from the ground truth answer.

Code: here

The authors of LLM-EVAL: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models take this method further by prompting an LLM to generate a JSON schema whose fields are scores that assess the model response on different dimensions using a reference answer. While this method was developed for chatbots, it exemplifies using JSON generation as a way to assess the correctness of the model response on various criteria. They compared using scales of 0-5 and 0-100, finding that the 0-5 scale only slightly outperforms.

Fine-tuned LLMs as General Purpose Evaluation Metrics

An emerging body of work proposes fine-tuning LLMs to yield evaluations assessing the correctness of a model response given a reference answer.

Prometheus

The authors of Prometheus: Inducing fine-grained evaluation capability in language models fine-tune LLaMa-2-Chat (7B & 13B) to output feedback and a score from 1-5 for a given a response, the instructions which yielded the response, a reference answer to compare against, and a score rubric. The model is highly aligned with GPT-4 evaluation and is comparable to it in terms of performance (as measured by human annotators) while being drastically cheaper. They train the model on GPT-4 generated data, which contained fine-grained scoring rubrics (a total of 1k rubrics) and reference answers to a given instruction. The methods were benchmarked on MT Bench, Vicuna Bench, Feedback Bench & Flask Eval.

Model: here

CritiqueLLM

The authors of CRITIQUELLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation fine-tune two versions of ChatGLM-2 (6B, 12B & 66B) to output a score (1-10) and a critique. One fine-tuned version receives as input the user query and the model response, such that it can be used as a reference-free evaluation metric. The other version receives as input the user query, the model response, and the reference answer, such that it can be used as a reference-based evaluation metric.

While their method performs worse than GPT-4, it is interesting as it converts a reference-based evaluation metric into a reference-free one. They achieve this by training the reference-based model on GPT-4 outputs and the reference-free model on GPT-4 outputs that respond to prompts to revise the previous evaluation to not use the reference answer.

InstructScore

The authors of INSTRUCTSCORE: Explainable Text Generation Evaluation with Fine-grained Feedback extend the idea of fine-tuning an LLM to generate feedback & scores given a user query, the model response, and the reference answer. Instead of only giving feedback & scores, they fine-tuned the model to generate a report that contains a list of error types, locations, severity labels, and explanations. Their Llama-7B-based model is close in performance to supervised methods and outperforms GPT-4 based methods.

Model: here

RAG Specific Evaluation Metrics

In its simplest form, a RAG application consists of a retrieval and a generation step. The retrieval step fetches the context given a query. The generation step answers the initial query after being supplied with the fetched context. The following is a collection of evaluation metrics to evaluate the retrieval and generation steps in an RAG application.

Percent Target Supported by Context

This metric calculates the percentage of sentences in the target/ground truth supported by the retrieved context. It does that by instructing an LLM to analyze each sentence in the reference answer and output "yes" if the sentence is supported by the retrieved context and "no" otherwise. This is useful to understand how well the retrieval step is working and provides an upper ceiling for the performance of the entire RAG system as the generation step can only be as good as the retrieved context.

Code: here

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

The authors of ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems improve upon the RAGAS paper using 150 labeled data to fine-tune LLM judges to evaluate context relevancy, answer faithfulness & answer relevance. Note, context relevancy measures how relevant the retrieved context is to the query, answer faithfulness measures how much the generated answer is based on the retrieved context, and answer relevance measures how well the generated answer matches the query.

Concretely, given a corpus of documents & few-shot examples of in-domain passages mapped to in-domain queries & answers, they generate synthetic triplets of query, passage, answer. Then, they use these triplets to train LLM judges for context relevancy, answer faithfulness & answer relevance with a binary classification loss, and utilize the labeled data as validation dataset. In the last step, they use the labeled data to learn a rectifier function to construct confidence intervals for the model's prediction (they leverage prediction-powered inference).

When benchmarking their method to rank different RAG systems, they find that their method outperforms RAGAS and a GPT-3.5-turbo-16k baseline as measured by correlation of true ranking with ranking based on the scores of the respective method.

Code: here

Getting Started

You can get started with these evaluation metrics on Parea.

How to measure the performance of LLM applications without ground truth data.

Joschka Braun — Thu, 21 Dec 2023 08:13:06 +0000

Rumor is that GitHub Copilot has thousands of lines of defensive code parsing its LLM responses to catch undesired model behaviors. Bryan Bischof (Head of AI at Hex) recently mentioned that this defensive coding can be created with evaluation metrics. This exemplifies a core part of building widely used production-grade LLM applications: quality control and evaluation. When building an LLM application, one should add an evaluation metric for any failure case to ensure this doesn't happen again under new user inputs (e.g., Copilot not closing code brackets).

When evaluating LLM apps, one must distinguish between end-to-end and step/component-wise evaluation.

The end-to-end evaluation gives a sense of the overall quality, which is valuable to compare different approaches.
The step/component-wise evaluation helps identify & mitigate failure modes that have cascading effects impacting the overall quality of the LLM app (e.g., ensuring that the correct context was retrieved in RAG).

Eval metrics are a highly sought-after topic in the LLM community, and getting started with them is hard. The following is an overview of evaluation metrics for different scenarios applicable for end-to-end and component-wise evaluation. The following insights were collected from research literature and discussions with other LLM app builders. Code examples are also provided in Python.

General Purpose Evaluation Metrics

These evaluation metrics can be applied to any LLM call and are a good starting point for determining output quality.

Rating LLMs Calls on a Scale from 1-10

The Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena paper introduces a general-purpose zero-shot prompt to rate responses from an LLM to a given question on a scale from 1-10. They find that GPT-4's ratings agree as much with a human rater as a human annotator agrees with another one (>80%). Further, they observe that the agreement with a human annotator increases as the response rating gets clearer. Additionally, they investigated how much the evaluating LLM overestimated its responses and found that GPT-4 and Claude-1 were the only models that didn't overestimate themselves.

Code: here.

Relevance of Generated Response to Query

Another general-purpose way to evaluate any LLM call is to measure if the retrieved contexts are ranked by relevancy to the given query. But instead of using an LLM to rate the relevancy on a scale, the RAGAS: Automated Evaluation of Retrieval Augmented Generation paper suggests using an LLM to generate multiple questions that fit the generated answer and measure the cosine similarity of the generated questions with the original one.

Code: here.

Assessing Uncertainty of LLM Predictions (w/o perplexity)

Given that many API-based LLMs, such as GPT-4, don't give access to the log probabilities of the generated tokens, assessing the certainty of LLM predictions via perplexity isn't possible. The SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models paper suggests measuring the average factuality of every sentence in a generated response. They generate additional responses from the LLM at a high temperature and check how much every sentence in the original answer is supported by the other generations. The intuition behind this is that if the LLM knows a fact, it's more likely to sample it. The authors find that this works well in detecting non-factual and factual sentences and ranking passages in terms of factuality. The authors noted that correlation with human judgment doesn't increase after 4-6 additional generations when using gpt-3.5-turbo to evaluate biography generations.

Code: here.

Cross-Examination for Hallucination Detection

The LM vs LM: Detecting Factual Errors via Cross Examination paper proposes using another LLM to assess an LLM response's factuality. To do this, the examining LLM generates follow-up questions to the original response until it can confidently determine the factuality of the response. This method outperforms prompting techniques such as asking the original model, "Are you sure?" or instructing the model to say, "I don't know," if it is uncertain.

Code: here.

RAG Specific Evaluation Metrics

In its simplest form, a RAG application consists of retrieval and generation steps. The retrieval step fetches for context given a specific query. The generation step answers the initial query after being supplied with the fetched context.

The following is a collection of evaluation metrics to evaluate the retrieval and generation steps in an RAG application.

Relevance of Context to Query

For RAG to work well, the retrieved context should only consist of relevant information to the given query such that the model doesn't need to "filter out" irrelevant information. The RAGAS paper suggests first using an LLM to extract any sentence from the retrieved context relevant to the query. Then, calculate the ratio of relevant sentences to the total number of sentences in the retrieved context.

Code: here.

Context Ranked by Relevancy to Query

Another way to assess the quality of the retrieved context is to measure the relevancy rank of the retrieved context on a given query. This is supported by the intuition from the Lost in the Middle paper, which finds that performance degrades if the relevant information is in the middle of the context window. And that performance is greatest if the relevant information is at the beginning of the context window.

The RAGAS paper also suggests using an LLM to check if every extracted context is relevant. Then, they measure how well the contexts are ranked by calculating the mean average precision. Note that this approach considers any two relevant contexts equally important/relevant to the query.

Code: here.

Instead of estimating the relevancy of every rank individually and measuring the rank based on that, one can also use an LLM to rerank a list of contexts and use that to evaluate how well the contexts are ranked by relevancy to the given query. The Zero-Shot Listwise Document Reranking with a Large Language Model paper finds that listwise reranking outperforms pointwise reranking with an LLM. The authors used a progressive listwise reordering if the retrieved contexts don't fit into the context window of the LLM.

Aman Sanger (Co-Founder at Cursor) mentioned (tweet) that they leveraged this listwise reranking with a variant of the Trueskill rating system to efficiently create a large dataset of queries with 100 well-ranked retrieved code blocks per query. He underlined the paper's claim by mentioning that using GPT-4 to estimate the rank of every code block individually performed worse.

Code: here.

Faithfulness of Generated Answer to Context

Once the relevance of the retrieved context is ensured, one should assess how much the LLM reuses the provided context to generate the answer, i.e., how faithful is the generated answer to the retrieved context?

One way to do this is to use an LLM to flag any information in the generated answer that cannot be deduced from the given context. This is the approach taken by the authors of Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering. They find that GPT-4 is the best model for this analysis as measured by correlation with human judgment.

Code: here.

A classical yet predictive way to assess the faithfulness of a generated answer to a given context is to measure how many tokens in the generated answer are also present in the retrieved context. This method only slightly lags behind GPT-4 and outperforms GPT-3.5-turbo (see Table 4 from the above paper).

Code: here.

The RAGAS paper spins the idea of measuring the faithfulness of the generated answer via an LLM by measuring how many factual statements from the generated answer can be inferred from the given context. They suggest creating a list of all statements in the generated answer and assessing whether the given context supports each statement.

Code: here.

AI Assistant/Chatbot-Specific Evaluation Metrics

Typically, a user interacts with a chatbot or AI assistant to achieve specific goals. This motivates to measure the quality of a chatbot by counting how many messages a user has to send before they reach their goal. One can further break this down by successful and unsuccessful goals to analyze user & LLM behavior.

Concretely:

Delineate the conversation into segments by splitting them by the goals the user wants to achieve.
Assess if every goal has been reached.
Calculate the average number of messages sent per segment.

Code: here.

Evaluation Metrics for Summarization Tasks

Text summaries can be assessed based on different dimensions, such as factuality and conciseness.

Evaluating Factual Consistency of Summaries w.r.t. Original Text

The ChatGPT as a Factual Inconsistency Evaluator for Text Summarization paper used gpt-3.5-turbo-0301 to assess the factuality of a summary by measuring how consistent the summary is with the original text, posed as a binary classification and a grading task. They find that gpt-3.5-turbo-0301 outperforms baseline methods such as SummaC and QuestEval when identifying factually inconsistent summaries. They also found that using gpt-3.5-turbo-0301 leads to a higher correlation with human expert judgment when grading the factuality of summaries on a scale from 1 to 10.

Code: binary classification and 1-10 grading.

Likert Scale for Grading Summaries

Among other methods, the Human-like Summarization Evaluation with ChatGPT paper used gpt-3.5-0301 to evaluate summaries on a Likert scale from 1-5 along the dimensions of relevance, consistency, fluency, and coherence. They find that this method outperforms other methods in most cases in terms of correlation with human expert annotation. Noteworthy is that BARTScore was very competitive to gpt-3.5-0301.

Code: Likert scale grading.

How To Use Above Evaluation Metrics

You can use these evaluation metrics on your own or through Parea by following our onboarding wizard. Alternatively, you can get started by:

Deploy any of the above evaluation function.
Add a small code snippet.
View logs in the dashboard.