DEV Community

Cover image for The one thing everyone's doing wrong with ChatGPT... πŸ€«πŸ€”
Jeffrey Ip for Confident AI

Posted on • Updated on • Originally published at confident-ai.com

The one thing everyone's doing wrong with ChatGPT... πŸ€«πŸ€”

TL;DR

Most developers don't evaluate their GPT outputs when building applications even if that means introducing unnoticed breaking changes because evaluation is very, very, hard. In this article, you're going to learn how to evaluate ChatGPT (LLM) outputs the right way.

πŸ”₯ On the agenda:

  • what are LLMs and why they're difficult to evaluate
  • different ways to evaluate LLM outputs
  • how to evaluate in python

Enjoy! πŸ€—


DeepEval - open-source evaluation framework for LLM applications

DeepEval is a framework that help engineers evaluate the performance of their LLM applications by providing default metrics to measure hallucination, relevancy, and much more.

We are just starting out, and we really want to help more developers build safer AI apps. Would you mind giving it a star to help spread the word, please? πŸ₯Ίβ€οΈπŸ₯Ί

🌟 DeepEval on GitHub

Github stars


What are LLMs and what makes them so hard to evaluate?

To understand why LLMs are difficult to evaluate and why they're often times referred to as a "black box", let's debunk are LLMs and how they work.

ChatGPT is an example of a large language model (LLM) and was trained on huge amounts of data. To be exact, around 300 billion words from articles, tweets, r/tifu, stack-overflow, how-to-guides, and other pieces of data that were scraped off the internet 🀯

Anyway, the GPT behind "Chat" stands for Generative Pre-trained Transformers. A transformer is a specific neural network architecture which is particularly good at predicting the next few tokens (a token == 4 characters for ChatGPT, but this can be as short as one character or as long as a word depending on the specific encoding strategy).

So in fact, LLMs don't really "know" anything, but instead "understand" linguistic patterns due to the way in which they were trained, which often times makes them pretty good at figuring out the right thing to say. Pretty manipulative huh?

All jokes aside, if there's one thing you need to remember, it's this: the process of predicting the next plausible "best" token is probabilistic in nature. This means that, LLMs can generate a variety of possible outputs for a given input, instead of always providing the same response. It is exactly this non-deterministic nature of LLMs that makes them challenging to evaluate, as there's often more than one appropriate response.


Why do we need to evaluate LLM applications?

When I say LLM applications, here are some examples of what I'm referring to:

  • Chatbots: For customer support, virtual assistants, or general conversational agents.
  • Code Assistance: Suggesting code completions, fixing code errors, or helping with debugging.
  • Legal Document Analysis: Helping legal professionals quickly understand the essence of long contracts or legal texts.
  • Personalized Email Drafting: Helping users draft emails based on context, recipient, and desired tone.

LLM applications usually have one thing in common - they perform better when augmented with proprietary data to help with the task at hand. Want to build an internal chatbot that helps boost your employee's productivity? OpenAI certainly doesn't keep tabs on your company's internal data (hopefully πŸ˜₯).

This matters because it is now not only OpenAI's job to ensure ChatGPT is performing as expected βš–οΈ but also yours to make sure your LLM application is generating the desired outputs by using the right prompt templates, data retrieval pipelines, model architecture (if you're fine-tuning), etc.

Evaluation (I'll just call them evals from hereon) helps you measure how well your application is handling the task at hand. Without evals, you will be introducing unnoticed breaking changes and would have to manually inspect all possible LLM outputs each time you iterate on your application πŸ‘€ which to me sounds like a terrible idea πŸ’€


How to evaluate LLM outputs

There are two ways everyone should know about when it comes to evals - with and without ChatGPT.

Evals without ChatGPT

A nice way to evaluate LLM outputs without using ChatGPT is using other machine learning models derived from the field of NLP. You can use specific models to judge your outputs on different metrics such as factual correctness, relevancy, biasness, and helpfulness (just to name a few, but the list goes on), despite non-deterministic outputs.

For example, we can use natural language inference (NLI) models (which outputs an entailment score) to determine how factually correct a response is based on some provided context. The higher the entailment score, the more factually correct an output is, which is particularity helpful if you're evaluating a long output that's not so black and white in terms of factual correctness.

You might also wonder how can these models possibly "know" whether a piece of text is factually correct πŸ€” It turns out you can provide context to these models for them to take at face value πŸ₯³ In fact, we call these context ground truths or references. A collection of these references are often referred to an evaluation dataset.

But not all metrics require references. For example, relevancy can be calculated using cross-encoder models (another ML model), and all you need is supply the input and output for it to determine how relevant they are to each another.

Off the top of my head, here's a list of reference-less metrics:

  • relevancy
  • bianess
  • toxicity
  • helpfulness
  • harmlessness

And here is a list of reference based metrics:

  • factual correctness
  • conceptual similarity

Note that reference based metrics doesn't require you to provide the initial input, as it only judges the output based on the provided context.

Using ChatGPT for Evals

Image description

There's a new emerging trend to use state-of-the-art (aka ChatGPT) LLMs to evaluate themselves or even other others LLMs.

G-Eval is a recently developed framework that uses LLMs for evals.

I'll attach an image from the research paper that introduced G-eval below, but in a nutshell G-Eval is a two part process - the first generates evaluation steps, and the second uses the generated evaluation steps to output a final score.

Let's run though a concrete example. Firstly, to generate evaluation steps:

  1. introduce an evaluation task to ChatGPT (eg. rate this summary from 1 - 5 based on relevancy)
  2. introduce an evaluation criteria (eg. Relevancy will based on the collective quality of all sentences)

Once the evaluation steps has been generated:

  1. concatenate the input, evaluation steps, context, and the actual output
  2. ask it to generate a score between 1 - 5, where 5 is better than 1
  3. (Optional) take the probabilities of the output tokens from the LLM to normalize the score and take their weighted summation as the final result

Step 3 is actually pretty complicated πŸ™ƒ because to get the probability of the output tokens, you would typically need access to the raw model outputs, not just the final generated text. This step was introduced in the paper because it offers more fine-grained scores that better reflect the quality of outputs.

Here's a diagram taken from the paper that can help you visualize what we've learnt:

Image description

Utilizing GPT-4 with G-Eval outperformed traditional metrics in areas such as coherence, consistency, fluency, and relevancy 😳 but, evaluations using LLMs can often be very expensive.

So, my recommendation would be to evaluate with G-Eval as a starting point to establish a performance standard and then transition to more cost-effective traditional methods where suitable.


Evaluating LLM outputs in python

By now, you probably feel inundated by all the jargon and definitely wouldn't want to implement everything from scratch. Imagine having to research what's the best way to compute each individual metric, train your own model for it, and code up an evaluation framework... 😰

Luckily, there are a few open source packages such as ragas and DeepEval that provides an evaluation framework so you don't have to write your own 😌

As the cofounder of Confident (the company behind DeepEval), I'm going to go ahead and shamelessly show you how you can unit test your LLM applications using DeepEvals 😊 (but seriously, we have an amazing Pytest-like developer experience, easy to setup, and offer a free platform for you to visualize your evaluation results)

Let's wrap things up with some coding.

Image description

Setting up your test environment

To implement our much anticipated evals, create a project folder and initialize a python virtual environment by running the code below in your terminal:

mkdir evals-example
cd evals-example
python3 -m venv venv
source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Your terminal should now start something like this:

(venv)
Enter fullscreen mode Exit fullscreen mode

Installing dependencies

Run the following code:

pip install deepeval
Enter fullscreen mode Exit fullscreen mode

Setting your OpenAI API Key

Lastly, set your OpenAI API key as an environment variable. We'll need OpenAI for G-Evals later (which basically means using LLMs for evaluation). In your terminal, paste in this with your own API key (get yours here if you don't already have one):

export OPENAI_API_KEY="your-api-key-here"
Enter fullscreen mode Exit fullscreen mode

Writing your first test file

Let's create a file called test_evals.py (note that test files must start with "test"):

touch test_evals.py
Enter fullscreen mode Exit fullscreen mode

Paste in the following code:

from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.metrics.answer_relevancy import AnswerRelevancyMetric
from deepeval.metrics.conceptual_similarity import ConceptualSimilarityMetric
from deepeval.metrics.llm_eval import LLMEvalMetric
from deepeval.test_case import LLMTestCase
from deepeval.run_test import assert_test
import openai

def test_factual_correctness():
    input = "What if these shoes don't fit?"
    context = "All customers are eligible for a 30 day full refund at no extra costs."
    output = "We offer a 30-day full refund at no extra costs."
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, context=context)
    assert_test(test_case, [factual_consistency_metric])

def test_relevancy():
    input = "What does your company do?"
    output = "Our company specializes in cloud computing"
    relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output)
    assert_test(test_case, [relevancy_metric])

def test_conceptual_similarity():
    input = "What did the cat do?"
    output = "The cat climbed up the tree"
    expected_output = "The cat ran up the tree."
    conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, expected_output=expected_output)
    assert_test(test_case, [conceptual_similarity_metric])

def test_humor():
    def make_chat_completion_request(prompt):
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ],
        )
        return response.choices[0].message.content
    input = "Write me something funny related to programming"
    output = "Why did the programmer quit his job? Because he didn't get arrays!"
    llm_metric = LLMEvalMetric(
        criteria="How funny it is",
        completion_function=make_chat_completion_request
    )
    test_case = LLMTestCase(query=input, output=output)
    assert_test(test_case, [llm_metric])
Enter fullscreen mode Exit fullscreen mode

Now run the test file:

deepeval test run test_evals.py
Enter fullscreen mode Exit fullscreen mode

For each of the test cases, there is a predefined metric provided by DeepEval, and each of these metrics output a score from 0 - 1. For example, FactualConsistencyMetric(minimum_score=0.5) means we want to evaluate how factually correct an output is, where the minimum_score=0.5 means the test will only pass if the output score is higher than a 0.5 threshold.

Let's go over the test cases one by one:

  1. test_factual_correctness tests how factually correct your LLM output is relative to the provided context.
  2. test_relevancy tests how relevant the output is relative to the given input.
  3. test_conceptual_similarity tests how conceptually similar the LLM output is relative to the expected output.
  4. test_humor tests how funny your LLM output is. This test case is the only test case that uses ChatGPT for evaluation.

Notice how there's up to 4 moving parameters for a single test case - the input, the expected output, the actual output (of your application), and the context (that was used to generate the actual output). Depending on the metric you're testing, some parameters are optional, while some are mandatory.

Lastly, what if you want to test more than a metric on the same input? Here's how you can aggregate metrics on a single test case:

def test_everything():
    input = "What did the cat do?"
    output = "The cat climbed up the tree"
    expected_output = "The cat ran up the tree."
    context = "The cat ran up the tree."
    conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.5)
    relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, context=context, expected_output=expected_output)
    assert_test(test_case, [conceptual_similarity_metric, relevancy_metric, factual_consistency_metric])
Enter fullscreen mode Exit fullscreen mode

Not so hard after all huh? Write enough of these (10-20), and you'll have much better control over what you're building πŸ€—

PS. And here's a bonus feature DeepEval offers: free web platform for you to view data on all your test runs.

Try running the following command:

deepeval login
Enter fullscreen mode Exit fullscreen mode

Follow the instructions (login, get your API key, paste it in the CLI), and run this again:

deepeval test run test_example.py
Enter fullscreen mode Exit fullscreen mode

Let me know what happens!

Conclusion

In this article, you've learnt:

  • how ChatGPT work
  • examples of LLM applications
  • why it's hard to evaluate LLM outputs
  • how to evaluate LLM outputs in python

With evals, you can stop making breaking changes to your LLM application βœ… quickly iterate on your implementation to improve on metrics you care about βœ… and most importantly be confident in the LLM application you build πŸ˜‡

If you enjoyed this article, don't forget to give us a star on GitHub! The source code for this tutorial is available here:
https://github.com/confident-ai/blog-examples/tree/main/evals-example

Thank you for reading, and till next time 🫑

Top comments (11)

Collapse
 
majilaii profile image
Kuong Ao Ieong

Interesting article, will definitely discuss this with my team. Please continue posting, loving the read!

Collapse
 
nathan_tarbert profile image
Nathan Tarbert

Interesting article, a good read!

Collapse
 
guybuildingai profile image
Jeffrey Ip

Thanks!

Collapse
 
syxaxis profile image
George Johnson

If nothing else I certainly learned what GPT means and ther basics of how LLM look smart if you don't know what they're doing as they work. Genuinely interesting stuff.

Collapse
 
guybuildingai profile image
Jeffrey Ip

Glad you found it helpful :)

Collapse
 
nevodavid profile image
Nevo David

I have definitely made a lot of wrong stuff!

Collapse
 
guybuildingai profile image
Jeffrey Ip

Glad you liked it :)

Collapse
 
gymmie profile image
Jimmy Wong

Excellent introduction to LLMs for anyone who is new to the field.

Collapse
 
zhussupovali profile image
Ali Zhussupov

That's very useful! Thanks a lot!

Collapse
 
guybuildingai profile image
Jeffrey Ip

Anytime!

Collapse
 
suede profile image
Yuval

Very interesting article