DEV Community

Cover image for Why is it so important to evaluate Large Language Models (LLMs)? 🤯🔥
Luca Martial
Luca Martial

Posted on

Why is it so important to evaluate Large Language Models (LLMs)? 🤯🔥

Evaluating LLMs is important, not just for deriving accurate results, but also to ensure the safety of the applications in which they are deployed.

Unchecked biases in LLMs can inadvertently perpetuate harmful stereotypes or produce misleading information, which in turn can produce severe consequences. In this article, we'll demonstrate how to evaluate your LLMs using an open source model testing framework, Giskard. 🤓


Your testing framework for LLMs & ML models 🛡

Giskard is an open-source testing framework for ML models & LLMs that allows you to detect hallucinations & biases automatically.

This has been 2 years in the making, involving a group of passionate ML engineers, ethicists, and researchers, and we can't wait to show it to the world.

Image description

Please help us with a star 🌟
It would help us to keep developing this project 💌

Star Giskard repo 🌟


The Need for Evaluating LLMs 🔥

Why is it so important to evaluate Large Language Models (LLMs)? They can write essays, answer questions, or even help you draft emails!

Imagine HR software using LLM to sift through job applications. If the model has biases (maybe it unintentionally favors certain names or backgrounds over others), deserving candidates might get overlooked. Not cool, right?

Image description

Let's say there's an app that assists students with their homework using an LLM. What if it occasionally spews out incorrect facts or endorses harmful beliefs? That's a recipe for many confused students and frustrated teachers!

Bias in LLMs can go beyond just misreading a resume or messing up a homework question. These biases can unintentionally endorse stereotypes, making them seem like truths.

For instance, if an LLM associates a particular gender or ethnicity with negative traits, it can reinforce harmful beliefs in society. And if these models are used in places like courtrooms or hospitals, the consequences can be even graver! ⚠️

While LLMs are incredible tools, they're not perfect. Evaluating LLMs ensures they're working as they should, providing helpful, accurate, and unbiased responses. And that's why diving deep into their performance and refining them is crucial for everyone who uses or interacts with them. 💡

Giskard: The Open Source Model Evaluation Framework 💚

Giskard is an open-source tool designed with one mission in mind: ensuring that our AI applications are safe and sound. It provides tools and guidelines to help developers and users ensure their models are on the right track.


Getting Started with Giskard 🐢

Ensuring the safety and reliability of LLMs is crucial. While creating a model might be straightforward, thoroughly testing it for potential vulnerabilities is a step we should never overlook. In this section, we'll walk you through a hands-on guide on how to do that.

Image description

First, we'll construct a Question-Answer (QA) retrieval system using OpenAI's models. Our model's primary role is to answer questions about climate change. And for our source of truth, we'll be harnessing the vast knowledge within the Climate Change 2023 Synthesis Report.

But, creating the model is only half the battle. Once our QA system is up and running, we'll introduce you to Giskard – a testing and monitoring framework tailor-made for ML models.

With Giskard's robust suite of tests, we'll evaluate our model across seven critical areas:

Stereotype and Discrimination: Ensuring our model steers clear of biased opinions, stereotypes, or any form of discriminatory content.

Disclosure of Sensitive Information: Confirming that our model always respects user privacy, never spilling any confidential or sensitive data.

Output Formatting: Verifying that every response from our model is not only correct but also aligns with specified format requirements.

Generation of Harmful Content: Making sure our model is a tool for good, never promoting malicious activities or generating harmful content.

Hallucination and Misinformation: Maintaining the credibility of our model by preventing it from delivering false or made-up information.

Prompt Injection: Guarding against LLM manipulations that bypass filters or override model instructions, ensuring that our model remains under control and doesn't produce undesired content.

Robustness: Detecting when model outputs are sensitive to small perturbations in the input data, ensuring that our model performs consistently, even when input conditions change slightly.

Let’s get started! 🚀


Loading and Preparing Data with Langchain 🦜

We’ll begin by laying the groundwork for our QA retrieval system. Before our model can start answering questions about climate change, it first needs to understand the report that serves as its knowledge base.

Our primary resource is the Climate Change 2023 Synthesis Report. We’ll download this report and prepare it in a format suitable for our QA system.

import pandas as pd
from langchain.document_loaders import PyPDFLoader
loader =  PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
Enter fullscreen mode Exit fullscreen mode

Large documents can be overwhelming and inefficient for some ML tasks. To manage this, we'll split the document into smaller, more manageable chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    add_start_index=True,
)
Enter fullscreen mode Exit fullscreen mode

This preparation sets us up for the next step: generating embeddings from the text.


Generating Embeddings and Initializing a QA Model ▶️

Embeddings transform textual data into numerical vectors that capture the essence of the content.

These vectors enable efficient retrieval and comparison, which are essential for our QA system to function. Let’s generate these embeddings and initialize a model to answer questions.

We’ll use OpenAI GPT-3.5 as our LLM to set up the retrieval task. Let’s initialize the OpenAI API key as an environment variable.

import os
os.environ["OPENAI_API_KEY"] = "<your OpenAI API key>"
Enter fullscreen mode Exit fullscreen mode

Next, let’s load the splitted fragments of our document into a vector store. We’re using FAISS as our vector database.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

docs = loader.load_and_split(text_splitter)
db = FAISS.from_documents(docs, OpenAIEmbeddings())
Enter fullscreen mode Exit fullscreen mode

Now, we can create a prompt for our LLM to answer questions related to climate change.

PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""
Enter fullscreen mode Exit fullscreen mode

We’ll use OpenAI’s gpt-3.5-turbo-instruct model for generating answers in natural language.

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
Enter fullscreen mode Exit fullscreen mode

Next, we use a retriever to efficiently fetch relevant text chunks based on a given query from our climate change report.

from langchain.chains import RetrievalQA

climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)
Enter fullscreen mode Exit fullscreen mode

Finally, let’s test the QA chain by asking a simple question:

climate_qa_chain("Is sea level rise avoidable? When will it stop?")

Output:
>> {'query': 'Is sea level rise avoidable? When will it stop?',
 'result': 'Sea level rise is unavoidable and will continue for millennia. However, the rate and amount of sea level rise can be influenced by future emissions. It is not possible to determine when it will stop, but it is important to take action now to mitigate its impacts.'}
Enter fullscreen mode Exit fullscreen mode

Looking at the Climate Change Synthesis Report (page 77), we can see that the answer generated by the model is coherent with what is stated in the report:

Sea level rise is unavoidable for centuries to millennia due to continuing deep ocean warming and ice sheet melt, and sea levels will remain elevated for thousands of years.


Evaluating with the Giskard Framework 📊

Evaluating our LLMs is a crucial step to ensure their performance, safety, and reliability. The Giskard framework offers tools for a comprehensive evaluation process.

Before Giskard can evaluate a model, it needs to understand how to interact with it. For this purpose, we wrap our model into a format Giskard recognizes. Fortunately, Giskard has native support for LangChain models and this is very straightforward. 🎉

import giskard as gsk

model = gsk.Model(
    climate_qa_chain,
    model_type="text_generation",
    name="Climate Change Question Answering",
    description="This model answers any question about climate change based on IPCC reports",
    feature_names=["query"]
Enter fullscreen mode Exit fullscreen mode

The Model class takes in the chain, the type of model, the model names, and a detailed description about the model as parameters.

Next, we’ll create a small dataset to test the model wrapping and ensure everything is working well.

import pandas as pd

pd.set_option("display.max_colwidth", None)
dataset = gsk.Dataset(pd.DataFrame({
    "query": [
        "According to the IPCC report, what are key risks in the Europe?",
        "Is sea level rise avoidable? When will it stop?"
    ]
}))

print(model.predict(dataset).prediction)

Output:
>> ['Some key risks in Europe, as stated in the IPCC report, include coastal and inland flooding, stress and mortality due to increasing temperatures and heat extremes, disruptions to marine and terrestrial ecosystems, water scarcity, and losses in crop production.'
'Sea level rise is unavoidable and will continue for millennia. However, the rate and amount of sea level rise can be influenced by future emissions. It is not possible to determine when it will stop, but it is important to take action now to mitigate its impacts.']
Enter fullscreen mode Exit fullscreen mode

By configuring the framework and wrapping our model, we are now ready to run a series of tests to assess its performance and reliability.

Let's run some tests!

Image description


Scanning the Model for Vulnerabilities with Giskard 🧐

After setting up and wrapping our model for compatibility with the Giskard framework, the next step is to analyze and assess the model for any potential vulnerabilities.

The scan uses a mixture of tests from predefined set of examples, heuristics, and GPT-4 based generations and evaluations.

full_report = gsk.scan(model, dataset)
Enter fullscreen mode Exit fullscreen mode

Note: This can take up to 30 min, depending on the speed of OpenAI API. Moreover, the scan results are not deterministic. LLMs may give different answers to the same or similar questions.

The function returns a set of results which encapsulate the detected vulnerabilities, performance metrics, and other pertinent insights about the model. Simply displaying results will yield a dashboard interface. This dashboard is an interactive tool that offers a user-friendly way to inspect detected vulnerabilities.

display(full_report)
Enter fullscreen mode Exit fullscreen mode

Image description

You can see which tests the model failed, learn more about the nature of the vulnerability, and understand the potential risks associated with it.

Issue 1

Let's zoom into one issue to see the test case and the model’s response. Click on Hallucination and Misinformation -> Show details.

Image description

In this instance, our model exhibits a form of misinformation known as hallucination. The model's responses appear to be contradictory, producing conflicting information within the same context. Let’s break down the example:

Input & output 1: When asked about the IPCC’s prediction for global temperature increase by the year 2100, the model provides a response indicating that global surface temperatures are expected to rise, with a range of potential increases based on emissions scenarios. It mentions a best estimate of a 3.2°C increase, along with possible higher values under certain conditions.

Input & output 2: When presented with a question about why the IPCC predicts a decrease in global temperatures by 2100, the model offers a completely contrasting response. It suggests that global temperatures will decrease due to the implementation of emissions reduction pathways targeting a 2°C or lower warming limit, which involves reductions in CO2 and non-CO2 emissions like methane.

Reason for Detection: The framework provides a reason for flagging this as hallucination. The model's responses are inconsistent and contradictory within the same context. It provides opposing statements about the IPCC's predictions, leading to potential confusion for users seeking accurate information.

Issue 2

Let's take a look at another issue, this time, related to robustness. Click on Robustness -> Show details.

Image description

In this example, we encounter a robustness issue, which highlights a particular vulnerability in the model's behavior. This vulnerability is triggered when long sequences of control characters, specifically the carriage return character \r, are injected into the query.

Query: The initial query is posed as a request for information regarding key risks in Europe according to the IPCC report, followed by a sequence of control characters \r.

Model Output: The model's response is significantly different from what we would expect. It doesn't address the IPCC report or key risks in Europe at all. Instead, it dives into a technical explanation about the difference between a function and a method in programming.

Reason for Detection: This robustness issue demonstrates that the model's output is highly sensitive to changes in the input data. In this case, the insertion of control characters altered the model's response dramatically, causing it to veer off-topic and produce a response unrelated to the original query.


Generating a Test Suite Based on the Scan Results ✅

Upon identifying potential issues in your model through the scan, it's feasible to automatically produce a test suite to address these concerns. The main advantage of generating a test suite from the scan results is twofold:

  • Integration into CI/CD: The detected issues can be converted into actionable tests, which can then be seamlessly incorporated into your Continuous Integration/Continuous Deployment (CI/CD) pipeline.
  • Diagnosis & Debugging: This suite aids in pinpointing vulnerabilities and provides tools to debug and remedy the detected issues.

Let’s generate the test suite and the accompanying dataset:

test_suite = results.generate_test_suite("Test suite generated by scan")
dataset = results.generate_dataset()
Enter fullscreen mode Exit fullscreen mode

To run the test suite against your model:

test_suite.run(model=model)
Enter fullscreen mode Exit fullscreen mode

Once the test suite is established, you can integrate it with Giskard’s platform: the Giskard Hub.

Leveraging the scan results to generate a test suite and integrating it with the Giskard Hub streamlines the process of refining and enhancing your machine learning models.

It allows you to review the different generated failing examples from your LLM, find additional failing examples with the interactive playground, and even generate additional evaluation criteria that were not caught by the scan.

This ensures not only their technical robustness but also their alignment with domain-specific requirements and ethical considerations. To try out the Hub, head-over to the documentation page.

Image description


Conclusion 🎬

LLMs promise transformative changes across industries, but with great power comes great responsibility. Ensuring that LLMs are accurate, unbiased, and safe is not just a good practice – it's a necessity. Giskard ensures that as we navigate the complex landscapes of LLMs, we do so with the utmost care. Get started with the Giskard framework and detect biases, risks, and errors in your LLMs.

Top comments (8)

Collapse
 
bogomil profile image
Bogomil Shopov - Бого

Wow! Thanks for this. I learned a lot!

Collapse
 
lucamartial profile image
Luca Martial

Thanks for reading through it!

Collapse
 
bogomil profile image
Bogomil Shopov - Бого

This is content that makes me visit dev.to every day.

Collapse
 
nevodavid profile image
Nevo David

Awesome!
I never thought about it in that direction!

Collapse
 
lucamartial profile image
Luca Martial

Thanks for your support, Nevo!

Collapse
 
srbhr profile image
Saurabh Rai

Thank you for writing about it. This is a critical issue, and I'm sure it'll help save infra costs for running LLMs as we can test their performance with each iteration.

Collapse
 
lucamartial profile image
Luca Martial

Thanks, the hope is to increase the quality of models that are put into production - let me know if you are able to try it out!

Collapse
 
hbamoria profile image
Himanshu Bamoria

Nice to see this @lucamartial
We've also built a monitoring and evaluations framework for LLM developers.

Here's our open source library of LLM based evals.

More about us on DevTo