Matt Eland

for Leading EDJE

Posted on Sep 9 • Originally published at blog.leadingedje.com

Tracking AI system performance using AI Evaluation Reports

#ai #csharp #testing #llm

A few months ago I wrote about how the AI Evaluation Library can help automate evaluating LLM applications. This capability is tremendously helpful in measuring the quality of your AI solutions, but it's only a part of the picture in terms of representing your application quality. In this article I'll walk through the AI Evaluation Reporting library and show how you can build interactive reports that help share model quality with your whole team, including product managers, testers, developers, and executives.

This article will start with an exploration of the final report and its capabilities, then dive into the handful of lines of C# code needed to generate the report in .NET using the Microsoft.Extensions.AI.Evaluation.Reporting library before concluding with thoughts on where this capability fits into your day to day workflow.

The Extensions AI Evaluation Report

Let's start by taking a look at what we're talking about here: The AI Evaluation Report showcasing the performance of a series of different evaluators as they grade a sample interaction produced by an LLM application:

This particular example features a single scenario where an AI agent is instructed to respond to interactions with humorous haikus related to the topic the user is mentioning:

System Prompt: You are a joke haiku bot. Listen to what the user says, then respond with a humorous haiku.

User: I'm learning about AI Evaluation and reporting

Assistant: I grade clever bots, reports spill midnight secrets, robots giggle on.

While not the best interaction, the system technically did close to what it was instructed to do, and the report summarizes the strengths and weaknesses of the system in handling this interaction.

Let's talk about how it works.

This "report card" was generated by sending the conversation history to an LLM with instructions on how to evaluate it for different capabilities including coherence, English fluency / grammatical correctness, relevance, truthfulness, and completeness.

This evaluation is performed using an LLM and specially prepared prompts built for evaluating the performance of this interaction. The evaluation LLM can be the same one as the one you used for conversation or it could be a different one entirely.

The results of this evaluation are persisted in a data store (such as on disk or on Azure) and are available to help show trends over time as well as generating periodic reports in HTML format.

Because the evaluation report is a HTML document, it allows for some interactive features. For example, you can click in on a particular evaluator and see details on its evaluation, as is shown here for the Fluency evaluator:

Here we can see the fluency evaluator giving the response middling reviews for English fluency, which is likely due to the fluency evaluator being designed more for conversational English and articles rather than haikus like the one our bot is generating.

Note that we can see some specific metrics on the tokens that were used, the amount of time taken, and the specific model being used for evaluation.

Implementing an AI Evaluation Report in .NET

There are a few more aspects of this evaluation report we'll highlight, and we'll talk later on about the overall context this report plays into your organization, but for now let's talk about how to generate it.

In this section I'll walk through the C# code needed to generate the report shown here in this article.

This code is taken directly from my GitHub repository and is specifically inside of the EvaluationReportGeneration project.

Connecting to Chat and Evaluation Models

The first thing we need to do with our application is to have a chat client for our AI evaluation as well as for our chat completions. I'll do this here with two OpenAIClient objects representing our chat and evaluation models:

// Load Settings
ReportGenerationDemoSettings settings = ConfigurationHelpers.LoadSettings<ReportGenerationDemoSettings>(args);

// Connect to OpenAI
OpenAIClientOptions options = new()
{
Endpoint = new Uri(settings.OpenAIEndpoint)
};
ApiKeyCredential key = new ApiKeyCredential(settings.OpenAIKey);
IChatClient evalClient = new OpenAIClient(key, options)
.GetChatClient(settings.EvaluationModelName)
.AsIChatClient();
IChatClient chatClient = new OpenAIClient(key, options)
.GetChatClient(settings.ChatModelName)
.AsIChatClient();

You can connect chat and evaluation to any model provider with an IChatClient implementation, which are either available or in preview for all major model providers such as OpenAI, Azure, Ollama, Anthropic, and more.

In this article I'm using o3-mini as my chat model generating the responses and gpt-4o as the evaluation model (the current recommended model by Microsoft as of this article).

Creating a Report Configuration

Now that we've got our chat clients ready, our next step is to create a ReportingConfiguration which will store the raw metrics and conversations that are evaluated over time. This helps in centralizing reporting data and in building trends over time in reports.

There are currently two supported default options for this: DiskBasedReportingConfiguration which stores data on disk in a location you specify, and the AzureStorageReportingConfiguration option present in the Microsoft.Extensions.AI.Reporting.Azure package.

We'll go with the disk-based configuration in this sample because it's far simpler to configure:

// Set up reporting configuration to store results on disk
ReportingConfiguration reportConfig = DiskBasedReportingConfiguration.Create(
storageRootPath: @"C:\dev\Reporting",
chatConfiguration: new ChatConfiguration(evalClient),
executionName: $"{DateTime.Now:yyyyMMddTHHmmss}",
evaluators:
[
new RelevanceTruthAndCompletenessEvaluator(),
new CoherenceEvaluator(),
new FluencyEvaluator()
]);

Here we create our ReportingConfiguration by telling it:

Where to store the raw report metrics on disk (not the location for the generated report file)
Which chat connection it should use to evaluate the interactions
A unique name for the evaluation run. This will be used to generate folder names so only certain characters are allowed.
One or more IEvaluator objects to use in generating evaluation metrics. This is equivalent to using a CompositeEvaluator like I demonstrated in my prior article.

More on Evaluators: If you're looking for more detail on the various evaluators you can use or how they work, I go into each of these evaluators more in my article on MEAI Evaluation.

You can also specify tags that apply to your entire evaluation run here, but I'll cover tags in a future article.

Defining a Scenario Run

Evaluation reports have one or more scenario runs associated with them, representing a specific test case.

We'll create a single "Joke Haiku Bot" scenario for our purposes here:

// Start a scenario run to capture results for this scenario
await using (ScenarioRun run = await reportConfig.CreateScenarioRunAsync(
scenarioName: "Joke Haiku Bot"))
{
// Contents detailed in next few snippets...
}

Note that we're using an await using around the whole context of our ScenarioRun object. This makes sure the run is properly disposed, which causes its metrics to be reported to the reporting configuration object and persisted to disk.

If we had additional scenarios, we could define each one sequentially so that we're aggregating our evaluation results into a single report. In this article we'll keep things simple and look only at a single case, but in our next article in the series I'll cover iteration, experimentation, and multiple scenarios.

Important Note: It's important that any ScenarioRun objects you're using for your evaluation are disposed before you use their evaluation metrics to generate a report. This is why I'm using the await using syntax here as well as explicitly declaring the scope of the object instead using the newer "scopeless" style of defining the object in a using statement.

Getting and Evaluating a Response

Now that we have an active ScenarioRun object we need a list of ChatMessage objects to send to the chat model:

string systemPrompt = "You are a joke haiku bot. Listen to what the user says, then respond with a humorous haiku.";
string userText = "I'm learning about AI Evaluation and reporting";

List<ChatMessage> messages = [
new(ChatRole.System, systemPrompt),
new(ChatRole.User, userText)
];

With that in place, we send it to the chat model using our chat client and we can get back a ChatResponse:

// Use our CHAT model to generate a response
ChatResponse response = await chatClient.GetResponseAsync(messages);

This particular example is using the IChatClient defined in the Microsoft.Extensions.AI (MEAI) package to do this, but you could use something else such as Semantic Kernel or another library, or even just hard-code a chat response you've observed in the wild.

Once we have our list of messages and the model's response, we can send both of them to our ScenarioRun for evaluation with a single line of code:

// Use the EVALUATION model to grade the response using our evaluators
await run.EvaluateAsync(messages, response);

This call returns an EvaluationResult object if you want to look at the immediate output of the evaluation, but the results will also be persisted to our reporting configuration, so we don't need to take immediate action on them.

Generating an AI Evaluation Report

We've now created our reporting configuration, started a scenario, gotten a chat response, and then used our evaluators to grade it. Let's talk about actually building an HTML report from our evaluation data.

The first step of this is to identify the data that should be included in our report.

While it may seem like we already have that data, the evaluation report can show the trends of your different evaluations over time, which can be handy for seeing how experiments are impacting the overall reporting experience.

I typically include the last 5 results in my reports, and use this snippet to grab that data from my reporting configuration:

// Enumerate the last 5 executions and add them to our list we'll use for reporting
List<ScenarioRunResult> results = [];
await foreach (var name in reportConfig.ResultStore.GetLatestExecutionNamesAsync(count: 5))
{
await foreach (var result in reportConfig.ResultStore.ReadResultsAsync(name))
{
results.Add(result);
}
}

Next, we'll use these results from our scenarios to generate the output report file. Reports can be written in JSON format or in HTML. I typically will choose the HTML option because these reports include an option to export the underlying JSON if you need it.

The code for this is fairly simple:

string reportFilePath = Path.Combine(Environment.CurrentDirectory, "Report.html");

IEvaluationReportWriter writer = new HtmlReportWriter(reportFilePath);

await writer.WriteReportAsync(results);

This generates a new report in the Report.html file we specified. You can then open up that file manually and see the results, or you can start a process to open this report in your default web browser:

Process.Start(new ProcessStartInfo
{
FileName = reportFilePath,
UseShellExecute = true
});

When this executes the user's operating system will handle the report just as if the user had double-clicked on the file in their file system - potentially opening a web browser or asking them what action they'd like to take with this file or type of file.

Practical uses for AI Evaluation Reports

Now that we've covered AI Evaluation reports and how to generate them using C#, let's close this article with a discussion of how this technology potentially fits into your workflow.

First of all, if you're looking for a way of evaluating your AI systems, AI Evaluation reports are a fantastic option, even for a solo developer trying to understand the performance of their hobby projects. The graphical reports and being able to click into details are easier than working directly with the EvaluationResult objects with their nested metric objects.

For more serious usage, AI Evaluation has some tremendous merit because it equips you to share something graphical with others to help them understand how your application works with different implementations. Instead of having conversations about your models being "good" or "not good enough", you can have targeted specific conversations on the specific interactions your system is succeeding with and those it is struggling with.

Because these HTML files are interactive and intuitive, this technology enables people in your organization to explore the examples on their own and internalize more of the systems strengths and weaknesses. In a nutshell, these reports make it easy to see and share information about the state of your AI systems.

I view AI evaluation as a vital part of integration testing and the MLOps process prior to any new deployment - or potentially even to block feature branches from rejoining the main product branch as part of the pull request review process in development. Having a graphical report to go with it can help you understand the trends and performance of your models over time and how different changes impact its performance.

Final Recommendations

AI Evaluation and evaluation reporting are important aspects of your team's success in its AI offerings.

Here are some closing recommendations I have when adopting AI evaluation tooling into your organization:

AI Evaluation and evaluation reports are key parts of any significant release that updates the behavior of an AI agent and should be part of your quality assurance and product management efforts.
Automating AI Evaluation as part of your integration tests is worth the effort. You can also optionally have significant degradation of evaluated quality fail your tests when run as an actual integration test (not covered in this article, but I plan on writing more on this in the future).
The quality of your evaluation model matters. It's worth using a more capable model for this as it's more likely to grasp the full context of the request and the response that was generated.
Having automated evaluation in place frees you up to do more experimentation around your system prompts, model selection, and other parameters and settings. Make sure you have this automation in place to collect metrics before doing serious performance tuning of your models as these metrics can help guide your decision-making and refinement process.
Store your model metrics in a centralized location for tracking over time. I recommend a dedicated shared location just for release candidates as well as local metrics storage for developers during development and testing.
The resulting HTML reports should be shared with your entire team, including organizational leadership, quality assurance, and product owners. This practice helps cut through the hype and fear around AI systems and allows your full team to more meaningfully understand what your system is good and bad at.
Just because your metrics are high doesn't mean that your system is performing well. It just means its performing well for those interactions you're measuring and observing.
As your system grows, its evaluation suite should grow over time as well. As you find new interactions it struggles with or add new capabilities to the system, you should be adding in new scenarios to represent these capabilities.

I view AI Evaluation as a vital part of the development of AI systems and evaluation reports make these systems so much more understandable to your whole team.

Top comments (1)

Umang Suthar • Sep 10

Clear evaluation reports are exactly what teams need to move from “AI feels good” to AI you can actually measure, trust, and improve. At Haveto (haveto.com), we’re also focused on making AI more transparent and reliable, and we're excited to see more tools like this pushing the ecosystem forward.