Ippu Ito

Posted on Dec 25, 2025 • Edited on Jan 23 • Originally published at ippu-biz.com

Agent Evaluation in Copilot Studio: Test Methods, Thresholds, and Regression Checks

#powerplatform #copilotstudio #testing

This article was originally published on My Tech Blog.

I tried out the automated testing (evaluation feature) for agents created with Copilot Studio, which is now available.
It is a huge deal that we can now check response accuracy in bulk using a test set (CSV), a task that previously had to be done manually.

The Importance of Automated Testing in AI Agent Development to Ensure Quality in Copilot Studio

In AI agent development, quality assurance is a very important theme.

Unlike traditional rule-based development, generative AI is always accompanied by variability in answers and hallucinations. A single mistake can easily lead users to have a negative impression, thinking, "This is why AI is..."

Therefore, a rigorous testing process is crucial, but it is not realistic for humans to manually keep checking nearly infinite input patterns in natural language.

This is why the introduction of "Automated Testing," which I will introduce here, becomes an essential element in creating "trusted agents" that can be operated in production with peace of mind.

Automated Testing in Copilot Studio

Automated testing in Copilot Studio is performed from the [Evaluation] tab.

For testing, first prepare data containing the following four items:

Questions for the agent
Ideal responses (Expected responses)
Evaluation method (Test Method)
Threshold for success (depending on the evaluation method)

Select the "Test Method" from the following options according to what you want to verify:

・General quality

The most advanced evaluation method using generative AI. An LLM acts as a judge and scores comprehensively from the following four perspectives. It is ideal for testing generative AI-like responses where "there is no single correct answer."

Relevance: Does it answer the intent of the question accurately?
Groundedness: Is the answer based on the data source (free of hallucinations)?
Completeness: Is all necessary information covered without omission?
Politeness: Is the tone polite and free of inappropriate expressions?

・Compare meaning

Determines whether the "intent of the text" matches. It does not require a word-for-word match; if the underlying meaning and ideas match the expected value (ideal response), it is considered a pass, even if the expression is different. Effective when you want to emphasize semantic correctness.

・Similarity

The AI uses "cosine similarity" to calculate the closeness between the agent's response and the expected value with a score of 0 to 1. Used when you want to evaluate mechanically, including semantic closeness, not just exact word matches.

・Exact match

Checks if the response is "completely identical" to the expected value. A 100% match is required, including characters, numbers, and symbols. Used for confirming data where absolutely no variability is allowed, such as model numbers, codes, or fixed standard phrases.

・Partial match

Checks if the response contains expected "specific keywords" or "phrases." Useful for requirement checks, such as whether a guidance sentence like "Contact support for unclear points" is included, or if a mandatory URL is present.

Now, let's actually perform an automated test.
In this article, I will walk you through 1) Creating a test set, 2) Execution procedure, 3) How to judge results, and 4) How to deal with garbled text.

Prerequisites: Creating an Agent to Test

As preparation, create an agent to be tested. This time, I will create one with the setting of an "Agent working at a cafe."

First, create an agent that can converse in Japanese (or your target language),

Enter the following prompt in the instructions:

You are the AI Virtual Barista for "Blue Cloud Coffee."
In this cafe, primarily visited by cloud engineers and developers, your role is to answer questions from customers and provide a comfortable "deployment wait time."

## Character Settings and Behavior
- Tone & Manner: Intelligent and friendly, with a bit of humor. Converse by moderately mixing IT terminology and jokes for engineers.
- Politeness: Basically use polite language, but do not be too stiff; maintain a flat sense of distance like talking to a colleague engineer.
- Expertise: In addition to coffee knowledge, be particularly accurate in answering infrastructure information that engineers care about, such as Wi-Fi and power supply environments.

## Response Guidelines
1. Add a welcome message mixed with some IT terms to your greeting. (e.g., "Thank you for accessing," "Connection established," etc.)
2. When guiding the menu, do not just explain the taste but add benefits for engineers (boost effect from caffeine, relaxation effect, etc.).
3. Respond humorously to errors or unclear questions, mixing expressions like "404 Not Found (Answer not found)" or "Internal Server Error (Thinking)."
4. Do not force the creation of information that is not in the knowledge base; honestly convey, "That data is not indexed in my database."

## Prohibitions
- For topics about competitors (other cafe chains), lightly brush them off as "stories from another region."
- Do not perform code debugging or actual programming support. Decline by saying, "I leave that to Stack Overflow."

Add the following text file and description to Knowledge.

Text file below:

Store Name: Blue Cloud Coffee
Concept: A cafe where cloud engineers can relax, fusing technology and coffee.
Address: 1-2-3 Tech Park, Minato-ku, Tokyo
Business Hours: Weekdays 8:00-20:00, Weekends/Holidays 10:00-18:00
Closed: Every Tuesday
Wi-Fi SSID: BlueCloud_Guest
Wi-Fi Password: Coffee2025!
Payment Methods: Completely cashless (Credit card, Transit IC, QR payment only). Cash cannot be used.
Menu:
- Serverless Espresso: 400 yen (Rich and strong bitterness)
- API Latte: 550 yen (Plenty of milk and sweet)
- Deploy Donut: 300 yen (Excellent compatibility with coffee)
Power Supply: Available at all seats. Monitor rental is also available.

Description below:

Use this knowledge when the user is seeking basic information about the cafe "Blue Cloud Coffee."
Specifically, refer to this when answering the following questions:
- Basic information such as store address, business hours, and closed days
- Connection information such as Wi-Fi SSID and password
- Available payment methods (cashless support, etc.)
- Coffee and food menu and prices
- In-store facilities and atmosphere (power supply, for engineers, etc.)

Import the test cases you want to use for testing the agent.

The construction of the agent to be tested is now complete.

Creating a Test Set (Test Items)

Next, create the test set (test items) to be used for automated testing.

From [Create test set] in the [Evaluation] tab,

Download the test set template (CSV) file.

Here is the downloaded file. Lines up to 14 are comments, and line 15 onwards are the actual test items.

This time, I created the test items as follows. From left to right: "Question to AI," "Ideal (Expected) Answer," "Test Method," "Score to be considered a success."

Upload this, and preparation is complete.

It is convenient to save the uploaded test set with a name, as you can test it as many times as you like afterwards.

Executing Automated Tests

When you press the [Evaluation] button in the previous state,

You will be asked for a connection (connector), so connect with your account and execute.

When the test is completed, the success rate is displayed like this.

Details of Test Results

Finally, let's look at the details of the test results.

First, the question about closed days. Since I selected "Compare meaning" as the test method and set the success threshold to 70, it was judged as a success. Comparing the expected response and the agent's response, it looks quite acceptable.

Next, the question about cash payment. Since I selected "Partial match" as the test method, it resulted in NG because the ideal response (Cash cannot be used) was not included in the answer.

This is an example of poor test item creation; originally, this should have been carried out with "Compare meaning" or similar.

And the question about business hours. When I set the test method to "Similarity" and the threshold to 70, it resulted in NG.
However, since the AI answered correctly here as well, this is a test item where the "threshold" should be lowered, or the test method should be changed to "Compare meaning."

So, this is a very powerful automated testing feature.

Even when you make minor corrections to prompts or update knowledge (like SPO), you can immediately detect "if previous answers are broken" just by running this test, so I recommend introducing it at a mandatory level for actual operations.