Agent Evaluation in Action: Tips, Pitfalls, and Best Practices

#powerfuldevs #copilotstudio #powerplatform

Intro:
As copilots become more central to enterprise workflows, evaluating their responses with precision is no longer optional — it's essential. The new Agent Evaluation Preview capability in Microsoft Copilot Studio introduces a structured way to test and validate your copilots using test cases that simulate real-world prompts and expected answers. While manual testing and gut feel can help during early development, they don’t scale. Structured evaluation:

Enables repeatable, automated testing
Surfaces regressions early
Helps teams align on what “good” looks like
Supports responsible AI by making quality measurable

let me walk you through the different test types, when to use them, and how to avoid common pitfalls — all based on hands-on experience based on few hours of testing with couple of agents

Setting the Evaluation up:

Navigate to the Analytics panel in your Copilot Studio environment.
Look for the Evaluate button in the top-right corner and select it.
Click + New test set to begin creating your evaluation set.

Each test case requires three key inputs:

Prompt: The user question or input you want to test.
Expected response: The ideal or reference answer you want the copilot to return.
Test type: The evaluation method to apply (e.g., ExactMatch, PartialMatch, CompareMeaning, etc.). You can author these test cases in a .csv file and import them in bulk, or create them manually within the UI.

Test Types at a Glance:
Here’s a breakdown of the five test types currently supported in Copilot Studio’s evaluation engine:

Test Type	Under the Hood	Purpose / Why it Matters	When to Use	Pass Rule (Suggested)	Why It Might Fail	Advantages	Best Practices
ExactMatch	Literal string equality (sensitive to whitespace, order, punctuation)	Verifies deterministic outputs exactly	Single values (yes/no, IDs, short codes) or fixed templates	Normalized actual == expected	Any rephrasing, reordering, formatting, citations, encoding artifacts	Unambiguous when it passes	Keep answers short; normalize aggressively; avoid for multi-sentence policy/FAQ; sort lists before compare
PartialMatch	Keyword/phrase spotting with AND/OR logic	Confirms must-have facts appear in longer answers	Numbers, names, policy terms, disclaimers	All required phrases present per AND/OR	Using full sentences as “expected”; synonyms/paraphrases not listed	Simple, predictable; good safety net for critical facts	Use short phrases; add variants with OR; 2–3 critical AND terms; case-insensitive; strip punctuation
CompareMeaning	Semantic/rubric scoring (e.g., 0–100) on coverage of key points	Mirrors user judgment: did it cover the important points?	Multi-part policy/FAQ answers	Score ≥ 70–75 (flag 70–80 for review if high risk)	Threshold too high; minor omissions drop score	Tolerant of paraphrase and order; penalizes substantive gaps	Keep expected concise; define critical points; pair with PartialMatch for numbers/terms
Similarity	Embedding cosine similarity (0–1) for meaning/wording closeness	Robust automated signal for overall alignment	Policy/FAQ, multi-sentence answers	Threshold 0.80 (review 0.75–0.85 band)	Borderline (e.g., 0.79) near miss; high score can mask a small factual error	Handles rewording and order changes well	Strip boilerplate/citations; keep expected crisp; pair with PartialMatch for must-have facts
GeneralQuality	Rubric on relevance, completeness, grounding (no direct compare)	Captures user-facing usefulness	Open-ended UX checks; when no canonical answer	Meets rubric (e.g., relevant, complete, grounded)	May pass answers missing subtle constraints; human inconsistency	Reflects practical usefulness	Define a clear rubric; use as supplemental signal; pair with a semantic metric for accuracy

Common Pitfalls & How to Fix Them:

Even well-authored test cases can fail for subtle reasons. Here are some common issues and how to resolve them:

Symptom	Likely Cause	Fix
“No matching phrases” on PartialMatch	Expected text too long; synonyms used; minor wording differences	Switch to short keywords; add OR variants; normalize case/punctuation
ExactMatch fails on “almost identical”	Formatting, boilerplate, citations, or encoding differences	Enable normalization; strip boilerplate/citations; avoid ExactMatch for long outputs
Similarity score is 0.79 but seems fine	Borderline vs threshold; minor omission	Lower threshold slightly; add PartialMatch for must-haves; or accept via CompareMeaning
CompareMeaning ~75 but marked as Fail	Threshold set too high	Set pass at ≥70–75; flag 70–80 for review on high-risk topics
GeneralQuality passes but a must-have is missing	Rubric is tolerant; no hard fact check	Add PartialMatch keyword for the must-have (e.g., “itemized receipt”, “manager approval”)

Note: Responses that include Adaptive Cards currently do not evaluate correctly. The evaluation engine appears to treat the card payload as raw JSON, which may not align with the expected text format — even if the visual output looks correct to the user. For now, avoid using Adaptive Card responses in test cases unless you're explicitly testing the raw JSON structure.

Final Thoughts:
The Agent Evaluation feature (preview) in Copilot Studio marks a significant step toward measurable quality in conversational AI. It brings much-needed structure to what was once a subjective process and empowers teams to scale their copilots with greater confidence.
That said, the feature is still maturing — both in terms of tooling flexibility (e.g., configuration levers, thresholds, normalization options) and the user enablement required to use it effectively. Without clear guidance, even experienced makers may struggle to choose the right test type or interpret borderline results.

As the ecosystem evolves, I hope to see: