Intro:
As copilots become more central to enterprise workflows, evaluating their responses with precision is no longer optional — it's essential. The new Agent Evaluation Preview capability in Microsoft Copilot Studio introduces a structured way to test and validate your copilots using test cases that simulate real-world prompts and expected answers. While manual testing and gut feel can help during early development, they don’t scale. Structured evaluation:
- Enables repeatable, automated testing
- Surfaces regressions early
- Helps teams align on what “good” looks like
- Supports responsible AI by making quality measurable
let me walk you through the different test types, when to use them, and how to avoid common pitfalls — all based on hands-on experience based on few hours of testing with couple of agents
Setting the Evaluation up:
- Navigate to the Analytics panel in your Copilot Studio environment.
- Look for the Evaluate button in the top-right corner and select it.
- Click + New test set to begin creating your evaluation set.
Each test case requires three key inputs:
- Prompt: The user question or input you want to test.
- Expected response: The ideal or reference answer you want the copilot to return.
- Test type: The evaluation method to apply (e.g., ExactMatch, PartialMatch, CompareMeaning, etc.). You can author these test cases in a .csv file and import them in bulk, or create them manually within the UI.
Test Types at a Glance:
Here’s a breakdown of the five test types currently supported in Copilot Studio’s evaluation engine:
| Test Type | Under the Hood | Purpose / Why it Matters | When to Use | Pass Rule (Suggested) | Why It Might Fail | Advantages | Best Practices |
|---|---|---|---|---|---|---|---|
| ExactMatch | Literal string equality (sensitive to whitespace, order, punctuation) | Verifies deterministic outputs exactly | Single values (yes/no, IDs, short codes) or fixed templates | Normalized actual == expected | Any rephrasing, reordering, formatting, citations, encoding artifacts | Unambiguous when it passes | Keep answers short; normalize aggressively; avoid for multi-sentence policy/FAQ; sort lists before compare |
| PartialMatch | Keyword/phrase spotting with AND/OR logic | Confirms must-have facts appear in longer answers | Numbers, names, policy terms, disclaimers | All required phrases present per AND/OR | Using full sentences as “expected”; synonyms/paraphrases not listed | Simple, predictable; good safety net for critical facts | Use short phrases; add variants with OR; 2–3 critical AND terms; case-insensitive; strip punctuation |
| CompareMeaning | Semantic/rubric scoring (e.g., 0–100) on coverage of key points | Mirrors user judgment: did it cover the important points? | Multi-part policy/FAQ answers | Score ≥ 70–75 (flag 70–80 for review if high risk) | Threshold too high; minor omissions drop score | Tolerant of paraphrase and order; penalizes substantive gaps | Keep expected concise; define critical points; pair with PartialMatch for numbers/terms |
| Similarity | Embedding cosine similarity (0–1) for meaning/wording closeness | Robust automated signal for overall alignment | Policy/FAQ, multi-sentence answers | Threshold 0.80 (review 0.75–0.85 band) | Borderline (e.g., 0.79) near miss; high score can mask a small factual error | Handles rewording and order changes well | Strip boilerplate/citations; keep expected crisp; pair with PartialMatch for must-have facts |
| GeneralQuality | Rubric on relevance, completeness, grounding (no direct compare) | Captures user-facing usefulness | Open-ended UX checks; when no canonical answer | Meets rubric (e.g., relevant, complete, grounded) | May pass answers missing subtle constraints; human inconsistency | Reflects practical usefulness | Define a clear rubric; use as supplemental signal; pair with a semantic metric for accuracy |
Common Pitfalls & How to Fix Them:
Even well-authored test cases can fail for subtle reasons. Here are some common issues and how to resolve them:
| Symptom | Likely Cause | Fix |
|---|---|---|
| “No matching phrases” on PartialMatch | Expected text too long; synonyms used; minor wording differences | Switch to short keywords; add OR variants; normalize case/punctuation |
| ExactMatch fails on “almost identical” | Formatting, boilerplate, citations, or encoding differences | Enable normalization; strip boilerplate/citations; avoid ExactMatch for long outputs |
| Similarity score is 0.79 but seems fine | Borderline vs threshold; minor omission | Lower threshold slightly; add PartialMatch for must-haves; or accept via CompareMeaning |
| CompareMeaning ~75 but marked as Fail | Threshold set too high | Set pass at ≥70–75; flag 70–80 for review on high-risk topics |
| GeneralQuality passes but a must-have is missing | Rubric is tolerant; no hard fact check | Add PartialMatch keyword for the must-have (e.g., “itemized receipt”, “manager approval”) |
Note: Responses that include Adaptive Cards currently do not evaluate correctly. The evaluation engine appears to treat the card payload as raw JSON, which may not align with the expected text format — even if the visual output looks correct to the user. For now, avoid using Adaptive Card responses in test cases unless you're explicitly testing the raw JSON structure.
Final Thoughts:
The Agent Evaluation feature (preview) in Copilot Studio marks a significant step toward measurable quality in conversational AI. It brings much-needed structure to what was once a subjective process and empowers teams to scale their copilots with greater confidence.
That said, the feature is still maturing — both in terms of tooling flexibility (e.g., configuration levers, thresholds, normalization options) and the user enablement required to use it effectively. Without clear guidance, even experienced makers may struggle to choose the right test type or interpret borderline results.
As the ecosystem evolves, I hope to see:
- More intuitive authoring experiences
- Built-in recommendations for test design
- Deeper integration with CI/CD pipelines


Top comments (0)