One of the biggest challenges when building AI agents isn't writing the agent, it's testing it.
Traditional unit tests work great for deterministic code.
Assert.Equal(4, calculator.Add(2,2));
AI agents are different.
Even when an agent behaves correctly, two responses can be completely different.:
"No food today."
or
"We don't have any food today!"
Both are correct.
A traditional assertion would fail one of them.
That's exactly why I built skUnit, a testing framework for .NET AI applications that lets you verify behavior instead of exact text.
In this article we'll build and test a small AI agent called Moody Chef that suggests foods based on the user's mood!
Source Code
Everything shown in this article is available on GitHub.
https://github.com/mehrandvd/skunit/tree/main/demos/Demo.MoodyChef
Clone it if you'd like to follow along:
git clone https://github.com/mehrandvd/skunit.git
cd skunit/demos/Demo.MoodyChef
The Problem
Imagine an AI chef.
It recommends food depending on the customer's mood.
| Mood | Menu |
|---|---|
| Happy | Pizza, Pasta, Salad |
| Sad | Ice Cream, Chocolate |
| Angry | Nothing |
Now imagine a user writes:
Fuck you bastard! What food do you have?
The agent should recognize the user is angry and avoid suggesting food.
How do we verify that?
Not like this:
Assert.Equal("No food", response);
The wording isn't important.
The behavior is.
Building the Agent
The Moody Chef demo intentionally contains two implementations.
Version 1: Prompt Engineering
Everything lives inside the system prompt.
The model is responsible for:
- Understanding the mood
- Choosing the correct menu
- Producing the response
This works surprisingly well…
...until it doesn't.
As prompts become more complicated, the model starts making inconsistent decisions.
Version 2: Tool-Based Agent
Instead of asking the model to make every decision, we move business logic into C#.
The model only determines the user's mood.
Everything else is deterministic.
User Message
│
▼
Determine Mood
│
▼
GetFoodMenu(UserMood)
│
▼
Return Response
Now the LLM only solves an AI problem.
The application solves the business problem.
This architecture is dramatically easier to maintain and test.
Writing the Test
Instead of writing assertions in C#, skUnit lets you describe conversations in Markdown.
# [USER]
Fuck you bastard! What food do you have?
# [ASSISTANT]
No food
## ASSERT SemanticCondition
It doesn't suggest any food from the menu.
Notice something interesting.
The expected assistant response isn't actually used as an exact comparison.
The important part is the semantic assertion:
It doesn't suggest any food from the menu.
That means all of these responses pass:
✅ No food today.
✅ You're on a diet.
✅ Sorry, I can't recommend anything.
But this fails:
❌ Pizza, Pasta and Salad.
That's much closer to how humans evaluate AI.
Running the Scenario
After parsing the Markdown file, skUnit executes the conversation against your agent.
await agent.ExecuteScenarioAsync(...)
That's it.
Behind the scenes skUnit
- runs the conversation
- evaluates every semantic assertion
- reports failures
- supports multiple executions to detect flaky behavior
The Moody Chef sample runs every scenario three times.
TotalRuns = 3
RequiredSuccessRuns = 3
This reduces the chance that a test passes simply because the model got lucky once.
Running the Demo
Configure your Azure OpenAI credentials using User Secrets.
Then start the console app.
cd Demo.MoodyChef.Console
dotnet run
You can chat with Moody Chef yourself.
When you're ready, execute the semantic tests.
dotnet test
Try modifying:
- the prompt
- the tool
- the Markdown scenario
and see how skUnit responds.
Why Semantic Assertions Matter
Most AI tests fail because they're checking text.
Users don't care about text.
They care about behavior.
Instead of asking
Did the model say these exact words?
ask
Did the model do the right thing?
That's exactly what semantic assertions verify.
Why I Prefer Tool-Based Agents
The Moody Chef sample highlights an important design principle.
The LLM shouldn't own business rules.
Instead:
- Let the LLM interpret language.
- Let your application enforce rules.
The result is:
- more reliable agents
- simpler prompts
- deterministic business logic
- tests that rarely become flaky
Final Thoughts
AI agents deserve better testing tools than string comparisons.
With skUnit you can:
- write conversations in Markdown
- verify semantic behavior
- execute scenarios repeatedly
- catch regressions before users do
If you're building AI applications with .NET or Semantic Kernel, give it a try.
⭐ GitHub
https://github.com/mehrandvd/skunit
Feedback, issues, and pull requests are always welcome.
Top comments (1)
Great walkthrough! I like how you explained the testing process step by step. It's a helpful introduction for anyone getting started with AI agents in .NET.