Luis Beltran

Posted on Aug 27 • Edited on Aug 28

Building Your Own Custom Evaluator for GenAI Apps, Agents, and Models Using Azure AI Foundry SDK

#azure #ai #microsoft #python

This article is part of the #wedoAI initiative. You'll find other helpful AI articles, videos, and tutorials published by community members and experts there, so make sure to check it out every day.

As Generative AI applications and agents move from experimentation into production, one challenge becomes clear: how do we measure a metric we are interested in, such as quality, jailbreak-risk level, or tool call accuracy?

Azure AI Foundry provides built-in evaluators—such as coherence, code-vulnerability level or fluency; however, real-world apps might also require domain-specific or nuanced evaluation metrics. This is where custom evaluators come in.

Why Custom Evaluators?

As we mentioned, Generative AI evaluation is not one-size-fits-all. For example:

A customer support bot can be measured on helpfulness and clarity.
A content summarizer may need to be judged on factual accuracy and readability.
A healthcare assistant might prioritize completeness and professional tone.

By creating your own evaluator, you can tailor evaluation criteria to align with your business goals, compliance needs, or user expectations.

Types of Custom Evaluators in Azure AI Foundry

Azure AI Foundry supports two main styles of custom evaluators:

Code-based evaluators: Implemented in Python, they use deterministic rules and metrics.
Prompt-based evaluators: Defined in .prompty assets, they leverage an LLM to provide human-like judgments.

Let's explore how to build your own evaluators using the Azure AI Foundry SDK, with two practical examples:

ClarityEvaluator – a lightweight, code-based evaluator.
HelpfulnessEvaluator – a prompt-based evaluator powered by an LLM.

Example 1: ClarityEvaluator (Code-Based)

The ClarityEvaluator measures how clear an answer is by looking at sentence length and structure.

class ClarityEvaluator:
    def __init__(self):
        pass

    def __call__(self, *, answer: str, **kwargs):
        import re

        sentences = re.split(r'(?<=[.!?])\s+', answer.strip())
        num_sentences = len(sentences) if sentences and sentences[0] else 0
        num_words = sum(len(s.split()) for s in sentences)
        avg_sentence_len = num_words / num_sentences if num_sentences else 0

        long_sentences = [s for s in sentences if len(s.split()) > 25]
        long_ratio = len(long_sentences) / num_sentences if num_sentences else 0.0

        return {
            "avg_sentence_length": avg_sentence_len,
            "long_sentence_ratio": long_ratio
        }

Why it’s useful:

Average sentence length → shorter sentences are easier to read.
Long sentence ratio → helps detect overly complex phrasing.

This evaluator is fast, lightweight, and doesn’t require an LLM call—perfect for continuous integration pipelines.

Example 2: HelpfulnessEvaluator (Prompt-Based)

Sometimes clarity alone isn’t enough. We also want to know: did the response actually help the user?

Here's where an LLM-driven evaluator shines. We can design a .prompty file that instructs the model to score helpfulness on a 1–5 scale.

helpfulness.prompty

---
name: Helpfulness Evaluator
description: "Rates how helpful and actionable an answer is."
model:
  api: chat
  configuration:
    type: azure_openai
    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
    azure_deployment: ${env:MODEL_EVALUATION_DEPLOYMENT_NAME}
    api_key: ${env:AZURE_AI_KEY}
  parameters:
    temperature: 0.1
inputs:
  response:
    type: string
outputs:
  score:
    type: int
  explanation:
    type: string
---

system:
You are evaluating the helpfulness of an answer. Rate it 1 to 5:
1 – Not helpful
3 – Moderately helpful
5 – Extremely helpful and actionable

Return JSON like:
{"score": <1-5>, "reason": "<short explanation>"}

Here is the answer to evaluate:
generated_query: {{response}}
output:

Python wrapper:

import json
from promptflow.client import load_flow

class HelpfulnessEvaluator:
    def __init__(self, model_config):
        self._flow = load_flow("helpfulness.prompty", model_config=model_config)

    def __call__(self, *, response: str, **kwargs):
        llm_output = self._flow(response=response)
        try:
            return json.loads(llm_output)
        except json.JSONDecodeError:
            return {"score": None, "reason": llm_output}

Why it’s useful:

Captures nuance that code-based heuristics can’t.
Produces a reason explanation for more transparency.
Can be tuned for domain-specific definitions of “helpfulness.”

Putting It All Together

With Azure AI Foundry SDK, you can mix and match evaluators:

Run ClarityEvaluator for automated readability scoring.
Run HelpfulnessEvaluator for human-like quality judgments.
Combine both into a composite evaluation pipeline to get a richer picture of model performance.

The code is available on GitHub.

Instructions

Step 1. Create an Azure AI Foundry Project

Go to Azure AI Foundry portal Home page and click on Create new. Select Azure AI Foundry resource and click on Next.

Fill in the required data, such as project name, Azure AI Foundry resource, Subscription, Resource group, and Region and click on Create.

NOTE: Consider choosing any of these regions if you would eventually like to test built-in protected material, risk and safety evaluators, as their support is locked to certain datacenter locations only at the moment of writing this post. For this example, East US 2 region was chosen.

Copy the Azure AI Foundry project endpoint and key values and paste them into Notepad. We will use them both later for the Environment Variables file and the values AZURE_AI_FOUNDRY_PROJECT_ENDPOINT and AZURE_AI_KEY, respectively.

Click on Azure OpenAI and also copy the Azure OpenAI endpoint value. Paste it into Notepad, it will be used in the .env file for AZURE_OPENAI_ENDPOINT.

Step 2. Deploy Azure OpenAI Model(s)

You can certainly use any base model for evaluation of your Generative AI app. For this scenario, we will deploy a gpt-4.1 model instance.

In Azure AI Foundry, click on Model + endpoints under My assets. Then, click on Deploy model and choose Deploy base model.

Search and select gpt-4.1 in the Select model pop-up.

Change the deployment details accordingly (especially the capacity in case you need more tokens per minute for your evaluations) by clicking on the Customize button. For this scenario, I am using the default values.

Copy and paste the following values to Notepad: deployment name, it will be used in the .env file for MODEL_EVALUATION_DEPLOYMENT_NAME.

For your GenAI app, you will use another model deployment, for example gpt-4o. So, deploy another model and copy/paste the deployment name for this one too. It is the MODEL_GENAIAPP_DEPLOYMENT_NAME environment variable that will be defined later in .env file.

Step 3. Write code

Part 1. Prerequisites

Open Visual Studio Code and on a new folder, create the following files:
- requirements.txt
- .env
- clarity.py
- clarity_evaluation.py
- helpfulness.prompty
- helpfulness.py
- helpfulness_evaluation.py
- dataset.jsonl
- local_evaluation.py
In requirements.txt, we define the libraries that our project needs. Here they are:

promptflow
azure-ai-evaluation
python-dotenv

In .env file, set the following environment variables with the corresponding values that you have copied from the Azure AI Foundry portal:

AZURE_AI_FOUNDRY_PROJECT_ENDPOINT=
AZURE_OPENAI_ENDPOINT=
AZURE_AI_KEY=

MODEL_EVALUATION_DEPLOYMENT_NAME=gpt-4.1
MODEL_GENAIAPP_DEPLOYMENT_NAME=gpt-4o

Part 2. Code-based custom evaluator (Clarity)

In clarity.py, we define the Clarity custom evaluator, you already know the code:

class ClarityEvaluator:
    def __init__(self):
        pass

    def __call__(self, *, answer: str, **kwargs):
        import re

        sentences = re.split(r'(?<=[.!?])\s+', answer.strip())
        num_sentences = len(sentences) if sentences and sentences[0] else 0
        num_words = sum(len(s.split()) for s in sentences)
        avg_sentence_len = num_words / num_sentences if num_sentences else 0

        long_sentences = [s for s in sentences if len(s.split()) > 25]
        long_ratio = len(long_sentences) / num_sentences if num_sentences else 0.0

        return {
            "avg_sentence_length": avg_sentence_len,
            "long_sentence_ratio": long_ratio
        }

Now, we can evaluate the clarity of two texts (let's imagine that both were AI-generated). This is the code for clarity_evaluation.py:

from clarity import ClarityEvaluator

evaluator = ClarityEvaluator()

answer_one = (
    "The process has three steps. "
    "First, submit your form online. "
    "Second, wait for the confirmation email. "
    "Finally, attend the scheduled appointment."
)

answer_two = (
    "In order to achieve the objective, it is necessary that the applicant "
    "not only completes the form—which might contain several sections, some "
    "of which are optional but highly recommended depending on the context—"
    "but also ensures that all accompanying documents are provided at the "
    "time of submission, otherwise the process may be delayed or even rejected."
)

result_one = evaluator(answer=answer_one)
result_two = evaluator(answer=answer_two)

print("Answer Evaluation:", result_one) # clear
print("Unclear Answer Evaluation:", result_two) # unclear

Test the code:
- Launch a terminal and write the following commands to create a virtual environment:
  - python -m venv pf
  - pf/Scripts/activate

Next, install the requirements with the command pip install -r .\requirements.txt

Now, run the following command to test the clarity evaluator: python clarity_evaluation.py. Here is the output:

As you can see, the first text is short, concise, solves a problem (thus, the average sentence length is low and the long sentence ratio is 0). The second text is vague, longer, with a high average sentence length and long sentence ratio of 1.

Part 3. Prompt-based custom evaluator (Helpfulness)

Now, let's test how helpful a text is. First, we define the content for helpfulness.prompty, which will be used by an LLM to evaluate a given text:

---
name: Helpfulness Evaluator
description: Rates how helpful and actionable an answer is.
model:
  api: chat
  configuration:
    type: azure_openai
    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
    azure_deployment: ${env:MODEL_EVALUATION_DEPLOYMENT_NAME}
    api_key: ${env:AZURE_AI_KEY}
  parameters:
    temperature: 0.1
inputs:
  response:
    type: string
outputs:
  score:
    type: int
  explanation:
    type: string
---

system:
You are evaluating the helpfulness of an answer. Rate it 1 to 5:
1 – Not helpful
3 – Moderately helpful
5 – Extremely helpful and actionable

Return JSON like:
{"score": <1-5>, "reason": "<short explanation>"}

Here is the answer to evaluate:
generated_query: {{response}}
output:

NOTE: It is recommended to include the query or context for better results, as it is important to provide valuable information to the LLM. It is not included in this sample though.

Next, here's the code for helpfulness.py where we define a custom Helpfulness Evaluator class:

import json
from promptflow.client import load_flow

class HelpfulnessEvaluator:
    def __init__(self, model_config):
        self._flow = load_flow("helpfulness.prompty", model_config=model_config)

    def __call__(self, *, response: str, **kwargs):
        llm_output = self._flow(response=response)
        try:
            return json.loads(llm_output)
        except json.JSONDecodeError:
            return {"score": None, "reason": llm_output}

NOTE: If you included the query/context in the prompty file, you'd also need that argument besides the response and pass it to the flow.

And we can test the helpfulness of a couple of AI-generated texts. We will do it in helpfulness_evaluation.py:

from helpfulness import HelpfulnessEvaluator
import os
from dotenv import load_dotenv

load_dotenv()

model_config = {
    "azure_endpoint": os.getenv("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.getenv("MODEL_EVALUATION_DEPLOYMENT_NAME"),
    "api_key": os.getenv("AZURE_AI_KEY")
}

evaluator = HelpfulnessEvaluator(model_config)

answer_one = (
    "To reset your password, go to the login page, click 'Forgot Password', "
    "and follow the instructions sent to your registered email. "
    "If you don't receive an email, check your spam folder or contact support."
)

answer_two = "I don't know. Maybe try something else."

result_one = evaluator(response=answer_one)
result_two = evaluator(response=answer_two)

print("Helpful Answer Evaluation:", result_one)
print("Unhelpful Answer Evaluation:", result_two)

NOTE: You'd need to include the query here too and pass it to the evaluator in case you added it in the previous files.

Test the code in the terminal with the command python helpfulness_evaluation.py. Here is the output:

As expected, the first text provides clear instructions (it is helpful, as a result the score is 5) while the second one does not (it is not helpful, thus the score is 1).

Part 4. Local evaluation

Finally, we can create a script that evaluates (locally) a given dataset of AI-generated answers and combines both custom evaluators (you can include built-in evaluators here too), so basically you are evaluating multiple metrics at once. This is helpful for batch evaluation and could be part of your DevOps pipeline/workflow where you test the quality/metrics of your agentic solution, model, or AI-based application.

In dataset.jsonl, add the following content, which is a dataset with the identifier and response of each content you want to test using your custom evaluators. The format is JSON lines, which is a requirement for Azure AI Evaluation.

{"id": "1", "response": "The process has three steps. First, submit your form online. Second, wait for the confirmation email. Finally, attend the scheduled appointment."}
{"id": "2", "response": "In order to achieve the objective, it is necessary that the applicant not only completes the form—which might contain several sections..."}
{"id": "3", "response": "To reset your password, go to the login page, click 'Forgot Password', and follow the instructions sent to your registered email."}
{"id": "4", "response": "I don’t know. Maybe try something else."}

Now, let's define the code for local_evaluation.py. We import our custom evaluators, load the dataset, and prepare an evaluation that includes both custom evaluators. Finally, the local evaluation is performed, with the evaluation result in a new file (myevalresults.json) and also sent to the console. Additionally, the results are uploaded to Azure AI Foundry portal thanks to the azure_ai_project argument, which uses AZURE_AI_FOUNDRY_PROJECT_ENDPOINT the environment variable.

from azure.ai.evaluation import evaluate
from clarity import ClarityEvaluator
from helpfulness import HelpfulnessEvaluator
import os
from dotenv import load_dotenv

load_dotenv()

dataset = "dataset.jsonl"

clarity_eval = ClarityEvaluator()

helpfulness_eval = HelpfulnessEvaluator(
    model_config={
        "azure_endpoint": os.getenv("AZURE_OPENAI_ENDPOINT"),
        "azure_deployment": os.getenv("MODEL_EVALUATION_DEPLOYMENT_NAME"),
        "api_key": os.getenv("AZURE_AI_KEY")
    }
)

project_endpoint = os.getenv("AZURE_AI_FOUNDRY_PROJECT_ENDPOINT")

results = evaluate(
    data=dataset,
    evaluators={
        "clarity": clarity_eval,
        "helpfulness": helpfulness_eval
    },
    evaluator_config={
        "clarity": {
            "column_mapping": {
                "answer": "${data.response}"
            } 
        },
        "helpfulness": {
            "column_mapping": {
                "response": "${data.response}"
            } 
        }
    },  
    output_path="./myevalresults.json",
    azure_ai_project=project_endpoint
)

print(results)

print("Local evaluation results saved to myevalresults.json")

Use the command python local_evaluation.py to see the output:

You can notice that the results are saved in two places:

A local file, myevalresults.json. The metrics that appear at the end are the average for the whole dataset:

And we also get a studio_url which is Azure AI Foundry portal, under Protect and govern, Evaluations:

Check the details by clicking on the evaluation:

Get insights of the evaluation of each row by clicking on the Data tab:

This is available in JSON lines format as well. Click on Logs tab:

Best Practices for Custom Evaluators

Keep it lightweight – Run deterministic metrics where possible (cheap, fast).
Use LLM evaluators sparingly – They add cost and latency, but are powerful for subjective judgments.
Align with business goals – Don’t just measure for the sake of measuring. Choose metrics that reflect what "good" means in your use case.
Integrate into CI/CD – Automate evaluation runs when pushing new versions of your app or model.

You can read more about Custom Evaluators from the official documentation.

Final Thoughts

Evaluators are the compass for GenAI development. Built-in metrics are a great start, but custom evaluators let you measure what truly matters for your application.

By combining rule-based clarity checks with LLM-powered helpfulness scoring, you can create a balanced, flexible evaluation strategy in Azure AI Foundry that drives continuous improvement for your GenAI apps, agents, and models.

I hope that this post was interesting and useful for you. Enjoy the rest of the #wedoAI publications!

Thank you for reading!

DEV Community

Building Your Own Custom Evaluator for GenAI Apps, Agents, and Models Using Azure AI Foundry SDK

Why Custom Evaluators?

Types of Custom Evaluators in Azure AI Foundry

Example 1: ClarityEvaluator (Code-Based)

Why it’s useful:

Example 2: HelpfulnessEvaluator (Prompt-Based)

Why it’s useful:

Putting It All Together

Instructions

Step 1. Create an Azure AI Foundry Project

Step 2. Deploy Azure OpenAI Model(s)

Step 3. Write code

Part 1. Prerequisites

Part 2. Code-based custom evaluator (Clarity)

Part 3. Prompt-based custom evaluator (Helpfulness)

Part 4. Local evaluation

Best Practices for Custom Evaluators

Final Thoughts

Top comments (0)