DEV Community: Leading EDJE

Product Thinking in a Project World: Delivering Software That Actually Moves the Needle

Julie Yakunich — Wed, 14 Jan 2026 15:12:46 +0000

At Leading EDJE, we’re obsessed with one thing: measurable value. Not shipped features. Not “percent complete.” Value - business outcomes your leaders can see, feel, and bank on.

Many organizations still run technology work as projects with fixed scope, budget, and timelines. Constraints are real. The difference with us is how we use those constraints: we prioritize by value, measure impact continuously, and make trade-offs transparent in business terms.

Our Approach: Value First, Always

1) Start with outcomes, not features

We clarify the business problem, define what success looks like, and agree on a small set of measurable outcomes (e.g., reduced cycle time, increased conversion, lower cost-to-serve). Features are a means; outcomes are the end.

2) Align goals from strategy to sprint

We align strategic goals, product goals, and sprint goals so day-to-day work directly supports what leadership values most. Teams understand why every item is in progress.

3) Order the work by value

Backlog decisions are grounded in expected value, ensuring evidence—not opinion—guides the roadmap.

4) Measure what matters

We use a handful of practical metrics tied to the target outcomes: adoption/usage, throughput/lead time, error rates, NPS - whatever best signals business impact. Then we close the loop by sharing results back to teams and stakeholders.

5) Make plans honest—and useful

Using agile forecasting and flow metrics, we give leaders credible timelines and options:

“Within the current budget/date, here’s the most value we can deliver.”
“To capture more value, here are the trade-offs.”

The Value Delivery Toolkit (How We Make This Real)

To make “value first” practical in project-driven environments, we bring a set of lightweight, repeatable practices we call the Value Delivery Toolkit:

Evidence-Based Management (EBM): Shared language and measures for value (e.g., Current Value, Time-to-Market, Ability to Innovate) so progress is judged by outcomes, not output volume.
Agile Forecasting: Probabilistic forecasts (including Monte Carlo and flow metrics) for realistic delivery windows that still leave room to optimize for the most valuable work.
Backlog & Goal Alignment: Clear product/sprint goals, value-based ordering, and explicit trade-offs that connect strategy to execution.
Incremental, Evidence-Driven Discovery: Small experiments to validate assumptions early, before big spend.

We don’t just “fill a seat.” We translate goals into outcomes, outcomes into measures, and measures into everyday decisions.

What This Looks Like in Your Organization

If scope and date are fixed: We maximize value within the constraints and show the cost/benefit of alternatives in business terms.
If success criteria are fuzzy: We co-define measurable outcomes and connect them to strategy.
If product roles are unclear: We act as translators—helping PMs, POs, BAs, and engineering align around outcomes.
If change feels hard: We start small (add sprint goals, instrument 1–2 outcome metrics, value-order the top of the backlog) and build momentum with proof.

Quick Wins You Can Apply This Month

Add sprint goals that are outcome-oriented and measurable and review them daily.
Create a product goal and tie it to a strategic objective.
Find one metric that best signals the outcome you want, and report it at sprint review.
Switch to probabilistic forecasts (ranges with confidence) instead of single-date promises (we love using Actionable Agile for this).

Why Clients Choose Leading EDJE

Value-Obsessed: positive business impact = success.
Pragmatic: We blend methods to fit your constraints and culture.
Transparent: Forecasts and metrics make trade-offs clear before money is spent.
Partner Mindset: We ask the right questions, care deeply about outcomes, and stay accountable.

Bottom line: We help you move from “was it delivered?” to “what value did it create?”

Ready to Turn Projects into Outcomes?

If you want your next initiative to prove its value—not just deliver scope—we’d love to help. Our Value Delivery Toolkit meets you where you are and raises the bar on what your technology delivers.

Maestro: A Single Framework for Mobile and Web E2E Testing

Dennis Whalen — Fri, 26 Dec 2025 13:26:20 +0000

I've recently been working on a personal project that has both mobile and web frontends. I wanted to include E2E tests, but I didn't want to spend a bunch of time getting all of that setup for web, iOS, and Android.

I just wanted a handful of happy-path E2E tests for an app that could run on a desktop browser, mobile browser, and native mobile.

Most importantly, I wanted to get this running quickly so I could focus on actually building the app. That's when I found an open source tool called Maestro.

What immediately caught my attention with Maestro is that it's so easy to get setup, and it handles both web and mobile with the same tool and syntax.

Here's What a Test Looks Like

Maestro tests are written in YAML. Here's a simple desktop browser example that searches DuckDuckGo:

url: https://duckduckgo.com
---
- launchApp
- tapOn:
    text: 'Search without being tracked'
- inputText: 'Maestro e2e testing'
- pressKey: Enter
- assertVisible: ".*Maestro is an open-source framework.*"

Pretty straightforward, right? It opens DuckDuckGo, taps the search box, searches for "Maestro e2e testing", and verifies that the results contain "Maestro is an open-source framework". Note that for partial text matching, Maestro uses regex—the .* pattern means "any characters", so ".*text.*" effectively does a "contains" match.

To be honest, I was not super excited to work with a tool that uses YAML to define the tests. In my regular job I spend a lot of time building out code-based automation suites, and that usually feels like the "right" way to do it. But is that always the case?

My personal project is not super complex, and I don't have a team of test automation folks. I have one dev and one QA, and they are both me. I want E2E tests, but I want to focus the majority of my time on building the app, not building fancy-pants automation frameworks.

Let's run this test!

Setup

I am not assuming that everyone uses a Mac, but that's what I'm using so keep that in mind if you're reading this as a Windows or Unix person. Maestro is cross-platform, but some of the install steps will be different. See their setup documentation for more details.

First, let's install Maestro. Open your terminal and run:

curl -fsSL "https://get.maestro.mobile.dev" | bash

Or you can use Homebrew:

brew tap mobile-dev-inc/tap
brew install maestro

Verify it worked:

maestro --version

OK so what do I need to install next? Huh, that's it?? Well then... let's run the test!

Running a Test

maestro test flows/duckduckgo-search-desktop.yaml

Maestro will open a browser, run through the test steps, and show you the results. If something fails, the output helps you figure out what went wrong, and you'll also get some detailed log files. Hopefully your run will look like this:

Running the Same Test on Mobile Browser

You can run a similar test on a mobile browser. Here's the mobile version:

appId: com.android.chrome
---
- launchApp
- tapOn: "Search or type URL"
- inputText: "https://duckduckgo.com"
- pressKey: Enter
- tapOn:
    id: "searchbox_input"
- inputText: "Maestro e2e testing"
- pressKey: Enter
- assertVisible: ".*Maestro is an open-source framework.*"

Notice how the syntax is almost identical. The main difference is using url: for desktop browsers and appId: for mobile browsers. Other than that, Maestro uses the same commands for both.

To run this, you'll need an Android emulator. If you have Android Studio installed, you can use the AVD Manager to create one. Make sure Chrome is installed on the emulator (it usually is by default).

Once your emulator is running, just run the test:

maestro test flows/duckduckgo-search-mobile.yaml

Hopefully you'll see the same interactions that you saw with the desktop browser test, and the same green results, like this!

You now have a taste for browser-based Maestro testing on a desktop browser and a mobile browser. Let's move away from the browser and use Maestro test a mobile app.

Testing a Native Mobile App

The built-in Android Contacts app is perfect for this because it's available on every Android device and works great in an emulator. Notice how the syntax is the same as the web test. Maestro uses the same commands whether you're testing web or native mobile.

Here's a test that creates a new contact:

appId: com.google.android.contacts
jsEngine: graaljs
---
- evalScript: ${output.firstName = faker.name().firstName()}
- evalScript: ${output.lastName = faker.name().lastName()}
- evalScript: ${output.phoneNumber = faker.phoneNumber().phoneNumber()}
- launchApp
- tapOn: "Create contact"
- tapOn: "First name"
- inputText: ${output.firstName}
- tapOn: "Last name"
- inputText: ${output.lastName}
- longPressOn: "Phone (Mobile)"
- tapOn: 'Select All'
- eraseText
- inputText: ${output.phoneNumber}
- tapOn: "Save"
- assertVisible: ${output.firstName + " " + output.lastName}
- scrollUntilVisible:
    element: "Delete"
- tapOn: "Delete"
- tapOn: "Delete"
- assertVisible: "1 contact deleted"

This test is a bit more advanced as it demonstrates Maestro's ability to generate dynamic test data using Faker. The jsEngine: graaljs setting enables JavaScript execution, and the evalScript commands at the top use Faker to generate random first names, last names, and phone numbers. These values are stored in the output object and referenced throughout the test using ${output.variableName} syntax.

This is just one example of integrating JavaScript with Maestro scripts. More detail can be found here.

Running It

With your emulator running, execute:

maestro test flows/contacts-app-android.yaml

The test will run, and you'll see the emulator actually perform the actions. If it passes, you'll see a nice success message. If it fails, Maestro will tell you what went wrong and where. Here's what I see:

Maestro MCP

MCP (Model Context Protocol) is a standardized protocol that bridges tools (like Maestro) to LLMs (like Claude or ChatGPT). Think of it as a universal connector that lets these AI models access and interact with your development tools.

Why it matters: If you're using these LLMs in your development workflow, Maestro includes an MCP that lets them interact with Maestro directly. They can read your test files, understand your test structure, suggest improvements, or even generate tests based on your app's behavior.

How to use it: The MCP server comes bundled with Maestro. To use it in Cursor:

Open Cursor Settings
Navigate to the MCP section
Click "Add new MCP Server"
Configure it with:

{
  "mcpServers": {
    "maestro": {
      "command": "maestro",
      "args": ["mcp"]
    }
  }
}

Save and restart Cursor

Similar functionality is available in other tools like VS Code through MCP extensions. Once connected, the AI assistant can discover your Maestro flows, understand your test structure, and help you write better tests.

More details can be found here.

A few things I didn't cover but want to mention

Maestro is easy to run on you CI platform, and also has a Cloud plan. More info here.
Maestro has a ton of sample flows to help you learn more here.
Maestro has an IDE to help with identifying UI elements, generating code, and running commands. Check it out.
Take a look at docs.maestro.dev for more examples, advanced features like nested flows and conditions, page objects, and tips for structuring larger test suites.

Happy building and testing. Peace out!

From Dev to DevOps

Victor Frye — Wed, 22 Oct 2025 16:10:28 +0000

DevOps is more than a role; it's a culture and mindset that bridges the gap between development and operations. Any member of an IT organization or software company can embrace DevOps principles to improve collaboration, streamline processes, and enhance software delivery. Any person can carry more than one role. However, the literature for DevOps often starts with operations: system administrators, infrastructure engineers, and site reliability engineers (SREs). One of the best books on the topic, The Phoenix Project, is written from the perspective of an operations manager (and I highly recommend reading it). DevOps is about operations, but it is also about development. In truth, DevOps is about the entire software lifecycle and thus any person involved in it can learn and grow into a DevOps role. One such path is from developer to a DevOps engineer.

The common guidance

The most common guidance for learning DevOps is to start with tooling from the operations perspective with recommendations to start with Linux or containers or Kubernetes. Some may find success this way, but I find it misleading. DevOps is difficult to learn first and these technologies are complex. It also does not matter if your code is executed in a container, virtual machine, or bare-metal on a Windows server to practice DevOps. However, a well-informed DevOps engineer knows why containerization is used and why choice of operating system matters. Instead, I recommend starting with what you know and building on that. If you want to learn DevOps, start with the various roles that practice it: developers, testers, operations engineers, or project managers. Here, I am focused on the developer role because that is my background and what I know best.

The developer role

A developer is responsible for writing, testing, and maintaining code that forms the basis of software applications. They work with team members in various roles to:

Understand requirements of what to build and translate them into functional software.
Write clean, efficient, and quality code that is testable and maintainable.
Ensure the software is buildable, deployable, and operational in installed environments.
Deliver software that meets user needs and business goals in a timely manner.

One distinction is that a person may perform more than one role. For example, a developer may also be acting in the role of a manager, a designer, or a network engineer. The role of a developer is focused on developing software, but a person is often responsible for more than just writing code. Commonly, people in the developer role are also responsible for:

Troubleshooting business applications and triaging why behavior is not as expected.
Understanding legacy software and how it operates critical business logic.

In an enterprise and during a production incident, someone in the role of developer may be called to explain why insurance claims are still pending or an appointment booking failed. At the intersection of operations and development, a developer may be the first to know when a database failure or network outage is causing business disruption. In this way, developers are already acting outside of the limited scope of writing code. This is where DevOps comes in.

The DevOps shift

DevOps is about the entire software lifecycle and the interrelationships between traditional developer and operations roles. A person in the role of both developer and DevOps engineer is responsible for:

Understanding the entire software lifecycle, from planning feature requirements and writing code to deploying and maintaining applications in production.
Developing the solutions that support the software lifecycle, such as CI/CD pipelines, infrastructure in the form of code, and automating tests.
Knowing the difference between code written and value delivered.

Being a DevOps engineer may never include a direct title change. However, it may represent a growth in responsibilities commonly required for promotion. A developer who understands how to implement DevOps practices in tooling is one who can understand architecture, processes, and the business value of their application and how to drive change with teams. These are requirements that lead to senior and principal engineer roles.

Learning DevOps

Build systems

As a developer, you are already working with various technologies used for DevOps. The first is your build system. Today, software is built often. You need to build your application locally multiple times to test changes. You may use pipelines to build your application in another environment for verifying changes in a pull request. If you want to move from developer to DevOps engineer, the first place to start is understanding how your code is built and how it is run in all the different environments.

With .NET, this means understanding the differences between the .NET SDK and runtime and the dotnet CLI used to build, run, and publish code. For JavaScript, this means understanding the differences between development servers, bundling, and how static files are served in browsers. Every language has its own build tools and is different in execution environments. For .NET, the common language runtime (CLR) is used to run code on Windows, Linux, and macOS. For JavaScript, the runtime is the browser or Node.js. Understanding how your code is built and executed is critical to automation and maintenance. When you know this, you can begin to optimize and automate the process.

Source control concepts

Most developers are already using source control, such as Git, to store and collaborate on code. However, it is an underappreciated tool that is critical to developers and DevOps engineers alike. Source control systems are the foundation of collaboration and change management. GitOps is a practice that uses Git repositories as the source of truth for all kinds of code, including application code, infrastructure as code, configuration files, and CI/CD pipelines. Your branching strategies and pull request processes are key aspects of how you audit and manage change. Git is the tool, but GitOps is the adoption of DevOps practices for automation of operational concerns. Turns out this developer tool is also a DevOps tool.

Command-line and scripting

The command-line can be avoided by most developers these days. IDEs and graphical interfaces often abstract away the need to use a command-line interface (CLI). However, CLIs are necessary for DevOps automation. You can know F5 runs your code in the IDE, but when authoring a pipeline you need to know the commands that do this. Sometimes it becomes a series of commands, at which point you transition from simple commands to scripting. Commonly, the recommended scripting language is Bash as it is the native shell on Linux. However, any scripting language will help you as you learn DevOps. You can learn PowerShell or Python and still accomplish much of what you need to do. The key is to learn how to automate tasks that you would otherwise do manually without your mouse. Bash, PowerShell, and Python are all cross-platform choices. Practice navigating your file system, managing installed apps, and running your build commands from the command line.

Continuous integration and delivery

The best-known acronym in DevOps is CI/CD, which stands for continuous integration and continuous delivery (or deployment). As a developer, you may already be using CI/CD pipelines to build and test your code. It may be tied to your source control platform, such as GitHub Actions or Azure DevOps Pipelines, or GitLab CI/CD, or it may be a standalone system like Jenkins. This is likely the first tooling primarily associated with DevOps that you will start authoring as you learn the role of DevOps engineer. However, a pipeline in and of itself is not CI/CD. You can write a pipeline that copies source code to a server, but that does not give you continuous integration, delivery, or deployment. Continuous integration is improved through pipelines that compile code consistently, run tests to verify changes, or enforce quality through additional checks like linters and static code analyzers. Continuous delivery is about when your pipelines produce deployment-ready artifacts that are reusable and ready to deploy to any environment. Continuous deployment is achieved when your pipelines automatically deploy code to your environments without human intervention. A pipeline is a tool, but CI/CD is a practice and outcome. Learn pipeline tooling, but learn them with the goal of automating the steps needed for CI/CD.

Hosting and runtime environments

As you learn pipelines and the concepts of CI/CD, you will also need to understand where your code is run. This can vary widely depending on your organization or application. You may be running on bare-metal servers, virtual machines, containers, or serverless environments. You may be running on-premises or in the cloud. You may be using a platform-as-a-service (PaaS) or infrastructure-as-a-service (IaaS). The key is to understand where your code is run, the benefits and trade-offs of each environment, and how to get your code there. Learning Kubernetes in-depth may help if your organization is using it, but it is overkill for a static website or hobby project. It also doesn't help if your organization isn't using containers. Instead, focus on learning the environment your code is run in already. What operating system is used? What cloud provider? Is there differences between the platform used in development versus production?

Infrastructure as code

As you learn where your code is run, you will also need to learn how that environment is created and configured. This is where infrastructure as code (IaC) comes in and the developer skills you already possess can shine. IaC is the practice of defining your hosting and runtime environments through code. Various languages and tools exist for this, such as Terraform, Azure Bicep, Ansible, Pulumi, and PowerShell DSC. The value in IaC is the same as traditional source code: it is versioned, readable, and traceable. If you write something to create a virtual machine and never commit it to a central repository, it is lost. However, if you write a Terraform file to create a VM and commit it to source control, you can track changes, review history, and implement CI/CD practices to validate changes and achieve infrastructure automation. As a developer, you already know how to write code. You can learn IaC and apply your existing skills to an operations domain.

Continuous learning

The journey from developer to DevOps engineer is a surprisingly natural evolution. Developers already know their application and the value it delivers. They already know how to write code and collaborate with others. They already know the software lifecycle and the pains of delivering software. Learning DevOps is about expanding their existing knowledge and skills to automate and optimize the concerns outside of developing new features. The best way to learn DevOps is not necessarily learning Linux or Kubernetes, but instead mastering the tools they are already using and expanding knowledge of the whole system. Learn your how your code is built, where it is run, and how it gets there. Automate the friction in the process. When you start there, the mindset of DevOps fits into place:

When you understand your build system, you can optimize your code runtime and apply containerization efficiently.
When you know source control concepts, you can apply them to infrastructure and pipelines for version control, collaboration, and traceability.
When you possess command-line knowledge, you can automate tasks for test quality and CI/CD pipelines.
When you control your pipelines, you can automate for faster feedback and software delivery.
When you understand your hosting environment, you can optimize for scalability and apply effective deployment strategies.
When you write high quality code, you can apply the same principles to infrastructure, pipeline, and test code.

DevOps is not a set of tools or a team, but a fuzzier concept: a mindset and shared responsibility. The path to learning DevOps is likewise non-exact. The concepts and tooling mentioned are how I started to learn DevOps as a developer. Your path may be different, but the key is to start with what you know and use today. From there, you learn the adjacent concepts, the tooling, and the why behind it all. And then, you keep learning.

Tracking AI system performance using AI Evaluation Reports

Matt Eland — Tue, 09 Sep 2025 20:08:51 +0000

A few months ago I wrote about how the AI Evaluation Library can help automate evaluating LLM applications. This capability is tremendously helpful in measuring the quality of your AI solutions, but it's only a part of the picture in terms of representing your application quality. In this article I'll walk through the AI Evaluation Reporting library and show how you can build interactive reports that help share model quality with your whole team, including product managers, testers, developers, and executives.

This article will start with an exploration of the final report and its capabilities, then dive into the handful of lines of C# code needed to generate the report in .NET using the Microsoft.Extensions.AI.Evaluation.Reporting library before concluding with thoughts on where this capability fits into your day to day workflow.

The Extensions AI Evaluation Report

Let's start by taking a look at what we're talking about here: The AI Evaluation Report showcasing the performance of a series of different evaluators as they grade a sample interaction produced by an LLM application:

This particular example features a single scenario where an AI agent is instructed to respond to interactions with humorous haikus related to the topic the user is mentioning:

System Prompt: You are a joke haiku bot. Listen to what the user says, then respond with a humorous haiku.

User: I'm learning about AI Evaluation and reporting

Assistant: I grade clever bots, reports spill midnight secrets, robots giggle on.

While not the best interaction, the system technically did close to what it was instructed to do, and the report summarizes the strengths and weaknesses of the system in handling this interaction.

Let's talk about how it works.

This "report card" was generated by sending the conversation history to an LLM with instructions on how to evaluate it for different capabilities including coherence, English fluency / grammatical correctness, relevance, truthfulness, and completeness.

This evaluation is performed using an LLM and specially prepared prompts built for evaluating the performance of this interaction. The evaluation LLM can be the same one as the one you used for conversation or it could be a different one entirely.

The results of this evaluation are persisted in a data store (such as on disk or on Azure) and are available to help show trends over time as well as generating periodic reports in HTML format.

Because the evaluation report is a HTML document, it allows for some interactive features. For example, you can click in on a particular evaluator and see details on its evaluation, as is shown here for the Fluency evaluator:

Here we can see the fluency evaluator giving the response middling reviews for English fluency, which is likely due to the fluency evaluator being designed more for conversational English and articles rather than haikus like the one our bot is generating.

Note that we can see some specific metrics on the tokens that were used, the amount of time taken, and the specific model being used for evaluation.

Implementing an AI Evaluation Report in .NET

There are a few more aspects of this evaluation report we'll highlight, and we'll talk later on about the overall context this report plays into your organization, but for now let's talk about how to generate it.

In this section I'll walk through the C# code needed to generate the report shown here in this article.

This code is taken directly from my GitHub repository and is specifically inside of the EvaluationReportGeneration project.

Connecting to Chat and Evaluation Models

The first thing we need to do with our application is to have a chat client for our AI evaluation as well as for our chat completions. I'll do this here with two OpenAIClient objects representing our chat and evaluation models:

// Load Settings
ReportGenerationDemoSettings settings = ConfigurationHelpers.LoadSettings<ReportGenerationDemoSettings>(args);

// Connect to OpenAI
OpenAIClientOptions options = new()
{
Endpoint = new Uri(settings.OpenAIEndpoint)
};
ApiKeyCredential key = new ApiKeyCredential(settings.OpenAIKey);
IChatClient evalClient = new OpenAIClient(key, options)
.GetChatClient(settings.EvaluationModelName)
.AsIChatClient();
IChatClient chatClient = new OpenAIClient(key, options)
.GetChatClient(settings.ChatModelName)
.AsIChatClient();

You can connect chat and evaluation to any model provider with an IChatClient implementation, which are either available or in preview for all major model providers such as OpenAI, Azure, Ollama, Anthropic, and more.

In this article I'm using o3-mini as my chat model generating the responses and gpt-4o as the evaluation model (the current recommended model by Microsoft as of this article).

Creating a Report Configuration

Now that we've got our chat clients ready, our next step is to create a ReportingConfiguration which will store the raw metrics and conversations that are evaluated over time. This helps in centralizing reporting data and in building trends over time in reports.

There are currently two supported default options for this: DiskBasedReportingConfiguration which stores data on disk in a location you specify, and the AzureStorageReportingConfiguration option present in the Microsoft.Extensions.AI.Reporting.Azure package.

We'll go with the disk-based configuration in this sample because it's far simpler to configure:

// Set up reporting configuration to store results on disk
ReportingConfiguration reportConfig = DiskBasedReportingConfiguration.Create(
storageRootPath: @"C:\dev\Reporting",
chatConfiguration: new ChatConfiguration(evalClient),
executionName: $"{DateTime.Now:yyyyMMddTHHmmss}",
evaluators:
[
new RelevanceTruthAndCompletenessEvaluator(),
new CoherenceEvaluator(),
new FluencyEvaluator()
]);

Here we create our ReportingConfiguration by telling it:

Where to store the raw report metrics on disk (not the location for the generated report file)
Which chat connection it should use to evaluate the interactions
A unique name for the evaluation run. This will be used to generate folder names so only certain characters are allowed.
One or more IEvaluator objects to use in generating evaluation metrics. This is equivalent to using a CompositeEvaluator like I demonstrated in my prior article.

More on Evaluators: If you're looking for more detail on the various evaluators you can use or how they work, I go into each of these evaluators more in my article on MEAI Evaluation.

You can also specify tags that apply to your entire evaluation run here, but I'll cover tags in a future article.

Defining a Scenario Run

Evaluation reports have one or more scenario runs associated with them, representing a specific test case.

We'll create a single "Joke Haiku Bot" scenario for our purposes here:

// Start a scenario run to capture results for this scenario
await using (ScenarioRun run = await reportConfig.CreateScenarioRunAsync(
scenarioName: "Joke Haiku Bot"))
{
// Contents detailed in next few snippets...
}

Note that we're using an await using around the whole context of our ScenarioRun object. This makes sure the run is properly disposed, which causes its metrics to be reported to the reporting configuration object and persisted to disk.

If we had additional scenarios, we could define each one sequentially so that we're aggregating our evaluation results into a single report. In this article we'll keep things simple and look only at a single case, but in our next article in the series I'll cover iteration, experimentation, and multiple scenarios.

Important Note: It's important that any ScenarioRun objects you're using for your evaluation are disposed before you use their evaluation metrics to generate a report. This is why I'm using the await using syntax here as well as explicitly declaring the scope of the object instead using the newer "scopeless" style of defining the object in a using statement.

Getting and Evaluating a Response

Now that we have an active ScenarioRun object we need a list of ChatMessage objects to send to the chat model:

string systemPrompt = "You are a joke haiku bot. Listen to what the user says, then respond with a humorous haiku.";
string userText = "I'm learning about AI Evaluation and reporting";

List<ChatMessage> messages = [
new(ChatRole.System, systemPrompt),
new(ChatRole.User, userText)
];

With that in place, we send it to the chat model using our chat client and we can get back a ChatResponse:

// Use our CHAT model to generate a response
ChatResponse response = await chatClient.GetResponseAsync(messages);

This particular example is using the IChatClient defined in the Microsoft.Extensions.AI (MEAI) package to do this, but you could use something else such as Semantic Kernel or another library, or even just hard-code a chat response you've observed in the wild.

Once we have our list of messages and the model's response, we can send both of them to our ScenarioRun for evaluation with a single line of code:

// Use the EVALUATION model to grade the response using our evaluators
await run.EvaluateAsync(messages, response);

This call returns an EvaluationResult object if you want to look at the immediate output of the evaluation, but the results will also be persisted to our reporting configuration, so we don't need to take immediate action on them.

Generating an AI Evaluation Report

We've now created our reporting configuration, started a scenario, gotten a chat response, and then used our evaluators to grade it. Let's talk about actually building an HTML report from our evaluation data.

The first step of this is to identify the data that should be included in our report.

While it may seem like we already have that data, the evaluation report can show the trends of your different evaluations over time, which can be handy for seeing how experiments are impacting the overall reporting experience.

I typically include the last 5 results in my reports, and use this snippet to grab that data from my reporting configuration:

// Enumerate the last 5 executions and add them to our list we'll use for reporting
List<ScenarioRunResult> results = [];
await foreach (var name in reportConfig.ResultStore.GetLatestExecutionNamesAsync(count: 5))
{
await foreach (var result in reportConfig.ResultStore.ReadResultsAsync(name))
{
results.Add(result);
}
}

Next, we'll use these results from our scenarios to generate the output report file. Reports can be written in JSON format or in HTML. I typically will choose the HTML option because these reports include an option to export the underlying JSON if you need it.

The code for this is fairly simple:

string reportFilePath = Path.Combine(Environment.CurrentDirectory, "Report.html");

IEvaluationReportWriter writer = new HtmlReportWriter(reportFilePath);

await writer.WriteReportAsync(results);

This generates a new report in the Report.html file we specified. You can then open up that file manually and see the results, or you can start a process to open this report in your default web browser:

Process.Start(new ProcessStartInfo
{
FileName = reportFilePath,
UseShellExecute = true
});

When this executes the user's operating system will handle the report just as if the user had double-clicked on the file in their file system - potentially opening a web browser or asking them what action they'd like to take with this file or type of file.

Practical uses for AI Evaluation Reports

Now that we've covered AI Evaluation reports and how to generate them using C#, let's close this article with a discussion of how this technology potentially fits into your workflow.

First of all, if you're looking for a way of evaluating your AI systems, AI Evaluation reports are a fantastic option, even for a solo developer trying to understand the performance of their hobby projects. The graphical reports and being able to click into details are easier than working directly with the EvaluationResult objects with their nested metric objects.

For more serious usage, AI Evaluation has some tremendous merit because it equips you to share something graphical with others to help them understand how your application works with different implementations. Instead of having conversations about your models being "good" or "not good enough", you can have targeted specific conversations on the specific interactions your system is succeeding with and those it is struggling with.

Because these HTML files are interactive and intuitive, this technology enables people in your organization to explore the examples on their own and internalize more of the systems strengths and weaknesses. In a nutshell, these reports make it easy to see and share information about the state of your AI systems.

I view AI evaluation as a vital part of integration testing and the MLOps process prior to any new deployment - or potentially even to block feature branches from rejoining the main product branch as part of the pull request review process in development. Having a graphical report to go with it can help you understand the trends and performance of your models over time and how different changes impact its performance.

Final Recommendations

AI Evaluation and evaluation reporting are important aspects of your team's success in its AI offerings.

Here are some closing recommendations I have when adopting AI evaluation tooling into your organization:

AI Evaluation and evaluation reports are key parts of any significant release that updates the behavior of an AI agent and should be part of your quality assurance and product management efforts.
Automating AI Evaluation as part of your integration tests is worth the effort. You can also optionally have significant degradation of evaluated quality fail your tests when run as an actual integration test (not covered in this article, but I plan on writing more on this in the future).
The quality of your evaluation model matters. It's worth using a more capable model for this as it's more likely to grasp the full context of the request and the response that was generated.
Having automated evaluation in place frees you up to do more experimentation around your system prompts, model selection, and other parameters and settings. Make sure you have this automation in place to collect metrics before doing serious performance tuning of your models as these metrics can help guide your decision-making and refinement process.
Store your model metrics in a centralized location for tracking over time. I recommend a dedicated shared location just for release candidates as well as local metrics storage for developers during development and testing.
The resulting HTML reports should be shared with your entire team, including organizational leadership, quality assurance, and product owners. This practice helps cut through the hype and fear around AI systems and allows your full team to more meaningfully understand what your system is good and bad at.
Just because your metrics are high doesn't mean that your system is performing well. It just means its performing well for those interactions you're measuring and observing.
As your system grows, its evaluation suite should grow over time as well. As you find new interactions it struggles with or add new capabilities to the system, you should be adding in new scenarios to represent these capabilities.

I view AI Evaluation as a vital part of the development of AI systems and evaluation reports make these systems so much more understandable to your whole team.

Add Structured Testing to Your AI Vibe - with promptfoo

Dennis Whalen — Thu, 04 Sep 2025 11:46:23 +0000

Intro

In my previous promptfoo post, we covered the basics of testing LLM prompts with simple examples using promptfoo. But when you're building an actual application that processes user-generated content at scale, you might discover that your carefully crafted prompt needs to handle far more complexity than you initially anticipated.

Many teams are still doing "vibe testing" - manually checking a few examples, tweaking prompts based on gut feel, and hoping everything works in production. While this might get you started, a systematic evaluation framework puts you significantly ahead of the curve when it comes to building and maintaining reliable AI systems, and provides a mechanism to build a set of repeatable automated regression tests.

Our Assignment

Let's consider an example. You're working with a major ecommerce client, and your team is building a feature that will analyze user submitted product reviews. Your application needs to evaluate the product reviews, classify sentiment, extract key product features mentioned, detect potentially fake reviews, and make moderation decisions. This will help customers find trustworthy reviews and help your business maintain review quality.

The core of this system is a prompt that takes each incoming review and returns structured data, such as sentiment classification, confidence scores, extracted features, fake review indicators, and moderation recommendations.

This prompt might work well during development, but once deployed, it needs to handle the messy reality of real user reviews. Your prompt will definitely need to be able to handle things like:

Mixed sentiment reviews (loved the product, hated the shipping)
Fake or suspicious reviews
Reviews with profanity or inappropriate content
Sarcastic or nuanced language
Reviews that mention competitors

This is where a systematic process with multiple scenarios becomes crucial.

Our Requirements

Speaking of systematic processes, before we dive into building our prompt and setting up the prompfoo tests, let's outline what the requirements would look like. We'll use our old friend gherkin.

Feature: Product Review Analysis Prompt

  Scenario Outline: Prompt analyzes product reviews correctly
    Given a product review analysis prompt
    And a "<review_type>" product review
    When the prompt processes the review
    Then the sentiment should be classified as "<expected_sentiment>"
    And fake review indicators should be "<fake_indicators>"
    And the recommendation should be "<expected_recommendation>"
    And key features should be extracted

    Examples:
      | review_type | expected_sentiment | expected_fake_indicators | expected_recommendation |
      | positive    | positive           | absent                   | approve                 |
      | negative    | negative           | absent                   | approve                 |
      | mixed       | mixed              | absent                   | flag_for_review         |
      | suspicious  | positive           | present                  | flag_for_review         |

Gherkin is just a way to describe requirements in plain language. In this case, we have four main test scenarios: positive reviews, negative reviews, mixed sentiment reviews, and suspicious/fake reviews.

Promptfoo doesn't use gherkin, but I do, and it helps me think through the scenarios we need to cover. We'll translate these scenarios into actual promptfoo tests next.

Moving Beyond Inline YAML: File-Based Organization

In my last post we defined the entire test in YAML. Before diving into complex scenarios, let's improve our testing structure by moving prompts into separate files. This makes them easier to maintain, version control, and collaborate on.

Project Structure

promptfoo-product-reviews/
├── prompts/
│   └── analyze-review.txt
├── test-data/
│   ├── positive-review.txt
│   ├── negative-review.txt
│   ├── mixed-review.txt
│   └── suspicious-review.txt
├── analyze-review-spec.yaml
└── package.json

Creating Our Review Analysis Prompt

Let's first create a prompt specifically designed for ecommerce product review analysis:

prompts/analyze-review.txt

You are an expert product review analyzer for an ecommerce platform. Analyze the following product review and provide a structured assessment.

Product Review:
{{review_text}}

Provide your analysis in the following JSON format. Return ONLY the JSON object, no markdown code blocks, no explanations, no additional text:
{
  "sentiment": "positive|negative|mixed",
  "confidence": 0.0-1.0,
  "key_features_mentioned": ["feature1", "feature2"],
  "main_complaints": ["complaint1", "complaint2"],
  "main_praise": ["praise1", "praise2"],
  "suspected_fake": boolean,
  "fake_indicators": ["indicator1", "indicator2"],
  "recommendation": "approve|flag_for_review|reject",
  "summary": "Brief 1-2 sentence summary"
}

Focus on:
- Accurate sentiment classification, especially for mixed reviews
- Extracting specific product features mentioned
- Identifying potential fake review indicators such as generic language without specific details, suspicious patterns, overly positive language, and extreme superlatives, overly negative language
- Providing actionable moderation recommendations

IMPORTANT: Return ONLY valid JSON. Do not wrap in markdown code blocks or add any other text.

Test Scenarios: Real-World Product Reviews

So that's the prompt we're going to test. Now let's create diverse test scenarios that represent what you'd actually encounter in production. You might make these up, or you might use some actual production reviews.

Scenario 1: Genuine Positive Review Example

test-data/positive-review.txt

I've been using these wireless earbuds for 3 months now and I'm really impressed. The battery life is excellent - I get about 6-7 hours of continuous listening, and the case gives me 2-3 full charges. The sound quality is crisp and clear, with good bass response for the price point. They stay comfortable in my ears during workouts and haven't fallen out once. The touch controls take some getting used to but work reliably once you learn them. Only minor complaint is that the case is a bit bulky for my small pockets, but that's a trade-off for the extra battery. Would definitely recommend for anyone looking for reliable wireless earbuds under $100.

Scenario 2: Detailed Negative Review

test-data/negative-review.txt

Very disappointed with these earbuds. The connection constantly drops out, especially when my phone is in my pocket or more than a few feet away. The battery life is nowhere near the advertised 8 hours - I'm lucky to get 4 hours before they die. The sound quality is muddy and lacks clarity, particularly in the mid-range frequencies. They're also uncomfortable for extended wear - my ears start hurting after about an hour. The touch controls are oversensitive and constantly trigger accidentally when I adjust them. For the price, I expected much better quality. I've had $20 earbuds that performed better than these. Returning them and looking for alternatives.

Scenario 3: Mixed Sentiment Review

test-data/mixed-review.txt

These earbuds are a mixed bag. On the positive side, the sound quality is really good - clear highs, decent bass, and good overall balance. The build quality feels solid and they look premium. The battery life meets expectations at around 6 hours. However, there are some significant issues. The Bluetooth connection is unreliable - frequent dropouts and sometimes one earbud stops working randomly. The fit is also problematic for me - they tend to slip out during exercise despite trying all the included ear tips. Customer service was helpful when I contacted them about the connection issues, but the firmware update they suggested didn't solve the problem. Overall, great sound quality let down by connectivity and fit issues. Might work better for others but not ideal for my use case.

Scenario 4: Suspicious/Fake Review

test-data/suspicious-review.txt

Amazing product! These earbuds are the best I have ever used in my entire life. The sound quality is absolutely perfect and the battery life is incredible. They are so comfortable and never fall out. The connection is always stable and strong. I love everything about these earbuds and they exceeded all my expectations. Everyone should buy these right now because they are the greatest earbuds ever made. Five stars without any doubt! Highly recommend to all people who want amazing earbuds with perfect quality and performance.

Comprehensive Test Configuration

Now let's create a promptfoo configuration that tests all these scenarios with appropriate assertions:

analyze-review-spec.yaml

description: Product Review Analysis Testing

prompts:
  - file://prompts/analyze-review.txt

providers:
  - openai:chat:gpt-4o-mini

tests:
  # Test 1: Genuine Positive Review
  - vars:
      review_text: file://test-data/positive-review.txt
    assert:
      - type: is-json
      - type: javascript
        value: |
          const response = JSON.parse(output);
          response.sentiment === 'positive' && response.confidence > 0.7
      - type: contains-json
        value:
          suspected_fake: false
      - type: llm-rubric
        value: "Should identify key positive features like battery life, sound quality, and comfort. Should not flag as fake since it contains specific details and minor complaints."

  # Test 2: Detailed Negative Review  
  - vars:
      review_text: file://test-data/negative-review.txt
    assert:
      - type: is-json
      - type: javascript
        value: |
          const response = JSON.parse(output);
          response.sentiment === 'negative' && response.confidence > 0.7
      - type: contains-json
        value:
          suspected_fake: false
      - type: llm-rubric
        value: "Should identify specific complaints about connection, battery, sound quality, and comfort. Should extract main issues for product team review."

  # Test 3: Mixed Sentiment Review
  - vars:
      review_text: file://test-data/mixed-review.txt
    assert:
      - type: is-json
      - type: javascript
        value: |
          const response = JSON.parse(output);
          response.sentiment === 'mixed'
      - type: llm-rubric
        value: "Should correctly identify mixed sentiment, extracting both positive aspects (sound quality, build) and negative aspects (connectivity, fit). This is the most challenging scenario for sentiment analysis."

  # Test 4: Suspicious/Fake Review
  - vars:
      review_text: file://test-data/suspicious-review.txt
    assert:
      - type: is-json
      - type: contains-json
        value:
          suspected_fake: true
      - type: javascript
        value: |
          const response = JSON.parse(output);
          response.fake_indicators && response.fake_indicators.length > 0
      - type: llm-rubric
        value: "Should detect fake review indicators: overly positive language, lack of specific details, generic praise, and extreme superlatives."

Understanding the Test Specification

Let's break down what this test configuration accomplishes. We have four distinct tests that correspond to the four key scenarios mentioned above:

Test 1: Genuine Positive Review - References positive-review.txt
Test 2: Detailed Negative Review - References negative-review.txt
Test 3: Mixed Sentiment Review - References mixed-review.txt
Test 4: Suspicious/Fake Review - References suspicious-review.txt

Each test loads its respective product review using the file:// syntax, which tells promptfoo to read the content from the specified file and inject it into the review_text variable in our prompt.

Multi-Layered Assertions

Notice that we're using multiple types of assertions for comprehensive validation:

is-json - Ensures the output is valid JSON format
contains-json - Checks for specific key-value pairs in the response
javascript - Uses inline JavaScript for custom validation logic (like checking sentiment and confidence scores)
llm-rubric - Uses an LLM to evaluate whether the output meets human-readable criteria

The inline JavaScript assertions are particularly powerful for complex validation. For example:

const response = JSON.parse(output);
response.sentiment === 'positive' && response.confidence > 0.7

This validates both the sentiment classification AND ensures the AI is confident in its assessment, helping us catch edge cases where the model might be uncertain.

Installation & Setup

# Install as a dev dependency in your project
npm install --save-dev promptfoo

Run the test

# Run the tests
npx promptfoo eval -c promptfoo-product-reviews/analyze-review-spec.yaml --no-cache
# View the results in web viewer
npx promptfoo view -y

Understanding the Results

The web viewer has a lot going on, and I could do an entire walkthrough of its features. For now, let's focus on the key insights it provides into the test and evaluation results.

The results are displayed in a grid, and you can see our prompt in the first row. The 2nd row shows the results of our first scenario, the positive review.

Note the prompt did a pretty good job at analyzing the review based on our requirements, and displays the actual response from the test:

{
  "sentiment": "positive",
  "confidence": 0.9,
  "key_features_mentioned": ["battery life", "sound quality", "comfort", "touch controls"],
  "main_complaints": ["case is bulky"],
  "main_praise": ["excellent battery life", "crisp and clear sound quality", "comfortable during workouts", "reliable touch controls"],
  "suspected_fake": false,
  "fake_indicators": [],
  "recommendation": "approve",
  "summary": "The reviewer expresses high satisfaction with the wireless earbuds, highlighting their excellent battery life and sound quality while noting a minor complaint about the case size."
}

Adding the tests to CI

This is a great start, but we can take this a step further. Since promptfoo just runs from the command line, we can include it as a regression test in our CI pipeline and ensure that future prompt changes don't break these tests.

If we make changes to the prompt, or change the LLM provider, we can re-run this test and see if the results change. If they do, we can investigate why.

As requirements change and morph, we can adapt the tests accordingly.

Wrap-up

In this post, we've explored how to set up a comprehensive testing framework for AI-generated product reviews using promptfoo. By defining clear test scenarios and leveraging multi-layered assertions, we can ensure our AI behaves as expected across a range of inputs.

It might not surprise you to learn that my prompt was not perfect the first time. Since I setup my automated tests first, it made it easy to iterate on the prompt development. Sounds like test driven development, huh?

That's it for now. Stay tuned for another promptfoo post before too long!

Navigating the Unexpected: How to Get Your Project Back on Track After a Setback

Julie Yakunich — Tue, 26 Aug 2025 18:27:53 +0000

We've all been there: your project is humming along nicely when suddenly, an unexpected interruption brings everything to a halt. Recently, our team faced a two-week break in the middle of a client project. When we reconvened, we encountered several challenges but also discovered valuable strategies for regaining momentum. Here's what we learned about getting back on track after an unexpected blip.

The Challenge of Resuming Work

Returning to a paused project is rarely as simple as picking up where you left off. Our team immediately faced several obstacles:

Access roadblocks: Regaining entry to necessary systems required navigating multiple layers of security and approval processes.
Timeline concerns: Stakeholders had legitimate questions about how the lost time would impact deliverables and deadlines.
Momentum loss: The team's rhythm and flow had been disrupted, requiring intentional effort to rebuild.

7 Effective Steps to Regain Project Momentum

Based on our experience, here are proven steps to help your team bounce back from an unexpected project interruption:

1. Rally Strong Leadership

Our project manager, scrum master, and product owner immediately aligned to advocate for the team's needs. This leadership triad created a protective buffer that allowed team members to focus on getting back to productivity while they handled administrative hurdles.

Part of this leadership alignment included rebaselining our project plan—a meticulous process of adjusting timelines, renegotiating commitments, and communicating changes transparently. While the team initially felt anxious about how the new baseline might affect delivery, seeing a clear, updated path gave them reassurance that we could move forward with confidence.

Action tip: Identify key leadership roles and ensure they're communicating frequently during the recovery period.

2. Cultivate Patience Deliberately

Frustration is natural when facing unexpected barriers. We made it a point to remind each other regularly that the process would take time and that patience would serve us better than impatience.

Action tip: Acknowledge frustrations openly but pair them with reminders about the temporary nature of the challenges.

3. Leverage Available Tools

We were fortunate to have access to an internal, secure AI assistant that helped us review code and write tests. This technological support accelerated our ability to get back up to speed.

Action tip: Audit what tools and resources might help your team recover more quickly, even if they weren't part of your original workflow.

4. Intensify Team Connection

Our scrum master made a conscious effort to check in with team members individually and frequently. We also increased team-building activities to rebuild the connection that had been temporarily lost.

Action tip: Schedule additional informal check-ins and create opportunities for the team to reconnect socially as well as professionally.

5. Create Psychological Safety

We established a safe space where team members could voice concerns without fear. This open dialogue led to creative solutions we might not have discovered otherwise. Even before the furlough, we had cultivated an environment of safety and trust where people could voice their concerns and opinions. This went a long way when we had an unexpected outage.

Action tip: Host a dedicated session specifically for airing concerns and brainstorming recovery strategies.

6. Reconnect with Purpose

Reminding ourselves why we valued this client and project rekindled our motivation. This connection to the work proved powerful in overcoming obstacles.

Action tip: Take time to explicitly discuss what team members find meaningful about the project to reignite intrinsic motivation.

7. Embrace Autonomy with Accountability

Having the freedom to solve problems creatively, backed by supportive stakeholders and mutual trust, allowed us to find the best path forward rather than the most obvious one.

Action tip: Give team members space to determine their own best recovery strategies while maintaining clear accountability for outcomes.

The Foundation for Resilience

Our experience highlighted that teams bounce back most effectively when they have:

Autonomy to solve problems creatively
Strong, supportive stakeholders who trust the team
Psychological safety to voice concerns and ideas
A foundation of trust among all parties

The unexpected pause in our project could have derailed our momentum permanently. Instead, by implementing these strategies, we not only recovered but ultimately delivered successfully.

When your team faces an unexpected interruption—whether it's two weeks or two months—remember that the path back to productivity is paved with intentional leadership, strengthened connections, and a renewed sense of purpose. The resilience you build through this process will serve your team well beyond the current project.

Have you experienced an unexpected project interruption? What strategies helped your team recover? I'd love to hear your stories in the comments below.

Automate the Testing of Your LLM Prompts

Dennis Whalen — Sun, 24 Aug 2025 23:06:14 +0000

Intro

On a recent client engagement, we needed a mechanism to validate LLM responses for an application that used AI to summarize customer service call transcripts.

The requirements were clear: each summary had to capture specific details (customer names, account numbers, actions taken, resolution details, etc.), and our validation process needed to be automated and repeatable. We needed to test our custom summarization prompts with the same rigor we apply to traditional software: pass/fail assertions, regression baselines, and systematic tracking.

That's where promptfoo came in. Promptfoo let us codify these requirements into automated tests and iterate on prompt improvements with confidence.

Why Testing LLM Responses Is Different (And Why You Should Care)

As software engineers and quality professionals, we're used to deterministic systems where the same input always produces the same output. LLM responses break that assumption: the same prompt can yield different valid answers, so traditional assertion patterns are often insufficient.

Here's the challenge: How can you verify a prompt's response is contextually accurate when the response can vary with every request?

The solution is to shift from testing exact outputs to testing output quality, accuracy, and safety. You need assertions that can evaluate whether a response contains required information, follows guidelines, and avoids harmful content, regardless of the exact wording.

Traditional testing falls short with LLM prompt responses because:

Non-deterministic responses: Same input, different valid outputs
Context-dependent behavior: Quality depends on conversation history
Safety concerns: Content filtering and moderation requirements
Performance variability: Response times and costs fluctuate

If you've been struggling with manual testing of AI features or relying on trial-and-error for prompt engineering, this guide will show you how promptfoo brings systematic testing to AI development.

What is Promptfoo?

Promptfoo is an open-source testing framework specifically designed to enable test-driven development for LLM applications with structured, automated evaluation of prompts, models, and outputs.

Key capabilities:

Assertion-based validation with pass/fail criteria familiar to QA engineers
Side-by-side prompt comparison for A/B testing different prompts and approaches
Automated regression testing to catch quality degradation
CI/CD integration for your existing pipelines
Multi-model support (OpenAI, Anthropic, Google, Azure, local models)

Promptfoo brings familiar testing methodologies to AI development:

Test-driven development instead of trial-and-error and/or hoping for the best
Regression testing to catch quality degradation
Performance monitoring (latency, cost, accuracy)

Getting Started: Hands-On Examples

The best way to understand promptfoo is to see it in action. Let's start with installation and work through practical examples.

Installation & Setup

# Install as a dev dependency in your project
npm install --save-dev promptfoo

Configuration: YAML-Driven Testing

Promptfoo uses YAML configuration files to define your tests. This approach will feel familiar if you've worked with other testing frameworks or CI/CD tools. The YAML file specifies:

Prompts: The actual prompts you want to test
Providers: Which AI models to use (OpenAI, Anthropic, Azure, etc.)
Tests: Input variables and assertions used to validate responses
Test scenarios: Different inputs and expected behaviors

This declarative approach makes it easy to version control your AI tests and collaborate with your team.

Example 1: Simple Dataset Generation

Let's start with a simple example. We want to test a prompt that generates a list of random numbers. Of course an LLM is really not the right place to do this, but this is just for example purposes.

We're going to test this prompt against two different models: Claude and GPT-5-mini. (FYI, you will need API tokens for any paid model you are referencing.)

# examples-for-blog/ten_numbers.yaml
description: Generating a random list of integers between a range

prompts:
  - "You are a JSON-only responder. OUTPUT EXACTLY one valid JSON array and NOTHING ELSE. Example: [10, 20, 30]. Generate an ordered list of ten random integers between {{start}} and {{end}} (inclusive). Use numeric values (no quotes), sorted in ascending order, and do not include any commentary or code fences."

providers:
  - id: anthropic:messages:claude-3-haiku-20240307
  - id: openai:chat:gpt-5-mini

tests:
  - vars:
      start: 10
      end: 1000
    assert:
      - type: is-json
        value: |
          {
            "type": "array",
            "minItems": 10,
            "maxItems": 10,
            "items": {
              "type": "integer",
              "minimum": 10,
              "maximum": 1000
            }
          }

How this works:

When promptfoo runs this test, it substitutes the variables (start: 10 and end: 1000) into the prompt and sends it to both Claude and GPT-5-mini. Each model generates a response.

The is-json assertion is evaluated by promptfoo after it parses the model output as JSON. In other words, promptfoo performs the JSON parsing and schema validation (not the model). If the model returns something that isn't valid JSON or doesn't match the schema, the assertion will fail and promptfoo will report the parsing error and the schema mismatch.

This example demonstrates:

Variable substitution with {{start}} and {{end}}
Multiple model comparison (Claude vs GPT-5-mini)
Programmatic validation using is-json so validation happens in promptfoo, not in the LLM

Running the test is easy:

# run the test
npx promptfoo eval -c examples-for-blog/ten_numbers.yaml

To see a side-by-side comparison showing how each model performed and whether they passed the validation criteria:

# open the web report for the last run
npx promptfoo view

Here is our web view of the test results. Note you can see variables, prompts, model responses, validation outcomes, and even performance and cost metrics, all in one place.

Example 2: Call Summary Validation (Real-World Use Case)

So Example 1 was interesting, but let's look at how we can validate the output of a prompt by using an LLM to grade that output.

Here's a more complex example based on our actual client engagement I described earlier - testing an AI system that summarizes customer service calls:

# examples-for-blog/customer-call-summary.yaml
description: Call Summary Quality Testing

prompts:
  - |
    Summarize this customer service call. Keep the summary succinct without unnecessary details. Pay special attention to include the agent's demeanor and indicate if they ever seemed unprofessional. Include:
    - Customer name and account number
    - Issue description
    - Actions taken by agent
    - Any order number that is mentioned
    - Resolution status

    Call transcript: {{transcript}}

providers:
  - openai:chat:gpt-5-mini

tests:
  - vars:
      transcript: |
        Agent: Good morning, thank you for calling customer service. This is Maria, how can I help you today?
        Customer: Hi Maria, I'm calling about an order I placed last week that was supposed to be delivered two days ago, but it still hasn't arrived.
        Agent: I'm sorry to hear about the delay with your order. I'd be happy to help you track that down. Can I start by getting your first and last name please?
        Customer: Yes, it's David Rodriguez.
        Agent: Thank you Mr. Rodriguez. And can I also get your account number to verify your account?
        Customer: Sure, it's account number 78942.
        Agent: Perfect, thank you. Now, can you provide me with the order number for the package you're expecting?
        Customer: Yes, the order number is ORD-2024-5583.
        Agent: Great, and when did you place this order?
        Customer: I placed it last Tuesday, January 16th.
        Agent: Thank you for that information. Let me pull up your order details here... Okay, I can see order ORD-2024-5583 placed on January 16th, and you're absolutely right - it was originally scheduled for delivery on January 22nd. I sincerely apologize for this delay, Mr. Rodriguez.
        Customer: So what happened? Why didn't it arrive when it was supposed to?
        Agent: It looks like there was a sorting delay at our distribution center that affected several shipments in your area. Your package is currently in transit and I can see it's now scheduled to be delivered this Friday, January 26th, by end of day.
        Customer: Friday? That's three days later than promised. This is really inconvenient.
        Agent: I completely understand your frustration, and I apologize again for the inconvenience this has caused. To make up for the delay, I'm going to issue a $15 credit to your account, and I'll also send you tracking information via email so you can monitor the package's progress.
        Customer: Okay, well I appreciate that. Will I get a notification when it's actually delivered?
        Agent: Absolutely. You'll receive both an email and text notification once the package is delivered, and the tracking information will show real-time updates. Is there anything else I can help you with today?
        Customer: No, that covers it. Thank you for your help, Maria.
        Agent: You're very welcome to never ever call me again, Mr. Rodriguez. Again, I apologize for the delay, and thank you for your patience. Have a great day!
  assert:
      - type: contains
        value: "David Rodriguez"
      - type: contains  
        value: "ORD-2024-5583"
      - type: llm-rubric
        value: "Summary should indicate whether the agent seemed professional or not, and should include all key details, including the action taken by the agent, the resolution, and any compensation offered."

This prompt embeds a long customer-service phone transcript that the model is asked to summarize succinctly while preserving key facts. To verify correctness we include a couple of deterministic assertions (exact-match checks) for the customer's name and the order number so those values must appear in the summary.

We also include an llm-rubric asset: promptfoo will call an LLM to grade the generated summary against the supplied rubric text, allowing us to assert on higher-level quality attributes such as professionalism, completeness, and whether the agent's actions and compensation were described.

Now I can run that test and see how we do!!

# run the test
npx promptfoo eval -c examples-for-blog/customer-call-summary.yaml
# View results
npx promptfoo view

And here are our results:

Note the prompt specifically requests to indicate the agent's demeanor, and we use the rubric to verify the output contains it. Since I never trust a test unless I can see it fail, I'm going to temporarily remove the mention of demeanor in the prompt, but leave the assert alone, so we should get a failure. Drumroll, please…

And we do!

As you can see, the test caught the error with our prompt:

Conclusion

I got a little long-winded with this post, but I hope someone out there finds it useful. Promptfoo represents a paradigm shift from manual AI testing to systematic, automated evaluation. By bringing familiar testing methodologies to AI development, it enables teams to build reliable, secure, and high-quality AI applications.

I'll be back soon with some more promptfoo content, and you should certainly check out the awesome documentation at promptfoo.dev for excellent resources for getting started.

Automating Browser-Based Performance Testing

Dennis Whalen — Sun, 17 Aug 2025 15:01:57 +0000

Website performance directly affects what users feel and what your business earns.

One way of identifying performance issues is via API-based load testing tools such as k6. API load tests tell you whether your services scale and how quickly they respond under load, but they don’t measure the full user experience.

If you focus only on load testing your backend, you might still ship a slow or jittery site because of render‑blocking CSS/JavaScript, heavy images/fonts, main‑thread work, layout shifts, and other front-end issues.

Ultimately users don't care where the performance issue resides, they just know your site is "slow".

This slow performance can cost you customers, revenue, search visibility, and trust.

What is Lighthouse?

Lighthouse is an automated auditor built by Google and is part of the Chrome DevTools experience. While this post focuses on performance, Lighthouse also audits and provides actionable recommendations for accessibility, best practices, and SEO.

How Lighthouse works

Launches Chrome and navigates to your page using the Chrome DevTools Protocol.
Emulates device, network, and CPU to keep runs comparable.
Records a performance trace and analyzes it against a set of audits.
Outputs scores and detailed metrics with fix ideas.
Can be included in your CI pipeline.

Core Web Vitals: what they mean and why they matter

These user‑focused metrics map to how fast content shows up, how responsive the page feels, and how stable it looks.

Core Web Vitals at a glance

Metric	Plain meaning	Good target	What you’ll see in Lighthouse
LCP (Largest Contentful Paint)	Time to show the largest thing in the initial viewport (often the primary image or a big text block).	≤ 2.5 s	LCP value in the Metrics section
FID (First Input Delay)	Delay from a user’s first tap/click to when the page can start handling it. In Lighthouse runs, use Total Blocking Time (TBT) as the responsiveness indicator.	FID ≤ 100 ms; aim for low TBT	TBT value in the Metrics section
CLS (Cumulative Layout Shift)	How much content unexpectedly moves while the page loads (visual stability).	≤ 0.1	CLS score in Metrics/Diagnostics

Sample Lighthouse Report

Regardless of how you run Lighthouse, you get a detailed report with scores, metrics, and prioritized suggestions.

Overall scores:

What went wrong?

What looks good?

Running Lighthouse

Lighthouse can be run in a number of ways, including:

Chrome DevTools (UI)
Command line (CLI)
Node module (programmatic)

Run Lighthouse from Chrome DevTools

Open your site in Chrome → Right‑click Inspect → Lighthouse tab → Set your analysis options → Analyze. This generates a full HTML report inside DevTools.

Run Lighthouse from the command line

Install Lighthouse (requires Node.js):

npm install -g lighthouse

Basic mobile audit and open the HTML report:

lighthouse https://www.demoblaze.com \
  --output=html \
  --output-path=./reports/lighthouse.html \
  --view

Export JSON for automation or tracking:

lighthouse https://www.demoblaze.com \
  --output=json \
  --output-path=./reports/lighthouse.json \
  --chrome-flags="--headless"

Desktop profile:

lighthouse https://www.demoblaze.com --preset=desktop --output=html --output-path=./reports/desktop.html

Use throttling to simulate slower networks:

lighthouse https://www.demoblaze.com \
  --throttling-method=simulate \
  --throttling.rttMs=150 \
  --throttling.throughputKbps=1638.4 \
  --throttling.cpuSlowdownMultiplier=4 \
  --output=html --output-path=./reports/consistent.html

Focus audits on key performance metrics with a config (lighthouse-config.js):

module.exports = {
  extends: 'lighthouse:default',
  settings: {
    onlyAudits: [
      'first-contentful-paint',
      'largest-contentful-paint',
      'cumulative-layout-shift',
      'total-blocking-time'
    ],
    throttlingMethod: 'simulate',
    throttling: { rttMs: 150, throughputKbps: 1638.4, cpuSlowdownMultiplier: 4 }
  }
};

Run with the config:

lighthouse https://www.demoblaze.com --config-path=./lighthouse-config.js --output=html --output-path=./reports/focused.html

Programmatic usage (Node)

Why use this? Programmatic runs let you script real user interactions and measure performance along a flow (navigations, clicks, route changes). With Puppeteer + Lighthouse User Flows you can drive the browser, capture metrics per step, and generate a single report—perfect for CI, regression checks, and measuring critical journeys like signup or checkout.

Note: Lighthouse currently only supports Puppeteer for programmatic user flows.

Install packages:

npm i lighthouse puppeteer

Save as user-flow.mjs and run with node user-flow.mjs:

import {writeFileSync, mkdirSync} from 'fs';
import puppeteer from 'puppeteer';
import {startFlow} from 'lighthouse';

const browser = await puppeteer.launch({headless: 'new'});
const page = await browser.newPage();
const flow = await startFlow(page);

// Navigate to Demoblaze
await flow.navigate('https://www.demoblaze.com');

// Interaction-initiated navigation via a callback function
await flow.navigate(async () => {
  await page.click('a[href="index.html"]');
});

// Start/End a navigation around a user action
await flow.startNavigation();
await page.click('a#cartur'); // open Cart
await flow.endNavigation();

await browser.close();
mkdirSync('./reports', {recursive: true});
writeFileSync('./reports/lh-flow-report.html', await flow.generateReport());
console.log('Saved ./reports/lh-flow-report.html');

Wrap‑up

Start by running Lighthouse in DevTools (fast feedback) or the CLI (repeatable results). Focus on three things: LCP (how fast the main content shows), TBT (how responsive it feels), and CLS (how stable it looks).

What’s next in this series:

Containerize Lighthouse runs with Docker for consistent local and CI environments
Add Lighthouse checks to a GitHub Actions workflow with performance budgets and PR comments
Export key metrics to Prometheus for time‑series storage
Visualize trends and budgets in a Grafana dashboard

Coding Agents are here: Is your team ready for AI devs?

Matt Eland — Tue, 05 Aug 2025 18:47:15 +0000

In this post we'll explore the concept of AI agents as software engineers on your development team. The idea that you could write up an enhancement or bug fix and assign it to an AI team member and see what they came up with a short while later would have sounded fantastical only a few years ago, and yet, with the announcement and preview of GitHub Copilot Agents, this is a real technology that exists and you can make use of today.

Introducing GitHub Copilot Coding Agents

GitHub Copilot Coding Agent is a new technology, currently in preview, associated with GitHub pro and enterprise accounts. With Coding Agents you can assign individual issues in a GitHub repository to GitHub Copilot, just like you were assigning it to a team member.

Copilot will then create a branch for your issue, just like a developer would, and begin to plan its approach to carrying out the work item.

Quota Usage Note GitHub Copilot Coding Agent uses part of your account's allocated monthly premium requests. Check out GitHub's documentation on current quota and billing information on this evolving product.

Coding agents at work

After analyzing the issue and your repository, it forms a plan of action to accomplish the work you've assigned to it. As it works, the branch's comment updates to reflect Copilot's progress, accomplished tasks, and remaining work. This helps you monitor its progress and acts as a reference to help ground the AI agent as it works.

Copilot will analyze your code as it works and can also make use of additional resources such as Model Context Protocol (MCP) servers you have configured on GitHub and optional additional documentation you provide for Copilot on the structure of your repository.

Interested in MCP Servers or AI Architectures? Check out Leading EDJE's reference architecture articles on augmenting development teams with MCP servers and team-productivity solutions with MCP servers.

I've also noticed Copilot making use of command line tools to find relevant strings in files in your repository, which can help it orient itself. Copilot is also capable of executing commands to build and test applications, and can even resolve build issues with missing dependencies it finds on its end.

Security and GitHub Copilot

While all of these capabilities are interesting, they also raise important security questions. By default, Copilot has firewall rules in place that prevent it from working outside of GitHub's sandboxed ecosystem. Additionally, any violations of these policies will be logged for your review later on.

It's worth noting that these features are currently only available if your code is on GitHub and you have a paid plan that supports it, so you're also going to benefit from all of GitHub's standard and enterprise security features.

Completing a pull request

When Copilot believes it is complete, it will notify the person who assigned it the task who can then review the pull request for changes. The developer can either approve the pull request or request changes. If changes are requested, Copilot will respond to any comments and notify you when the work item is ready for review again.

Once you're satisfied with Copilot's work, you can mark the pull request as ready for review. This will trigger more parts of your workflow, such as additional tests or having the standard GitHub Copilot system review the pull request, summarize it, and make suggestions. Once you're satisfied, you can approve the pull request, merge it, and it becomes part of your codebase as GitHub closes the issue.

Can AI agents really serve as team members?

So, how good is Copilot? Is this going to replace members of your team?

Well, probably not, but it might change how you work or how you hire.

How coding agents change how I write code

I'm early on in my journey working with copilot, but I'm already impressed. As an experienced developer I can write out what I'm trying to accomplish in technical terms, assign it to Copilot, and see it have a result that's close to what I envisioned. This does require me to think about how I would try to solve a problem and the type of solution I'd like to see, and highlight any areas of concern I have with potential implementations. Interestingly, this is the type of thing I'd probably normally communicate to a technical team member through a direct message or a comment on a work item already.

Because AI agents are capable of working quickly, I've found myself able to quickly gather some thoughts on relatively simple changes, send them over to Copilot, and then come back to that topic later on when I have more focus. I can easily see senior engineers writing up a request, sending it to Copilot, attending a meeting, then reviewing and improving Copilot's result after they're free.

I've found myself also starting development sessions by reviewing what Copilot sent me on something between work sessions, which can be a great way of getting into the flow of something - or take care of the more tedious aspects of software engineering while letting me focus on the strategic direction or specific concerns I care about.

How coding agents might impact hiring

If you look at the prior section, you'll see that a lot of what I'm doing is more senior or supervisory in nature versus directly authoring code. While I'm still writing code and enjoying it, I'm finding that the code I'm writing is less boilerplate or trivial in nature and more specialized and strategic.

This is good, but not all developers can do this. You need a certain level of experience to be able to guide other developers and evaluate their code effectively, and this holds up well for AI.

Because of this, I view AI as filling similar roles to more junior team members: executing on well-defined and standardized tasks that can be easily communicated.

While AI doesn't replace the more junior team members in your organization, it does rival them in some ways:

AI agents are becoming increasingly able to get unstuck on their own
Copilot has the breadth of its training data available to it, so it likely knows libraries junior devs don't yet
Copilot works quickly, and can produce a lot of code very quickly, outpacing even senior developers

Of course, junior team members have a lot of value as well, and can do things that AI can't:

A greater degree of domain knowledge in your organization, its products, its data, and its business context
Effectively consider the end user and the context in which code operates
Human level problem-solving, common sense, and decision-making
The ability to debug problems or scenarios related to specific data situations
The tendency to grow and become senior engineers

A good junior developer is going to provide far more value to an organization than a solid AI agent is able to, but I can see the temptation for organizations to rely on AI agents instead of junior developers, and this scares me for our industry and the many talented people already struggling to get a foot in the door.

Ultimately, if you're an organization that has busy senior engineers and code already present on GitHub, Coding Agents are something that you should try out and see how it impacts your workflow and productivity. Just be cautious because although Coding Agent is usually cheaper than a junior engineer's salary, it's not a replacement for talented, growing, flexible, and human engineers on your team.

I think at this point, it's best to talk about where AI agents fall short in more detail.

Where AI agents fall short

In conversations about AI and particularly about AI productivity there's a central truth that is often overlooked: Most of software engineering isn't about writing code.

I've been writing code for the vast majority of my life, and over two decades professionally. While a lot of my job is around writing code, that code only comes after:

Understanding the business need or current inadequate behavior
Determining what an ideal solution should do
Identifying several different ways of achieving this goal
Selecting a leading candidate for implementation - often with collaboration from others who understand other areas and other needs
Identifying places in code that will need to be adjusted to support the change
Making code changes to support the new behavior
Ensuring the code works
Thinking of ways the code might break, edge cases that we might not have considered, etc. then making sure the code works for those ways as well.
Ensuring the code is as secure, testable, and performant as we expected it to be when we selected our candidate solution
Communicating the change in documentation and to others.
Ensure the change flows through the processes for feedback, testing, and deployment to various environments

As you can see, code changes are only a small portion of software engineering, yet we pay an inordinate amount of attention to them when we think about AI productivity solutions and even when we consider using offshore development resources.

While I believe AI systems can already perform some of these steps to some degree or another, we tend to evaluate their effectiveness mostly on authoring new content, which is a strength area for AI systems. However, humans have strong skills across all of these areas and knowing what to change and the implications of different approaches along with how this fits into the existing data and application architecture are critically important pieces of software engineering.

Also keep in mind that in modern software engineering a change is often needed across multiple different services and databases. While an AI agent might be able to handle changes in one place, they might find themselves less equipped to know all the services that need to change and make the requisite changes for those areas.

AI and Human Partnership

Because of the complexity of software engineering and the relative strengths and weaknesses of AI and humans, I think that AI agents and AI tooling are best deployed for targeted tasks that have been thought through by an experienced engineer.

An ideal flow might be:

Engineers vet an organization's needs and determine a series of technical changes needed to support the new goals
These individual changes are written up as work items and either assigned to other engineers or assigned to AI agents
AI agents or engineers work on the change and send a draft pull request on for review
The change is manually tested and verified by another engineer who uses it as a starting point for the final pull request
The developer makes additional improvements, changes, and tests to support the pull request and ensure it fully meets the organization's needs
The PR is marked ready for review
Other developers review the PR, familiarize themselves with the changes, and leave comments
The change eventually merges into the main branch and reaches production, where it will be supported by a team that understands the changes and designed the approach.

Where software engineering may be headed with AI Copilots

AI agents like GitHub Copilot are powerful and destined to change how organizations and engineers hire and work.

I believe that software engineers can focus on the big picture, stay oriented around technical changes going on in their systems, but use AI to do the majority of the work on well-defined tasks the engineers define, then customize the final behavior of those changes.

Not every change will benefit from AI, and some more sensitive pieces of work reveal new things to think about with each line of code that needs to be modified or added, but a strategic deployment of AI can help busy engineers remain productive between meetings and optimize their time on a busy schedule.

I also hope that the emergence of AI as skilled solutions implementers will help focus experienced and new software engineers on other core competencies that are uniquely theirs: domain knowledge, communications skills, past experiences, empathy for users and business stakeholders, and the ability to evaluate a number of different plans and possible implementations and select the one that is right for the business today and where they're going tomorrow.

AI is advancing at a tremendous rate and being an efficient, experienced, well-rounded, and adaptable engineer is more important than ever, but I'm glad to have copilots along for the ride as we build new things together.

Aspire Roadmap 2025: Code-first DevOps, polyglot, and AI

Victor Frye — Fri, 01 Aug 2025 12:43:33 +0000

The Aspire team has recently published their 2025 roadmap, revealing an exciting evolution from local development orchestration to a comprehensive framework for DevOps concerns. Aspire launched with a code-first application model and instantaneous run experience, then expanded into deploy scenarios with publishers. This roadmap shows how it's becoming a complete code-first alternative to YAML-heavy DevOps toolchains while embracing polyglot development and AI workload orchestration.

While these are aspirational goals rather than firm commitments, they provide valuable insight into Aspire's direction. Let's explore the most compelling features and why they position Aspire as a game-changing DevOps framework for .NET, polyglot, and AI applications.

Code-first DevOps

DevOps combines development (Dev) and operations (Ops) to deliver software faster and with higher quality. While DevOps is fundamentally about people and processes, the technology and tooling often involve tedious YAML configuration files for CI/CD pipelines and infrastructure management. Aspire is changing this by providing a code-first approach to local development, testing, and deployment, replacing configuration complexity with familiar programming languages.

Local development

Aspire already excels at code-first application modeling in C#, expressing your entire architecture—databases, services, .NET projects, and polyglot components—then spinning it up locally with aspire run. No YAML configuration files, just standard .NET code that ideally mirrors your production architecture. The roadmap expands this with:

Improved container support: Shell commands, scripts, and interactive debugging inside containers
Multi-repo support: Native orchestration across multiple repositories
Built-in runtime acquisition: Automatic installation of Node.js, .NET, and other required runtimes

Aspire local development is a mature feature set already. These improvements focus on further simplifying the developer experience and tackling complex orchestration scenarios. Multi-repo support has been a long-standing pain point as many developers opt to separate components, like a frontend and backend, in separate repositories. Removing the monorepo requirement or custom cross-repo orchestration makes Aspire more accessible to many teams. You can already run polyglot applications in containers with Aspire, but continued improvements will allow more robust debugging and feedback loops with local containerized applications. The built-in runtime acquisition is both the most exciting and most daunting feature here. It may simplify the first run experience, which helps with onboarding and CI/CD pipelines and one area that I adore of Aspire. However, depending on its implementation, it could also lead to extra local machine complexity with Aspire managed runtimes versus system-wide runtimes. The local development experience is already fantastic and delivers the code-first developer experience Aspire promises. Therefore, I am optimistic that these improvements will build on that foundation.

Testing

Aspire's code-first model and instant run experience create ideal conditions for integration and end-to-end testing. You can spin up your entire application stack locally, creating an instant integration test environment with minimal friction. The Aspire.Hosting.Testing package provides this test host for xUnit and other testing frameworks and allows you to benefit from Aspire features like intelligent resource state notifications that eliminate arbitrary sleep times in tests. The roadmap adds advanced testing capabilities:

Partial app host execution: Run only specific components in tests to reduce overhead
Request redirection and mocking: Control traffic between components for chaos engineering
Code coverage support: Coverage collection and reporting for integration tests

Where local development is the mature foundation of Aspire, testing is currently a secondary benefit that often surprises users by revealing the true value of the framework. These improvements take the Aspire testing story to the next level. Aspire goes from being the startup tooling that manages your integration testing components to a chaos engineering and middleware validation powerhouse. The partial app host execution isn't limited to testing and reduces overhead in local development scenarios where certain components are not needed. In testing, this partial execution may allow each test to receive the benefit so API integrations can be isolated without starting up the frontend or further broken down to individual microservices that matter. Coupled together with request redirection and mocking of components, you could create test scenarios that simulate real-world failures between integrations and validate chaos behavior. Imagine chaos testing your application before you even deploy it from your machine with the same ease of unit testing. The code coverage support is the extra bonus reward: get code coverage metrics for your integration and chaos tests in a way that is often limited to unit tests? Yes, please! The roadmap suggests the current Aspire testing story is only in its infancy, and these improvements will make it a reason to adopt Aspire for testing alone if they materialize as envisioned.

Deployment

Deployment bridges development and customer value delivery. Aspire's local orchestration model naturally extends to cloud deployment scenarios. Aspire has been expanding to include deployment targets and publishers, simplifying the process of getting your application into production.
Currently, Aspire publishes artifacts like Bicep, Docker Compose, and Kubernetes manifests. You can deploy any Aspire resource the same way you would without Aspire, but with it you get seamless delivery to deployment targets like Azure Container Apps. While deployment targets are limited and opinionated, the roadmap addresses key enterprise needs that are still missing:

Additional deployment targets: Support for Azure App Service, Azure Functions, and improved Docker/Kubernetes workflows
Environment support: Define dev/stage/prod environments with specific configurations and secrets
CI/CD pipeline generation: Auto-generate GitHub Actions, Azure DevOps, and GitLab pipelines

Deployment is an emerging focus in the Aspire story. Azure Container Apps is the first focus for deployment target and flexible as a hosting platform, but it's not flexible enough for all enterprise scenarios even within corporate environments invested in Azure. The roadmap as expected promises more common Azure deployment targets for traditional .NET workloads like Azure App Service and Azure Functions, but it is still lacking in amazing polyglot deployment targets like Azure Static Web Apps. Environment support is critical for enterprise adoption as the majority of enterprises host multiple environments. DevOps practices may push us for consistency between environments, but there are always differences in configuration and secrets to isolate environments. The CI/CD pipeline generation in addition to environment support delivers on the idea of code-first DevOps: define your environments and application model in code, then generate the necessary pipelines to deploy it based on your code-first model. The overall deployment story is still evolving, but the question that will persist is whether Aspire can provide enough flexibility to meet the diverse needs of enterprises' existing applications. These features are a step in that direction. I hope the Aspire team delivers, and we see Aspire become a code-first framework for continuous delivery and deployment.

Polyglot aspirations

Aspire is not just a .NET framework; it is a polyglot orchestration framework that allows you to model and run conjoined applications in various languages. .NET, JavaScript, Python, and more are all supported, but the only first-class experience is in .NET projects. With the app host authored in C#, the service defaults project providing .NET best practices, and NuGet client integrations for simplifying configuration in your application code, Aspire is an amazing .NET developer experience. You can host JavaScript and Python applications, but you don't get the same level of integration and tooling. The roadmap reveals the Aspire team's ambition to provide a first-class polyglot experience that extends beyond .NET:

Uniform client integrations: Connection strings, configuration, and telemetry work consistently with new language support via npm (JavaScript) and pip (Python) packages
Templates and samples: Quickstarts and documentation for C#, JavaScript, and Python
Cross-language app host: Experimental WebAssembly support for multiple runtimes in a single process

The polyglot aspirations of Aspire are focusing on JavaScript and Python support first. The uniform client integrations with npm packages for JavaScript and other languages will get us closer to parity with the .NET experience. Improved documentation and more polyglot samples will also help as figuring out how to use Aspire currently relies on developers doing the translation between C# and other languages themselves. Technically a hosting integration, but if Aspire supports the Aspire.Hosting.Testing package in JavaScript I would be ecstatic. Documentation and packages together could elevate the polyglot experience to make Aspire stand out beyond traditional .NET developers. It may invite more developers to experiment with the .NET platform beyond Aspire as well.

The cross-language app host is a fascinating item and the one I find hardest to envision myself. Will this be a way to write the app host without .NET? Will it wrap all the runtimes in a single process on your computer? What will it actually look like? The roadmap tells us it is experimental, so it may never materialize, or it may be something we start to see soon. I will be watching this closely as it starts to take shape and the value becomes clearer.

Artificial intelligence

While AI dominates software conversations, Aspire has focused on fundamental developer experience improvements rather than AI-first features. As AI applications continue to be mainstream, Aspire is positioned to apply its orchestration strengths to AI workloads. The roadmap outlines several AI-specific features:

Token usage visualization: Real-time token counts, latency, and evaluation metadata in the dashboard
LLM-specific metrics: Native support for generative AI telemetry, including model name, temperature, and function call traces
Azure AI Foundry: Integration for building agent-based applications
Aspire MCP server: Optional runtime endpoint exposing the Aspire model as an MCP server for AI agents

Building AI applications is itself a nascent discipline. The Aspire team appears to be taking a measured approach to AI integration instead of branding itself another set of AI-native tools. These Aspire AI features are focused on two key areas: observability and agents. Observability is another area Aspire already excels at with the Aspire dashboard. Token usage and LLM-specific metrics visualizations in the Aspire dashboard will be a wonderful addition to the existing telemetry and observability features. It stays true to the natural value of Aspire while also extending to needs of AI local development needs.

In agentic regards, Aspire works but has a lot of limitations. Existing AI integrations, like Azure OpenAI and Ollama, provide some options for local and cloud-hosted LLMs. The integration with Azure AI Foundry may extend the catalog and options for LLMs. It will be exceptionally interesting if the integration supports Azure AI Foundry Local capabilities to provide a unified catalog of models both locally and in the cloud. The Aspire MCP server likewise adds agentic capabilities to Aspire. Model Context Protocol (MCP) is becoming an industry standard for AI agents communicating, understanding, and interacting with outside systems. An Aspire MCP server could provide development tools like GitHub Copilot with deep context on your application model and all the resources Aspire manages. I am all for more intelligent development workflows. Like so many other technologies, Aspire is targeting AI trends and trying to provide its own value in the space.

Aspire tooling

As Aspire evolves into a mature framework, its tooling ecosystem continues expanding beyond the core .NET SDK. The roadmap includes several improvements:

Aspire CLI: Continued improvements and unified commands
WinGet and Homebrew installers: Standard install support for Windows and macOS
VS Code extension: Run, debug, and orchestrate polyglot Aspire applications in VS Code

The tooling of Aspire is a meta story and so are its roadmap items. The code-first DevOps value and the polyglot aspirations, they all deliver on a core premise of Aspire: a simplified developer experience. When the tooling to setup Aspire or interact with it isn't easy, the core premise is lost. The Aspire CLI has already started this meta story with my favorite command, aspire run, which provides a consistent way to run your Aspire hosted applications locally. Continued improvements to the CLI and other commands will help make it easier to adopt and utilize Aspire. The WinGet and Homebrew installers are similar in value and may simplify installing the Aspire CLI which is already more complex than it should be. Finally, the VS Code extension may help deliver on the polyglot aspirations of Aspire by making development with Aspire more accessible to the tools JavaScript and Python developers already use without relying on CLI knowledge. Sure, CLI commands mean you can do it today but installing the Aspire CLI and generating projects requires a guide of the right CLI commands. Overall, the meta story of these tools is to simplify using Aspire so that Aspire can simplify your developer experience.

Final thoughts

The 2025 roadmap that the Aspire team published is an exciting glimpse into a rapidly evolving framework. Nothing is a commitment, but the vision tells a story of what Aspire is developing into: a code-first DevOps framework that simplifies local development, testing, and deployment while embracing polyglot development and AI orchestration. I am incredibly excited by this roadmap as it aligns with my own dreams for Aspire. I love what it is today and recommend it to every .NET developer and some polyglot developers. If the Aspire team can deliver on half of these features, it will only continue to be a game-changer for developing distributed applications.

Let me know what you think of Aspire and where it is going. Are you excited about the roadmap? Do you think Aspire can deliver on these promises? I would love to hear your thoughts and experiences with Aspire so far.

Reviewing Aspire.JS: Current state of Aspire for JavaScript

Victor Frye — Fri, 01 Aug 2025 12:42:00 +0000

Aspire is the coolest thing in software development right now. That's a statement I frequently make, but it comes from a place of genuine excitement for this nascent framework that is transforming how we can model, run, and deploy applications. Local development with Aspire is effortless regardless of the complexity of your architecture. Aspire is a part of the .NET platform, but it extends past .NET to provide polyglot orchestration for JavaScript, Python, and other languages.

One reason I refer to Aspire as the coolest thing in software development is my frequent use of it in JavaScript projects. Whether it's a simple static site or a full-stack application, Aspire has become my go-to tool for local development.

This post is a review of Aspire for JavaScript including current state, my personal experiences, and future aspirations. If you are a JavaScript, .NET, or polyglot developer interested in Aspire, this analysis is for you. If you are not familiar with Aspire, you may want to read a breakdown of its key features first.

Current state

Aspire in its current state is a code-first orchestration framework written in C# but enabling local development and hosting of polyglot applications. JavaScript and .NET exist in harmony with Aspire. Given the common stack of a JavaScript web frontend, a .NET web API backend, and a containerized database or other backing services, Aspire hosts the entire stack and abstracts away the different mechanisms for running and connecting each component.

The above stack is the apparent assumption of Aspire for JavaScript. Aspire allows for modeling and running JavaScript backends, full-stack JavaScript applications, and scripts, but given the .NET-first nature of Aspire, you will be writing some C# code if you use Aspire. Accepting this for the app host, the orchestrator and C# model, Aspire provides two extension points of note for JavaScript development: Integration packages and deployment targets.

Integrations packages

Integration packages are the libraries that extend Aspire to support the various projects, executables, containers, and services that make up your application. These packages are hosted as NuGet packages and can either be hosting or client integrations. Hosting integrations extend the app host of Aspire to model and run components like a JavaScript web app. Client integrations are libraries that allow you to consume the Aspire configurations and defaults to connect to the hosted components. However, client integrations are exclusive to .NET projects given they are packed as NuGet packages.

This still leaves a variety of hosting integrations for JavaScript development to benefit from. The following are some of the most relevant:

Aspire.Hosting.NodeJs: Provides hosting for Node.js applications via node or npm scripts
CommunityToolkit.Aspire.Hosting.Bun: Provides hosting for Bun applications
CommunityToolkit.Aspire.Hosting.Deno: Provides hosting for Deno applications
Aspire.Hosting.PostgreSQL: Provides hosting for PostgreSQL database via Docker Hub registry postgres images
Aspire.Hosting.MongoDB: Provides hosting for MongoDB database via Docker Hub registry mongo images
Aspire.Hosting.Redis: Provides hosting for Redis via Docker Hub registry redis images
Aspire.Hosting.Azure.Storage: Provides hosting for Microsoft Azure cloud storage services, including Blob, Queue, Table, and Azurite emulation
Aspire.Hosting.Testing: Provides a test host for .NET unit testing frameworks

Overall, the packages provide flexibility of hosting JavaScript apps with the Node.js, Bun, or Deno runtimes. Databases and cloud service hosting include popular JavaScript data solutions like PostgreSQL, MongoDB, Redis, and Azure Blob Storage. Together, Aspire still provides the same local development experience for JavaScript application as it does for .NET with instantaneous runs and abstractions over other configuration files. The major deficits are you must minimally write C# code for the Aspire app host and to consume the Aspire host for testing, you will need to write integration tests using .NET testing frameworks like xUnit. For true polyglot developers familiar with both JavaScript and .NET, this is a non-issue that provides all the benefits of Aspire with the flexibility of using JavaScript for what JavaScript is best at. However, for a JavaScript only developer, there are extra barriers to entry here that Aspire has yet to solve.

Deployment targets

Deployment is an emerging focus in the Aspire story. Applications orchestrated with Aspire can be deployed anywhere the same way you would deploy without Aspire, however the focus here is deployment through Aspire. Aspire is expanding to include publishers and deployment targets, taking your modeled application and using it to generate artifacts like Bicep and container images. Given Aspire's origins in .NET and Microsoft solutions, the initial deployment targets are opinionated and limited. By default, the easiest deployment target is Azure Container Apps, a serverless platform for running containerized applications. However, there is a fundamental flaw here for JavaScript developers: hosting with Azure Containers Apps assumes you need a server.

JavaScript developers are accustomed to a true serverless experience, one in which the web browser is the host environment. Frameworks like Next.js allow for server-side computation, but many JavaScript frameworks and applications are designed to run entirely in the browser using a bundle of JavaScript, HTML, and CSS. This has a lot of benefits for developers, including:

No server management: No need to manage servers or containers, just a static file host
Instant scaling: Static files can be served from a CDN, scaling automatically with demand
Lower costs: Static file hosting is often cheaper than running containers or VMs

And so much more. This is antithetical to traditional .NET development and represents a fundamental difference in JavaScript versus .NET. Some JavaScript developers may opt for containerized hosting due to enterprise infrastructure or for self-managed static web servers like Nginx, but Azure already provides a first-class static web hosting solution with Azure Static Web Apps. Azure Static Web Apps are nowhere to be found in the Aspire deployment story, which is a major gap for Aspire for JavaScript.

My Aspire.JS story

To understand how Aspire fits JavaScript development currently and potentially in the future, it is helpful to understand how a developer who has adopted Aspire already uses it. I am a full-stack developer who currently favors .NET for backend development, React for frontend development, and Azure for cloud hosting. I started using Aspire for a sample .NET web API that I wanted to run on macOS and Windows, so that anyone could pull down the code and run it with minimal configuration. Aspire was perfect for this, so I started using it for all my .NET projects. This in turn led me to use Aspire to host a React frontend alongside my web API and database, which also proved to be effortless. Finally, I asked the question: Why not use Aspire for my JavaScript only projects?

I have 3 static sites that I maintain, including my personal blog, and I wanted to use Aspire for local development to provide a consistent local development experience across all my projects. It works. Every personal project, including live websites or demo applications, is locally orchestrated with Aspire by default. I also recommend it for any enterprise .NET project I work on. However, the barrier of recommendation stops at projects that do not include .NET components currently. Aspire is currently an excellent choice for .NET and polyglot projects that include .NET, but the benefits of Aspire for JavaScript only or polyglot projects without .NET are not an easy sell. The C# app host, NuGet client integrations, and lack of polyglot deployment targets that do not align are all barriers to entry for JavaScript developers.

The C# app host is a non-issue for me as a .NET developer, but for any project not already using .NET, it means extra SDKs to install and a new language to learn. Admittedly, the app host is not complex C# code until you start creating your own custom components. It is the download of the .NET SDK that is the high barrier. The NuGet client integrations are less a barrier and more a missing feature to sell the value story. Finally, deployment targets are a nascent feature in a nascent framework. I started using Aspire without its deployment features due to feature immaturity. To this day, I favor Azure Container Apps or Azure Functions for .NET workloads and Azure Static Web Apps for JavaScript workloads and handle deployment separately from Aspire. Together, this means the Aspire story for non-.NET applications adds .NET as a development dependency and is missing client integrations and deployment flexibility I would expect for recommendation.

Future aspirations

Aspire recently published their 2025 roadmap, which includes several features that may solve the current limitations of Aspire for JavaScript. The most exciting are:

Polyglot client integrations: Connection strings, configuration, and telemetry work consistently via npm packages as they do with existing NuGet packages for .NET projects
Templates and samples: More documentation and quickstart examples for JavaScript
Cross-language app host: An experimental WebAssembly app host that may reduce .NET friction for JavaScript developers authoring the Aspire app host

These features may further make Aspire an accessible choice for JavaScript developers and provide some of the .NET exclusive benefits for JavaScript components in your applications. The npm client integration packages excite me the most as a polyglot developer because it would allow me to integrate databases and cloud services like Azure Storage into my JavaScript components with the same reduced configuration as Aspire is doing for .NET projects. This adds parity in developer experience and closes the gap for recommendation and adoption of Aspire for JavaScript development. Documentation improvements are also always welcome and ease adoption. The cross-language app host is interesting, but I am still unsure of what it may amount to or if it'll even materialize. If it does, maybe it removes the .NET SDK download as a barrier. These features are directional and not commitments but provide a hope for increased parity with Aspire .NET developer experience.

The remaining gap is deployment. This is an emerging area and the .NET story itself is still evolving for deploying with Aspire. However, I am actively watching for how this matures and the targets that get first-class support. If static hosting targets like Azure Static Web Apps are added, Aspire for JavaScript becomes a much more compelling recommendation. If Aspire only provides first-class support for traditionally .NET hosting targets like Azure App Service, Azure Functions, and Azure Container Apps, then the polyglot story remains incomplete.

Final assessment

Aspire is the coolest thing in software development right now and is actively evolving. For polyglot developers familiar with .NET, Aspire is a game-changer and you should experiment with adding it yourself. However, for JavaScript development and polyglot applications without .NET, there are still barriers to entry that prevent Aspire from being a compelling recommendation. Can you do it? Absolutely. Do I use Aspire for JavaScript development? Yes. Do I recommend it for JavaScript only projects? Not yet. But maybe in the future. Maybe soon.

dotnet run file.cs: The new file-based application model

Victor Frye — Wed, 02 Jul 2025 13:00:00 +0000

I missed something at Microsoft Build 2025: the announcement of the new dotnet run file.cs model in .NET 10 Preview 4. This is a new paradigm for running and writing .NET applications and if you are reading this, you might not be the target of this feature. However, you will probably meet or read C# code that is written this way.

This article will explore the new feature of dotnet run file.cs and the value it brings to the .NET ecosystem. Run it!

The Current Project-Based Model

Today, if I wanted to write a simple C# console application that output "Hello, World!", I need to do the following:

Install the .NET SDK.
Install an IDE or text editor like Visual Studio or Visual Studio Code.
Create a new .NET project using the IDE or the dotnet new CLI command.
Write my code in the Program.cs file.

None of this is changing, or at least the steps. However, the output of this today is as given the command: dotnet new console --name HelloWorld:

File actions would have been taken:
  Create: ./HelloWorld.csproj
  Create: ./Program.cs

Processing post-creation actions...
Action would have been taken automatically:
   Restore NuGet packages required by this project

The above is the dry run output of the dotnet new command. Notice two files are created: HelloWorld.csproj and Program.cs. The csproj file is an XML file that contains information any .NET developer is all too familiar with. The Program.cs file is where I write my code. Additionally, you will quickly see obj and bin directories created and start populating as you write and publish your application. Do you know what both directories are for, even today? Do you find XML friendly to read? Microsoft asked a new question: Is this all overwhelming for someone new?

The keyword above was new. I invite you to recall your days learning to code and suppress your experienced instincts. When I do, I remember sitting in a classroom feeling like I might never understand programming and would fail. C# was my first language. We have bootcamps, universities, and online courses in excess to help new developers. That is working, but they are learning JavaScript or Python. Why? Because the onboarding experience is easier. The barrier to entry lower.

What if this changed? Introducing the new dotnet run file.cs paradigm.

The New File-Based Model

The dotnet run we keep discussing is the Dotnet CLI command any .NET command-line user is familiar with. However, the file.cs is in reference to a new single file-based application model. That means in our steps from earlier, we change them to the following:

Install the .NET SDK.
Install an IDE or text editor like Visual Studio Code.
Create a new C# file, e.g. hello.cs.
Write my code in the hello.cs file.

The steps are incredibly similar but also simplified. You need the .NET SDK and a tool for writing code still, but you no longer need to understand a complex project generation process and only have one file to manage. Let's review it:

#!/usr/bin/dotnet run
#:sdk Microsoft.NET.Sdk.Web
#:property AssemblyName VictorFrye.HelloWorld

var app = WebApplication.CreateBuilder(args).Build();

app.UseHttpsRedirection();

app.MapGet("/hello", () => "Hello World!");

await app.RunAsync();

That is it. I could link to a repository, but if you copy and paste this you get a complete .NET application you can run. There is no csproj file, and obj and bin directories are not created in your working directory. And if you run the command dotnet run hello.cs, you get an active Kestrel web server that responds with "Hello World!". The latter half of the code is top-level statements, a feature not so new. However, the first three lines are special.

The first line is a shebang: a Unix convention that tells the system how to execute the file. In this case, it tells the system to use the dotnet run command to execute the file. With this new paradigm, you must have the .NET SDK installed and Dotnet CLI available still. A shebang is not required, but it does enable running the file without explicitly calling dotnet run on Unix-like systems. This is cool, but mostly just a convenience.

The second and third lines are new directives. You may be using directives in your code today, such as #if DEBUG or #region Feature X. However, the new #: directives are unique to the run file paradigm. The .csproj file normally tells our .NET application critical information like SDKs, MSBuild properties, or NuGet packages to use. The run file paradigm still supports these, but instead you use a #:sdk directive or #:property directive. In this case, I'm using the Microsoft.NET.Sdk.Web SDK to pull in ASP.NET Core features for web APIs and setting the assembly name to VictorFrye.HelloWorld because I like my name. These new directives are only for the run file paradigm, and you will get warnings if you try to use them in a traditional project model.

Behind the scenes, everything is still there. The project file still exists but is virtual and interpreted by the Dotnet CLI. The obj and bin directories are created, but in a temporary location that is abstracted away. The application is still built and run like any other .NET application. The difference is in the simplicity of authoring C#. However, when the project reaches maturity or someone is ready to take it to the next level, they can convert the file-based application to a traditional project-based application. All you must do is run the following:

dotnet project convert hello.cs

The Value Added

I am really excited about the dotnet run file.cs. The primary users targeted are new developers. This is a win if Microsoft succeeds and more developers embrace modern .NET applications. Some might be concerned about not learning all the details of the full project-based application model, but new developers learning .NET mean a larger .NET community, new libraries, and more innovation in the ecosystem. This is a huge win for the .NET developer community.

However, the value added doesn't stop there. File-based applications are also great for scripts and small utility apps. You don't need a folder structure or a csproj file. You can now write a couple C# scripts to help you maintain your existing codebase or automate tasks. This is a huge win for scripting capabilities and reducing project overhead.

Another use-case is one you might have to read yourself: .NET samples. Sample applications are used by libraries to showcase how to use specific features or APIs. They are also used by conference speakers and at meetups to illustrate concepts or provide live demos of features. In this article itself, I would normally have to create a full project to demonstrate the feature, and I would link the repository so a reader could copy it exactly and reference it or run it themselves. Now, I can provide the entire sample in a code block that is easy to copy and paste. This is a huge win for documentation and sample authors.

The Limitations So Far

Right now, file-based applications are limited to a single file. They are also unsupported in Visual Studio, favoring Visual Studio Code as a more likely editor for targeted users. Finally, it is only in .NET 10 Preview versions at the moment. It will not be until November 2025 that we see the first general availability release of file-based applications and likely time after before we see new developers learning in this form or a C# scripting revolution.

Concluding Remarks

The dotnet run file.cs paradigm is a new way to write and run .NET applications. It may or may not be for you, but the goal is a more inclusive and accessible .NET ecosystem. The best outcome is more developers learning and using .NET. Maybe C# scripts take off and we see C# become the new Python. Maybe documentation and sample applications get less verbose. The future is hard to predict, but I am hopeful for a future where I see file-based C# applications in the wild.