DEV Community: Dennis Whalen

Maestro: A Single Framework for Mobile and Web E2E Testing

Dennis Whalen — Fri, 26 Dec 2025 13:26:20 +0000

I've recently been working on a personal project that has both mobile and web frontends. I wanted to include E2E tests, but I didn't want to spend a bunch of time getting all of that setup for web, iOS, and Android.

I just wanted a handful of happy-path E2E tests for an app that could run on a desktop browser, mobile browser, and native mobile.

Most importantly, I wanted to get this running quickly so I could focus on actually building the app. That's when I found an open source tool called Maestro.

What immediately caught my attention with Maestro is that it's so easy to get setup, and it handles both web and mobile with the same tool and syntax.

Here's What a Test Looks Like

Maestro tests are written in YAML. Here's a simple desktop browser example that searches DuckDuckGo:

url: https://duckduckgo.com
---
- launchApp
- tapOn:
    text: 'Search without being tracked'
- inputText: 'Maestro e2e testing'
- pressKey: Enter
- assertVisible: ".*Maestro is an open-source framework.*"

Pretty straightforward, right? It opens DuckDuckGo, taps the search box, searches for "Maestro e2e testing", and verifies that the results contain "Maestro is an open-source framework". Note that for partial text matching, Maestro uses regex—the .* pattern means "any characters", so ".*text.*" effectively does a "contains" match.

To be honest, I was not super excited to work with a tool that uses YAML to define the tests. In my regular job I spend a lot of time building out code-based automation suites, and that usually feels like the "right" way to do it. But is that always the case?

My personal project is not super complex, and I don't have a team of test automation folks. I have one dev and one QA, and they are both me. I want E2E tests, but I want to focus the majority of my time on building the app, not building fancy-pants automation frameworks.

Let's run this test!

Setup

I am not assuming that everyone uses a Mac, but that's what I'm using so keep that in mind if you're reading this as a Windows or Unix person. Maestro is cross-platform, but some of the install steps will be different. See their setup documentation for more details.

First, let's install Maestro. Open your terminal and run:

curl -fsSL "https://get.maestro.mobile.dev" | bash

Or you can use Homebrew:

brew tap mobile-dev-inc/tap
brew install maestro

Verify it worked:

maestro --version

OK so what do I need to install next? Huh, that's it?? Well then... let's run the test!

Running a Test

maestro test flows/duckduckgo-search-desktop.yaml

Maestro will open a browser, run through the test steps, and show you the results. If something fails, the output helps you figure out what went wrong, and you'll also get some detailed log files. Hopefully your run will look like this:

Running the Same Test on Mobile Browser

You can run a similar test on a mobile browser. Here's the mobile version:

appId: com.android.chrome
---
- launchApp
- tapOn: "Search or type URL"
- inputText: "https://duckduckgo.com"
- pressKey: Enter
- tapOn:
    id: "searchbox_input"
- inputText: "Maestro e2e testing"
- pressKey: Enter
- assertVisible: ".*Maestro is an open-source framework.*"

Notice how the syntax is almost identical. The main difference is using url: for desktop browsers and appId: for mobile browsers. Other than that, Maestro uses the same commands for both.

To run this, you'll need an Android emulator. If you have Android Studio installed, you can use the AVD Manager to create one. Make sure Chrome is installed on the emulator (it usually is by default).

Once your emulator is running, just run the test:

maestro test flows/duckduckgo-search-mobile.yaml

Hopefully you'll see the same interactions that you saw with the desktop browser test, and the same green results, like this!

You now have a taste for browser-based Maestro testing on a desktop browser and a mobile browser. Let's move away from the browser and use Maestro test a mobile app.

Testing a Native Mobile App

The built-in Android Contacts app is perfect for this because it's available on every Android device and works great in an emulator. Notice how the syntax is the same as the web test. Maestro uses the same commands whether you're testing web or native mobile.

Here's a test that creates a new contact:

appId: com.google.android.contacts
jsEngine: graaljs
---
- evalScript: ${output.firstName = faker.name().firstName()}
- evalScript: ${output.lastName = faker.name().lastName()}
- evalScript: ${output.phoneNumber = faker.phoneNumber().phoneNumber()}
- launchApp
- tapOn: "Create contact"
- tapOn: "First name"
- inputText: ${output.firstName}
- tapOn: "Last name"
- inputText: ${output.lastName}
- longPressOn: "Phone (Mobile)"
- tapOn: 'Select All'
- eraseText
- inputText: ${output.phoneNumber}
- tapOn: "Save"
- assertVisible: ${output.firstName + " " + output.lastName}
- scrollUntilVisible:
    element: "Delete"
- tapOn: "Delete"
- tapOn: "Delete"
- assertVisible: "1 contact deleted"

This test is a bit more advanced as it demonstrates Maestro's ability to generate dynamic test data using Faker. The jsEngine: graaljs setting enables JavaScript execution, and the evalScript commands at the top use Faker to generate random first names, last names, and phone numbers. These values are stored in the output object and referenced throughout the test using ${output.variableName} syntax.

This is just one example of integrating JavaScript with Maestro scripts. More detail can be found here.

Running It

With your emulator running, execute:

maestro test flows/contacts-app-android.yaml

The test will run, and you'll see the emulator actually perform the actions. If it passes, you'll see a nice success message. If it fails, Maestro will tell you what went wrong and where. Here's what I see:

Maestro MCP

MCP (Model Context Protocol) is a standardized protocol that bridges tools (like Maestro) to LLMs (like Claude or ChatGPT). Think of it as a universal connector that lets these AI models access and interact with your development tools.

Why it matters: If you're using these LLMs in your development workflow, Maestro includes an MCP that lets them interact with Maestro directly. They can read your test files, understand your test structure, suggest improvements, or even generate tests based on your app's behavior.

How to use it: The MCP server comes bundled with Maestro. To use it in Cursor:

Open Cursor Settings
Navigate to the MCP section
Click "Add new MCP Server"
Configure it with:

{
  "mcpServers": {
    "maestro": {
      "command": "maestro",
      "args": ["mcp"]
    }
  }
}

Save and restart Cursor

Similar functionality is available in other tools like VS Code through MCP extensions. Once connected, the AI assistant can discover your Maestro flows, understand your test structure, and help you write better tests.

More details can be found here.

A few things I didn't cover but want to mention

Maestro is easy to run on you CI platform, and also has a Cloud plan. More info here.
Maestro has a ton of sample flows to help you learn more here.
Maestro has an IDE to help with identifying UI elements, generating code, and running commands. Check it out.
Take a look at docs.maestro.dev for more examples, advanced features like nested flows and conditions, page objects, and tips for structuring larger test suites.

Happy building and testing. Peace out!

Add Structured Testing to Your AI Vibe - with promptfoo

Dennis Whalen — Thu, 04 Sep 2025 11:46:23 +0000

Intro

In my previous promptfoo post, we covered the basics of testing LLM prompts with simple examples using promptfoo. But when you're building an actual application that processes user-generated content at scale, you might discover that your carefully crafted prompt needs to handle far more complexity than you initially anticipated.

Many teams are still doing "vibe testing" - manually checking a few examples, tweaking prompts based on gut feel, and hoping everything works in production. While this might get you started, a systematic evaluation framework puts you significantly ahead of the curve when it comes to building and maintaining reliable AI systems, and provides a mechanism to build a set of repeatable automated regression tests.

Our Assignment

Let's consider an example. You're working with a major ecommerce client, and your team is building a feature that will analyze user submitted product reviews. Your application needs to evaluate the product reviews, classify sentiment, extract key product features mentioned, detect potentially fake reviews, and make moderation decisions. This will help customers find trustworthy reviews and help your business maintain review quality.

The core of this system is a prompt that takes each incoming review and returns structured data, such as sentiment classification, confidence scores, extracted features, fake review indicators, and moderation recommendations.

This prompt might work well during development, but once deployed, it needs to handle the messy reality of real user reviews. Your prompt will definitely need to be able to handle things like:

Mixed sentiment reviews (loved the product, hated the shipping)
Fake or suspicious reviews
Reviews with profanity or inappropriate content
Sarcastic or nuanced language
Reviews that mention competitors

This is where a systematic process with multiple scenarios becomes crucial.

Our Requirements

Speaking of systematic processes, before we dive into building our prompt and setting up the prompfoo tests, let's outline what the requirements would look like. We'll use our old friend gherkin.

Feature: Product Review Analysis Prompt

  Scenario Outline: Prompt analyzes product reviews correctly
    Given a product review analysis prompt
    And a "<review_type>" product review
    When the prompt processes the review
    Then the sentiment should be classified as "<expected_sentiment>"
    And fake review indicators should be "<fake_indicators>"
    And the recommendation should be "<expected_recommendation>"
    And key features should be extracted

    Examples:
      | review_type | expected_sentiment | expected_fake_indicators | expected_recommendation |
      | positive    | positive           | absent                   | approve                 |
      | negative    | negative           | absent                   | approve                 |
      | mixed       | mixed              | absent                   | flag_for_review         |
      | suspicious  | positive           | present                  | flag_for_review         |

Gherkin is just a way to describe requirements in plain language. In this case, we have four main test scenarios: positive reviews, negative reviews, mixed sentiment reviews, and suspicious/fake reviews.

Promptfoo doesn't use gherkin, but I do, and it helps me think through the scenarios we need to cover. We'll translate these scenarios into actual promptfoo tests next.

Moving Beyond Inline YAML: File-Based Organization

In my last post we defined the entire test in YAML. Before diving into complex scenarios, let's improve our testing structure by moving prompts into separate files. This makes them easier to maintain, version control, and collaborate on.

Project Structure

promptfoo-product-reviews/
├── prompts/
│   └── analyze-review.txt
├── test-data/
│   ├── positive-review.txt
│   ├── negative-review.txt
│   ├── mixed-review.txt
│   └── suspicious-review.txt
├── analyze-review-spec.yaml
└── package.json

Creating Our Review Analysis Prompt

Let's first create a prompt specifically designed for ecommerce product review analysis:

prompts/analyze-review.txt

You are an expert product review analyzer for an ecommerce platform. Analyze the following product review and provide a structured assessment.

Product Review:
{{review_text}}

Provide your analysis in the following JSON format. Return ONLY the JSON object, no markdown code blocks, no explanations, no additional text:
{
  "sentiment": "positive|negative|mixed",
  "confidence": 0.0-1.0,
  "key_features_mentioned": ["feature1", "feature2"],
  "main_complaints": ["complaint1", "complaint2"],
  "main_praise": ["praise1", "praise2"],
  "suspected_fake": boolean,
  "fake_indicators": ["indicator1", "indicator2"],
  "recommendation": "approve|flag_for_review|reject",
  "summary": "Brief 1-2 sentence summary"
}

Focus on:
- Accurate sentiment classification, especially for mixed reviews
- Extracting specific product features mentioned
- Identifying potential fake review indicators such as generic language without specific details, suspicious patterns, overly positive language, and extreme superlatives, overly negative language
- Providing actionable moderation recommendations

IMPORTANT: Return ONLY valid JSON. Do not wrap in markdown code blocks or add any other text.

Test Scenarios: Real-World Product Reviews

So that's the prompt we're going to test. Now let's create diverse test scenarios that represent what you'd actually encounter in production. You might make these up, or you might use some actual production reviews.

Scenario 1: Genuine Positive Review Example

test-data/positive-review.txt

I've been using these wireless earbuds for 3 months now and I'm really impressed. The battery life is excellent - I get about 6-7 hours of continuous listening, and the case gives me 2-3 full charges. The sound quality is crisp and clear, with good bass response for the price point. They stay comfortable in my ears during workouts and haven't fallen out once. The touch controls take some getting used to but work reliably once you learn them. Only minor complaint is that the case is a bit bulky for my small pockets, but that's a trade-off for the extra battery. Would definitely recommend for anyone looking for reliable wireless earbuds under $100.

Scenario 2: Detailed Negative Review

test-data/negative-review.txt

Very disappointed with these earbuds. The connection constantly drops out, especially when my phone is in my pocket or more than a few feet away. The battery life is nowhere near the advertised 8 hours - I'm lucky to get 4 hours before they die. The sound quality is muddy and lacks clarity, particularly in the mid-range frequencies. They're also uncomfortable for extended wear - my ears start hurting after about an hour. The touch controls are oversensitive and constantly trigger accidentally when I adjust them. For the price, I expected much better quality. I've had $20 earbuds that performed better than these. Returning them and looking for alternatives.

Scenario 3: Mixed Sentiment Review

test-data/mixed-review.txt

These earbuds are a mixed bag. On the positive side, the sound quality is really good - clear highs, decent bass, and good overall balance. The build quality feels solid and they look premium. The battery life meets expectations at around 6 hours. However, there are some significant issues. The Bluetooth connection is unreliable - frequent dropouts and sometimes one earbud stops working randomly. The fit is also problematic for me - they tend to slip out during exercise despite trying all the included ear tips. Customer service was helpful when I contacted them about the connection issues, but the firmware update they suggested didn't solve the problem. Overall, great sound quality let down by connectivity and fit issues. Might work better for others but not ideal for my use case.

Scenario 4: Suspicious/Fake Review

test-data/suspicious-review.txt

Amazing product! These earbuds are the best I have ever used in my entire life. The sound quality is absolutely perfect and the battery life is incredible. They are so comfortable and never fall out. The connection is always stable and strong. I love everything about these earbuds and they exceeded all my expectations. Everyone should buy these right now because they are the greatest earbuds ever made. Five stars without any doubt! Highly recommend to all people who want amazing earbuds with perfect quality and performance.

Comprehensive Test Configuration

Now let's create a promptfoo configuration that tests all these scenarios with appropriate assertions:

analyze-review-spec.yaml

description: Product Review Analysis Testing

prompts:
  - file://prompts/analyze-review.txt

providers:
  - openai:chat:gpt-4o-mini

tests:
  # Test 1: Genuine Positive Review
  - vars:
      review_text: file://test-data/positive-review.txt
    assert:
      - type: is-json
      - type: javascript
        value: |
          const response = JSON.parse(output);
          response.sentiment === 'positive' && response.confidence > 0.7
      - type: contains-json
        value:
          suspected_fake: false
      - type: llm-rubric
        value: "Should identify key positive features like battery life, sound quality, and comfort. Should not flag as fake since it contains specific details and minor complaints."

  # Test 2: Detailed Negative Review  
  - vars:
      review_text: file://test-data/negative-review.txt
    assert:
      - type: is-json
      - type: javascript
        value: |
          const response = JSON.parse(output);
          response.sentiment === 'negative' && response.confidence > 0.7
      - type: contains-json
        value:
          suspected_fake: false
      - type: llm-rubric
        value: "Should identify specific complaints about connection, battery, sound quality, and comfort. Should extract main issues for product team review."

  # Test 3: Mixed Sentiment Review
  - vars:
      review_text: file://test-data/mixed-review.txt
    assert:
      - type: is-json
      - type: javascript
        value: |
          const response = JSON.parse(output);
          response.sentiment === 'mixed'
      - type: llm-rubric
        value: "Should correctly identify mixed sentiment, extracting both positive aspects (sound quality, build) and negative aspects (connectivity, fit). This is the most challenging scenario for sentiment analysis."

  # Test 4: Suspicious/Fake Review
  - vars:
      review_text: file://test-data/suspicious-review.txt
    assert:
      - type: is-json
      - type: contains-json
        value:
          suspected_fake: true
      - type: javascript
        value: |
          const response = JSON.parse(output);
          response.fake_indicators && response.fake_indicators.length > 0
      - type: llm-rubric
        value: "Should detect fake review indicators: overly positive language, lack of specific details, generic praise, and extreme superlatives."

Understanding the Test Specification

Let's break down what this test configuration accomplishes. We have four distinct tests that correspond to the four key scenarios mentioned above:

Test 1: Genuine Positive Review - References positive-review.txt
Test 2: Detailed Negative Review - References negative-review.txt
Test 3: Mixed Sentiment Review - References mixed-review.txt
Test 4: Suspicious/Fake Review - References suspicious-review.txt

Each test loads its respective product review using the file:// syntax, which tells promptfoo to read the content from the specified file and inject it into the review_text variable in our prompt.

Multi-Layered Assertions

Notice that we're using multiple types of assertions for comprehensive validation:

is-json - Ensures the output is valid JSON format
contains-json - Checks for specific key-value pairs in the response
javascript - Uses inline JavaScript for custom validation logic (like checking sentiment and confidence scores)
llm-rubric - Uses an LLM to evaluate whether the output meets human-readable criteria

The inline JavaScript assertions are particularly powerful for complex validation. For example:

const response = JSON.parse(output);
response.sentiment === 'positive' && response.confidence > 0.7

This validates both the sentiment classification AND ensures the AI is confident in its assessment, helping us catch edge cases where the model might be uncertain.

Installation & Setup

# Install as a dev dependency in your project
npm install --save-dev promptfoo

Run the test

# Run the tests
npx promptfoo eval -c promptfoo-product-reviews/analyze-review-spec.yaml --no-cache
# View the results in web viewer
npx promptfoo view -y

Understanding the Results

The web viewer has a lot going on, and I could do an entire walkthrough of its features. For now, let's focus on the key insights it provides into the test and evaluation results.

The results are displayed in a grid, and you can see our prompt in the first row. The 2nd row shows the results of our first scenario, the positive review.

Note the prompt did a pretty good job at analyzing the review based on our requirements, and displays the actual response from the test:

{
  "sentiment": "positive",
  "confidence": 0.9,
  "key_features_mentioned": ["battery life", "sound quality", "comfort", "touch controls"],
  "main_complaints": ["case is bulky"],
  "main_praise": ["excellent battery life", "crisp and clear sound quality", "comfortable during workouts", "reliable touch controls"],
  "suspected_fake": false,
  "fake_indicators": [],
  "recommendation": "approve",
  "summary": "The reviewer expresses high satisfaction with the wireless earbuds, highlighting their excellent battery life and sound quality while noting a minor complaint about the case size."
}

Adding the tests to CI

This is a great start, but we can take this a step further. Since promptfoo just runs from the command line, we can include it as a regression test in our CI pipeline and ensure that future prompt changes don't break these tests.

If we make changes to the prompt, or change the LLM provider, we can re-run this test and see if the results change. If they do, we can investigate why.

As requirements change and morph, we can adapt the tests accordingly.

Wrap-up

In this post, we've explored how to set up a comprehensive testing framework for AI-generated product reviews using promptfoo. By defining clear test scenarios and leveraging multi-layered assertions, we can ensure our AI behaves as expected across a range of inputs.

It might not surprise you to learn that my prompt was not perfect the first time. Since I setup my automated tests first, it made it easy to iterate on the prompt development. Sounds like test driven development, huh?

That's it for now. Stay tuned for another promptfoo post before too long!

Automate the Testing of Your LLM Prompts

Dennis Whalen — Sun, 24 Aug 2025 23:06:14 +0000

Intro

On a recent client engagement, we needed a mechanism to validate LLM responses for an application that used AI to summarize customer service call transcripts.

The requirements were clear: each summary had to capture specific details (customer names, account numbers, actions taken, resolution details, etc.), and our validation process needed to be automated and repeatable. We needed to test our custom summarization prompts with the same rigor we apply to traditional software: pass/fail assertions, regression baselines, and systematic tracking.

That's where promptfoo came in. Promptfoo let us codify these requirements into automated tests and iterate on prompt improvements with confidence.

Why Testing LLM Responses Is Different (And Why You Should Care)

As software engineers and quality professionals, we're used to deterministic systems where the same input always produces the same output. LLM responses break that assumption: the same prompt can yield different valid answers, so traditional assertion patterns are often insufficient.

Here's the challenge: How can you verify a prompt's response is contextually accurate when the response can vary with every request?

The solution is to shift from testing exact outputs to testing output quality, accuracy, and safety. You need assertions that can evaluate whether a response contains required information, follows guidelines, and avoids harmful content, regardless of the exact wording.

Traditional testing falls short with LLM prompt responses because:

Non-deterministic responses: Same input, different valid outputs
Context-dependent behavior: Quality depends on conversation history
Safety concerns: Content filtering and moderation requirements
Performance variability: Response times and costs fluctuate

If you've been struggling with manual testing of AI features or relying on trial-and-error for prompt engineering, this guide will show you how promptfoo brings systematic testing to AI development.

What is Promptfoo?

Promptfoo is an open-source testing framework specifically designed to enable test-driven development for LLM applications with structured, automated evaluation of prompts, models, and outputs.

Key capabilities:

Assertion-based validation with pass/fail criteria familiar to QA engineers
Side-by-side prompt comparison for A/B testing different prompts and approaches
Automated regression testing to catch quality degradation
CI/CD integration for your existing pipelines
Multi-model support (OpenAI, Anthropic, Google, Azure, local models)

Promptfoo brings familiar testing methodologies to AI development:

Test-driven development instead of trial-and-error and/or hoping for the best
Regression testing to catch quality degradation
Performance monitoring (latency, cost, accuracy)

Getting Started: Hands-On Examples

The best way to understand promptfoo is to see it in action. Let's start with installation and work through practical examples.

Installation & Setup

# Install as a dev dependency in your project
npm install --save-dev promptfoo

Configuration: YAML-Driven Testing

Promptfoo uses YAML configuration files to define your tests. This approach will feel familiar if you've worked with other testing frameworks or CI/CD tools. The YAML file specifies:

Prompts: The actual prompts you want to test
Providers: Which AI models to use (OpenAI, Anthropic, Azure, etc.)
Tests: Input variables and assertions used to validate responses
Test scenarios: Different inputs and expected behaviors

This declarative approach makes it easy to version control your AI tests and collaborate with your team.

Example 1: Simple Dataset Generation

Let's start with a simple example. We want to test a prompt that generates a list of random numbers. Of course an LLM is really not the right place to do this, but this is just for example purposes.

We're going to test this prompt against two different models: Claude and GPT-5-mini. (FYI, you will need API tokens for any paid model you are referencing.)

# examples-for-blog/ten_numbers.yaml
description: Generating a random list of integers between a range

prompts:
  - "You are a JSON-only responder. OUTPUT EXACTLY one valid JSON array and NOTHING ELSE. Example: [10, 20, 30]. Generate an ordered list of ten random integers between {{start}} and {{end}} (inclusive). Use numeric values (no quotes), sorted in ascending order, and do not include any commentary or code fences."

providers:
  - id: anthropic:messages:claude-3-haiku-20240307
  - id: openai:chat:gpt-5-mini

tests:
  - vars:
      start: 10
      end: 1000
    assert:
      - type: is-json
        value: |
          {
            "type": "array",
            "minItems": 10,
            "maxItems": 10,
            "items": {
              "type": "integer",
              "minimum": 10,
              "maximum": 1000
            }
          }

How this works:

When promptfoo runs this test, it substitutes the variables (start: 10 and end: 1000) into the prompt and sends it to both Claude and GPT-5-mini. Each model generates a response.

The is-json assertion is evaluated by promptfoo after it parses the model output as JSON. In other words, promptfoo performs the JSON parsing and schema validation (not the model). If the model returns something that isn't valid JSON or doesn't match the schema, the assertion will fail and promptfoo will report the parsing error and the schema mismatch.

This example demonstrates:

Variable substitution with {{start}} and {{end}}
Multiple model comparison (Claude vs GPT-5-mini)
Programmatic validation using is-json so validation happens in promptfoo, not in the LLM

Running the test is easy:

# run the test
npx promptfoo eval -c examples-for-blog/ten_numbers.yaml

To see a side-by-side comparison showing how each model performed and whether they passed the validation criteria:

# open the web report for the last run
npx promptfoo view

Here is our web view of the test results. Note you can see variables, prompts, model responses, validation outcomes, and even performance and cost metrics, all in one place.

Example 2: Call Summary Validation (Real-World Use Case)

So Example 1 was interesting, but let's look at how we can validate the output of a prompt by using an LLM to grade that output.

Here's a more complex example based on our actual client engagement I described earlier - testing an AI system that summarizes customer service calls:

# examples-for-blog/customer-call-summary.yaml
description: Call Summary Quality Testing

prompts:
  - |
    Summarize this customer service call. Keep the summary succinct without unnecessary details. Pay special attention to include the agent's demeanor and indicate if they ever seemed unprofessional. Include:
    - Customer name and account number
    - Issue description
    - Actions taken by agent
    - Any order number that is mentioned
    - Resolution status

    Call transcript: {{transcript}}

providers:
  - openai:chat:gpt-5-mini

tests:
  - vars:
      transcript: |
        Agent: Good morning, thank you for calling customer service. This is Maria, how can I help you today?
        Customer: Hi Maria, I'm calling about an order I placed last week that was supposed to be delivered two days ago, but it still hasn't arrived.
        Agent: I'm sorry to hear about the delay with your order. I'd be happy to help you track that down. Can I start by getting your first and last name please?
        Customer: Yes, it's David Rodriguez.
        Agent: Thank you Mr. Rodriguez. And can I also get your account number to verify your account?
        Customer: Sure, it's account number 78942.
        Agent: Perfect, thank you. Now, can you provide me with the order number for the package you're expecting?
        Customer: Yes, the order number is ORD-2024-5583.
        Agent: Great, and when did you place this order?
        Customer: I placed it last Tuesday, January 16th.
        Agent: Thank you for that information. Let me pull up your order details here... Okay, I can see order ORD-2024-5583 placed on January 16th, and you're absolutely right - it was originally scheduled for delivery on January 22nd. I sincerely apologize for this delay, Mr. Rodriguez.
        Customer: So what happened? Why didn't it arrive when it was supposed to?
        Agent: It looks like there was a sorting delay at our distribution center that affected several shipments in your area. Your package is currently in transit and I can see it's now scheduled to be delivered this Friday, January 26th, by end of day.
        Customer: Friday? That's three days later than promised. This is really inconvenient.
        Agent: I completely understand your frustration, and I apologize again for the inconvenience this has caused. To make up for the delay, I'm going to issue a $15 credit to your account, and I'll also send you tracking information via email so you can monitor the package's progress.
        Customer: Okay, well I appreciate that. Will I get a notification when it's actually delivered?
        Agent: Absolutely. You'll receive both an email and text notification once the package is delivered, and the tracking information will show real-time updates. Is there anything else I can help you with today?
        Customer: No, that covers it. Thank you for your help, Maria.
        Agent: You're very welcome to never ever call me again, Mr. Rodriguez. Again, I apologize for the delay, and thank you for your patience. Have a great day!
  assert:
      - type: contains
        value: "David Rodriguez"
      - type: contains  
        value: "ORD-2024-5583"
      - type: llm-rubric
        value: "Summary should indicate whether the agent seemed professional or not, and should include all key details, including the action taken by the agent, the resolution, and any compensation offered."

This prompt embeds a long customer-service phone transcript that the model is asked to summarize succinctly while preserving key facts. To verify correctness we include a couple of deterministic assertions (exact-match checks) for the customer's name and the order number so those values must appear in the summary.

We also include an llm-rubric asset: promptfoo will call an LLM to grade the generated summary against the supplied rubric text, allowing us to assert on higher-level quality attributes such as professionalism, completeness, and whether the agent's actions and compensation were described.

Now I can run that test and see how we do!!

# run the test
npx promptfoo eval -c examples-for-blog/customer-call-summary.yaml
# View results
npx promptfoo view

And here are our results:

Note the prompt specifically requests to indicate the agent's demeanor, and we use the rubric to verify the output contains it. Since I never trust a test unless I can see it fail, I'm going to temporarily remove the mention of demeanor in the prompt, but leave the assert alone, so we should get a failure. Drumroll, please…

And we do!

As you can see, the test caught the error with our prompt:

Conclusion

I got a little long-winded with this post, but I hope someone out there finds it useful. Promptfoo represents a paradigm shift from manual AI testing to systematic, automated evaluation. By bringing familiar testing methodologies to AI development, it enables teams to build reliable, secure, and high-quality AI applications.

I'll be back soon with some more promptfoo content, and you should certainly check out the awesome documentation at promptfoo.dev for excellent resources for getting started.

Automating Browser-Based Performance Testing

Dennis Whalen — Sun, 17 Aug 2025 15:01:57 +0000

Website performance directly affects what users feel and what your business earns.

One way of identifying performance issues is via API-based load testing tools such as k6. API load tests tell you whether your services scale and how quickly they respond under load, but they don’t measure the full user experience.

If you focus only on load testing your backend, you might still ship a slow or jittery site because of render‑blocking CSS/JavaScript, heavy images/fonts, main‑thread work, layout shifts, and other front-end issues.

Ultimately users don't care where the performance issue resides, they just know your site is "slow".

This slow performance can cost you customers, revenue, search visibility, and trust.

What is Lighthouse?

Lighthouse is an automated auditor built by Google and is part of the Chrome DevTools experience. While this post focuses on performance, Lighthouse also audits and provides actionable recommendations for accessibility, best practices, and SEO.

How Lighthouse works

Launches Chrome and navigates to your page using the Chrome DevTools Protocol.
Emulates device, network, and CPU to keep runs comparable.
Records a performance trace and analyzes it against a set of audits.
Outputs scores and detailed metrics with fix ideas.
Can be included in your CI pipeline.

Core Web Vitals: what they mean and why they matter

These user‑focused metrics map to how fast content shows up, how responsive the page feels, and how stable it looks.

Core Web Vitals at a glance

Metric	Plain meaning	Good target	What you’ll see in Lighthouse
LCP (Largest Contentful Paint)	Time to show the largest thing in the initial viewport (often the primary image or a big text block).	≤ 2.5 s	LCP value in the Metrics section
FID (First Input Delay)	Delay from a user’s first tap/click to when the page can start handling it. In Lighthouse runs, use Total Blocking Time (TBT) as the responsiveness indicator.	FID ≤ 100 ms; aim for low TBT	TBT value in the Metrics section
CLS (Cumulative Layout Shift)	How much content unexpectedly moves while the page loads (visual stability).	≤ 0.1	CLS score in Metrics/Diagnostics

Sample Lighthouse Report

Regardless of how you run Lighthouse, you get a detailed report with scores, metrics, and prioritized suggestions.

Overall scores:

What went wrong?

What looks good?

Running Lighthouse

Lighthouse can be run in a number of ways, including:

Chrome DevTools (UI)
Command line (CLI)
Node module (programmatic)

Run Lighthouse from Chrome DevTools

Open your site in Chrome → Right‑click Inspect → Lighthouse tab → Set your analysis options → Analyze. This generates a full HTML report inside DevTools.

Run Lighthouse from the command line

Install Lighthouse (requires Node.js):

npm install -g lighthouse

Basic mobile audit and open the HTML report:

lighthouse https://www.demoblaze.com \
  --output=html \
  --output-path=./reports/lighthouse.html \
  --view

Export JSON for automation or tracking:

lighthouse https://www.demoblaze.com \
  --output=json \
  --output-path=./reports/lighthouse.json \
  --chrome-flags="--headless"

Desktop profile:

lighthouse https://www.demoblaze.com --preset=desktop --output=html --output-path=./reports/desktop.html

Use throttling to simulate slower networks:

lighthouse https://www.demoblaze.com \
  --throttling-method=simulate \
  --throttling.rttMs=150 \
  --throttling.throughputKbps=1638.4 \
  --throttling.cpuSlowdownMultiplier=4 \
  --output=html --output-path=./reports/consistent.html

Focus audits on key performance metrics with a config (lighthouse-config.js):

module.exports = {
  extends: 'lighthouse:default',
  settings: {
    onlyAudits: [
      'first-contentful-paint',
      'largest-contentful-paint',
      'cumulative-layout-shift',
      'total-blocking-time'
    ],
    throttlingMethod: 'simulate',
    throttling: { rttMs: 150, throughputKbps: 1638.4, cpuSlowdownMultiplier: 4 }
  }
};

Run with the config:

lighthouse https://www.demoblaze.com --config-path=./lighthouse-config.js --output=html --output-path=./reports/focused.html

Programmatic usage (Node)

Why use this? Programmatic runs let you script real user interactions and measure performance along a flow (navigations, clicks, route changes). With Puppeteer + Lighthouse User Flows you can drive the browser, capture metrics per step, and generate a single report—perfect for CI, regression checks, and measuring critical journeys like signup or checkout.

Note: Lighthouse currently only supports Puppeteer for programmatic user flows.

Install packages:

npm i lighthouse puppeteer

Save as user-flow.mjs and run with node user-flow.mjs:

import {writeFileSync, mkdirSync} from 'fs';
import puppeteer from 'puppeteer';
import {startFlow} from 'lighthouse';

const browser = await puppeteer.launch({headless: 'new'});
const page = await browser.newPage();
const flow = await startFlow(page);

// Navigate to Demoblaze
await flow.navigate('https://www.demoblaze.com');

// Interaction-initiated navigation via a callback function
await flow.navigate(async () => {
  await page.click('a[href="index.html"]');
});

// Start/End a navigation around a user action
await flow.startNavigation();
await page.click('a#cartur'); // open Cart
await flow.endNavigation();

await browser.close();
mkdirSync('./reports', {recursive: true});
writeFileSync('./reports/lh-flow-report.html', await flow.generateReport());
console.log('Saved ./reports/lh-flow-report.html');

Wrap‑up

Start by running Lighthouse in DevTools (fast feedback) or the CLI (repeatable results). Focus on three things: LCP (how fast the main content shows), TBT (how responsive it feels), and CLS (how stable it looks).

What’s next in this series:

Containerize Lighthouse runs with Docker for consistent local and CI environments
Add Lighthouse checks to a GitHub Actions workflow with performance budgets and PR comments
Export key metrics to Prometheus for time‑series storage
Visualize trends and budgets in a Grafana dashboard

Open Source Load Testing with k6, Docker, Prometheus, and Grafana

Dennis Whalen — Sun, 18 May 2025 16:13:44 +0000

Introduction

Load testing is crucial for ensuring your applications can handle expected load volumes. In this guide, we'll set up a complete load testing environment using k6 for testing, Prometheus for metrics collection, and Grafana for visualization—all orchestrated with Docker.

Although there are paid versions of these products, this guide will focus exclusively on a basic setup with their open source Docker images.

Prerequisites

Docker and Docker Compose installed
Basic understanding of load testing concepts
Familiarity with Docker

Architecture Overview

Our setup consists of four main components:

k6: An open-source load testing tool that enables you to write test scripts in JavaScript to simulate real user traffic, measure application performance, and export detailed metrics for analysis.
Application: A simple API-based application to test
Prometheus: An open-source monitoring and alerting toolkit that collects, stores, and queries time-series metrics from k6 and other sources, making them available for analysis and visualization.
Grafana: An open-source analytics and visualization platform that lets you create interactive dashboards and graphs from a wide variety of data sources—including Prometheus, InfluxDB, Elasticsearch, MySQL, PostgreSQL, and many others.

These components will be implemented with 4 Docker containers. Here's how these components interact:

Data Flow:

Load generation: our k6 script sends HTTP requests to the Sample API to simulate user traffic
Metrics Export: as the test runs, performance metrics from k6 are exported to Prometheus via remote write
Data Query: Grafana uses PromQL to query Prometheus for metrics

All components run within the same Docker network, enabling seamless communication between services.

Project Structure

k6-prometheus-grafana/
├── docker-compose.yml
├── prometheus/
│   └── prometheus.yml
├── grafana/
│   └── dashboards/
│       └── k6-dashboard.json
├── k6/
│   └── script.js
└── sample-api/
    └── Dockerfile
    └── server.js

Step 1: Create the Sample API

First, let's create a simple Node.js API to test against:

sample-api/server.js

const express = require('express');
const app = express();

app.get('/health', (req, res) => {
  res.json({ status: 'healthy', timestamp: new Date().toISOString() });
});

app.get('/api/users/:id', (req, res) => {
  const { id } = req.params;
  // Simulate some processing delay
  setTimeout(() => {
    res.json({ id, name: `User ${id}`, timestamp: new Date().toISOString() });
  }, Math.random() * 100);
});

app.post('/api/users', (req, res) => {
  // Simulate user creation
  setTimeout(() => {
    res.status(201).json({ 
      id: Math.floor(Math.random() * 1000),
      message: 'User created successfully' 
    });
  }, Math.random() * 200);
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
});

And add a docker file that will start the app:

sample-api/Dockerfile

FROM node:16-alpine
WORKDIR /app
RUN npm init -y && npm install express
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]

Step 2: Create k6 Test Script

This JavaScript test script defines how k6 will interact with our sample API during the load test.

k6/script.js

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Counter, Trend } from 'k6/metrics';

// Custom metrics - these allow us to track specific aspects of our test
export const errorRate = new Rate('errors');           // Tracks percentage of errors
export const myCounter = new Counter('my_counter');    // Simple incrementing counter
export const responseTime = new Trend('response_time'); // Tracks response time distribution

export const options = {
  stages: [
    { duration: '30s', target: 5 }, // Ramp up to 5 virtual users over 30 seconds
    { duration: '90s', target: 20 }, // Ramp to from 5 to 20 virtual users over 90 seconds
    { duration: '3m', target: 20 }, // Stay at 20 virtual users for 3 minutes
    { duration: '30s', target: 0 },  // Gradually ramp down to 0 over 30 seconds
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'], // 95% of requests must complete in less than 500ms for the test to pass
    http_req_failed: ['rate<0.1'],    // Test fails if more than 10% of requests fail
  },
};

export default function () {
  const baseUrl = 'http://sample-api:3000';

  // Test GET endpoint - fetches a random user
  let getResponse = http.get(`${baseUrl}/api/users/${Math.floor(Math.random() * 100)}`);
  check(getResponse, {
    'GET status is 200': (r) => r.status === 200,
    'GET response time < 500ms': (r) => r.timings.duration < 500,
  });

  // Track custom metrics for this request
  errorRate.add(getResponse.status !== 200);
  responseTime.add(getResponse.timings.duration);
  myCounter.add(1);

  sleep(1); // Pause for 1 second between requests

  // Test POST endpoint - creates a new user
  let postResponse = http.post(`${baseUrl}/api/users`, JSON.stringify({
    name: `TestUser_${Date.now()}`,
    email: `test_${Date.now()}@example.com`
  }), {
    headers: { 'Content-Type': 'application/json' },
  });

  check(postResponse, {
    'POST status is 201': (r) => r.status === 201,
    'POST response time < 1000ms': (r) => r.timings.duration < 1000,
  });

  errorRate.add(postResponse.status !== 201);
  myCounter.add(1);

  sleep(1);
}

Step 3: Configure Prometheus

Prometheus is an open-source monitoring and alerting toolkit that collects and stores time-series metrics. The configuration below sets up Prometheus to scrape metrics from both itself and the k6 load testing tool.

prometheus/prometheus.yml

global:
  scrape_interval: 15s      # How frequently to scrape targets by default
  evaluation_interval: 15s  # How frequently to evaluate rules
scrape_configs:
  - job_name: 'prometheus'  # Self-monitoring configuration
    static_configs:
      - targets: ['localhost:9090']  # Prometheus's own metrics endpoint
  - job_name: 'k6'          # Configuration to scrape k6 metrics
    static_configs:
      - targets: ['k6:6565']  # k6's metrics endpoint (using Docker service name)
    scrape_interval: 5s     # More frequent scraping for k6 during tests
    metrics_path: /metrics  # Path where metrics are exposed

Once Prometheus is collecting metrics, we'll be able to query this data directly or visualize it through Grafana in the next steps.

Step 4: Grafana Dashboard Configuration

Create a dashboard provisioning file for automatic setup:

grafana/dashboards/dashboard.yml

apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards

Step 5: Docker Compose Configuration

docker-compose.yml

services:
  # Sample API service to be load tested by k6
  sample-api:
    build: ./sample-api
    ports:
      - "3000:3000" # Exposes API on localhost:3000
    networks:
      - k6-net

  # Prometheus for metrics collection
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090" # Prometheus UI available at localhost:9090
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml # Custom config
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle' # Allows config reloads without restart
      - '--web.enable-remote-write-receiver' # Enables remote write endpoint for k6
    networks:
      - k6-net

  # Grafana for dashboarding and visualization
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3001:3000" # Grafana UI available at localhost:3001
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin # Default admin password
    volumes:
      - grafana-storage:/var/lib/grafana # Persistent storage for Grafana data
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards # Pre-provisioned dashboards
    networks:
      - k6-net
    depends_on:
      - prometheus # Waits for Prometheus to be ready

  # k6 load testing tool with Prometheus remote write output
  k6:
    image: grafana/k6:latest
    container_name: k6
    ports:
      - "6565:6565"
    environment:
      - K6_PROMETHEUS_RW_SERVER_URL=http://prometheus:9090/api/v1/write # Prometheus remote write endpoint
      - K6_PROMETHEUS_RW_TREND_STATS=p(95),p(99),min,max # Custom trend stats
    volumes:
      - ./k6:/scripts # Mounts local k6 scripts
    command: run --out experimental-prometheus-rw /scripts/script.js # Runs the main k6 script
    networks:
      - k6-net
    depends_on:
      - sample-api
      - prometheus

volumes:
  grafana-storage: # Named volume for Grafana data

networks:
  k6-net:
    driver: bridge # Isolated network for all services

Step 6: Start the stack

Start all services:

docker-compose up -d

Step 7: Setting Up a Pre-built K6 Dashboard

Access Grafana: Navigate to http://localhost:3001
Login: Use admin/admin (you'll be prompted to change the password)
Add Prometheus Data Source First:
- Go to Configuration → Data Sources
- Click "Add data source"
- Select "Prometheus"
- Set URL to: http://prometheus:9090
- Click "Save & Test"
Import K6 Dashboard:
- Click the "+" icon in the left sidebar
- Select "Import"
- Use one of these dashboard IDs for Prometheus:
  - 19665 - K6 Prometheus (recommended)
  - 10660 - K6 Load Testing Results (Prometheus)
  - 19634 - K6 Performance Test Dashboard
- Click "Load"
- Select your Prometheus data source
- Click "Import"

Step 8: Run the load test

Run the k6 test:

docker-compose run --rm k6 run --out experimental-prometheus-rw /scripts/script.js

As the test runs, k6 will send API requests to the sample API, and metrics will be collected and sent to Prometheus. You can monitor the test progress in the terminal.

Step 9: Monitor your test run in Grafana

The Grafana UI available at http://localhost:3001. Select your Dashboard from the left nav and you can monitor your test real time with the Grafana dashboard, which should look something like this:

Cleanup

Stop and remove all containers and volumes:

docker-compose down -v

Conclusion

The point of this post was just to provide awareness of the open source options available to you as you consider k6 for load testing. I skimmed over a lot of detail and explanation about k6, Prometheus, and Grafana. I will likely fill in some detail with future posts. Until then, this setup provides a complete observability stack for K6 load testing.

The Docker-based approach ensures consistency across environments and makes it easy to integrate into CI/CD pipelines. And FYI, you can find all the code from this blog post here.

Thanks for reading and let me know if you have any questions or suggestions for future posts!

Automating Accessibility Testing With Playwright

Dennis Whalen — Sat, 07 Dec 2024 23:23:24 +0000

Introduction

In my previous post, I showed you how to use the axe DevTools chrome extension to test for accessibility issues on a webpage. Today, I'm going to show you how to do the same thing with Playwright, enabling you to automate accessibility testing in your CI/CD pipeline. Everything I'm going to demo can also be found in my sample repo.

The axe-core package

The axe-core package is a JavaScript library that can be used to run accessibility tests on a webpage. It's the same library that powers the axe DevTools extension that we looked at in my previous post, but it can be run programmatically in a variety of environments.

We're going to use it with Playwright, but axe-core packages exist for a number of automated testing frameworks, including Cypress, Selenium, WebdriverIO, and more.

Our first Playwright accessibility test

Including accessibility tests in your Playwright suite is as simple as adding a few lines of code. We've got a sample website that we're going to test, and we're going to use Playwright to navigate to the page and run an accessibility check.

Here's an example test that navigates to the webpage and runs an accessibility check:

import { test, expect } from '@playwright/test';
import AxeBuilder from '@axe-core/playwright';

test.describe('Accessibility University testing', () => {
  test('Full page scan should not find accessibility issues', async ({ page }) => {
    await page.goto('https://www.washington.edu/accesscomputing/AU/before.html');
    await page.waitForLoadState('networkidle');
    const accessibilityScanResults = await new AxeBuilder({ page }).analyze();
    expect(accessibilityScanResults.violations).toEqual([]);
  });
});

A few things to note:

waitForLoadState('networkidle') waits for the page to finish loading before running the accessibility check. This is important because we want to make sure the page is fully rendered before we check for accessibility issues.
new AxeBuilder({ page }).analyze() runs the accessibility check on the page and returns the results.
expect(accessibilityScanResults.violations).toEqual([]); indicates that we are expecting 0 accessibility violations. If violations are found, this test will fail.

Since our sample website is specifically designed to have lots of accessibility issues, we're expecting this test to fail, and it does!

Reporting

OK so we know that our test is failing, but what accessibility issues are being found? The accessibilityScanResults object contains a lot of information about the accessibility issues that were found, and it's pretty easy to create a report with tweaks like this:

import { test, expect } from '@playwright/test';
import path from 'path';
import AxeBuilder from '@axe-core/playwright';
import { createHtmlReport } from 'axe-html-reporter';

test.describe('Accessibility University testing', () => {
  test('Full page scan should of BEFORE page', async ({ page }) => {
    await page.goto('https://www.washington.edu/accesscomputing/AU/before.html');
    await page.waitForLoadState('networkidle');
    const accessibilityScanResults = await new AxeBuilder({ page }).analyze();
    createHtmlReport({
      results: accessibilityScanResults,
      options: {
          outputDir: path.join('e2e', 'test-results', 'accessibility-results'),
          reportFileName: `my-report.html`,
      },
  });
    expect(accessibilityScanResults.violations).toEqual([]);
  });
});

createHtmlReport is imported from the axe-html-reporter package, and it creates an HTML report of the accessibility issues found. The report is saved to the e2e/test-results/accessibility-results directory with the name my-report.html. Here's a snippet of what the report for our test looks like:

This is just the first page of the report, and there are lots of details in following pages. You'll notice that there are 50 total violations, which matches what we saw when we used on the axe DevTools extension in my previous post.

The report also provides a detailed breakdown of the violations, including the rule that was violated, the impact of the violation, and a description of the issue.

Just as we did with the Chrome extension, we can use this report to identify and fix the accessibility issues on our website.

Adding your tests to the CI/CD pipeline

If you have some general familiarity with Playwright, you probably already know how to run your tests in your CI/CD pipeline. If you don't, you can check out the Playwright documentation for more information. Since these accessibility tests are just Playwright tests, you can run them in the same way you run your other Playwright tests.

Conclusion

In this post I showed you a basic example of how to include accessibility tests in your Playwright suite. Don't forget that axe-core is not limited to Playwright, and can be used with a variety of automated testing frameworks, such as Cypress and Selenium.

Some additional things to note:

Although I was just testing in Chrome, axe-core supports all major browsers. You should consider testing in multiple browsers and viewports to ensure that your website is accessible to all users with all devices.
The scan can be configured to include or exclude certain rules, for example color-contrast rules.
You can limit the scan to a subset of the page by using the include and exclude options.
You can limit the scan to a subset of the WCAG guidelines by using the rules option.

Finally, it's probably appropriate to mention that a clean accessibility scan is a great first step in making your website accessible, but it's not a silver bullet. For example, take the test to verify the existence of alt-text for images. The tool can tell you if an image is missing alt-text, but it can't tell you if the alt-text is meaningful.

Manual validation via a screen reader is an additional step you can do to ensure that your website is accessible. Automation tooling exists that can allow you to automate the manual screen reader testing process, but that's a topic for another post. Stay tuned!

Accessibility Testing with a Chrome Extension

Dennis Whalen — Mon, 02 Dec 2024 11:00:00 +0000

Introduction

An accessible website ensures that all users, regardless of their physical or cognitive limitations, can navigate and interact with its content.

Accessibility addresses issues faced by people with impairments, such as:

visual impairments (requiring screen readers or high contrast)
hearing impairments (requiring captions for audio content)
mobility challenges (navigating without a mouse)
cognitive disabilities (requiring clear and consistent layouts).

Web Content Accessibility Guidelines are the gold standard for web accessibility, offering a comprehensive set of standards to ensure websites are usable by everyone.

Even with these standards, testing for accessibility can be challenging, particularly when done manually. The good news? Many aspects of accessibility testing can be automated.

Let's get started with a Chrome browser extension, and a website that could really use some help with accessibility.

axe DevTools extension for Chrome

A number of browser extensions are available to help you test for accessibility issues. One of the most popular is the "axe DevTools" extension for Chrome. This tool allows you to run accessibility checks right in the browser and view the results in an easy-to-understand format.

axe DevTools unlocks more features with a Pro subscription, but the free version is still quite powerful. Everything we'll be doing here can be done with the free version.

Enough chatter, let's see how this works!

The website

I want to take a look at this page and see what kind of accessibility issues I can find. The page is specifically designed to have lots of accessibility issues, so it's a great place to start. It looks like this:

Install the extension

Install the axe DevTools extension from the Chrome Web Store. Once installed, you can start it by selecting axe DevTools from the Chrome Developer Tools menu:

And it will load, like this!

Running a scan

Once you have the extension loaded, you can click the "Full Page Scan" button to run a scan on the entire page. This will check for a variety of accessibility issues and display the results in the panel, like this:

You can see above there are a total of 50 issues, with 12 of them being critical. I'm going to click on the hyperlinked "12" to filter just the 12 critical issues:

Alt-text issues

OK now it's starting to get interesting. Our first problem is we have two images with no alt-text. This is a big no-no for accessibility, as screen readers rely on alt-text to describe images to users who can't see them.

When I click on the "Images must have alt-text" link above, I see this:

There is great info here. You can see:

the flagged element location in the DOM
actionable steps to address the issue
highlighted element on the page (pink border)

Form field issues

Let's look at our other critical issues. The next issue is "form fields without labels". This is a big deal because users who rely on screen readers need to know what each form field is for. If the form field is missing a label, or the label is not descriptive, it can be very confusing. Looks like we have 10 of those issues.

In our first example, we have the Name field, and just looking at it visually, it does appear to have a label. But the label is not connected to the input field in the DOM. This is a common issue with forms, and it's easy to fix.

You can see the problem element in the DOM, and the highlighted element on the page. The actionable steps tell you to connect the label to the input field.

Wrap-up

So there you have it, some basics about how to use the axe DevTools extension to identify and fix accessibility issues.

Also, along the top of the page you wil see an "After" button that can show you a version of the page with all of the issues addressed, and 0 accessibility issues found in the scan.

Finally, we looked at the 12 critical issues, but remember there are 50 total issues, and this is just for one page! Imagine how many issues you might find on a large site, especially one that has not been designed with accessibility in mind.

In addition to the browser extension, we can do accessibility testing in the CI pipeline with tools like Playwright. This approach ensures we catch these issues before they reach our codebase or get deployed to production. I'll cover that in my next blog post, so stay tuned!

Playwright Visual Testing - Dynamic Data

Dennis Whalen — Mon, 11 Nov 2024 13:00:00 +0000

Intro

In my last post I talked about how to get started with visual testing using Playwright. In this post, I’m going to cover how to handle dynamic data in your visual tests.

What is dynamic data?

For the purposes of this post, I’m defining dynamic data as anything on your web page that could change between test runs. This could be data that is generated randomly, data that is pulled from an API, data that is based on the current date or time, etc.

Options for dealing with dynamic data

In visual testing, it's essential to ensure that tests are consistent and reliable. When applications contain dynamic data, a decision is needed on how to handle it within the visual tests.

Option 1: Mock your dynamic data

One option for handling dynamic data in your visual tests is to mock the data so that it is consistent each time you run the test. For example, if you have a list of items that can change over time (and break you test), you could mock the list so it's the same each time you run the test.

Option 2: Hide the dynamic data

Another option for handling dynamic data in your visual tests is to hide the data so that it is not visible in the screenshot. This is the option I’m going to focus on in this post, as it can be a simple and effective way to handle dynamic data in your visual tests.

Back to my sample application

In my last post I had a sample ToDo app with a single Playwright visual testing. I've tweaked the application to include the current date and time in the footer. So now when I run the test it fails, because it doesn't match my baseline image:

You can see the Actual page has the current date and time in the footer, but the Baseline image does not. You can also look at the Diff image to see the highlighted differences between the two:

Hiding dynamic data

The toHaveScreenshot() function in Playwright is what we use to do the screen shot comparison, and it accepts a stylePath parameter that allows you to hide elements on the page before taking the screenshot. This is a great way to handle dynamic data in your visual tests. In my sample application I created screenshot.css that looks like this:

  #datetime {
    display: none;
  }

I then pass this file to the toHaveScreenshot() function like this:

      await expect(page).toHaveScreenshot(
      [os.platform(), 'landing.png'],
      {
        stylePath: path.join(__dirname, 'screenshot.css'),
      });

Now when I run the test, the dynamic data is hidden, and the test passes. The Baseline image does not need to change, as the dynamic data is not visible in the screenshot.

Masking dynamic data

Another option for dealing with dynamic data is to use Playwright's mask parameter, like this:

    await expect(page).toHaveScreenshot(
      [os.platform(), 'landing.png'],
      {
        mask: [
          page.locator('#datetime')
        ]
      });

If you what to go this route, you would need to update the Baseline image to include the mask. This is a little more work, but it's a good option if you want a placeholder for the dynamic data in the Baseline image:

But what if hiding dynamic data is also hiding a bug?

Hey, great question! The good news is you still have the ability to do functional validation of that dynamic data, just like you always have.

For example, if I want to verify datetime is displayed and it matches the current date, I could just do something like this:

    const datetimeValue = page.locator('#datetime-value');

    // Verify that the datetime value element is visible
    await expect(datetimeValue).toBeVisible();

    // Get the text content of the datetime value element
    const datetimeText = await datetimeValue.textContent();

    // Check if datetimeText is a valid date string
    if (datetimeText && !isNaN(Date.parse(datetimeText))) {
      const datetimeDate = new Date(datetimeText).toLocaleDateString();
      const currentDate = new Date().toLocaleDateString();

      // Assert that the date part of the datetime value is equal to the current date
      expect(datetimeDate).toBe(currentDate);
    } else {
      throw new Error('Invalid datetime text');
    }

This code verifies the datetime is displayed, and it's the current date. I can then let my visual comparison do the rest of the heavy lifting.

Wrap-up

So that's about it for this post. You now know how to hide dynamic data in your visual tests, as you continue using functional validation to ensure the dynamic data is correct.

So far these blog posts have been focused on how Playwright supports visual testing, without considering cloud-based solutions like Applitools or Percy. In my next post I'll cover how to integrate Playwright with Applitools (or Percy) for visual testing. Stay tuned!

Playwright Visual Testing - Getting Started

Dennis Whalen — Fri, 25 Oct 2024 17:25:42 +0000

Intro

So, what’s the big deal with visual testing, and how is it different from functional testing?

Visual testing is all about comparing an actual web page to an expected (baseline) image. Unlike functional testing, which checks if your application works as expected, visual testing focuses on how your web pages look.

Many times, visual testing is done manually, but you can (and should) automate it where appropriate using tools like Playwright.

While Playwright is known for functional testing, it also comes with a straightforward API that lets you take screenshots and compare them to baseline images automatically. This way, you can easily include visual testing in your automated suite to ensure that as your application evolves, it continues to look great.

A number of 3rd party cloud solutions exist that support visual testing and can be integrated into your Playwright functional tests, but for the moment I am going to focus solely on using only the Playwright core standalone functionality.

MANUAL Visual Testing Challenges

Manual visual testing can be super useful, but, like anything else, it has its challenges. Here are a couple that come to mind:

Challenge #1: Manual validation is time consuming

One of the trickiest parts of visual testing is ensuring your application looks good across different browsers, operating systems, and screen resolutions. What looks perfect in Chrome on your MacBook might not look so great in Safari on an iPhone 12.

If you want to test your application across 3 browsers and 3 viewports, that’s 9 different combinations you need to test. And if you want to test across multiple operating systems, that’s even more combinations to test. So even if you're able to manually test all these combinations without pulling your hair out, it’s time-consuming.

Challenge #2: Manual validation is error-prone

Manual testing is done by humans, which means human errors come with it. It’s easy to miss a visual bug, especially if you’re testing on multiple browsers and devices.

Luckily, Playwright has some great features to help with these challenges, so let’s dig into how it works.

A Basic Example

Let’s start with a simple example to show how easy it is to set up visual testing with Playwright. If you want to see all the details, I’ve got a sample repo with a basic task-tracking application and some Playwright tests to go with it. Here’s the first test I wrote:

test.beforeEach(async ({ page }) => {
    page.goto('/');
});

test.describe('New Todo', () => {
  test('@visual should allow me to add todo items', async ({ page }) => {
    await expect(page).toHaveScreenshot('landing.png');
  });
});

This test is pretty basic. It just navigates to the application’s home page and uses Playwright’s toHaveScreenshot() function to grab a screenshot of the page and compare it to the baseline image. When I run this test in the Chromium browser, I get this:

Error: A snapshot doesn't exist at /Users/denniswhalen/visual-testing-sandbox/e2e-tests/blog.spec.ts-snapshots/landing-chromium-desktop-darwin.png, writing actual.

The test failed because Playwright didn’t find a baseline image, which makes sense since I don’t have one yet! Playwright created a baseline for me based on what the page looked like during the test:

After manually checking the baseline image and confirming it looked good, I can run the test again:

      ✓  1 [chromium-desktop] › blog.spec.ts:10:9 › New Todo › @visual should allow me to add todo items (1.5s)

  1 passed (3.0s)

This time, Playwright found the baseline screenshot, compared it to the actual screenshot in the test, and verified they match. All good!

Dealing with Challenge #1: Different Browsers and Viewports

That test ran in the Chromium desktop browser, but you can also run it in other browsers like Firefox and WebKit, and even on mobile viewports. To do that, I updated my playwright.config.ts file to include the browsers and viewports I wanted to test:

projects: [
    {
      name: 'chrome-desktop',
      use: { ...devices['Desktop Chrome'] },
    },
    {
      name: 'firefox-desktop',
      use: { ...devices['Desktop Firefox'] },
    },
    {
      name: 'webkit-desktop',
      use: { ...devices['Desktop Safari'] },
    },
    {
      name: 'chrome-mobile',
      use: { ...devices['Pixel 5'] },
    },
    {
      name: 'safari-mobile',
      use: { ...devices['iPhone 12'] },
    },
  ],

With these updates, I can run the tests again, and they’ll run across all the browsers and viewports I specified.

When the test runs, Playwright will look for a baseline image that matches the name of the project for which the test is running. Since I don’t yet have baseline images for the new projects, Playwright will generate them for me. After reviewing the images and rerunning the test, I’m all set.

With that done, I have one simple test that runs in five browser/viewport combinations, verifying the screens against the baseline files. The baseline filenames include the page name (landing), browser name, and OS (-darwin), so it’s easy to track which baseline images are being used.

Boom! My first visual test is working locally. Next, I could write more tests, but first, I want to push this to GitHub and run it in the CI workflow, just to be sure it works. But really, what could go wrong?

Dealing with Challenge #1: Different Operating Systems

Hmmm, when I ran it in CI, I got this error:

Error: A snapshot doesn't exist at /workspace/e2e-tests/blog.spec.ts-snapshots/landing-chromium-desktop-linux.png, writing actual.

This happened because the baseline image Playwright is looking for in the CI environment doesn’t exist. The baseline images I created earlier had names ending in -darwin.png (Mac), but the CI system is running Linux, so it’s expecting a -linux.png file.

Playwright uses the operating system as part of the baseline filename, which is helpful, as mentioned in Challenge #1.

So, now I need to create a Linux baseline image.

Docker to the Rescue!

To solve this, I used Docker on my MacBook to generate the Linux baseline images. Here’s the command I ran:

docker run -it --rm \
  --ipc=host \
  -v $(pwd):/workspace \
  -w /workspace \
  -e HOME=/tmp \
  mcr.microsoft.com/playwright:latest \
  /bin/bash -c "npx playwright install --with-deps && npm run e2e:visual"

This command runs the visual tests inside a Docker container and generates the baseline images for Linux, so now I have all the baseline images I need:

I also tweaked my GitHub workflow to run the visual tests with the same Docker image I used locally, so if the tests pass in Docker locally, they should also pass in the CI environment.

After committing the changes to the repo, the tests now pass both locally on my Mac and in the CI workflow. Pretty slick, right?

Dealing with a Failed Test

Let’s see how Playwright helps you when a test fails. I’m going to change the application so the ToDo textbox doesn’t have placeholder text. When I run the test, I get this error:

Error: Screenshot comparison failed:

To see the specific issue, I can check the HTML report that Playwright generates:

I selected the side-by-side view, which shows the actual image on the left and the expected image on the right. You can easily spot differences and decide if they’re acceptable. If they are expected, you can update the baseline image. If not, you can update the page to fix the issue.

Wrap-up

So that’s just a little taste of how you can use Playwright for visual testing. Hopefully, you can see the value of visual testing and how it helps catch visual bugs before your users do!

One benefit that might not be as obvious is how visual testing can simplify your functional testing. In future posts, I’ll talk about that and cover how to handle dynamic data in automated visual testing.

Stay tuned!

Why I Blog

Dennis Whalen — Thu, 13 Jul 2023 20:27:54 +0000

I've been blogging off-and-on for over 4 years. It's really helped me grow, and I usually recommend it to others as great way to keep growing skills.

Recently I was giving peer review feedback to a fellow employee. As I was describing all of the reasons THEY might want to consider blogging, I realized I should create my own blog post that talks about the value of blogging.

Focused learning, for me!

It probably sounds weird, but a lot of the things I blog about are things I don't know much about when I start the blog.

When I identify a skill or technology that's new to me and I want to learn more, many times I will write a blog as I work through my learning.

For example, let's say I want to get some expertise on using Postman for API testing, and I also want to demonstrate how to include those Postman tests in a CI pipeline.

For something like that, I will want to build a working prototype that demonstrates the key pieces of the tech I'm learning. As I build that prototype, I will also write a blog post that describes what I'm doing and why.

Writing the blog as I work through the prototype reinforces my learning, and it forces me to feel sure I understand how things work. I don't want to share info with others that is incomplete or wrong.

I'll also have screenshots and a repo I can share in the blog.

Feel free to look at the blog I wrote for Postman. The blog actually started as a single post, but turned into a 2-part series. Once I got into it, I realized it was probably too big for one post, so I split it up.

Introducing myself to a future client

I'm an IT consultant. That means I may be interviewing with new clients relatively frequently. I've found that making my blog posts available to potential clients is an AWESOME marketing tool!

If you have good content, and the person interviewing you has looked at it, that will get you a long way towards securing that next project or client or job.

For future reference

Like many of us, I work with a lot of different technologies over time, and I need to be constantly growing my skills.

Along with all this learning comes a LOT of forgetting. The stuff that I was working with last year? That stuff that I knew like the back of my hand?? Well, it might be a distant memory to me now.

If I create a clear and concise blog about a topic as I work with it, I will have a future reference point that I can go back to as needed. Even if no one else reads my blog post, I can reference it.

Since it's written in my voice, about a topic that I've struggled through, my memory will get refreshed a lot quicker than googling "help me with postman".

Helping others

I guess this is an obvious benefit to blogging, and a bit less selfish than the previous ones I mentioned.

We've all been helped by so many nameless and faceless folks on our journey to improve our skills and marketability. Paying that back is a good thing and will make you feel good.

Some suggestions

These are just some random blogging suggestions that I thought of as I composed this.

Don't compose in the blog provider's editor

Avoid composing your blog directly in the blog platform's editor. Instead, consider composing your blog locally and pushing it to a repo. If you have a POC to support the blog, keep the blog with that code. From there you can just copy/paste it to the blog platform editor.

In the past I have run into issues where I've lost work with the blog platform editor, so I do everything locally. I just use VS Code for authoring markdown files and store them in GitHub.

Don't get too hung up on visit and like counts

I have yet to break the internet with any of my posts. I've had 1 or 2 that have done ok, but most have 10 likes or less. That's ok. Remember, blogging is for the benefit of the author also!

I just remember that I'm good enough, I'm smart enough, and doggone it, people like me!

Keep your post to a 5-minute read, or less

Like everything in this post, this is just my opinion. I prefer shorter blog posts when reading them, so I try to do the same when writing them. If I find I have too much content, i can just split it into multiple posts.

Reference your repo and use screen shots

This is pretty self-explanatory. If you are building something as you write you blog, include a link to your repo and don't forget to add some screen shots in the blog. Pictures are good!

Hmm, that Visual Basic 6 post may be too old...

Blogs get stale, and they need to be updated or pruned. I have not done a good job with that, and I need to do better.

(No, I don't really have a VB 6 post.)

Blogging will give you more ideas

Blogging will give you new ideas for more blogs, write that sh*t down so you don't forget!

Wrap-up

So, there you go. That's a decent overview of why I blog. What do you think? Does this give you any ideas or motivation for blogging?

What are some other good reasons to blog?

Feel free to share your thoughts in the comments!

Gherkin and Robot Framework

Dennis Whalen — Sun, 12 Feb 2023 17:21:13 +0000

Greetings! They say all good things must come to an end, and with this post, so it is with my series of posts covering Robot Framework.

This post builds on what was covered in previous posts. If you haven't checked out the other posts in the series, please do.

Robot Framework and Gherkin

In the last post we built a Robot test to validate some functionality of the ToDo app. The test accessed the ToDo page, added a ToDo, and verified it was successfully added. (The repo for the app and the tests can be found here).

This is what our current Robot test looks like:

***Test Case***
Add a new ToDo item
  Open Browser    ${URL}  ${BROWSER}
  Page Should Not Contain Element  //section[@class='main']
  Input text  class: new-todo  Finish This Blog Post
  Press Keys  class: new-todo  RETURN
  Page Should Contain Element  //section[@class='main']
  ${actual_count}=  SeleniumLibrary.Get Element Count  //ul[@class='todo-list']/li
  ${expected_count_number}=  Convert To Number  1
  Should Be Equal  ${expected_count_number}  ${actual_count}

There are at least a couple issues with this test:

Page locators are not reusable. Following this pattern, if I wanted to create another test (or 100 tests) that needs the count of ToDo items, I would probably just copy/paste that locator, //ul[@class='todo-list']/li every time I needed it. A better strategy would be to define the locator in a single place. If the locator ever needs to change, I have one place to go to make my update.
To my eyes, the test is hard to read. This is a pretty basic test, but It's not super clear what's going on. Also, this pattern requires knowledge of the Robot syntax. Product owners and BAs are not going to be interested in reading this, and probably no one else will be either.

Let's address these 2 issues.

As a reminder, this is what that test is actually doing:

- open the ToDo page
- verify there are no ToDos
- add a ToDo
- verify the ToDo text matches what was added
- verify there is 1 ToDo

I want my test to look a lot like this, without all the implementation details in my way.

Gherkin and Robot Framework

In the real world, the above is a good example of an acceptance test. It defines some basic functionality and describes how the application should react. Automating the testing of the requirements is useful as a component of Acceptance Test Driven Development (ATDD). I usually see this go hand-in-hand with BDD and the gherkin syntax. In that example our gherkin-syntax test could look something like this:

When the User accesses the Home page
Then the ToDo count is  0
When the user enters new ToDo  learn Robot
Then the ToDo item is added to the to the list  learn Robot  1
And the ToDo count is  1

So how does Robot Framework facilitate our ability to write tests like this?

First of all, remember Robot is a keyword-driven framework. The first line of our test When the User accesses the Home page is just a keyword to Robot. As discussed in previous posts, we can easily hide the implementation details of this keyword in a Robot resource file. Defining this keyword in a resource file could be as simple as:

*** Keywords ***
the User accesses the Home page
    Open Browser  http://localhost:8888  chrome

Here I have created a custom keyword the User accesses the Home page, which uses the built-in keyword Open Browser. Note that my custom keyword does NOT begin with the word When. That's because if no match is found for the full keyword, Robot will ignore the prefixes "Given", "When", "Then", "And", "But" . This allows Robot to easily facilitate our ability to build tests using the Gherkin syntax.

Also, since we can pass parameters with Robot keywords, we're doing that with the step When the user enters new ToDo learn Robot. We are passing the text of the ToDo, "learn robot". The implementation of that step in the resource file can look something like this:

*** Keywords ***
the User enters new ToDo
[Arguments]  ${todo_to_enter}
Input text  class: new-todo  ${todo_to_enter}
Press Keys  class: new-todo  RETURN

The full resource file, including other examples, can be found here, with the gherkin test here.

To run these tests, just be sure to first start the app locally (npm start). If you can't open the app manually, the Robot test won't be able to either.

Wrap-up

So that's it. We had 2 goals for cleaning up that test:

facilitate reuse
make it readable

With reusable and parameterized gherkin steps, I feel we've accomplished our goal:

When the User accesses the Home page
Then the ToDo count is  0
When the user enters new ToDo  learn Robot
Then the ToDo item is added to the to the list  learn Robot  1
And the ToDo count is  1

I hope this series of posts has helped someone learn more about Robot Framework. Of course I have barely scratched the surface, and the links below should help you continue your Robot journey!

Links

https://github.com/robotframework/QuickStartGuide/blob/master/QuickStart.rst
https://github.com/robotframework/RobotDemo
https://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html
https://robotframework.org/robotframework/latest/libraries/BuiltIn.html

Web Testing With Robot Framework

Dennis Whalen — Sun, 29 Jan 2023 21:32:44 +0000

Liquid syntax error: 'raw' tag was never closed