DEV Community

Cover image for How to Test LLM Applications: The Complete Guide to Promptfoo (2026)
Wanda
Wanda

Posted on • Originally published at apidog.com

How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

TL;DR

Promptfoo is an open-source LLM evaluation and red-teaming framework that enables systematic, automated testing of AI applications. It supports 90+ model providers, offers 67+ security attack plugins, and runs entirely locally for privacy. With over 1.6 million npm downloads and adoption by companies serving 10M+ users, it’s become the standard for LLM testing. Get started with:

npm install -g promptfoo
promptfoo init --example getting-started
Enter fullscreen mode Exit fullscreen mode

Try Apidog today


Introduction

After building an AI-powered customer support chatbot, you might find that users can make it leak sensitive data, bypass guardrails, or provide inconsistent answers. This is common: manual and gut-feel testing often misses vulnerabilities and quality issues, which are expensive to fix post-launch.

Promptfoo brings systematic, automated testing to LLM applications—enabling you to evaluate prompts across multiple models, run red-team security assessments, and catch regressions before they impact users.

This guide will walk you through setting up evaluations, running security scans, integrating with CI/CD, and building a robust test suite for your LLM application.

💡 Tip: If you need to test APIs alongside LLMs, Apidog provides a unified platform for API design, testing, and docs. Use promptfoo for LLMs and Apidog for API validation.

What Is Promptfoo and Why You Need It

Promptfoo is a CLI tool and Node.js library for evaluating and red-teaming LLM applications. Unlike traditional test frameworks, promptfoo is designed for the non-determinism and security challenges in AI development.

Promptfoo output comparison

Promptfoo addresses LLM-specific testing needs with:

  • Semantic assertions (meaning-based, not string-based)
  • LLM-graded evals (one model grades another’s output)
  • Multi-model comparisons (test prompts across GPT-4, Claude, etc.)
  • Security plugins (auto-detect vulnerabilities)

Promptfoo runs fully locally. Your prompts and test data stay on your machine unless you opt into cloud features—ideal for privacy and sensitive data.

The Problem Promptfoo Solves

Manual LLM testing misses regressions, edge cases, and has no metrics. Promptfoo replaces this with automated, repeatable evals you can run on every code change, delivering pass/fail rates, cost, and latency metrics.

Who Uses Promptfoo

With 1.6M+ npm downloads, promptfoo is used by:

  • Customer support chatbots
  • Content generation pipelines
  • Healthcare/fintech with compliance needs
  • Security-sensitive systems

Promptfoo is now maintained by OpenAI (as of March 2026), remains open source, and continues to evolve.

Getting Started: Install and Run Your First Eval

Install promptfoo globally or use npx for zero-install runs.

Installation

# Global install (recommended)
npm install -g promptfoo

# Or run with npx (no install)
npx promptfoo@latest

# macOS (Homebrew)
brew install promptfoo

# Python
pip install promptfoo
Enter fullscreen mode Exit fullscreen mode

Set your API keys as environment variables:

export OPENAI_API_KEY=sk-abc123
export ANTHROPIC_API_KEY=sk-ant-xxx
Enter fullscreen mode Exit fullscreen mode

Create Your First Eval

Initialize an example project:

promptfoo init --example getting-started
cd getting-started
Enter fullscreen mode Exit fullscreen mode

This creates a promptfooconfig.yaml file with sample prompts, providers, and tests.

Run the evaluation:

promptfoo eval
Enter fullscreen mode Exit fullscreen mode

View results in the web UI:

promptfoo view
Enter fullscreen mode Exit fullscreen mode

The UI opens at localhost:3000 with a side-by-side comparison of model outputs and assertion results.

Understanding the Config File

promptfooconfig.yaml defines your eval suite:

description: "My First Eval Suite"

prompts:
  - prompts/greeting.txt
  - prompts/farewell.txt

providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-5

tests:
  - vars:
      input: "Hello"
    assert:
      - type: contains
        value: "Hi"
      - type: latency
        threshold: 3000
Enter fullscreen mode Exit fullscreen mode
  • prompts: Files or inline text to test
  • providers: List of models (supports 90+)
  • tests: Test cases with variables and assertions

Teams keep eval configs in version control and run them in CI for every pull request.

Core Features: What Promptfoo Can Do

1. Automated Evaluations

Define test cases with expected outcomes and run them against any supported model.

Assertion Types

Promptfoo provides 30+ built-in assertions:

Assertion Purpose
contains Output includes a substring
equals Exact string match
regex Match against a regex pattern
json-schema Validate JSON structure
javascript Custom JS function returns pass/fail
python Custom Python function
llm-rubric Use an LLM to grade output
similar Semantic similarity score
latency Response time under threshold
cost Cost per request under threshold

Example:

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: javascript
        value: output.length < 100
      - type: latency
        threshold: 2000
      - type: cost
        threshold: 0.001
Enter fullscreen mode Exit fullscreen mode

This checks for "Paris" in the response, keeps output under 100 characters, ensures response in under 2 seconds, and cost under $0.001.

LLM-Graded Evals

Use one LLM to grade another:

assert:
  - type: llm-rubric
    value: "Response should be helpful, harmless, and honest"
Enter fullscreen mode Exit fullscreen mode

Grader can be a cheaper model to reduce costs.

2. Red Teaming and Security Testing

Promptfoo includes red team modules to automatically generate adversarial probes.

Promptfoo red team report

Supported Attack Vectors

Category What It Tests
Prompt Injection Direct, indirect, and context injection attacks
Jailbreaks DAN, persona switching, role-play bypasses
Data Exfiltration SSRF, system prompt extraction, prompt leakage
Harmful Content Hate speech, dangerous activities, self-harm requests
Compliance PII leakage, HIPAA violations, financial data exposure
Audio/Visual Audio injection and image-based attacks

Running a Red Team Scan

promptfoo redteam init
promptfoo redteam run
promptfoo redteam report [directory]
Enter fullscreen mode Exit fullscreen mode

The tool generates dynamic attack probes, evaluates them, and reports vulnerabilities with severity levels and remediation suggestions.

Example output:

Vulnerability Summary:
- Critical: 2 (PII leakage, prompt extraction)
- High: 5 (jailbreaks, injection attacks)
- Medium: 12 (bias, inconsistent responses)
- Low: 23 (minor policy violations)
Enter fullscreen mode Exit fullscreen mode

Fix critical/high issues and re-scan after changes.

3. Code Scanning for Pull Requests

Integrate with GitHub Actions to block LLM-related security issues:

# .github/workflows/promptfoo-scan.yml
name: Promptfoo Code Scan
on: [pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: promptfoo/promptfoo/code-scan-action@main
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
Enter fullscreen mode Exit fullscreen mode

Detects hardcoded API keys, insecure prompt patterns, missing validation, and potential injection vectors.

4. Model Comparison

Compare multiple models side by side:

promptfoo eval
promptfoo view
Enter fullscreen mode Exit fullscreen mode

Web UI displays pass/fail rates, cost, latency, and qualitative differences—helping you select the most effective and cost-efficient model.

Supported Providers: 90+ LLM Integrations

Promptfoo supports 90+ LLM providers. Test prompts across OpenAI, Anthropic, Google, Amazon, Meta, and local models.

Provider Models
OpenAI GPT-4, GPT-4o, GPT-4o-mini, o1, o3
Anthropic Claude 3.5/3.7/4.5/4.6, Thinking models
Google Gemini 1.5/2.0, Vertex AI
Microsoft Azure OpenAI, Phi
Amazon Bedrock (Claude, Llama, Titan)
Meta Llama 3, 3.1, 3.2 (via multiple providers)
Ollama Local models (Llama, Mistral, Phi)

Custom Providers

Write custom providers in Python or JS if your model isn’t supported.

Python:

# custom_provider.py
from typing import Any

class CustomProvider:
    async def call_api(self, prompt: str, options: dict, context: dict) -> dict:
        response = await my_async_api.generate(prompt)
        return {
            "output": response.text,
            "tokenUsage": {
                "total": response.usage.total_tokens,
                "prompt": response.usage.prompt_tokens,
                "completion": response.usage.completion_tokens
            }
        }
Enter fullscreen mode Exit fullscreen mode

JavaScript:

// customProvider.js
export default class CustomProvider {
  async callApi(prompt) {
    return {
      output: await myApi.generate(prompt),
      tokenUsage: { total: 50, prompt: 20, completion: 30 }
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

Register in promptfooconfig.yaml:

providers:
  - id: file://custom_provider.py
    config:
      api_key: ${MY_API_KEY}
Enter fullscreen mode Exit fullscreen mode

Command-Line Interface: Essential Commands

Promptfoo CLI covers all daily workflows.

Core Commands

# Run evaluations
promptfoo eval -c promptfooconfig.yaml

# Open web UI
promptfoo view

# Share results online
promptfoo share

# Red team testing
promptfoo redteam init
promptfoo redteam run

# Configuration
promptfoo init
promptfoo validate [config]

# Results management
promptfoo list
promptfoo show <id>
promptfoo delete <id>
promptfoo export <id>

# Utilities
promptfoo cache clear
promptfoo retry <id>
Enter fullscreen mode Exit fullscreen mode

Useful Flags

--no-cache              # Disable caching for fresh results
--max-concurrency <n>   # Limit parallel API calls
--output <file>         # Write results to JSON file
--verbose               # Enable debug logging
--env-file <path>       # Load environment variables from file
--filter <pattern>      # Run specific test cases
Enter fullscreen mode Exit fullscreen mode

Example: Run Eval with Custom Settings

promptfoo eval \
  -c promptfooconfig.yaml \
  --no-cache \
  --max-concurrency 3 \
  --output results.json \
  --env-file .env
Enter fullscreen mode Exit fullscreen mode

CI/CD Integration: Automate LLM Testing

Integrate promptfoo into your CI/CD pipeline to automate regression detection.

GitHub Actions Example

name: LLM Tests
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
      - run: npm install -g promptfoo
      - run: promptfoo eval -c promptfooconfig.yaml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Enter fullscreen mode Exit fullscreen mode

Quality Gates

Set pass thresholds in your config:

commandLineOptions:
  threshold: 0.8  # Require 80% pass rate
Enter fullscreen mode Exit fullscreen mode

This fails CI if tests don’t meet the threshold.

Caching in CI

Speed up runs with caching:

- uses: actions/cache@v4
  with:
    path: ~/.cache/promptfoo
    key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
Enter fullscreen mode Exit fullscreen mode

Cached results skip API calls for unchanged tests.

Web UI: Visualize and Share Results

promptfoo view opens an interactive UI at localhost:3000:

  • Eval matrix: Side-by-side comparison
  • Filtering: By status or provider
  • Diff view: See changes between runs
  • Sharing: Generate shareable links
  • Live updates: Watch evals run in real-time

Security: UI has CSRF protection. Don’t expose it to untrusted networks. Use promptfoo share for cloud sharing or self-host with auth.

Database and Caching

Cache Location

  • macOS/Linux: ~/.cache/promptfoo
  • Windows: %LOCALAPPDATA%\promptfoo

Use --no-cache during development for fresh runs.

Database Location

  • All platforms: ~/.promptfoo/promptfoo.db (SQLite)

Stores historical evals; don’t delete unless you want to lose data.

Security Model: What You Can Trust

Promptfoo separates trusted (code-executed) and untrusted (data-only) inputs.

Trusted Inputs

  • Config files (promptfooconfig.yaml)
  • Custom JS/Python/Ruby assertions
  • Provider configs
  • Transform functions

Only use from trusted sources.

Untrusted Inputs

  • Prompt text
  • Test case variables
  • Model outputs
  • Remote content fetched during evals

Treated strictly as data.

Hardening Recommendations

  • Run in containers/VMs with minimal privileges
  • Use least-privileged API keys
  • Don’t put secrets in prompts/configs
  • Restrict network egress for third-party code
  • Don’t expose local web UI to untrusted networks

Performance: Optimize Your Evals

Optimization Tips

  1. Use caching (default) for speed
  2. Tune concurrency with --max-concurrency for speed vs. API limits
  3. Filter tests with --filter during development
  4. Sample datasets with --repeat for quick iteration

Scaling for Large Evals

  • Use the scheduler (src/scheduler/) for distributed runs
  • Use remote generation for heavy compute
  • Export to Google Sheets for team visibility

Extensibility: Build Custom Features

Custom Assertions

Write your own assertion logic:

// assertions/customCheck.js
export default function customCheck(output, context) {
  const pass = output.includes('expected');
  return {
    pass,
    score: pass ? 1 : 0,
    reason: pass ? 'Output matched' : 'Missing expected content'
  };
}
Enter fullscreen mode Exit fullscreen mode

Use in config:

assert:
  - type: file://assertions/customCheck.js
Enter fullscreen mode Exit fullscreen mode

MCP Server

Run the MCP server for AI assistant integration:

promptfoo mcp
Enter fullscreen mode Exit fullscreen mode

Enables agents to:

  • Run evals from chat
  • Access red team scans
  • Query results
  • Generate test cases

Real-World Use Cases

Customer Support Chatbot

  • 500+ test cases
  • Multi-model evals (GPT-4, Claude)
  • Red team for PII/jailbreak
  • CI blocks deploys on failures
  • Result: 90% fewer customer-reported issues

Content Generation Pipeline

  • LLM-graded evals for tone/style
  • Latency/cost thresholds
  • Model comparison for best value
  • Result: Consistent voice, 40% lower API costs

Healthcare Application

  • Red team for HIPAA
  • Custom assertions for medical accuracy
  • Local evals for privacy
  • Auditable trails
  • Result: Passed SOC 2 audit using promptfoo

Conclusion

Promptfoo brings data-driven, automated testing to LLM applications—catching regressions, security issues, and quality problems before production.

Key steps:

  • Install: npm install -g promptfoo then promptfoo init
  • Use semantic assertions for robust validation
  • Run red team scans for security
  • Integrate with CI/CD
  • Compare models objectively
  • Extend with custom providers/assertions

With promptfoo, you can confidently build, test, and secure LLM applications at scale.

If you work with APIs, use Apidog with promptfoo. Apidog handles API design, testing, and docs; promptfoo covers LLM evaluation. Together, you get full-stack testing for modern apps.

FAQ

What is promptfoo used for?

Testing and evaluating LLM applications: automated tests against prompts, cross-model output comparisons, and security red-team assessments.

Is promptfoo free?

Yes, it’s open source (MIT). Free for personal and commercial use. Cloud/enterprise features may require paid plans.

How do I install promptfoo?

npm install -g promptfoo
Enter fullscreen mode Exit fullscreen mode

Or use npx promptfoo@latest, brew install promptfoo (macOS), or pip install promptfoo (Python).

What models does promptfoo support?

90+ providers: OpenAI (GPT-4/o1), Anthropic (Claude 3.5/4/4.5), Google (Gemini), Microsoft (Azure OpenAI), Amazon Bedrock, Ollama local models, and more.

How do I run a red team scan?

promptfoo redteam init
promptfoo redteam run
promptfoo redteam report
Enter fullscreen mode Exit fullscreen mode

Can I use promptfoo in CI/CD?

Yes—install promptfoo in your CI pipeline and run promptfoo eval with your config. Set threshold to enforce quality gates.

Does promptfoo send my data to external servers?

No. All runs are local unless you opt into cloud features. Cache and DB files are local.

How do I compare models with promptfoo?

List multiple providers in your config, run promptfoo eval, then promptfoo view to compare pass rates, cost, and latency in the web UI.

Top comments (0)