Wanda

Posted on Mar 19 • Originally published at apidog.com

How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

#llm #security #testing #tutorial

TL;DR

Promptfoo is an open-source LLM evaluation and red-teaming framework that enables systematic, automated testing of AI applications. It supports 90+ model providers, offers 67+ security attack plugins, and runs entirely locally for privacy. With over 1.6 million npm downloads and adoption by companies serving 10M+ users, it’s become the standard for LLM testing. Get started with:

npm install -g promptfoo
promptfoo init --example getting-started

Try Apidog today

Introduction

After building an AI-powered customer support chatbot, you might find that users can make it leak sensitive data, bypass guardrails, or provide inconsistent answers. This is common: manual and gut-feel testing often misses vulnerabilities and quality issues, which are expensive to fix post-launch.

Promptfoo brings systematic, automated testing to LLM applications—enabling you to evaluate prompts across multiple models, run red-team security assessments, and catch regressions before they impact users.

This guide will walk you through setting up evaluations, running security scans, integrating with CI/CD, and building a robust test suite for your LLM application.

💡 Tip: If you need to test APIs alongside LLMs, Apidog provides a unified platform for API design, testing, and docs. Use promptfoo for LLMs and Apidog for API validation.

What Is Promptfoo and Why You Need It

Promptfoo is a CLI tool and Node.js library for evaluating and red-teaming LLM applications. Unlike traditional test frameworks, promptfoo is designed for the non-determinism and security challenges in AI development.

Promptfoo addresses LLM-specific testing needs with:

Semantic assertions (meaning-based, not string-based)
LLM-graded evals (one model grades another’s output)
Multi-model comparisons (test prompts across GPT-4, Claude, etc.)
Security plugins (auto-detect vulnerabilities)

Promptfoo runs fully locally. Your prompts and test data stay on your machine unless you opt into cloud features—ideal for privacy and sensitive data.

The Problem Promptfoo Solves

Manual LLM testing misses regressions, edge cases, and has no metrics. Promptfoo replaces this with automated, repeatable evals you can run on every code change, delivering pass/fail rates, cost, and latency metrics.

Who Uses Promptfoo

With 1.6M+ npm downloads, promptfoo is used by:

Customer support chatbots
Content generation pipelines
Healthcare/fintech with compliance needs
Security-sensitive systems

Promptfoo is now maintained by OpenAI (as of March 2026), remains open source, and continues to evolve.

Getting Started: Install and Run Your First Eval

Install promptfoo globally or use npx for zero-install runs.

Installation

# Global install (recommended)
npm install -g promptfoo

# Or run with npx (no install)
npx promptfoo@latest

# macOS (Homebrew)
brew install promptfoo

# Python
pip install promptfoo

Set your API keys as environment variables:

export OPENAI_API_KEY=sk-abc123
export ANTHROPIC_API_KEY=sk-ant-xxx

Create Your First Eval

Initialize an example project:

promptfoo init --example getting-started
cd getting-started

This creates a promptfooconfig.yaml file with sample prompts, providers, and tests.

Run the evaluation:

promptfoo eval

View results in the web UI:

promptfoo view

The UI opens at localhost:3000 with a side-by-side comparison of model outputs and assertion results.

Understanding the Config File

promptfooconfig.yaml defines your eval suite:

description: "My First Eval Suite"

prompts:
  - prompts/greeting.txt
  - prompts/farewell.txt

providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-5

tests:
  - vars:
      input: "Hello"
    assert:
      - type: contains
        value: "Hi"
      - type: latency
        threshold: 3000

prompts: Files or inline text to test
providers: List of models (supports 90+)
tests: Test cases with variables and assertions

Teams keep eval configs in version control and run them in CI for every pull request.

Core Features: What Promptfoo Can Do

1. Automated Evaluations

Define test cases with expected outcomes and run them against any supported model.

Assertion Types

Promptfoo provides 30+ built-in assertions:

Assertion	Purpose
`contains`	Output includes a substring
`equals`	Exact string match
`regex`	Match against a regex pattern
`json-schema`	Validate JSON structure
`javascript`	Custom JS function returns pass/fail
`python`	Custom Python function
`llm-rubric`	Use an LLM to grade output
`similar`	Semantic similarity score
`latency`	Response time under threshold
`cost`	Cost per request under threshold

Example:

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: javascript
        value: output.length < 100
      - type: latency
        threshold: 2000
      - type: cost
        threshold: 0.001

This checks for "Paris" in the response, keeps output under 100 characters, ensures response in under 2 seconds, and cost under $0.001.

LLM-Graded Evals

Use one LLM to grade another:

assert:
  - type: llm-rubric
    value: "Response should be helpful, harmless, and honest"

Grader can be a cheaper model to reduce costs.

2. Red Teaming and Security Testing

Promptfoo includes red team modules to automatically generate adversarial probes.

Supported Attack Vectors

Category	What It Tests
Prompt Injection	Direct, indirect, and context injection attacks
Jailbreaks	DAN, persona switching, role-play bypasses
Data Exfiltration	SSRF, system prompt extraction, prompt leakage
Harmful Content	Hate speech, dangerous activities, self-harm requests
Compliance	PII leakage, HIPAA violations, financial data exposure
Audio/Visual	Audio injection and image-based attacks

Running a Red Team Scan

promptfoo redteam init
promptfoo redteam run
promptfoo redteam report [directory]

The tool generates dynamic attack probes, evaluates them, and reports vulnerabilities with severity levels and remediation suggestions.

Example output:

Vulnerability Summary:
- Critical: 2 (PII leakage, prompt extraction)
- High: 5 (jailbreaks, injection attacks)
- Medium: 12 (bias, inconsistent responses)
- Low: 23 (minor policy violations)

Fix critical/high issues and re-scan after changes.

3. Code Scanning for Pull Requests

Integrate with GitHub Actions to block LLM-related security issues:

# .github/workflows/promptfoo-scan.yml
name: Promptfoo Code Scan
on: [pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: promptfoo/promptfoo/code-scan-action@main
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

Detects hardcoded API keys, insecure prompt patterns, missing validation, and potential injection vectors.

4. Model Comparison

Compare multiple models side by side:

promptfoo eval
promptfoo view

Web UI displays pass/fail rates, cost, latency, and qualitative differences—helping you select the most effective and cost-efficient model.

Supported Providers: 90+ LLM Integrations

Promptfoo supports 90+ LLM providers. Test prompts across OpenAI, Anthropic, Google, Amazon, Meta, and local models.

Provider	Models
OpenAI	GPT-4, GPT-4o, GPT-4o-mini, o1, o3
Anthropic	Claude 3.5/3.7/4.5/4.6, Thinking models
Google	Gemini 1.5/2.0, Vertex AI
Microsoft	Azure OpenAI, Phi
Amazon	Bedrock (Claude, Llama, Titan)
Meta	Llama 3, 3.1, 3.2 (via multiple providers)
Ollama	Local models (Llama, Mistral, Phi)

Custom Providers

Write custom providers in Python or JS if your model isn’t supported.

Python:

# custom_provider.py
from typing import Any

class CustomProvider:
    async def call_api(self, prompt: str, options: dict, context: dict) -> dict:
        response = await my_async_api.generate(prompt)
        return {
            "output": response.text,
            "tokenUsage": {
                "total": response.usage.total_tokens,
                "prompt": response.usage.prompt_tokens,
                "completion": response.usage.completion_tokens
            }
        }

JavaScript:

// customProvider.js
export default class CustomProvider {
  async callApi(prompt) {
    return {
      output: await myApi.generate(prompt),
      tokenUsage: { total: 50, prompt: 20, completion: 30 }
    };
  }
}

providers:
  - id: file://custom_provider.py
    config:
      api_key: ${MY_API_KEY}

Command-Line Interface: Essential Commands

Promptfoo CLI covers all daily workflows.

Core Commands

# Run evaluations
promptfoo eval -c promptfooconfig.yaml

# Open web UI
promptfoo view

# Share results online
promptfoo share

# Red team testing
promptfoo redteam init
promptfoo redteam run

# Configuration
promptfoo init
promptfoo validate [config]

# Results management
promptfoo list
promptfoo show <id>
promptfoo delete <id>
promptfoo export <id>

# Utilities
promptfoo cache clear
promptfoo retry <id>

Useful Flags

--no-cache              # Disable caching for fresh results
--max-concurrency <n>   # Limit parallel API calls
--output <file>         # Write results to JSON file
--verbose               # Enable debug logging
--env-file <path>       # Load environment variables from file
--filter <pattern>      # Run specific test cases

Example: Run Eval with Custom Settings

promptfoo eval \
  -c promptfooconfig.yaml \
  --no-cache \
  --max-concurrency 3 \
  --output results.json \
  --env-file .env

CI/CD Integration: Automate LLM Testing

Integrate promptfoo into your CI/CD pipeline to automate regression detection.

GitHub Actions Example

name: LLM Tests
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
      - run: npm install -g promptfoo
      - run: promptfoo eval -c promptfooconfig.yaml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Quality Gates

Set pass thresholds in your config:

commandLineOptions:
  threshold: 0.8  # Require 80% pass rate

This fails CI if tests don’t meet the threshold.

Caching in CI

Speed up runs with caching:

- uses: actions/cache@v4
  with:
    path: ~/.cache/promptfoo
    key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

Cached results skip API calls for unchanged tests.

Web UI: Visualize and Share Results

promptfoo view opens an interactive UI at localhost:3000:

Eval matrix: Side-by-side comparison
Filtering: By status or provider
Diff view: See changes between runs
Sharing: Generate shareable links
Live updates: Watch evals run in real-time

Security: UI has CSRF protection. Don’t expose it to untrusted networks. Use promptfoo share for cloud sharing or self-host with auth.

Database and Caching

Cache Location

macOS/Linux: ~/.cache/promptfoo
Windows: %LOCALAPPDATA%\promptfoo

Use --no-cache during development for fresh runs.

Database Location

All platforms: ~/.promptfoo/promptfoo.db (SQLite)

Stores historical evals; don’t delete unless you want to lose data.

Security Model: What You Can Trust

Promptfoo separates trusted (code-executed) and untrusted (data-only) inputs.

Trusted Inputs

Config files (promptfooconfig.yaml)
Custom JS/Python/Ruby assertions
Provider configs
Transform functions

Only use from trusted sources.

Untrusted Inputs

Prompt text
Test case variables
Model outputs
Remote content fetched during evals

Treated strictly as data.

Hardening Recommendations

Run in containers/VMs with minimal privileges
Use least-privileged API keys
Don’t put secrets in prompts/configs
Restrict network egress for third-party code
Don’t expose local web UI to untrusted networks

Performance: Optimize Your Evals

Optimization Tips

Use caching (default) for speed
Tune concurrency with --max-concurrency for speed vs. API limits
Filter tests with --filter during development
Sample datasets with --repeat for quick iteration

Scaling for Large Evals

Use the scheduler (src/scheduler/) for distributed runs
Use remote generation for heavy compute
Export to Google Sheets for team visibility

Extensibility: Build Custom Features

Custom Assertions

Write your own assertion logic:

// assertions/customCheck.js
export default function customCheck(output, context) {
  const pass = output.includes('expected');
  return {
    pass,
    score: pass ? 1 : 0,
    reason: pass ? 'Output matched' : 'Missing expected content'
  };
}

Use in config:

assert:
  - type: file://assertions/customCheck.js

MCP Server

Run the MCP server for AI assistant integration:

promptfoo mcp

Enables agents to:

Run evals from chat
Access red team scans
Query results
Generate test cases

Real-World Use Cases

Customer Support Chatbot

500+ test cases
Multi-model evals (GPT-4, Claude)
Red team for PII/jailbreak
CI blocks deploys on failures
Result: 90% fewer customer-reported issues

Content Generation Pipeline

LLM-graded evals for tone/style
Latency/cost thresholds
Model comparison for best value
Result: Consistent voice, 40% lower API costs

Healthcare Application

Red team for HIPAA
Custom assertions for medical accuracy
Local evals for privacy
Auditable trails
Result: Passed SOC 2 audit using promptfoo

Conclusion

Promptfoo brings data-driven, automated testing to LLM applications—catching regressions, security issues, and quality problems before production.

Key steps:

Install: npm install -g promptfoo then promptfoo init
Use semantic assertions for robust validation
Run red team scans for security
Integrate with CI/CD
Compare models objectively
Extend with custom providers/assertions

With promptfoo, you can confidently build, test, and secure LLM applications at scale.

If you work with APIs, use Apidog with promptfoo. Apidog handles API design, testing, and docs; promptfoo covers LLM evaluation. Together, you get full-stack testing for modern apps.

FAQ

What is promptfoo used for?

Testing and evaluating LLM applications: automated tests against prompts, cross-model output comparisons, and security red-team assessments.

Is promptfoo free?

Yes, it’s open source (MIT). Free for personal and commercial use. Cloud/enterprise features may require paid plans.

How do I install promptfoo?

npm install -g promptfoo

Or use npx promptfoo@latest, brew install promptfoo (macOS), or pip install promptfoo (Python).

What models does promptfoo support?

90+ providers: OpenAI (GPT-4/o1), Anthropic (Claude 3.5/4/4.5), Google (Gemini), Microsoft (Azure OpenAI), Amazon Bedrock, Ollama local models, and more.

How do I run a red team scan?

promptfoo redteam init
promptfoo redteam run
promptfoo redteam report

Can I use promptfoo in CI/CD?

Yes—install promptfoo in your CI pipeline and run promptfoo eval with your config. Set threshold to enforce quality gates.

Does promptfoo send my data to external servers?

No. All runs are local unless you opt into cloud features. Cache and DB files are local.

How do I compare models with promptfoo?

List multiple providers in your config, run promptfoo eval, then promptfoo view to compare pass rates, cost, and latency in the web UI.