TL;DR
Promptfoo is an open-source LLM evaluation and red-teaming framework that enables systematic, automated testing of AI applications. It supports 90+ model providers, offers 67+ security attack plugins, and runs entirely locally for privacy. With over 1.6 million npm downloads and adoption by companies serving 10M+ users, it’s become the standard for LLM testing. Get started with:
npm install -g promptfoo
promptfoo init --example getting-started
Introduction
After building an AI-powered customer support chatbot, you might find that users can make it leak sensitive data, bypass guardrails, or provide inconsistent answers. This is common: manual and gut-feel testing often misses vulnerabilities and quality issues, which are expensive to fix post-launch.
Promptfoo brings systematic, automated testing to LLM applications—enabling you to evaluate prompts across multiple models, run red-team security assessments, and catch regressions before they impact users.
This guide will walk you through setting up evaluations, running security scans, integrating with CI/CD, and building a robust test suite for your LLM application.
💡 Tip: If you need to test APIs alongside LLMs, Apidog provides a unified platform for API design, testing, and docs. Use promptfoo for LLMs and Apidog for API validation.
What Is Promptfoo and Why You Need It
Promptfoo is a CLI tool and Node.js library for evaluating and red-teaming LLM applications. Unlike traditional test frameworks, promptfoo is designed for the non-determinism and security challenges in AI development.
Promptfoo addresses LLM-specific testing needs with:
- Semantic assertions (meaning-based, not string-based)
- LLM-graded evals (one model grades another’s output)
- Multi-model comparisons (test prompts across GPT-4, Claude, etc.)
- Security plugins (auto-detect vulnerabilities)
Promptfoo runs fully locally. Your prompts and test data stay on your machine unless you opt into cloud features—ideal for privacy and sensitive data.
The Problem Promptfoo Solves
Manual LLM testing misses regressions, edge cases, and has no metrics. Promptfoo replaces this with automated, repeatable evals you can run on every code change, delivering pass/fail rates, cost, and latency metrics.
Who Uses Promptfoo
With 1.6M+ npm downloads, promptfoo is used by:
- Customer support chatbots
- Content generation pipelines
- Healthcare/fintech with compliance needs
- Security-sensitive systems
Promptfoo is now maintained by OpenAI (as of March 2026), remains open source, and continues to evolve.
Getting Started: Install and Run Your First Eval
Install promptfoo globally or use npx for zero-install runs.
Installation
# Global install (recommended)
npm install -g promptfoo
# Or run with npx (no install)
npx promptfoo@latest
# macOS (Homebrew)
brew install promptfoo
# Python
pip install promptfoo
Set your API keys as environment variables:
export OPENAI_API_KEY=sk-abc123
export ANTHROPIC_API_KEY=sk-ant-xxx
Create Your First Eval
Initialize an example project:
promptfoo init --example getting-started
cd getting-started
This creates a promptfooconfig.yaml file with sample prompts, providers, and tests.
Run the evaluation:
promptfoo eval
View results in the web UI:
promptfoo view
The UI opens at localhost:3000 with a side-by-side comparison of model outputs and assertion results.
Understanding the Config File
promptfooconfig.yaml defines your eval suite:
description: "My First Eval Suite"
prompts:
- prompts/greeting.txt
- prompts/farewell.txt
providers:
- openai:gpt-4o
- anthropic:claude-sonnet-4-5
tests:
- vars:
input: "Hello"
assert:
- type: contains
value: "Hi"
- type: latency
threshold: 3000
- prompts: Files or inline text to test
- providers: List of models (supports 90+)
- tests: Test cases with variables and assertions
Teams keep eval configs in version control and run them in CI for every pull request.
Core Features: What Promptfoo Can Do
1. Automated Evaluations
Define test cases with expected outcomes and run them against any supported model.
Assertion Types
Promptfoo provides 30+ built-in assertions:
| Assertion | Purpose |
|---|---|
contains |
Output includes a substring |
equals |
Exact string match |
regex |
Match against a regex pattern |
json-schema |
Validate JSON structure |
javascript |
Custom JS function returns pass/fail |
python |
Custom Python function |
llm-rubric |
Use an LLM to grade output |
similar |
Semantic similarity score |
latency |
Response time under threshold |
cost |
Cost per request under threshold |
Example:
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: javascript
value: output.length < 100
- type: latency
threshold: 2000
- type: cost
threshold: 0.001
This checks for "Paris" in the response, keeps output under 100 characters, ensures response in under 2 seconds, and cost under $0.001.
LLM-Graded Evals
Use one LLM to grade another:
assert:
- type: llm-rubric
value: "Response should be helpful, harmless, and honest"
Grader can be a cheaper model to reduce costs.
2. Red Teaming and Security Testing
Promptfoo includes red team modules to automatically generate adversarial probes.
Supported Attack Vectors
| Category | What It Tests |
|---|---|
| Prompt Injection | Direct, indirect, and context injection attacks |
| Jailbreaks | DAN, persona switching, role-play bypasses |
| Data Exfiltration | SSRF, system prompt extraction, prompt leakage |
| Harmful Content | Hate speech, dangerous activities, self-harm requests |
| Compliance | PII leakage, HIPAA violations, financial data exposure |
| Audio/Visual | Audio injection and image-based attacks |
Running a Red Team Scan
promptfoo redteam init
promptfoo redteam run
promptfoo redteam report [directory]
The tool generates dynamic attack probes, evaluates them, and reports vulnerabilities with severity levels and remediation suggestions.
Example output:
Vulnerability Summary:
- Critical: 2 (PII leakage, prompt extraction)
- High: 5 (jailbreaks, injection attacks)
- Medium: 12 (bias, inconsistent responses)
- Low: 23 (minor policy violations)
Fix critical/high issues and re-scan after changes.
3. Code Scanning for Pull Requests
Integrate with GitHub Actions to block LLM-related security issues:
# .github/workflows/promptfoo-scan.yml
name: Promptfoo Code Scan
on: [pull_request]
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: promptfoo/promptfoo/code-scan-action@main
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
Detects hardcoded API keys, insecure prompt patterns, missing validation, and potential injection vectors.
4. Model Comparison
Compare multiple models side by side:
promptfoo eval
promptfoo view
Web UI displays pass/fail rates, cost, latency, and qualitative differences—helping you select the most effective and cost-efficient model.
Supported Providers: 90+ LLM Integrations
Promptfoo supports 90+ LLM providers. Test prompts across OpenAI, Anthropic, Google, Amazon, Meta, and local models.
| Provider | Models |
|---|---|
| OpenAI | GPT-4, GPT-4o, GPT-4o-mini, o1, o3 |
| Anthropic | Claude 3.5/3.7/4.5/4.6, Thinking models |
| Gemini 1.5/2.0, Vertex AI | |
| Microsoft | Azure OpenAI, Phi |
| Amazon | Bedrock (Claude, Llama, Titan) |
| Meta | Llama 3, 3.1, 3.2 (via multiple providers) |
| Ollama | Local models (Llama, Mistral, Phi) |
Custom Providers
Write custom providers in Python or JS if your model isn’t supported.
Python:
# custom_provider.py
from typing import Any
class CustomProvider:
async def call_api(self, prompt: str, options: dict, context: dict) -> dict:
response = await my_async_api.generate(prompt)
return {
"output": response.text,
"tokenUsage": {
"total": response.usage.total_tokens,
"prompt": response.usage.prompt_tokens,
"completion": response.usage.completion_tokens
}
}
JavaScript:
// customProvider.js
export default class CustomProvider {
async callApi(prompt) {
return {
output: await myApi.generate(prompt),
tokenUsage: { total: 50, prompt: 20, completion: 30 }
};
}
}
Register in promptfooconfig.yaml:
providers:
- id: file://custom_provider.py
config:
api_key: ${MY_API_KEY}
Command-Line Interface: Essential Commands
Promptfoo CLI covers all daily workflows.
Core Commands
# Run evaluations
promptfoo eval -c promptfooconfig.yaml
# Open web UI
promptfoo view
# Share results online
promptfoo share
# Red team testing
promptfoo redteam init
promptfoo redteam run
# Configuration
promptfoo init
promptfoo validate [config]
# Results management
promptfoo list
promptfoo show <id>
promptfoo delete <id>
promptfoo export <id>
# Utilities
promptfoo cache clear
promptfoo retry <id>
Useful Flags
--no-cache # Disable caching for fresh results
--max-concurrency <n> # Limit parallel API calls
--output <file> # Write results to JSON file
--verbose # Enable debug logging
--env-file <path> # Load environment variables from file
--filter <pattern> # Run specific test cases
Example: Run Eval with Custom Settings
promptfoo eval \
-c promptfooconfig.yaml \
--no-cache \
--max-concurrency 3 \
--output results.json \
--env-file .env
CI/CD Integration: Automate LLM Testing
Integrate promptfoo into your CI/CD pipeline to automate regression detection.
GitHub Actions Example
name: LLM Tests
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
- run: npm install -g promptfoo
- run: promptfoo eval -c promptfooconfig.yaml
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Quality Gates
Set pass thresholds in your config:
commandLineOptions:
threshold: 0.8 # Require 80% pass rate
This fails CI if tests don’t meet the threshold.
Caching in CI
Speed up runs with caching:
- uses: actions/cache@v4
with:
path: ~/.cache/promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
Cached results skip API calls for unchanged tests.
Web UI: Visualize and Share Results
promptfoo view opens an interactive UI at localhost:3000:
- Eval matrix: Side-by-side comparison
- Filtering: By status or provider
- Diff view: See changes between runs
- Sharing: Generate shareable links
- Live updates: Watch evals run in real-time
Security: UI has CSRF protection. Don’t expose it to untrusted networks. Use promptfoo share for cloud sharing or self-host with auth.
Database and Caching
Cache Location
-
macOS/Linux:
~/.cache/promptfoo -
Windows:
%LOCALAPPDATA%\promptfoo
Use --no-cache during development for fresh runs.
Database Location
-
All platforms:
~/.promptfoo/promptfoo.db(SQLite)
Stores historical evals; don’t delete unless you want to lose data.
Security Model: What You Can Trust
Promptfoo separates trusted (code-executed) and untrusted (data-only) inputs.
Trusted Inputs
- Config files (
promptfooconfig.yaml) - Custom JS/Python/Ruby assertions
- Provider configs
- Transform functions
Only use from trusted sources.
Untrusted Inputs
- Prompt text
- Test case variables
- Model outputs
- Remote content fetched during evals
Treated strictly as data.
Hardening Recommendations
- Run in containers/VMs with minimal privileges
- Use least-privileged API keys
- Don’t put secrets in prompts/configs
- Restrict network egress for third-party code
- Don’t expose local web UI to untrusted networks
Performance: Optimize Your Evals
Optimization Tips
- Use caching (default) for speed
-
Tune concurrency with
--max-concurrencyfor speed vs. API limits -
Filter tests with
--filterduring development -
Sample datasets with
--repeatfor quick iteration
Scaling for Large Evals
- Use the scheduler (
src/scheduler/) for distributed runs - Use remote generation for heavy compute
- Export to Google Sheets for team visibility
Extensibility: Build Custom Features
Custom Assertions
Write your own assertion logic:
// assertions/customCheck.js
export default function customCheck(output, context) {
const pass = output.includes('expected');
return {
pass,
score: pass ? 1 : 0,
reason: pass ? 'Output matched' : 'Missing expected content'
};
}
Use in config:
assert:
- type: file://assertions/customCheck.js
MCP Server
Run the MCP server for AI assistant integration:
promptfoo mcp
Enables agents to:
- Run evals from chat
- Access red team scans
- Query results
- Generate test cases
Real-World Use Cases
Customer Support Chatbot
- 500+ test cases
- Multi-model evals (GPT-4, Claude)
- Red team for PII/jailbreak
- CI blocks deploys on failures
- Result: 90% fewer customer-reported issues
Content Generation Pipeline
- LLM-graded evals for tone/style
- Latency/cost thresholds
- Model comparison for best value
- Result: Consistent voice, 40% lower API costs
Healthcare Application
- Red team for HIPAA
- Custom assertions for medical accuracy
- Local evals for privacy
- Auditable trails
- Result: Passed SOC 2 audit using promptfoo
Conclusion
Promptfoo brings data-driven, automated testing to LLM applications—catching regressions, security issues, and quality problems before production.
Key steps:
- Install:
npm install -g promptfoothenpromptfoo init - Use semantic assertions for robust validation
- Run red team scans for security
- Integrate with CI/CD
- Compare models objectively
- Extend with custom providers/assertions
With promptfoo, you can confidently build, test, and secure LLM applications at scale.
If you work with APIs, use Apidog with promptfoo. Apidog handles API design, testing, and docs; promptfoo covers LLM evaluation. Together, you get full-stack testing for modern apps.
FAQ
What is promptfoo used for?
Testing and evaluating LLM applications: automated tests against prompts, cross-model output comparisons, and security red-team assessments.
Is promptfoo free?
Yes, it’s open source (MIT). Free for personal and commercial use. Cloud/enterprise features may require paid plans.
How do I install promptfoo?
npm install -g promptfoo
Or use npx promptfoo@latest, brew install promptfoo (macOS), or pip install promptfoo (Python).
What models does promptfoo support?
90+ providers: OpenAI (GPT-4/o1), Anthropic (Claude 3.5/4/4.5), Google (Gemini), Microsoft (Azure OpenAI), Amazon Bedrock, Ollama local models, and more.
How do I run a red team scan?
promptfoo redteam init
promptfoo redteam run
promptfoo redteam report
Can I use promptfoo in CI/CD?
Yes—install promptfoo in your CI pipeline and run promptfoo eval with your config. Set threshold to enforce quality gates.
Does promptfoo send my data to external servers?
No. All runs are local unless you opt into cloud features. Cache and DB files are local.
How do I compare models with promptfoo?
List multiple providers in your config, run promptfoo eval, then promptfoo view to compare pass rates, cost, and latency in the web UI.


Top comments (0)