DEV Community

Tang Weigang
Tang Weigang

Posted on

Do not add LLM evals after launch: use promptfoo to define the failure boundary first

A common LLM workflow mistake is to tune the prompt first and add evaluation later. That order feels fast, but it leaves the team with a weak release boundary. When the workflow fails in production, nobody can say exactly which behavior should have blocked the release.

The category claim matters: Doramagic is not a prompt library and not a README summary. It turns an open-source project into a portable AI agent capability asset: source map, host instructions, prompt preview, pitfall log, eval or smoke check, boundary card, human manual, test log, and feedback path.

The useful way to read promptfoo is not as a leaderboard tool. It is a contract layer for LLM workflows: before changing prompts, switching models, adding tools, or giving an agent more authority, define what must pass, what must fail, and what evidence is enough.

The Doramagic promptfoo manual highlights four surfaces that matter in practice.

1. An eval is not a score. It is a reproducible judgment set.

Promptfoo's core engine ties together configuration, provider calls, assertion grading, and result aggregation. YAML or JSON config can place prompts, test cases, providers, and assertions in one run.

That changes the question from "did the model sound good?" to a more useful checklist:

  • Which inputs must pass every time?
  • Which outputs must be rejected?
  • Which tool-call structures must be valid?
  • Which checks should be deterministic, and which deserve an LLM-as-judge rubric?
  • After a prompt, provider, or model change, which cases must run again?

If those answers are not in the eval config, the final score is mostly decoration.

2. Agents can run evals, but the scope must be explicit.

One important detail in the manual is promptfoo's MCP tool surface. Tools such as list_evaluations, get_evaluation_details, run_evaluation, and share_evaluation allow AI agents to drive evaluations programmatically.

That is powerful, but it needs a boundary. run_evaluation accepts a config path, optional test-case filters, prompt and provider filters, concurrency settings, timeout limits, and pagination controls. An agent should not treat this as a casual "run everything" button.

Before running an eval, the agent should state which config it will use, which cases it will run, how much concurrency it needs, what timeout applies, and why this run is the right evidence for the current change.

3. Many providers do not mean automatic portability.

Promptfoo supports a broad provider ecosystem: OpenAI, Anthropic, Google Vertex, xAI, Bedrock, Cerebras, agent SDKs, MCP tools, and custom gateways. That breadth is useful because one test set can compare different execution surfaces.

But every provider has its own behavior around structured outputs, tool calls, timeouts, caching, permissions, and cost. A test that passes with one provider should not be treated as proof that another provider is safe.

For a first run, I would not start with a giant model comparison. I would fix one provider and one real workflow, run a small smoke eval, then add the second provider only after the assertions are stable.

4. Red teaming should become a negative-case release contract.

Promptfoo's redteam layer reuses the evaluation engine and adds adversarial providers, target invocation, judging, and iterative feedback. The manual notes presets around prompt injection, harmful content, PII, and related risks.

The mistake is treating red teaming as a one-off report. The useful version is a release gate: which prompt-injection attempts must fail, which PII outputs must be blocked, and which tool permissions must be denied before the workflow ships.

If the redteam output does not feed CI, a release checklist, or a human review step, it is only a demo.

A safer first-run path

For an AI coding agent, I would use promptfoo like this:

  1. Read the Doramagic manual and identify the runtime boundary.
  2. Create a minimal promptfoo config in a temporary directory.
  3. Use one provider and one real workflow first.
  4. Run 3 to 5 high-value test cases before expanding the suite.
  5. Start with deterministic assertions, then add LLM rubrics only where judgment is unavoidable.
  6. Report passing examples, failing examples, cost, latency, and what changed.
  7. Do not claim the workflow is ready until failed cases are reviewed.

The Doramagic promptfoo pack does not replace the upstream docs. Its job is to make promptfoo loadable by an AI coding host as an operating contract: read the manual, check the pitfalls, run the evals, and only then recommend a prompt or model change.

The feedback loop is part of the asset. If the first run exposes a new failure case, it should update the pitfall log, eval suite, boundary card, or human manual. Otherwise the article is just content; the capability asset is what lets the next agent start from better evidence.

Doramagic manual: https://doramagic.ai/en/projects/promptfoo/manual/
Doramagic project page: https://doramagic.ai/en/projects/promptfoo/
GitHub pack: https://github.com/tangweigang-jpg/doramagic-promptfoo-pack
Upstream project: https://github.com/promptfoo/promptfoo

Disclosure: this is an independent Doramagic project asset, not an official upstream release.

Top comments (0)