Launch week day 1: drive the full AI testing workflow from inside any AI tool

Nicolai Bohn — Mon, 04 May 2026 17:17:03 +0000

Building an AI agent uses one tool. Testing it uses another. Every iteration cycle ends with you switching to your test platform UI, running tests, inspecting results, then back to your editor to fix what broke. The context switch is small, but it adds up.

Today we shipped the fix: the Rhesis Agent Skill, day 1 of Rhesis Launch Week.

If you build LLM agents and use Claude Code, Cursor, Codex, Gemini CLI, or any of 40+ other AI tools, you can now drive the full Rhesis testing workflow from inside the chat where you write the code.

What it does

The Agent Skill packages our domain knowledge into a portable skill file that any compatible AI tool can load. Once installed, your AI assistant gains:

Endpoint discovery in Quick or Comprehensive mode
Test suite design with behaviors, test sets, and metrics
Confirmation guards that wait for approval before anything is created
Test execution against your endpoints
Failure analysis with pass/fail summaries and links back to runs

All powered by the Rhesis MCP server (27 tools covering test sets, behaviors, metrics, runs, and OData queries), all in natural language.

Install

Single command across all your AI tools:

npx skills add rhesis-ai/rhesis -g

The CLI detects which AI tools you have installed and configures the skill for each one. Then set your API token:

export RHESIS_API_KEY=rhs_your_token_here

Get a token at app.rhesis.ai/tokens.

Claude Code

Claude Code uses a plugin system that bundles the skill and MCP server config together:

/plugin marketplace add rhesis-ai/rhesis
/plugin install rhesis@rhesis-ai

Cursor

Add the MCP server to your .cursor/mcp.json:

{
  "mcpServers": {
    "rhesis": {
      "url": "<https://api.rhesis.ai/mcp>",
      "headers": {
        "Authorization": "Bearer YOUR_RHESIS_API_KEY"
      }
    }
  }
}

For self-hosted backends, swap https://api.rhesis.ai/mcp for http://localhost:8080/mcp.

Use it

Type something like:

"Test my support agent on billing scenarios, run it, and rank the failures by severity."

The skill walks the conversation through a 6-step loop:

Discover: explores what your endpoint can do
Plan: proposes a test suite with behaviors and metrics
Review: waits for your approval before creating anything
Create: builds entities on the platform following the approved plan
Execute: runs tests once you confirm
Analyze: surfaces a pass/fail summary, failure patterns, and links back to results

For ad-hoc operations:

"List my existing test sets."
"Improve the Safety Compliance metric. Make the threshold stricter."
"Compare my last two test runs."

The host agent's native confirmation handles the safety guard, so destructive actions never happen without your sign-off.

Issues in conversational AI apps are so obvious right now that even John Oliver felt the need to dedicate a whole episode to the topic: https://www.youtube.com/watch?v=Ykvf3MunGf8 #AISafety #AIEvals #Testing #QA

Nicolai Bohn — Tue, 28 Apr 2026 09:24:37 +0000

youtube.com