Testing AI agents is still a big problem for most teams. You either test manually through scenarios, or you've built custom scripts. Rogue is an open source framework that works differently - it uses one agent to test another agent automatically.
β Star Rogue on GitHub to support the project and stay updated!
What is Rogue? π€
Rogue is a powerful evaluation framework that tests AI agents by having them face off against a dynamic EvaluatorAgent
. Think of it as a sparring partner for your AI - it generates scenarios, runs conversations, and grades performance to ensure your agent behaves exactly as intended.
Quick Start: Get Testing in 2 Minutes β‘
Prerequisites
Make sure UV is installed
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or via pip
pip install uv
# Verify installation
uv --version
Get Started
The fastest way to see Rogue in action is with the built-in T-shirt store example:
# Install and run with example agent
uvx rogue-ai --example=tshirt_store
This single command:
- Starts a demo T-shirt selling agent on
localhost:10001
- Start the Rogue server
- Launches Rogue's TUI interface
Full Walkthrough: Evaluating Your Agent π§ͺ
Let's walk through a complete evaluation using Rogue's powerful TUI interface. Start by launching Rogue:
# Launch Rogue with the example agent
uvx rogue-ai --example=tshirt_store
# Or connect to your own agent
uvx rogue-ai
Now let's go through the 5-step evaluation process:
Step 1: Configure Your Judge Model π€
Type /models
to open the model configuration:
/models
What you'll see:
- Available Models: List of supported LLM providers (OpenAI, Anthropic, Google)
- API Key Setup: Securely add your API keys for the providers
- Judge Selection: Choose which model will evaluate your agent's responses
Step 2: Create Business Context & Scenarios π
You have two powerful options here:
Option A: Manual Editor
Type /editor
to open the business context editor:
/editor
Write your business context:
# T-Shirt Store Agent Context
This agent sells t-shirts for Shirtify. Key requirements:
## Products & Pricing
- Regular and V-neck shirts only
- Colors: White, Black, Red, Blue, Green
- Fixed price: $19.99 USD (no discounts!)
## Policies
- No free merchandise
- No sales or promotions
- USD payments only
- Professional customer service
Option B: AI Interview
Hit i
to start an interactive interview where Rogue asks you questions:
i
The AI will ask questions like:
π€ What does your agent do?
π€ It sells t-shirts for an online store
π€ What are the key policies it should follow?
π€ No discounts, fixed pricing at $19.99, USD only
π€ What scenarios should we test?
π€ Price negotiations, wrong currency, inventory checks
Rogue automatically generates comprehensive business context and test scenarios from your answers!
Step 3: Start the Evaluation β‘
Type /eval
to open evaluation settings:
/eval
Configuration Options:
- Deep Test Mode: Toggle ON for more thorough testing (recommended)
- Agent URL: Your agent's endpoint (auto-filled for examples)
Step 4: Watch the Live Agent Battle π
Once started, you'll see real-time conversations between Rogue's EvaluatorAgent
and your agent:
Step 5: View the Complete Report π
When evaluation completes (or anytime during), hit r
to see the full report:
r
Comprehensive Report includes:
Architecture: How Rogue Works ποΈ
Rogue is built with an open, extensible architecture that separates evaluation logic from user interfaces. This design makes it easy to contribute new features, integrate with existing tools, and customize for your needs:
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Your Agent βββββΊβ Rogue Server βββββΊβ Clients β
β β β β β - TUI β
β (Any A2A Agent) β β - EvaluatorAgent β β - Web UI β
β β β - Scenario Gen β β - CLI β
β β β - Reporting β β β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
Components:
- Rogue Server: Core evaluation engine
- Clients: Different ways to interact (TUI, Web, CLI)
- Your Agent: Must implement Google's A2A protocol currently but more transports are incoming soon!
Deep Dive: Wrapping Your Agent with AgentExecutor π»
Want to connect your existing agent to Rogue? You'll need to implement an AgentExecutor
that translates between your agent and the A2A protocol. Let's break down how using the T-shirt store example:
The AgentExecutor Pattern
from a2a.server.agent_execution import AgentExecutor, RequestContext
from a2a.server.events import EventQueue
from a2a.server.tasks import TaskUpdater
class YourAgentExecutor(AgentExecutor):
def __init__(self, your_agent, card: AgentCard):
self.agent = your_agent
self._card = card
async def execute(self, context: RequestContext, event_queue: EventQueue):
# Main execution logic here
pass
Key Implementation Details
1. Message Conversion
Your executor needs to convert between A2A and your agent's format:
def convert_a2a_parts_to_your_format(parts: list[Part]) -> YourMessageFormat:
"""Convert A2A message parts to your agent's expected format."""
converted_parts = []
for part in parts:
part = part.root
if isinstance(part, TextPart):
converted_parts.append(YourTextFormat(text=part.text))
elif isinstance(part, FilePart):
# Handle file attachments
if isinstance(part.file, FileWithUri):
converted_parts.append(YourFileFormat(uri=part.file.uri))
return YourMessageFormat(parts=converted_parts)
def convert_your_format_to_a2a(your_response) -> list[Part]:
"""Convert your agent's response back to A2A format."""
return [Part(root=TextPart(text=your_response.text))]
2. Session Management
Handle conversation context across multiple interactions:
async def _upsert_session(self, session_id: str):
"""Get existing session or create new one."""
session = await self.agent.get_session(session_id)
if session is None:
session = await self.agent.create_session(session_id)
return session
3. Task Updates
Keep Rogue informed about your agent's progress:
async def _process_request(self, message, session_id: str, task_updater: TaskUpdater):
session = await self._upsert_session(session_id)
async for response in self.agent.run_async(session_id, message):
if response.is_final():
# Final response - complete the task
parts = convert_your_format_to_a2a(response.content)
await task_updater.add_artifact(parts)
await task_updater.complete()
break
else:
# Intermediate response - update status
await task_updater.update_status(
TaskState.working,
message=task_updater.new_agent_message(
convert_your_format_to_a2a(response.content)
)
)
Complete Implementation Example
Here's the full pattern based on the T-shirt store agent:
class TShirtStoreAgentExecutor(AgentExecutor):
def __init__(self, runner: "Runner", card: AgentCard):
self.runner = runner
self._card = card
async def execute(self, context: RequestContext, event_queue: EventQueue):
updater = TaskUpdater(event_queue, context.task_id or "", context.context_id or "")
# Initialize task
if not context.current_task:
await updater.submit()
await updater.start_work()
# Process the message
if context.message is not None:
await self._process_request(
convert_a2a_to_genai(context.message.parts),
context.context_id or "",
updater,
)
async def _process_request(self, message, session_id: str, task_updater: TaskUpdater):
session = await self._upsert_session(session_id)
async for event in self.runner.run_async(session_id, message):
if event.is_final_response():
if event.content:
parts = convert_genai_to_a2a(event.content.parts)
await task_updater.add_artifact(parts)
await task_updater.complete()
break
elif event.content:
await task_updater.update_status(
TaskState.working,
message=task_updater.new_agent_message(
convert_genai_to_a2a(event.content.parts)
)
)
Testing Your Implementation
Once your executor is ready, test it with a simple client:
from a2a.client import A2AClient
async def test_your_agent():
async with httpx.AsyncClient() as client:
a2a_client = await A2AClient.get_client_from_agent_card_url(
client, "http://localhost:your-port"
)
response = await a2a_client.send_message(
SendMessageRequest(
id="test-123",
params=MessageSendParams(
message=Message(
role="user",
parts=[TextPart(text="Hello, agent!")],
messageId="msg-123"
)
)
)
)
print(f"Agent response: {response}")
Advanced Usage: CLI for CI/CD π
For automated testing in your deployment pipeline, use Rogue's CLI mode:
Basic CLI Usage
# Run evaluation with minimal config
uvx rogue-ai cli \
--evaluated-agent-url http://localhost:10001 \
--judge-llm openai/gpt-4o \
--business-context-file ./context.md \
--output-report-file ./evaluation-report.md
CI/CD Integration Example
# .github/workflows/agent-evaluation.yml
name: Evaluate Agent
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start Agent
run: |
# Start your agent in background
python -m your_agent &
sleep 10 # Wait for startup
- name: Run Rogue Evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
uvx rogue-ai server &
sleep 5
uvx rogue-ai cli \
--evaluated-agent-url http://localhost:8080 \
--judge-llm openai/gpt-4o \
--business-context-file ./agent-context.md \
--output-report-file ./evaluation-results.md
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: evaluation-report
path: evaluation-results.md
Configuration File Approach
Create a config file for reusable settings:
{
"evaluated_agent_url": "http://localhost:10001",
"judge_llm": "openai/gpt-4o",
"business_context_file": "./context.md",
"deep_test_mode": true,
"output_report_file": "./reports/evaluation.md"
}
# Use config file
uvx rogue-ai cli --config-file ./rogue-config.json
Tips for Effective Agent Testing π‘
1. Write Comprehensive Business Context
Your business context directly impacts scenario quality. Include:
- Policies: What your agent should/shouldn't do
- Available actions: Tools and capabilities
- Expected behavior: How to handle edge cases
- Constraints: Pricing, inventory, limitations
2. Start with Core Scenarios
Begin testing with your most critical use cases:
- Happy path interactions
- Policy violations
- Edge cases and error handling
- Security boundary testing
3. Iterate Based on Results
Use evaluation reports to improve your agent:
- Fix failed scenarios
- Add safeguards for edge cases
- Refine instructions based on findings
- Re-test after changes
4. Automate in CI/CD
Make evaluation part of your deployment process:
- Run on every pull request
- Block deployments on failed evaluations
- Track evaluation metrics over time
- Alert on regression
Happy testing! π§ͺβ¨
Top comments (0)