Testing AI agents is still a big problem for most teams. You either test manually through scenarios, or you've built custom scripts. Rogue is an open source framework that works differently - it uses one agent to test another agent automatically.
β Star Rogue on GitHub to support the project and stay updated!
What is Rogue? π€
Rogue is a powerful evaluation framework that tests AI agents by having them face off against a dynamic EvaluatorAgent. Think of it as a sparring partner for your AI - it generates scenarios, runs conversations, and grades performance to ensure your agent behaves exactly as intended.
Quick Start: Get Testing in 2 Minutes β‘
Prerequisites
Make sure UV is installed
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or via pip
pip install uv
# Verify installation
uv --version
Get Started
The fastest way to see Rogue in action is with the built-in T-shirt store example:
# Install and run with example agent
uvx rogue-ai --example=tshirt_store
This single command:
- Starts a demo T-shirt selling agent on
localhost:10001 - Starts the Rogue server
- Launches Rogue's TUI interface
Full Walkthrough: Evaluating Your Agent π§ͺ
Let's walk through a complete evaluation using Rogue's powerful TUI interface. Start by launching Rogue:
# Launch Rogue with the example agent
uvx rogue-ai --example=tshirt_store
# Or connect to your own agent
uvx rogue-ai
Now let's go through the 5-step evaluation process:
Step 1: Configure Your Judge Model π€
Type /models to open the model configuration:
/models
What you'll see:
- Available Models: List of supported LLM providers (OpenAI, Anthropic, Google)
- API Key Setup: Securely add your API keys for the providers
- Judge Selection: Choose which model will evaluate your agent's responses
Step 2: Create Business Context & Scenarios π
You have two powerful options here:
Option A: Manual Editor
Type /editor to open the business context editor:
/editor
Hit i to start an interactive interview where Rogue asks you questions:
The AI will ask questions like:
π€ What does your agent do?
π€ It sells t-shirts for an online store
π€ What are the key policies it should follow?
π€ No discounts, fixed pricing at $19.99, USD only
π€ What scenarios should we test?
π€ Price negotiations, wrong currency, inventory checks
Rogue automatically generates comprehensive business context and test scenarios from your answers!
option B
Write your business context:
# T-Shirt Store Agent Context
This agent sells t-shirts for Shirtify. Key requirements:
## Products & Pricing
- Regular and V-neck shirts only
- Colors: White, Black, Red, Blue, Green
- Fixed price: $19.99 USD (no discounts!)
## Policies
- No free merchandise
- No sales or promotions
- USD payments only
- Professional customer service
Step 3: Start the Evaluation β‘
Type /eval to open evaluation settings:
/eval
Configuration Options:
- Deep Test Mode: Toggle ON for more thorough testing (recommended) this will enable a multi turn conversation per scenario.
- Agent URL: Your agent's endpoint (auto-filled for examples)
Step 4: Watch the Live Agent Battle π
Once started, you'll see real-time conversations between Rogue's EvaluatorAgent and your agent:
Step 5: View the Complete Report π
When evaluation completes (or anytime during), hit r to see the full report:
Comprehensive Report includes:
Architecture: How Rogue Works ποΈ
Rogue is built with an open, extensible architecture that separates evaluation logic from user interfaces. This design makes it easy to contribute new features, integrate with existing tools, and customize for your needs:
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Your Agent βββββΊβ Rogue Server βββββΊβ Clients β
β β β β β β
β (Any A2A Agent) β β - EvaluatorAgent β β - TUI β
β β β - Scenario Gen β β - Web UI β
β β β - Reporting β β - CLI β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
Components:
- Rogue Server: Core evaluation engine
- Clients: Different ways to interact (TUI, Web, CLI)
- Your Agent: Must implement Google's A2A protocol currently but more transports are incoming soon!
Deep Dive: Wrapping Your Agent with AgentExecutor π»
Want to connect your existing agent to Rogue? You'll need to implement an AgentExecutor that translates between your agent and the A2A protocol. Let's break down how using the T-shirt store example:
The AgentExecutor Pattern
from a2a.server.agent_execution import AgentExecutor, RequestContext
from a2a.server.events import EventQueue
from a2a.server.tasks import TaskUpdater
class YourAgentExecutor(AgentExecutor):
def __init__(self, your_agent, card: AgentCard):
self.agent = your_agent
self._card = card
async def execute(self, context: RequestContext, event_queue: EventQueue):
# Main execution logic here
pass
async def cancel(
self, context: RequestContext, event_queue: EventQueue
) -> None:
pass
Key Implementation Details
1. Message Conversion
Your executor needs to convert between A2A and your agent's format:
def convert_a2a_parts_to_your_format(parts: list[Part]) -> YourMessageFormat:
"""Convert A2A message parts to your agent's expected format."""
converted_parts = []
for part in parts:
part = part.root
if isinstance(part, TextPart):
converted_parts.append(YourTextFormat(text=part.text))
elif isinstance(part, FilePart):
# Handle file attachments
if isinstance(part.file, FileWithUri):
converted_parts.append(YourFileFormat(uri=part.file.uri))
return YourMessageFormat(parts=converted_parts)
def convert_your_format_to_a2a(your_response) -> list[Part]:
"""Convert your agent's response back to A2A format."""
return [Part(root=TextPart(text=your_response.text))]
2. Session Management
Handle conversation context across multiple interactions:
async def _upsert_session(self, session_id: str):
"""Get existing session or create new one."""
session = await self.agent.get_session(session_id)
if session is None:
session = await self.agent.create_session(session_id)
return session
3. Task Updates
Keep Rogue informed about your agent's progress:
async def _process_request(self, message, session_id: str, task_updater: TaskUpdater):
session = await self._upsert_session(session_id)
async for response in self.agent.run_async(session_id, message):
if response.is_final():
# Final response - complete the task
parts = convert_your_format_to_a2a(response.content)
await task_updater.add_artifact(parts)
await task_updater.complete()
break
else:
# Intermediate response - update status
await task_updater.update_status(
TaskState.working,
message=task_updater.new_agent_message(
convert_your_format_to_a2a(response.content)
)
)
Complete Implementation Example
Here's the full pattern based on the T-shirt store agent:
import base64
from typing import TYPE_CHECKING, AsyncGenerator
from a2a.server.agent_execution import AgentExecutor
from a2a.server.tasks import TaskUpdater
from a2a.types import (
AgentCard,
FilePart,
FileWithBytes,
FileWithUri,
Part,
TaskState,
TextPart,
UnsupportedOperationError,
)
from a2a.utils.errors import ServerError
from google.genai.types import UserContent, Blob, FileData, Part as GenAIPart
from loguru import logger
if TYPE_CHECKING:
from a2a.server.agent_execution import RequestContext
from a2a.server.events import EventQueue
from google.adk import Runner
from google.adk.events import Event
from google.genai.types import Content
from google.genai.types import Part as GenAIPart
class TShirtStoreAgentExecutor(AgentExecutor):
"""An AgentExecutor that runs an ADK-based Agent."""
def __init__(self, runner: "Runner", card: AgentCard):
self.runner = runner
self._card = card
self._running_sessions = {} # type: ignore
def _run_agent(
self,
session_id: str,
new_message: "Content",
) -> AsyncGenerator["Event", None]:
return self.runner.run_async(
session_id=session_id,
user_id="self",
new_message=new_message,
)
async def _process_request(
self,
new_message: "Content",
session_id: str,
task_updater: TaskUpdater,
) -> None:
# The call to self._upsert_session was returning a coroutine object,
# leading to an AttributeError when trying to access .id on it directly.
# We need to await the coroutine to get the actual session object.
session_obj = await self._upsert_session(
session_id,
)
# Update session_id with the ID from the resolved session object
# to be used in self._run_agent.
session_id = session_obj.id
async for event in self._run_agent(session_id, new_message):
if event.is_final_response():
if event.content:
parts = convert_genai_parts_to_a2a(event.content.parts)
logger.debug("Yielding final response", extra={"parts": parts})
await task_updater.add_artifact(parts)
await task_updater.complete()
break
if not event.get_function_calls() and event.content:
logger.debug("Yielding update response")
await task_updater.update_status(
TaskState.working,
message=task_updater.new_agent_message(
convert_genai_parts_to_a2a(event.content.parts),
),
)
else:
logger.debug("Skipping event")
async def execute(
self,
context: "RequestContext",
event_queue: "EventQueue",
):
# Run the agent until either complete or the task is suspended.
updater = TaskUpdater(
event_queue,
context.task_id or "",
context.context_id or "",
)
# Immediately notify that the task is submitted.
if not context.current_task:
await updater.submit()
await updater.start_work()
if context.message is not None:
await self._process_request(
UserContent(
parts=convert_a2a_parts_to_genai(context.message.parts),
),
context.context_id or "",
updater,
)
logger.debug("TShirtStoreAgentExecutor execute exiting")
async def cancel(self, context: "RequestContext", event_queue: "EventQueue"):
# Ideally: kill any ongoing tasks.
raise ServerError(error=UnsupportedOperationError())
async def _upsert_session(self, session_id: str):
"""
Retrieves a session if it exists, otherwise creates a new one.
Ensures that async session service methods are properly awaited.
"""
session = await self.runner.session_service.get_session(
app_name=self.runner.app_name,
user_id="self",
session_id=session_id,
)
if session is None:
session = await self.runner.session_service.create_session(
app_name=self.runner.app_name,
user_id="self",
session_id=session_id,
)
# According to ADK InMemorySessionService,
# create_session should always return a Session object.
if session is None:
logger.error(
f"Critical error: Session is None even after "
f"create_session for session_id: {session_id}",
)
raise RuntimeError(
f"Failed to get or create session: {session_id}",
)
return session
def convert_a2a_parts_to_genai(parts: list[Part]) -> list["GenAIPart"]:
"""Convert a list of A2A Part types into a list of Google Gen AI Part types."""
return [convert_a2a_part_to_genai(part) for part in parts]
def convert_a2a_part_to_genai(part: Part) -> "GenAIPart":
"""Convert a single A2A Part type into a Google Gen AI Part type."""
part = part.root # type: ignore
if isinstance(part, TextPart):
return GenAIPart(text=part.text)
if isinstance(part, FilePart):
if isinstance(part.file, FileWithUri):
return GenAIPart(
file_data=FileData(
file_uri=part.file.uri,
mime_type=part.file.mimeType,
),
)
if isinstance(part.file, FileWithBytes):
return GenAIPart(
inline_data=Blob(
data=base64.b64decode(part.file.bytes),
mime_type=part.file.mimeType,
),
)
raise ValueError(f"Unsupported file type: {type(part.file)}")
raise ValueError(f"Unsupported part type: {type(part)}")
def convert_genai_parts_to_a2a(parts: list["GenAIPart"] | None) -> list[Part]:
"""Convert a list of Google Gen AI Part types into a list of A2A Part types."""
parts = parts or []
return [
convert_genai_part_to_a2a(part)
for part in parts
if (part.text or part.file_data or part.inline_data)
]
def convert_genai_part_to_a2a(part: "GenAIPart") -> Part:
"""Convert a single Google Gen AI Part type into an A2A Part type."""
if part.text:
return Part(root=TextPart(text=part.text))
if part.file_data:
return Part(
root=FilePart(
file=FileWithUri(
uri=part.file_data.file_uri or "",
mimeType=part.file_data.mime_type,
),
),
)
if part.inline_data:
return Part(
root=FilePart(
file=FileWithBytes(
bytes=base64.b64encode(
part.inline_data.data, # type: ignore
).decode(),
mimeType=part.inline_data.mime_type,
),
),
)
raise ValueError(f"Unsupported part type: {part}")
Testing Your Implementation
Once your executor is ready, test it with a simple client:
from a2a.client import A2AClient
async def test_your_agent():
async with httpx.AsyncClient() as client:
a2a_client = await A2AClient.get_client_from_agent_card_url(
client, "http://localhost:your-port"
)
response = await a2a_client.send_message(
SendMessageRequest(
id="test-123",
params=MessageSendParams(
message=Message(
role="user",
parts=[TextPart(text="Hello, agent!")],
messageId="msg-123"
)
)
)
)
print(f"Agent response: {response}")
Advanced Usage: CLI for CI/CD π
For automated testing in your deployment pipeline, use Rogue's CLI mode:
Basic CLI Usage
after creating the scenario and business context rogue will create a
.roguefolder with the scenarios.json file that contains the scenarios
# Run evaluation with minimal config
uvx rogue-ai cli \
--evaluated-agent-url http://localhost:10001 \
--judge-llm openai/o4-mini \
--workdir './examples/tshirt_store_agent/.rogue'
CI/CD Integration Example
# .github/workflows/agent-evaluation.yml
name: Rogue
on:
pull_request:
push:
branches:
- main
workflow_dispatch:
jobs:
run_rogue:
runs-on: ubuntu-latest
timeout-minutes: 15
steps:
- name: Run rogue
shell: bash
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
echo "π Starting rogue server and example agent..."
uvx rogue-ai server --host 0.0.0.0 --port 8000 --example tshirt_store &
ROGUE_PID=$!
echo "Rogue server started with PID: $ROGUE_PID"
trap 'echo "π Stopping rogue server..."; kill $ROGUE_PID' EXIT
echo "β³ Waiting for example agent to be ready..."
curl --retry 10 --retry-delay 5 --retry-connrefused -s --fail -o /dev/null \
http://localhost:10001/.well-known/agent.json
echo "β³ Waiting for rogue server to be ready..."
curl --retry 10 --retry-delay 5 --retry-connrefused -s --fail -o /dev/null \
http://localhost:8000/api/v1/health
echo "π Rogue server ready"
echo "π Running rogue..."
uvx rogue-ai cli \
--evaluated-agent-url http://localhost:10001 \
--judge-llm openai/o4-mini \
--workdir './examples/tshirt_store_agent/.rogue'
Tips for Effective Agent Testing π‘
1. Write Comprehensive Business Context
Your business context directly impacts scenario quality. Include:
- Policies: What your agent should/shouldn't do
- Available actions: Tools and capabilities
- Expected behavior: How to handle edge cases
- Constraints: Pricing, inventory, limitations
2. Start with Core Scenarios
Begin testing with your most critical use cases:
- Happy path interactions
- Policy violations
- Edge cases and error handling
- Security boundary testing
3. Iterate Based on Results
Use evaluation reports to improve your agent:
- Fix failed scenarios
- Add safeguards for edge cases
- Refine instructions based on findings
- Re-test after changes
4. Automate in CI/CD
Make evaluation part of your deployment process:
- Run on every pull request
- Block deployments on failed evaluations
- Track evaluation metrics over time
- Alert on regression
Happy testing! π§ͺβ¨









Top comments (0)