DEV Community

drorIvry
drorIvry

Posted on

Testing AI Agents Like a Pro: A Complete Guide to Rogue πŸ”

Testing AI agents is still a big problem for most teams. You either test manually through scenarios, or you've built custom scripts. Rogue is an open source framework that works differently - it uses one agent to test another agent automatically.

⭐ Star Rogue on GitHub to support the project and stay updated!

What is Rogue? πŸ€”

Rogue is a powerful evaluation framework that tests AI agents by having them face off against a dynamic EvaluatorAgent. Think of it as a sparring partner for your AI - it generates scenarios, runs conversations, and grades performance to ensure your agent behaves exactly as intended.

Quick Start: Get Testing in 2 Minutes ⚑

Prerequisites

Make sure UV is installed

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or via pip
pip install uv

# Verify installation
uv --version
Enter fullscreen mode Exit fullscreen mode

Get Started

The fastest way to see Rogue in action is with the built-in T-shirt store example:

# Install and run with example agent
uvx rogue-ai --example=tshirt_store
Enter fullscreen mode Exit fullscreen mode

This single command:

  • Starts a demo T-shirt selling agent on localhost:10001
  • Start the Rogue server
  • Launches Rogue's TUI interface

Full Walkthrough: Evaluating Your Agent πŸ§ͺ

Let's walk through a complete evaluation using Rogue's powerful TUI interface. Start by launching Rogue:

# Launch Rogue with the example agent
uvx rogue-ai --example=tshirt_store

# Or connect to your own agent
uvx rogue-ai
Enter fullscreen mode Exit fullscreen mode

Now let's go through the 5-step evaluation process:

Step 1: Configure Your Judge Model πŸ€–

Type /models to open the model configuration:

/models
Enter fullscreen mode Exit fullscreen mode

What you'll see:

  • Available Models: List of supported LLM providers (OpenAI, Anthropic, Google)
  • API Key Setup: Securely add your API keys for the providers
  • Judge Selection: Choose which model will evaluate your agent's responses

Step 2: Create Business Context & Scenarios πŸ“

You have two powerful options here:

Option A: Manual Editor

Type /editor to open the business context editor:

/editor
Enter fullscreen mode Exit fullscreen mode

Write your business context:

# T-Shirt Store Agent Context

This agent sells t-shirts for Shirtify. Key requirements:

## Products & Pricing

- Regular and V-neck shirts only
- Colors: White, Black, Red, Blue, Green
- Fixed price: $19.99 USD (no discounts!)

## Policies

- No free merchandise
- No sales or promotions
- USD payments only
- Professional customer service

Enter fullscreen mode Exit fullscreen mode

Option B: AI Interview

Hit i to start an interactive interview where Rogue asks you questions:

i
Enter fullscreen mode Exit fullscreen mode

The AI will ask questions like:

πŸ€– What does your agent do?
πŸ‘€ It sells t-shirts for an online store

πŸ€– What are the key policies it should follow?
πŸ‘€ No discounts, fixed pricing at $19.99, USD only

πŸ€– What scenarios should we test?
πŸ‘€ Price negotiations, wrong currency, inventory checks
Enter fullscreen mode Exit fullscreen mode

Rogue automatically generates comprehensive business context and test scenarios from your answers!

Step 3: Start the Evaluation ⚑

Type /eval to open evaluation settings:

/eval
Enter fullscreen mode Exit fullscreen mode

Configuration Options:

  • Deep Test Mode: Toggle ON for more thorough testing (recommended)
  • Agent URL: Your agent's endpoint (auto-filled for examples)

Step 4: Watch the Live Agent Battle πŸ‘€

Once started, you'll see real-time conversations between Rogue's EvaluatorAgent and your agent:

Step 5: View the Complete Report πŸ“Š

When evaluation completes (or anytime during), hit r to see the full report:

r
Enter fullscreen mode Exit fullscreen mode

Comprehensive Report includes:

Architecture: How Rogue Works πŸ—οΈ

Rogue is built with an open, extensible architecture that separates evaluation logic from user interfaces. This design makes it easy to contribute new features, integrate with existing tools, and customize for your needs:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Your Agent    │◄──►│   Rogue Server   │◄──►│    Clients      β”‚
β”‚                 β”‚    β”‚                  β”‚    β”‚  - TUI          β”‚
β”‚ (Any A2A Agent) β”‚    β”‚ - EvaluatorAgent β”‚    β”‚  - Web UI       β”‚
β”‚                 β”‚    β”‚ - Scenario Gen   β”‚    β”‚  - CLI          β”‚
β”‚                 β”‚    β”‚ - Reporting      β”‚    β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Components:

  • Rogue Server: Core evaluation engine
  • Clients: Different ways to interact (TUI, Web, CLI)
  • Your Agent: Must implement Google's A2A protocol currently but more transports are incoming soon!

Deep Dive: Wrapping Your Agent with AgentExecutor πŸ’»

Want to connect your existing agent to Rogue? You'll need to implement an AgentExecutor that translates between your agent and the A2A protocol. Let's break down how using the T-shirt store example:

The AgentExecutor Pattern

from a2a.server.agent_execution import AgentExecutor, RequestContext
from a2a.server.events import EventQueue
from a2a.server.tasks import TaskUpdater

class YourAgentExecutor(AgentExecutor):
    def __init__(self, your_agent, card: AgentCard):
        self.agent = your_agent
        self._card = card

    async def execute(self, context: RequestContext, event_queue: EventQueue):
        # Main execution logic here
        pass
Enter fullscreen mode Exit fullscreen mode

Key Implementation Details

1. Message Conversion

Your executor needs to convert between A2A and your agent's format:


def convert_a2a_parts_to_your_format(parts: list[Part]) -> YourMessageFormat:
    """Convert A2A message parts to your agent's expected format."""
    converted_parts = []
    for part in parts:
        part = part.root
        if isinstance(part, TextPart):
            converted_parts.append(YourTextFormat(text=part.text))
        elif isinstance(part, FilePart):
            # Handle file attachments
            if isinstance(part.file, FileWithUri):
                converted_parts.append(YourFileFormat(uri=part.file.uri))
    return YourMessageFormat(parts=converted_parts)

def convert_your_format_to_a2a(your_response) -> list[Part]:
    """Convert your agent's response back to A2A format."""
    return [Part(root=TextPart(text=your_response.text))]
Enter fullscreen mode Exit fullscreen mode

2. Session Management

Handle conversation context across multiple interactions:

async def _upsert_session(self, session_id: str):
    """Get existing session or create new one."""
    session = await self.agent.get_session(session_id)
    if session is None:
        session = await self.agent.create_session(session_id)
    return session
Enter fullscreen mode Exit fullscreen mode

3. Task Updates

Keep Rogue informed about your agent's progress:

async def _process_request(self, message, session_id: str, task_updater: TaskUpdater):
    session = await self._upsert_session(session_id)

    async for response in self.agent.run_async(session_id, message):
        if response.is_final():
            # Final response - complete the task
            parts = convert_your_format_to_a2a(response.content)
            await task_updater.add_artifact(parts)
            await task_updater.complete()
            break
        else:
            # Intermediate response - update status
            await task_updater.update_status(
                TaskState.working,
                message=task_updater.new_agent_message(
                    convert_your_format_to_a2a(response.content)
                )
            )
Enter fullscreen mode Exit fullscreen mode

Complete Implementation Example

Here's the full pattern based on the T-shirt store agent:

class TShirtStoreAgentExecutor(AgentExecutor):
    def __init__(self, runner: "Runner", card: AgentCard):
        self.runner = runner
        self._card = card

    async def execute(self, context: RequestContext, event_queue: EventQueue):
        updater = TaskUpdater(event_queue, context.task_id or "", context.context_id or "")

        # Initialize task
        if not context.current_task:
            await updater.submit()
        await updater.start_work()

        # Process the message
        if context.message is not None:
            await self._process_request(
                convert_a2a_to_genai(context.message.parts),
                context.context_id or "",
                updater,
            )

    async def _process_request(self, message, session_id: str, task_updater: TaskUpdater):
        session = await self._upsert_session(session_id)

        async for event in self.runner.run_async(session_id, message):
            if event.is_final_response():
                if event.content:
                    parts = convert_genai_to_a2a(event.content.parts)
                    await task_updater.add_artifact(parts)
                await task_updater.complete()
                break
            elif event.content:
                await task_updater.update_status(
                    TaskState.working,
                    message=task_updater.new_agent_message(
                        convert_genai_to_a2a(event.content.parts)
                    )
                )
Enter fullscreen mode Exit fullscreen mode

Testing Your Implementation

Once your executor is ready, test it with a simple client:

from a2a.client import A2AClient

async def test_your_agent():
    async with httpx.AsyncClient() as client:
        a2a_client = await A2AClient.get_client_from_agent_card_url(
            client, "http://localhost:your-port"
        )

        response = await a2a_client.send_message(
            SendMessageRequest(
                id="test-123",
                params=MessageSendParams(
                    message=Message(
                        role="user",
                        parts=[TextPart(text="Hello, agent!")],
                        messageId="msg-123"
                    )
                )
            )
        )
        print(f"Agent response: {response}")
Enter fullscreen mode Exit fullscreen mode

Advanced Usage: CLI for CI/CD πŸš€

For automated testing in your deployment pipeline, use Rogue's CLI mode:

Basic CLI Usage

# Run evaluation with minimal config
uvx rogue-ai cli \
    --evaluated-agent-url http://localhost:10001 \
    --judge-llm openai/gpt-4o \
    --business-context-file ./context.md \
    --output-report-file ./evaluation-report.md
Enter fullscreen mode Exit fullscreen mode

CI/CD Integration Example

# .github/workflows/agent-evaluation.yml
name: Evaluate Agent
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Start Agent
        run: |
          # Start your agent in background
          python -m your_agent &
          sleep 10  # Wait for startup

      - name: Run Rogue Evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          uvx rogue-ai server &
          sleep 5

          uvx rogue-ai cli \
            --evaluated-agent-url http://localhost:8080 \
            --judge-llm openai/gpt-4o \
            --business-context-file ./agent-context.md \
            --output-report-file ./evaluation-results.md

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: evaluation-report
          path: evaluation-results.md
Enter fullscreen mode Exit fullscreen mode

Configuration File Approach

Create a config file for reusable settings:

{
  "evaluated_agent_url": "http://localhost:10001",
  "judge_llm": "openai/gpt-4o",
  "business_context_file": "./context.md",
  "deep_test_mode": true,
  "output_report_file": "./reports/evaluation.md"
}
Enter fullscreen mode Exit fullscreen mode
# Use config file
uvx rogue-ai cli --config-file ./rogue-config.json
Enter fullscreen mode Exit fullscreen mode

Tips for Effective Agent Testing πŸ’‘

1. Write Comprehensive Business Context

Your business context directly impacts scenario quality. Include:

  • Policies: What your agent should/shouldn't do
  • Available actions: Tools and capabilities
  • Expected behavior: How to handle edge cases
  • Constraints: Pricing, inventory, limitations

2. Start with Core Scenarios

Begin testing with your most critical use cases:

  • Happy path interactions
  • Policy violations
  • Edge cases and error handling
  • Security boundary testing

3. Iterate Based on Results

Use evaluation reports to improve your agent:

  • Fix failed scenarios
  • Add safeguards for edge cases
  • Refine instructions based on findings
  • Re-test after changes

4. Automate in CI/CD

Make evaluation part of your deployment process:

  • Run on every pull request
  • Block deployments on failed evaluations
  • Track evaluation metrics over time
  • Alert on regression

Happy testing! πŸ§ͺ✨

Top comments (0)