DEV Community

drorIvry
drorIvry

Posted on

Testing AI Agents Like a Pro: A Complete Guide to Rogue πŸ”

Testing AI agents is still a big problem for most teams. You either test manually through scenarios, or you've built custom scripts. Rogue is an open source framework that works differently - it uses one agent to test another agent automatically.

⭐ Star Rogue on GitHub to support the project and stay updated!

What is Rogue? πŸ€”

Rogue is a powerful evaluation framework that tests AI agents by having them face off against a dynamic EvaluatorAgent. Think of it as a sparring partner for your AI - it generates scenarios, runs conversations, and grades performance to ensure your agent behaves exactly as intended.

Quick Start: Get Testing in 2 Minutes ⚑

Prerequisites

Make sure UV is installed

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or via pip
pip install uv

# Verify installation
uv --version
Enter fullscreen mode Exit fullscreen mode

Get Started

The fastest way to see Rogue in action is with the built-in T-shirt store example:

# Install and run with example agent
uvx rogue-ai --example=tshirt_store
Enter fullscreen mode Exit fullscreen mode

This single command:

  • Starts a demo T-shirt selling agent on localhost:10001
  • Starts the Rogue server
  • Launches Rogue's TUI interface

Full Walkthrough: Evaluating Your Agent πŸ§ͺ

Let's walk through a complete evaluation using Rogue's powerful TUI interface. Start by launching Rogue:

# Launch Rogue with the example agent
uvx rogue-ai --example=tshirt_store

# Or connect to your own agent
uvx rogue-ai
Enter fullscreen mode Exit fullscreen mode

Now let's go through the 5-step evaluation process:

Step 1: Configure Your Judge Model πŸ€–

Type /models to open the model configuration:

/models
Enter fullscreen mode Exit fullscreen mode

What you'll see:

  • Available Models: List of supported LLM providers (OpenAI, Anthropic, Google)
  • API Key Setup: Securely add your API keys for the providers
  • Judge Selection: Choose which model will evaluate your agent's responses

Step 2: Create Business Context & Scenarios πŸ“

You have two powerful options here:

Option A: Manual Editor

Type /editor to open the business context editor:

/editor
Enter fullscreen mode Exit fullscreen mode

Hit i to start an interactive interview where Rogue asks you questions:

The AI will ask questions like:

πŸ€– What does your agent do?
πŸ‘€ It sells t-shirts for an online store

πŸ€– What are the key policies it should follow?
πŸ‘€ No discounts, fixed pricing at $19.99, USD only

πŸ€– What scenarios should we test?
πŸ‘€ Price negotiations, wrong currency, inventory checks
Enter fullscreen mode Exit fullscreen mode

Rogue automatically generates comprehensive business context and test scenarios from your answers!

option B

Write your business context:

# T-Shirt Store Agent Context

This agent sells t-shirts for Shirtify. Key requirements:

## Products & Pricing

- Regular and V-neck shirts only
- Colors: White, Black, Red, Blue, Green
- Fixed price: $19.99 USD (no discounts!)

## Policies

- No free merchandise
- No sales or promotions
- USD payments only
- Professional customer service

Enter fullscreen mode Exit fullscreen mode

Step 3: Start the Evaluation ⚑

Type /eval to open evaluation settings:

/eval
Enter fullscreen mode Exit fullscreen mode

Configuration Options:

  • Deep Test Mode: Toggle ON for more thorough testing (recommended) this will enable a multi turn conversation per scenario.
  • Agent URL: Your agent's endpoint (auto-filled for examples)

Step 4: Watch the Live Agent Battle πŸ‘€

Once started, you'll see real-time conversations between Rogue's EvaluatorAgent and your agent:

Step 5: View the Complete Report πŸ“Š

When evaluation completes (or anytime during), hit r to see the full report:

Comprehensive Report includes:

Architecture: How Rogue Works πŸ—οΈ

Rogue is built with an open, extensible architecture that separates evaluation logic from user interfaces. This design makes it easy to contribute new features, integrate with existing tools, and customize for your needs:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Your Agent    │◄──►│   Rogue Server   │◄──►│    Clients      β”‚
β”‚                 β”‚    β”‚                  β”‚    β”‚                 β”‚
β”‚ (Any A2A Agent) β”‚    β”‚ - EvaluatorAgent β”‚    β”‚  - TUI          β”‚
β”‚                 β”‚    β”‚ - Scenario Gen   β”‚    β”‚  - Web UI       β”‚
β”‚                 β”‚    β”‚ - Reporting      β”‚    β”‚  - CLI          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Components:

  • Rogue Server: Core evaluation engine
  • Clients: Different ways to interact (TUI, Web, CLI)
  • Your Agent: Must implement Google's A2A protocol currently but more transports are incoming soon!

Deep Dive: Wrapping Your Agent with AgentExecutor πŸ’»

Want to connect your existing agent to Rogue? You'll need to implement an AgentExecutor that translates between your agent and the A2A protocol. Let's break down how using the T-shirt store example:

The AgentExecutor Pattern

from a2a.server.agent_execution import AgentExecutor, RequestContext
from a2a.server.events import EventQueue
from a2a.server.tasks import TaskUpdater

class YourAgentExecutor(AgentExecutor):
    def __init__(self, your_agent, card: AgentCard):
        self.agent = your_agent
        self._card = card

    async def execute(self, context: RequestContext, event_queue: EventQueue):
        # Main execution logic here
        pass

    async def cancel(
        self, context: RequestContext, event_queue: EventQueue
    ) -> None:
        pass
Enter fullscreen mode Exit fullscreen mode

Key Implementation Details

1. Message Conversion

Your executor needs to convert between A2A and your agent's format:


def convert_a2a_parts_to_your_format(parts: list[Part]) -> YourMessageFormat:
    """Convert A2A message parts to your agent's expected format."""
    converted_parts = []
    for part in parts:
        part = part.root
        if isinstance(part, TextPart):
            converted_parts.append(YourTextFormat(text=part.text))
        elif isinstance(part, FilePart):
            # Handle file attachments
            if isinstance(part.file, FileWithUri):
                converted_parts.append(YourFileFormat(uri=part.file.uri))
    return YourMessageFormat(parts=converted_parts)

def convert_your_format_to_a2a(your_response) -> list[Part]:
    """Convert your agent's response back to A2A format."""
    return [Part(root=TextPart(text=your_response.text))]
Enter fullscreen mode Exit fullscreen mode

2. Session Management

Handle conversation context across multiple interactions:

async def _upsert_session(self, session_id: str):
    """Get existing session or create new one."""
    session = await self.agent.get_session(session_id)
    if session is None:
        session = await self.agent.create_session(session_id)
    return session
Enter fullscreen mode Exit fullscreen mode

3. Task Updates

Keep Rogue informed about your agent's progress:

async def _process_request(self, message, session_id: str, task_updater: TaskUpdater):
    session = await self._upsert_session(session_id)

    async for response in self.agent.run_async(session_id, message):
        if response.is_final():
            # Final response - complete the task
            parts = convert_your_format_to_a2a(response.content)
            await task_updater.add_artifact(parts)
            await task_updater.complete()
            break
        else:
            # Intermediate response - update status
            await task_updater.update_status(
                TaskState.working,
                message=task_updater.new_agent_message(
                    convert_your_format_to_a2a(response.content)
                )
            )
Enter fullscreen mode Exit fullscreen mode

Complete Implementation Example

Here's the full pattern based on the T-shirt store agent:


import base64
from typing import TYPE_CHECKING, AsyncGenerator

from a2a.server.agent_execution import AgentExecutor
from a2a.server.tasks import TaskUpdater
from a2a.types import (
    AgentCard,
    FilePart,
    FileWithBytes,
    FileWithUri,
    Part,
    TaskState,
    TextPart,
    UnsupportedOperationError,
)
from a2a.utils.errors import ServerError
from google.genai.types import UserContent, Blob, FileData, Part as GenAIPart
from loguru import logger


if TYPE_CHECKING:
    from a2a.server.agent_execution import RequestContext
    from a2a.server.events import EventQueue
    from google.adk import Runner
    from google.adk.events import Event
    from google.genai.types import Content
    from google.genai.types import Part as GenAIPart


class TShirtStoreAgentExecutor(AgentExecutor):
    """An AgentExecutor that runs an ADK-based Agent."""

    def __init__(self, runner: "Runner", card: AgentCard):
        self.runner = runner
        self._card = card

        self._running_sessions = {}  # type: ignore

    def _run_agent(
        self,
        session_id: str,
        new_message: "Content",
    ) -> AsyncGenerator["Event", None]:
        return self.runner.run_async(
            session_id=session_id,
            user_id="self",
            new_message=new_message,
        )

    async def _process_request(
        self,
        new_message: "Content",
        session_id: str,
        task_updater: TaskUpdater,
    ) -> None:
        # The call to self._upsert_session was returning a coroutine object,
        # leading to an AttributeError when trying to access .id on it directly.
        # We need to await the coroutine to get the actual session object.
        session_obj = await self._upsert_session(
            session_id,
        )
        # Update session_id with the ID from the resolved session object
        # to be used in self._run_agent.
        session_id = session_obj.id

        async for event in self._run_agent(session_id, new_message):

            if event.is_final_response():
                if event.content:
                    parts = convert_genai_parts_to_a2a(event.content.parts)
                    logger.debug("Yielding final response", extra={"parts": parts})
                    await task_updater.add_artifact(parts)
                await task_updater.complete()
                break
            if not event.get_function_calls() and event.content:
                logger.debug("Yielding update response")
                await task_updater.update_status(
                    TaskState.working,
                    message=task_updater.new_agent_message(
                        convert_genai_parts_to_a2a(event.content.parts),
                    ),
                )
            else:
                logger.debug("Skipping event")

    async def execute(
        self,
        context: "RequestContext",
        event_queue: "EventQueue",
    ):
        # Run the agent until either complete or the task is suspended.
        updater = TaskUpdater(
            event_queue,
            context.task_id or "",
            context.context_id or "",
        )
        # Immediately notify that the task is submitted.
        if not context.current_task:
            await updater.submit()
        await updater.start_work()

        if context.message is not None:
            await self._process_request(
                UserContent(
                    parts=convert_a2a_parts_to_genai(context.message.parts),
                ),
                context.context_id or "",
                updater,
            )
        logger.debug("TShirtStoreAgentExecutor execute exiting")

    async def cancel(self, context: "RequestContext", event_queue: "EventQueue"):
        # Ideally: kill any ongoing tasks.
        raise ServerError(error=UnsupportedOperationError())

    async def _upsert_session(self, session_id: str):
        """
        Retrieves a session if it exists, otherwise creates a new one.
        Ensures that async session service methods are properly awaited.
        """
        session = await self.runner.session_service.get_session(
            app_name=self.runner.app_name,
            user_id="self",
            session_id=session_id,
        )
        if session is None:
            session = await self.runner.session_service.create_session(
                app_name=self.runner.app_name,
                user_id="self",
                session_id=session_id,
            )
        # According to ADK InMemorySessionService,
        # create_session should always return a Session object.
        if session is None:
            logger.error(
                f"Critical error: Session is None even after "
                f"create_session for session_id: {session_id}",
            )
            raise RuntimeError(
                f"Failed to get or create session: {session_id}",
            )
        return session


def convert_a2a_parts_to_genai(parts: list[Part]) -> list["GenAIPart"]:
    """Convert a list of A2A Part types into a list of Google Gen AI Part types."""
    return [convert_a2a_part_to_genai(part) for part in parts]


def convert_a2a_part_to_genai(part: Part) -> "GenAIPart":
    """Convert a single A2A Part type into a Google Gen AI Part type."""
    part = part.root  # type: ignore
    if isinstance(part, TextPart):
        return GenAIPart(text=part.text)
    if isinstance(part, FilePart):
        if isinstance(part.file, FileWithUri):
            return GenAIPart(
                file_data=FileData(
                    file_uri=part.file.uri,
                    mime_type=part.file.mimeType,
                ),
            )
        if isinstance(part.file, FileWithBytes):
            return GenAIPart(
                inline_data=Blob(
                    data=base64.b64decode(part.file.bytes),
                    mime_type=part.file.mimeType,
                ),
            )
        raise ValueError(f"Unsupported file type: {type(part.file)}")
    raise ValueError(f"Unsupported part type: {type(part)}")


def convert_genai_parts_to_a2a(parts: list["GenAIPart"] | None) -> list[Part]:
    """Convert a list of Google Gen AI Part types into a list of A2A Part types."""
    parts = parts or []
    return [
        convert_genai_part_to_a2a(part)
        for part in parts
        if (part.text or part.file_data or part.inline_data)
    ]


def convert_genai_part_to_a2a(part: "GenAIPart") -> Part:
    """Convert a single Google Gen AI Part type into an A2A Part type."""
    if part.text:
        return Part(root=TextPart(text=part.text))
    if part.file_data:
        return Part(
            root=FilePart(
                file=FileWithUri(
                    uri=part.file_data.file_uri or "",
                    mimeType=part.file_data.mime_type,
                ),
            ),
        )
    if part.inline_data:
        return Part(
            root=FilePart(
                file=FileWithBytes(
                    bytes=base64.b64encode(
                        part.inline_data.data,  # type: ignore
                    ).decode(),
                    mimeType=part.inline_data.mime_type,
                ),
            ),
        )
    raise ValueError(f"Unsupported part type: {part}")
Enter fullscreen mode Exit fullscreen mode

Testing Your Implementation

Once your executor is ready, test it with a simple client:

from a2a.client import A2AClient

async def test_your_agent():
    async with httpx.AsyncClient() as client:
        a2a_client = await A2AClient.get_client_from_agent_card_url(
            client, "http://localhost:your-port"
        )

        response = await a2a_client.send_message(
            SendMessageRequest(
                id="test-123",
                params=MessageSendParams(
                    message=Message(
                        role="user",
                        parts=[TextPart(text="Hello, agent!")],
                        messageId="msg-123"
                    )
                )
            )
        )
        print(f"Agent response: {response}")
Enter fullscreen mode Exit fullscreen mode

Advanced Usage: CLI for CI/CD πŸš€

For automated testing in your deployment pipeline, use Rogue's CLI mode:

Basic CLI Usage

after creating the scenario and business context rogue will create a .rogue folder with the scenarios.json file that contains the scenarios

# Run evaluation with minimal config
uvx rogue-ai cli \
    --evaluated-agent-url http://localhost:10001 \
    --judge-llm openai/o4-mini \
    --workdir './examples/tshirt_store_agent/.rogue'
Enter fullscreen mode Exit fullscreen mode

CI/CD Integration Example

# .github/workflows/agent-evaluation.yml
name: Rogue

on:
  pull_request:
  push:
    branches:
      - main
  workflow_dispatch:

jobs:
  run_rogue:
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      - name: Run rogue
        shell: bash
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |          
          echo "πŸš€ Starting rogue server and example agent..."

          uvx rogue-ai server --host 0.0.0.0 --port 8000 --example tshirt_store &
          ROGUE_PID=$!
          echo "Rogue server started with PID: $ROGUE_PID"
          trap 'echo "πŸ›‘ Stopping rogue server..."; kill $ROGUE_PID' EXIT

          echo "⏳ Waiting for example agent to be ready..."
          curl --retry 10 --retry-delay 5 --retry-connrefused -s --fail -o /dev/null \
            http://localhost:10001/.well-known/agent.json

          echo "⏳ Waiting for rogue server to be ready..."
          curl --retry 10 --retry-delay 5 --retry-connrefused -s --fail -o /dev/null \
            http://localhost:8000/api/v1/health

          echo "πŸš€ Rogue server ready"

          echo "πŸš€ Running rogue..."
          uvx rogue-ai cli \
            --evaluated-agent-url http://localhost:10001 \
            --judge-llm openai/o4-mini \
            --workdir './examples/tshirt_store_agent/.rogue'
Enter fullscreen mode Exit fullscreen mode

Tips for Effective Agent Testing πŸ’‘

1. Write Comprehensive Business Context

Your business context directly impacts scenario quality. Include:

  • Policies: What your agent should/shouldn't do
  • Available actions: Tools and capabilities
  • Expected behavior: How to handle edge cases
  • Constraints: Pricing, inventory, limitations

2. Start with Core Scenarios

Begin testing with your most critical use cases:

  • Happy path interactions
  • Policy violations
  • Edge cases and error handling
  • Security boundary testing

3. Iterate Based on Results

Use evaluation reports to improve your agent:

  • Fix failed scenarios
  • Add safeguards for edge cases
  • Refine instructions based on findings
  • Re-test after changes

4. Automate in CI/CD

Make evaluation part of your deployment process:

  • Run on every pull request
  • Block deployments on failed evaluations
  • Track evaluation metrics over time
  • Alert on regression

Happy testing! πŸ§ͺ✨

Top comments (0)