Carlos Cortez 🇵🇪 [AWS Hero] for AWS Heroes

Posted on May 14 • Edited on May 21

How I Monitor AI Agents: CloudWatch for Infra, Arize Phoenix for Traces and OpenTelemetry, LLM-as-Judge for Quality

#agents #ai #llm #monitoring

How I Monitor My AI Agents: CloudWatch for Infra, Arize Phoenix for Traces, LLM-as-Judge for Quality

AI agents are not regular software. They reason, they call tools, they make decisions — and they can fail in ways that a simple health check will never catch. The response was technically successful, but was it actually helpful? The agent called the right tool, but did it interpret the result correctly? Traditional monitoring doesn't answer these questions.

That's why I built a three-layer observability stack for my AI agents, and today I'm walking you through exactly how it works.

📓 Full working notebook: All the code in this post is validated and executable in the companion Jupyter notebook — including setup, tracing, evals, and cleanup. here as well: https://github.com/breakingthecloud/observability-ai-agents-phoenix-otel-strands

The Problem with Monitoring AI Agents

Here's the thing: when your agent answers "I don't have weather data for Paris" — is that a failure? Technically no, the agent ran fine. But from a user perspective, it's a miss. Traditional monitoring would show 200 OK, low latency, zero errors. Everything looks green. But the user didn't get what they needed.

You need three layers of observability to actually understand what's happening:

User Query → Strands Agent → Tool Calls → Bedrock (Claude)
     ↓              ↓              ↓
  Phoenix      CloudWatch      Phoenix Evals
 (AI traces)  (infra metrics)  (quality scores)

Layer	Tool	What it answers
AI Traces	Arize Phoenix	What did the agent think? Which tools did it call? What was the full LLM input/output?
Infrastructure	Amazon CloudWatch	Is the system healthy? How fast? How much is it costing me?
Quality Evals	Phoenix + LLM-as-Judge	Was the response actually good? Helpful? Accurate?

The Stack

Strands Agents SDK — AWS's open-source framework for building agents
Amazon Bedrock — Claude Sonnet 4.6 as the foundation model
Arize Phoenix — Open-source AI observability, runs locally, zero accounts needed
Amazon CloudWatch — Metrics, alarms, dashboards
OpenTelemetry — The glue that connects everything

What's interesting is that Phoenix runs entirely on your machine — localhost:6006. No cloud accounts, no API keys for the observability layer. You get a full tracing UI for free.

Layer 1: AI Traces with Arize Phoenix

The first thing you need is visibility into what your agent is actually doing. Not just "it responded in 2 seconds" but the full reasoning chain: what the LLM received, what it decided, which tools it called, and what it returned.

Setting Up the Tracing Pipeline

Three steps: launch Phoenix, configure OpenTelemetry, instrument Bedrock.

import phoenix as px
from opentelemetry import trace as trace_api
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.bedrock import BedrockInstrumentor

# 1. Launch Phoenix locally
session = px.launch_app()  # UI at http://localhost:6006

# 2. Configure OTel to send traces to Phoenix
tracer_provider = trace_sdk.TracerProvider()
tracer_provider.add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:6006/v1/traces"))
)
trace_api.set_tracer_provider(tracer_provider)

# 3. Auto-instrument all Bedrock API calls
BedrockInstrumentor().instrument(tracer_provider=tracer_provider)

That's it. Every Bedrock call your agent makes is now traced automatically. No decorators on your business logic, no manual span creation. OpenInference handles it.

Building the Agent

The agent itself is straightforward with Strands:

import boto3
from strands import Agent
from strands.models.bedrock import BedrockModel

session = boto3.Session(profile_name="cc", region_name="us-east-1")

def get_weather(city: str) -> str:
    """Get current weather for a city."""
    weather_data = {
        "Lima": "☀️ 22°C, clear skies",
        "New York": "🌧️ 15°C, rainy",
        "Tokyo": "⛅ 18°C, partly cloudy",
    }
    return weather_data.get(city, f"Weather data not available for {city}")

model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-6", boto_session=session)

agent = Agent(
    model=model,
    tools=[get_weather],
    system_prompt="You are a helpful weather assistant. Use the get_weather tool."
)

Now when you run agent("What's the weather in Lima and Tokyo?"), Phoenix captures the entire trace tree: the agent span, the LLM calls, the tool invocations, the final response. All visible in the UI at localhost:6006.

Exploring Traces Programmatically

You don't have to use the UI. Phoenix exposes everything as DataFrames:

from phoenix.client import Client

traces_df = Client().spans.get_spans_dataframe()
traces_df["latency_ms"] = (traces_df["end_time"] - traces_df["start_time"]).dt.total_seconds() * 1000

print(f"Total spans captured: {len(traces_df)}")
traces_df[["name", "span_kind", "latency_ms", "status_code"]].head(10)

This gives you every span — agent, LLM, tool — with timing, status, and the full input/output attributes. Perfect for building custom analytics or feeding into your own dashboards.

Layer 2: Infrastructure Monitoring with CloudWatch

Phoenix tells you what the agent is thinking. CloudWatch tells you if the system is healthy. Different questions, both critical.

The AgentMonitor Class

I built a simple wrapper that publishes four metrics per agent invocation:

cloudwatch = session.client("cloudwatch", region_name="us-east-1")

class AgentMonitor:
    def __init__(self, namespace="AI/Agents"):
        self.namespace = namespace
        self.cw = cloudwatch

    def track(self, agent_name: str, latency_ms: float, tokens: int,
              success: bool, tool_calls: int = 0):
        metrics = [
            {"MetricName": "Latency", "Value": latency_ms, "Unit": "Milliseconds"},
            {"MetricName": "TokensUsed", "Value": tokens, "Unit": "Count"},
            {"MetricName": "Success", "Value": 1 if success else 0, "Unit": "Count"},
            {"MetricName": "ToolCalls", "Value": tool_calls, "Unit": "Count"},
        ]
        dims = [{"Name": "AgentName", "Value": agent_name}]
        for m in metrics:
            m["Dimensions"] = dims
        self.cw.put_metric_data(Namespace=self.namespace, MetricData=metrics)

Usage is clean — wrap your agent call:

monitor = AgentMonitor()
start = time.time()
try:
    result = agent("What's the weather in Lima?")
    latency = (time.time() - start) * 1000
    monitor.track("weather-agent", latency, tokens=150, success=True, tool_calls=1)
except Exception as e:
    latency = (time.time() - start) * 1000
    monitor.track("weather-agent", latency, tokens=0, success=False, tool_calls=0)

Smart Alarms

Two alarms that catch the most common issues:

# Alert when response time is consistently high
cloudwatch.put_metric_alarm(
    AlarmName="Agent-High-Latency",
    MetricName="Latency", Namespace="AI/Agents",
    Statistic="Average", Period=300, EvaluationPeriods=3,
    Threshold=10000.0,  # 10 seconds
    ComparisonOperator="GreaterThanThreshold",
    Dimensions=[{"Name": "AgentName", "Value": "weather-agent"}],
)

# Alert when error rate exceeds 5%
cloudwatch.put_metric_alarm(
    AlarmName="Agent-High-Error-Rate",
    MetricName="Success", Namespace="AI/Agents",
    Statistic="Average", Period=300, EvaluationPeriods=2,
    Threshold=0.95,
    ComparisonOperator="LessThanThreshold",
    Dimensions=[{"Name": "AgentName", "Value": "weather-agent"}],
)

The concept is straightforward: latency catches performance degradation, error rate catches reliability issues. These two alarms alone will catch 80% of production problems.

Layer 3: LLM-as-Judge Evals

This is the layer most people skip — and it's the most important one. Your agent can be fast, reliable, and still give terrible answers. You need automated quality evaluation.

The idea: use another LLM to judge the quality of your agent's responses. It's not perfect, but it's infinitely better than no evaluation at all.

Setting Up the Evaluator

Phoenix evals v3 uses a provider-based LLM wrapper. For Bedrock, it goes through litellm:

from phoenix.evals import LLM, create_evaluator, evaluate_dataframe
from phoenix.client import Client
import pandas as pd

# Get LLM spans from Phoenix
spans_df = Client().spans.get_spans_dataframe()
llm_spans = spans_df[spans_df["span_kind"] == "LLM"].copy()

# Build eval dataframe
eval_data = pd.DataFrame({
    "input": llm_spans["attributes.input.value"].fillna("").values,
    "output": llm_spans["attributes.output.value"].fillna("").values,
})
eval_data = eval_data[eval_data["output"].str.len() > 0].reset_index(drop=True)

# Create the judge
eval_model = LLM(provider="bedrock", model="us.anthropic.claude-sonnet-4-6")

@create_evaluator(name="helpfulness", source="llm")
def helpfulness(input: str, output: str) -> float:
    """Rate how helpful the agent response is on a scale of 0 to 1."""
    prompt = (
        f"Rate the helpfulness of this AI response on a scale of 0.0 to 1.0.\n"
        f"User asked: {input}\n"
        f"AI responded: {output}\n"
        f"Return ONLY a number between 0.0 and 1.0."
    )
    result = eval_model.generate_text(prompt=prompt)
    try:
        return float(result.strip())
    except ValueError:
        return 0.5

# Run evaluation
results = evaluate_dataframe(dataframe=eval_data, evaluators=[helpfulness])

The cool part here is the @create_evaluator decorator — it turns a simple function into a full evaluator that Phoenix understands. You can create as many as you need: helpfulness, accuracy, safety, tone, whatever matters for your use case.

Pushing Scores Back to Phoenix

The evaluation results are useful in a DataFrame, but they're even more useful when attached to the actual traces in Phoenix:

import json as _json

score_col = [c for c in results.columns if "_score" in c][0]
scores = results[score_col].apply(
    lambda x: _json.loads(x).get("value", 0) if isinstance(x, str) else 0
)

annotations = pd.DataFrame({
    "span_id": llm_spans["context.span_id"].values[:len(scores)],
    "score": scores.values,
    "label": scores.apply(lambda s: "good" if s >= 0.7 else "needs_review").values,
    "explanation": [f"Helpfulness score: {s:.2f}" for s in scores.values],
})

Client().spans.log_span_annotations_dataframe(
    dataframe=annotations,
    annotation_name="helpfulness",
    annotator_kind="LLM",
)

Now when you open Phoenix UI and click on any LLM span, you see the helpfulness score right there in the Annotations tab. Traces + quality scores in one place.

What This Costs

One question I always get: what does this cost to run?

Component	Where	Cost
Phoenix	Your machine (localhost)	$0
Bedrock (agent calls)	AWS, pay-per-request	~$0.003 per query
Bedrock (eval judge)	AWS, pay-per-request	~$0.003 per eval
CloudWatch alarms	AWS	~$0.20/month
CloudWatch custom metrics	AWS	~$0.30/month

For development and testing, you're looking at less than $1/month for the AWS side. Phoenix is completely free and local.

The Main Takeaway

Observability for AI agents requires thinking in three layers:

Traces (Phoenix) — What is the agent doing? What's the full reasoning chain?
Infra metrics (CloudWatch) — Is the system healthy? Fast? Within budget?
Quality evals (LLM-as-Judge) — Are the responses actually good?

Most teams only do layer 2. Some add layer 1. Almost nobody does layer 3 — and that's where the real insights are. A fast, reliable agent that gives bad answers is worse than a slow one that gives good answers, because you won't even know there's a problem.

My advice: start with Phoenix traces (it's free and local), add CloudWatch for the basics (latency, errors, tokens), and then build at least one LLM-as-Judge evaluator for whatever quality dimension matters most to your users. You can set this up in an afternoon and it will save you weeks of debugging blind.

Connect with me:

LinkedIn - Let's discuss AI observability and agent architectures
X/Twitter - Follow for AWS, GenAI, and agentic AI updates
GitHub - Check out the full notebook and more
Dev.to - More technical deep-dives
AWS Community - Join the conversation

I'm Carlos Cortez, this is Breaking the Cloud, and today we made our agents observable. See you in the next one!

Top comments (1)

Artemii Amelin • May 16

Three layers is the right structure and the LLM-as-judge quality layer is the one most setups skip and then regret. One gap I'd add: network-layer observability between agents. Once agents are distributed across environments you want to know whether peer tunnels are healthy, not just whether the LLM call returned 200. I've been running Pilot Protocol (pilotprotocol.network) for agent networking and the built-in ping and health commands give you RTT and connectivity state per peer. Not a replacement for what you've built but useful once the agent graph gets distributed.