DEV Community

Cover image for How I Monitor AI Agents: CloudWatch for Infra, Arize Phoenix for Traces and OpenTelemetry, LLM-as-Judge for Quality

How I Monitor AI Agents: CloudWatch for Infra, Arize Phoenix for Traces and OpenTelemetry, LLM-as-Judge for Quality

How I Monitor My AI Agents: CloudWatch for Infra, Arize Phoenix for Traces, LLM-as-Judge for Quality

AI agents are not regular software. They reason, they call tools, they make decisions — and they can fail in ways that a simple health check will never catch. The response was technically successful, but was it actually helpful? The agent called the right tool, but did it interpret the result correctly? Traditional monitoring doesn't answer these questions.

That's why I built a three-layer observability stack for my AI agents, and today I'm walking you through exactly how it works.

📓 Full working notebook: All the code in this post is validated and executable in the companion Jupyter notebook — including setup, tracing, evals, and cleanup. here as well: https://github.com/breakingthecloud/observability-ai-agents-phoenix-otel-strands

The Problem with Monitoring AI Agents

Here's the thing: when your agent answers "I don't have weather data for Paris" — is that a failure? Technically no, the agent ran fine. But from a user perspective, it's a miss. Traditional monitoring would show 200 OK, low latency, zero errors. Everything looks green. But the user didn't get what they needed.

You need three layers of observability to actually understand what's happening:

User Query → Strands Agent → Tool Calls → Bedrock (Claude)
     ↓              ↓              ↓
  Phoenix      CloudWatch      Phoenix Evals
 (AI traces)  (infra metrics)  (quality scores)
Enter fullscreen mode Exit fullscreen mode
Layer Tool What it answers
AI Traces Arize Phoenix What did the agent think? Which tools did it call? What was the full LLM input/output?
Infrastructure Amazon CloudWatch Is the system healthy? How fast? How much is it costing me?
Quality Evals Phoenix + LLM-as-Judge Was the response actually good? Helpful? Accurate?

The Stack

  • Strands Agents SDK — AWS's open-source framework for building agents
  • Amazon Bedrock — Claude Sonnet 4.6 as the foundation model
  • Arize Phoenix — Open-source AI observability, runs locally, zero accounts needed
  • Amazon CloudWatch — Metrics, alarms, dashboards
  • OpenTelemetry — The glue that connects everything

What's interesting is that Phoenix runs entirely on your machine — localhost:6006. No cloud accounts, no API keys for the observability layer. You get a full tracing UI for free.

Layer 1: AI Traces with Arize Phoenix

The first thing you need is visibility into what your agent is actually doing. Not just "it responded in 2 seconds" but the full reasoning chain: what the LLM received, what it decided, which tools it called, and what it returned.

Setting Up the Tracing Pipeline

Three steps: launch Phoenix, configure OpenTelemetry, instrument Bedrock.

import phoenix as px
from opentelemetry import trace as trace_api
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.bedrock import BedrockInstrumentor

# 1. Launch Phoenix locally
session = px.launch_app()  # UI at http://localhost:6006

# 2. Configure OTel to send traces to Phoenix
tracer_provider = trace_sdk.TracerProvider()
tracer_provider.add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:6006/v1/traces"))
)
trace_api.set_tracer_provider(tracer_provider)

# 3. Auto-instrument all Bedrock API calls
BedrockInstrumentor().instrument(tracer_provider=tracer_provider)
Enter fullscreen mode Exit fullscreen mode

That's it. Every Bedrock call your agent makes is now traced automatically. No decorators on your business logic, no manual span creation. OpenInference handles it.

Building the Agent

The agent itself is straightforward with Strands:

import boto3
from strands import Agent
from strands.models.bedrock import BedrockModel

session = boto3.Session(profile_name="cc", region_name="us-east-1")

def get_weather(city: str) -> str:
    """Get current weather for a city."""
    weather_data = {
        "Lima": "☀️ 22°C, clear skies",
        "New York": "🌧️ 15°C, rainy",
        "Tokyo": "⛅ 18°C, partly cloudy",
    }
    return weather_data.get(city, f"Weather data not available for {city}")

model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-6", boto_session=session)

agent = Agent(
    model=model,
    tools=[get_weather],
    system_prompt="You are a helpful weather assistant. Use the get_weather tool."
)
Enter fullscreen mode Exit fullscreen mode

Now when you run agent("What's the weather in Lima and Tokyo?"), Phoenix captures the entire trace tree: the agent span, the LLM calls, the tool invocations, the final response. All visible in the UI at localhost:6006.

Exploring Traces Programmatically

You don't have to use the UI. Phoenix exposes everything as DataFrames:

from phoenix.client import Client

traces_df = Client().spans.get_spans_dataframe()
traces_df["latency_ms"] = (traces_df["end_time"] - traces_df["start_time"]).dt.total_seconds() * 1000

print(f"Total spans captured: {len(traces_df)}")
traces_df[["name", "span_kind", "latency_ms", "status_code"]].head(10)
Enter fullscreen mode Exit fullscreen mode

This gives you every span — agent, LLM, tool — with timing, status, and the full input/output attributes. Perfect for building custom analytics or feeding into your own dashboards.

Layer 2: Infrastructure Monitoring with CloudWatch

Phoenix tells you what the agent is thinking. CloudWatch tells you if the system is healthy. Different questions, both critical.

The AgentMonitor Class

I built a simple wrapper that publishes four metrics per agent invocation:

cloudwatch = session.client("cloudwatch", region_name="us-east-1")

class AgentMonitor:
    def __init__(self, namespace="AI/Agents"):
        self.namespace = namespace
        self.cw = cloudwatch

    def track(self, agent_name: str, latency_ms: float, tokens: int,
              success: bool, tool_calls: int = 0):
        metrics = [
            {"MetricName": "Latency", "Value": latency_ms, "Unit": "Milliseconds"},
            {"MetricName": "TokensUsed", "Value": tokens, "Unit": "Count"},
            {"MetricName": "Success", "Value": 1 if success else 0, "Unit": "Count"},
            {"MetricName": "ToolCalls", "Value": tool_calls, "Unit": "Count"},
        ]
        dims = [{"Name": "AgentName", "Value": agent_name}]
        for m in metrics:
            m["Dimensions"] = dims
        self.cw.put_metric_data(Namespace=self.namespace, MetricData=metrics)
Enter fullscreen mode Exit fullscreen mode

Usage is clean — wrap your agent call:

monitor = AgentMonitor()
start = time.time()
try:
    result = agent("What's the weather in Lima?")
    latency = (time.time() - start) * 1000
    monitor.track("weather-agent", latency, tokens=150, success=True, tool_calls=1)
except Exception as e:
    latency = (time.time() - start) * 1000
    monitor.track("weather-agent", latency, tokens=0, success=False, tool_calls=0)
Enter fullscreen mode Exit fullscreen mode

Smart Alarms

Two alarms that catch the most common issues:

# Alert when response time is consistently high
cloudwatch.put_metric_alarm(
    AlarmName="Agent-High-Latency",
    MetricName="Latency", Namespace="AI/Agents",
    Statistic="Average", Period=300, EvaluationPeriods=3,
    Threshold=10000.0,  # 10 seconds
    ComparisonOperator="GreaterThanThreshold",
    Dimensions=[{"Name": "AgentName", "Value": "weather-agent"}],
)

# Alert when error rate exceeds 5%
cloudwatch.put_metric_alarm(
    AlarmName="Agent-High-Error-Rate",
    MetricName="Success", Namespace="AI/Agents",
    Statistic="Average", Period=300, EvaluationPeriods=2,
    Threshold=0.95,
    ComparisonOperator="LessThanThreshold",
    Dimensions=[{"Name": "AgentName", "Value": "weather-agent"}],
)
Enter fullscreen mode Exit fullscreen mode

The concept is straightforward: latency catches performance degradation, error rate catches reliability issues. These two alarms alone will catch 80% of production problems.

Layer 3: LLM-as-Judge Evals

This is the layer most people skip — and it's the most important one. Your agent can be fast, reliable, and still give terrible answers. You need automated quality evaluation.

The idea: use another LLM to judge the quality of your agent's responses. It's not perfect, but it's infinitely better than no evaluation at all.

Setting Up the Evaluator

Phoenix evals v3 uses a provider-based LLM wrapper. For Bedrock, it goes through litellm:

from phoenix.evals import LLM, create_evaluator, evaluate_dataframe
from phoenix.client import Client
import pandas as pd

# Get LLM spans from Phoenix
spans_df = Client().spans.get_spans_dataframe()
llm_spans = spans_df[spans_df["span_kind"] == "LLM"].copy()

# Build eval dataframe
eval_data = pd.DataFrame({
    "input": llm_spans["attributes.input.value"].fillna("").values,
    "output": llm_spans["attributes.output.value"].fillna("").values,
})
eval_data = eval_data[eval_data["output"].str.len() > 0].reset_index(drop=True)

# Create the judge
eval_model = LLM(provider="bedrock", model="us.anthropic.claude-sonnet-4-6")

@create_evaluator(name="helpfulness", source="llm")
def helpfulness(input: str, output: str) -> float:
    """Rate how helpful the agent response is on a scale of 0 to 1."""
    prompt = (
        f"Rate the helpfulness of this AI response on a scale of 0.0 to 1.0.\n"
        f"User asked: {input}\n"
        f"AI responded: {output}\n"
        f"Return ONLY a number between 0.0 and 1.0."
    )
    result = eval_model.generate_text(prompt=prompt)
    try:
        return float(result.strip())
    except ValueError:
        return 0.5

# Run evaluation
results = evaluate_dataframe(dataframe=eval_data, evaluators=[helpfulness])
Enter fullscreen mode Exit fullscreen mode

The cool part here is the @create_evaluator decorator — it turns a simple function into a full evaluator that Phoenix understands. You can create as many as you need: helpfulness, accuracy, safety, tone, whatever matters for your use case.

Pushing Scores Back to Phoenix

The evaluation results are useful in a DataFrame, but they're even more useful when attached to the actual traces in Phoenix:

import json as _json

score_col = [c for c in results.columns if "_score" in c][0]
scores = results[score_col].apply(
    lambda x: _json.loads(x).get("value", 0) if isinstance(x, str) else 0
)

annotations = pd.DataFrame({
    "span_id": llm_spans["context.span_id"].values[:len(scores)],
    "score": scores.values,
    "label": scores.apply(lambda s: "good" if s >= 0.7 else "needs_review").values,
    "explanation": [f"Helpfulness score: {s:.2f}" for s in scores.values],
})

Client().spans.log_span_annotations_dataframe(
    dataframe=annotations,
    annotation_name="helpfulness",
    annotator_kind="LLM",
)
Enter fullscreen mode Exit fullscreen mode

Now when you open Phoenix UI and click on any LLM span, you see the helpfulness score right there in the Annotations tab. Traces + quality scores in one place.

What This Costs

One question I always get: what does this cost to run?

Component Where Cost
Phoenix Your machine (localhost) $0
Bedrock (agent calls) AWS, pay-per-request ~$0.003 per query
Bedrock (eval judge) AWS, pay-per-request ~$0.003 per eval
CloudWatch alarms AWS ~$0.20/month
CloudWatch custom metrics AWS ~$0.30/month

For development and testing, you're looking at less than $1/month for the AWS side. Phoenix is completely free and local.

The Main Takeaway

Observability for AI agents requires thinking in three layers:

  1. Traces (Phoenix) — What is the agent doing? What's the full reasoning chain?
  2. Infra metrics (CloudWatch) — Is the system healthy? Fast? Within budget?
  3. Quality evals (LLM-as-Judge) — Are the responses actually good?

Most teams only do layer 2. Some add layer 1. Almost nobody does layer 3 — and that's where the real insights are. A fast, reliable agent that gives bad answers is worse than a slow one that gives good answers, because you won't even know there's a problem.

My advice: start with Phoenix traces (it's free and local), add CloudWatch for the basics (latency, errors, tokens), and then build at least one LLM-as-Judge evaluator for whatever quality dimension matters most to your users. You can set this up in an afternoon and it will save you weeks of debugging blind.


Connect with me:

  • LinkedIn - Let's discuss AI observability and agent architectures
  • X/Twitter - Follow for AWS, GenAI, and agentic AI updates
  • GitHub - Check out the full notebook and more
  • Dev.to - More technical deep-dives
  • AWS Community - Join the conversation

I'm Carlos Cortez, this is Breaking the Cloud, and today we made our agents observable. See you in the next one!

Top comments (0)