How I Monitor My AI Agents: CloudWatch for Infra, Arize Phoenix for Traces, LLM-as-Judge for Quality
AI agents are not regular software. They reason, they call tools, they make decisions — and they can fail in ways that a simple health check will never catch. The response was technically successful, but was it actually helpful? The agent called the right tool, but did it interpret the result correctly? Traditional monitoring doesn't answer these questions.
That's why I built a three-layer observability stack for my AI agents, and today I'm walking you through exactly how it works.
📓 Full working notebook: All the code in this post is validated and executable in the companion Jupyter notebook — including setup, tracing, evals, and cleanup. here as well: https://github.com/breakingthecloud/observability-ai-agents-phoenix-otel-strands
The Problem with Monitoring AI Agents
Here's the thing: when your agent answers "I don't have weather data for Paris" — is that a failure? Technically no, the agent ran fine. But from a user perspective, it's a miss. Traditional monitoring would show 200 OK, low latency, zero errors. Everything looks green. But the user didn't get what they needed.
You need three layers of observability to actually understand what's happening:
User Query → Strands Agent → Tool Calls → Bedrock (Claude)
↓ ↓ ↓
Phoenix CloudWatch Phoenix Evals
(AI traces) (infra metrics) (quality scores)
| Layer | Tool | What it answers |
|---|---|---|
| AI Traces | Arize Phoenix | What did the agent think? Which tools did it call? What was the full LLM input/output? |
| Infrastructure | Amazon CloudWatch | Is the system healthy? How fast? How much is it costing me? |
| Quality Evals | Phoenix + LLM-as-Judge | Was the response actually good? Helpful? Accurate? |
The Stack
- Strands Agents SDK — AWS's open-source framework for building agents
- Amazon Bedrock — Claude Sonnet 4.6 as the foundation model
- Arize Phoenix — Open-source AI observability, runs locally, zero accounts needed
- Amazon CloudWatch — Metrics, alarms, dashboards
- OpenTelemetry — The glue that connects everything
What's interesting is that Phoenix runs entirely on your machine — localhost:6006. No cloud accounts, no API keys for the observability layer. You get a full tracing UI for free.
Layer 1: AI Traces with Arize Phoenix
The first thing you need is visibility into what your agent is actually doing. Not just "it responded in 2 seconds" but the full reasoning chain: what the LLM received, what it decided, which tools it called, and what it returned.
Setting Up the Tracing Pipeline
Three steps: launch Phoenix, configure OpenTelemetry, instrument Bedrock.
import phoenix as px
from opentelemetry import trace as trace_api
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.bedrock import BedrockInstrumentor
# 1. Launch Phoenix locally
session = px.launch_app() # UI at http://localhost:6006
# 2. Configure OTel to send traces to Phoenix
tracer_provider = trace_sdk.TracerProvider()
tracer_provider.add_span_processor(
SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:6006/v1/traces"))
)
trace_api.set_tracer_provider(tracer_provider)
# 3. Auto-instrument all Bedrock API calls
BedrockInstrumentor().instrument(tracer_provider=tracer_provider)
That's it. Every Bedrock call your agent makes is now traced automatically. No decorators on your business logic, no manual span creation. OpenInference handles it.
Building the Agent
The agent itself is straightforward with Strands:
import boto3
from strands import Agent
from strands.models.bedrock import BedrockModel
session = boto3.Session(profile_name="cc", region_name="us-east-1")
def get_weather(city: str) -> str:
"""Get current weather for a city."""
weather_data = {
"Lima": "☀️ 22°C, clear skies",
"New York": "🌧️ 15°C, rainy",
"Tokyo": "⛅ 18°C, partly cloudy",
}
return weather_data.get(city, f"Weather data not available for {city}")
model = BedrockModel(model_id="us.anthropic.claude-sonnet-4-6", boto_session=session)
agent = Agent(
model=model,
tools=[get_weather],
system_prompt="You are a helpful weather assistant. Use the get_weather tool."
)
Now when you run agent("What's the weather in Lima and Tokyo?"), Phoenix captures the entire trace tree: the agent span, the LLM calls, the tool invocations, the final response. All visible in the UI at localhost:6006.
Exploring Traces Programmatically
You don't have to use the UI. Phoenix exposes everything as DataFrames:
from phoenix.client import Client
traces_df = Client().spans.get_spans_dataframe()
traces_df["latency_ms"] = (traces_df["end_time"] - traces_df["start_time"]).dt.total_seconds() * 1000
print(f"Total spans captured: {len(traces_df)}")
traces_df[["name", "span_kind", "latency_ms", "status_code"]].head(10)
This gives you every span — agent, LLM, tool — with timing, status, and the full input/output attributes. Perfect for building custom analytics or feeding into your own dashboards.
Layer 2: Infrastructure Monitoring with CloudWatch
Phoenix tells you what the agent is thinking. CloudWatch tells you if the system is healthy. Different questions, both critical.
The AgentMonitor Class
I built a simple wrapper that publishes four metrics per agent invocation:
cloudwatch = session.client("cloudwatch", region_name="us-east-1")
class AgentMonitor:
def __init__(self, namespace="AI/Agents"):
self.namespace = namespace
self.cw = cloudwatch
def track(self, agent_name: str, latency_ms: float, tokens: int,
success: bool, tool_calls: int = 0):
metrics = [
{"MetricName": "Latency", "Value": latency_ms, "Unit": "Milliseconds"},
{"MetricName": "TokensUsed", "Value": tokens, "Unit": "Count"},
{"MetricName": "Success", "Value": 1 if success else 0, "Unit": "Count"},
{"MetricName": "ToolCalls", "Value": tool_calls, "Unit": "Count"},
]
dims = [{"Name": "AgentName", "Value": agent_name}]
for m in metrics:
m["Dimensions"] = dims
self.cw.put_metric_data(Namespace=self.namespace, MetricData=metrics)
Usage is clean — wrap your agent call:
monitor = AgentMonitor()
start = time.time()
try:
result = agent("What's the weather in Lima?")
latency = (time.time() - start) * 1000
monitor.track("weather-agent", latency, tokens=150, success=True, tool_calls=1)
except Exception as e:
latency = (time.time() - start) * 1000
monitor.track("weather-agent", latency, tokens=0, success=False, tool_calls=0)
Smart Alarms
Two alarms that catch the most common issues:
# Alert when response time is consistently high
cloudwatch.put_metric_alarm(
AlarmName="Agent-High-Latency",
MetricName="Latency", Namespace="AI/Agents",
Statistic="Average", Period=300, EvaluationPeriods=3,
Threshold=10000.0, # 10 seconds
ComparisonOperator="GreaterThanThreshold",
Dimensions=[{"Name": "AgentName", "Value": "weather-agent"}],
)
# Alert when error rate exceeds 5%
cloudwatch.put_metric_alarm(
AlarmName="Agent-High-Error-Rate",
MetricName="Success", Namespace="AI/Agents",
Statistic="Average", Period=300, EvaluationPeriods=2,
Threshold=0.95,
ComparisonOperator="LessThanThreshold",
Dimensions=[{"Name": "AgentName", "Value": "weather-agent"}],
)
The concept is straightforward: latency catches performance degradation, error rate catches reliability issues. These two alarms alone will catch 80% of production problems.
Layer 3: LLM-as-Judge Evals
This is the layer most people skip — and it's the most important one. Your agent can be fast, reliable, and still give terrible answers. You need automated quality evaluation.
The idea: use another LLM to judge the quality of your agent's responses. It's not perfect, but it's infinitely better than no evaluation at all.
Setting Up the Evaluator
Phoenix evals v3 uses a provider-based LLM wrapper. For Bedrock, it goes through litellm:
from phoenix.evals import LLM, create_evaluator, evaluate_dataframe
from phoenix.client import Client
import pandas as pd
# Get LLM spans from Phoenix
spans_df = Client().spans.get_spans_dataframe()
llm_spans = spans_df[spans_df["span_kind"] == "LLM"].copy()
# Build eval dataframe
eval_data = pd.DataFrame({
"input": llm_spans["attributes.input.value"].fillna("").values,
"output": llm_spans["attributes.output.value"].fillna("").values,
})
eval_data = eval_data[eval_data["output"].str.len() > 0].reset_index(drop=True)
# Create the judge
eval_model = LLM(provider="bedrock", model="us.anthropic.claude-sonnet-4-6")
@create_evaluator(name="helpfulness", source="llm")
def helpfulness(input: str, output: str) -> float:
"""Rate how helpful the agent response is on a scale of 0 to 1."""
prompt = (
f"Rate the helpfulness of this AI response on a scale of 0.0 to 1.0.\n"
f"User asked: {input}\n"
f"AI responded: {output}\n"
f"Return ONLY a number between 0.0 and 1.0."
)
result = eval_model.generate_text(prompt=prompt)
try:
return float(result.strip())
except ValueError:
return 0.5
# Run evaluation
results = evaluate_dataframe(dataframe=eval_data, evaluators=[helpfulness])
The cool part here is the @create_evaluator decorator — it turns a simple function into a full evaluator that Phoenix understands. You can create as many as you need: helpfulness, accuracy, safety, tone, whatever matters for your use case.
Pushing Scores Back to Phoenix
The evaluation results are useful in a DataFrame, but they're even more useful when attached to the actual traces in Phoenix:
import json as _json
score_col = [c for c in results.columns if "_score" in c][0]
scores = results[score_col].apply(
lambda x: _json.loads(x).get("value", 0) if isinstance(x, str) else 0
)
annotations = pd.DataFrame({
"span_id": llm_spans["context.span_id"].values[:len(scores)],
"score": scores.values,
"label": scores.apply(lambda s: "good" if s >= 0.7 else "needs_review").values,
"explanation": [f"Helpfulness score: {s:.2f}" for s in scores.values],
})
Client().spans.log_span_annotations_dataframe(
dataframe=annotations,
annotation_name="helpfulness",
annotator_kind="LLM",
)
Now when you open Phoenix UI and click on any LLM span, you see the helpfulness score right there in the Annotations tab. Traces + quality scores in one place.
What This Costs
One question I always get: what does this cost to run?
| Component | Where | Cost |
|---|---|---|
| Phoenix | Your machine (localhost) | $0 |
| Bedrock (agent calls) | AWS, pay-per-request | ~$0.003 per query |
| Bedrock (eval judge) | AWS, pay-per-request | ~$0.003 per eval |
| CloudWatch alarms | AWS | ~$0.20/month |
| CloudWatch custom metrics | AWS | ~$0.30/month |
For development and testing, you're looking at less than $1/month for the AWS side. Phoenix is completely free and local.
The Main Takeaway
Observability for AI agents requires thinking in three layers:
- Traces (Phoenix) — What is the agent doing? What's the full reasoning chain?
- Infra metrics (CloudWatch) — Is the system healthy? Fast? Within budget?
- Quality evals (LLM-as-Judge) — Are the responses actually good?
Most teams only do layer 2. Some add layer 1. Almost nobody does layer 3 — and that's where the real insights are. A fast, reliable agent that gives bad answers is worse than a slow one that gives good answers, because you won't even know there's a problem.
My advice: start with Phoenix traces (it's free and local), add CloudWatch for the basics (latency, errors, tokens), and then build at least one LLM-as-Judge evaluator for whatever quality dimension matters most to your users. You can set this up in an afternoon and it will save you weeks of debugging blind.
Connect with me:
- LinkedIn - Let's discuss AI observability and agent architectures
- X/Twitter - Follow for AWS, GenAI, and agentic AI updates
- GitHub - Check out the full notebook and more
- Dev.to - More technical deep-dives
- AWS Community - Join the conversation
I'm Carlos Cortez, this is Breaking the Cloud, and today we made our agents observable. See you in the next one!




Top comments (0)