Dipayan Das

Posted on Jun 8

From 0.91 to 0.30: Evaluating AI Agents with Bedrock AgentCore and OpenTelemetry

#ai #observability #opentelemetry #aws

A field guide to turning a subjective "feels ready" into a 0–1 score you can alert on — and the OpenTelemetry architecture that keeps that signal portable across backends.

This post is not about whether LLM-as-a-judge is perfect. It's about making AI-agent quality observable: score the trace, alert on regressions, and keep the telemetry portable with OpenTelemetry.

TL;DR

Two evaluation surfaces matter on AWS, sharing one idea — an LLM scoring another model's output:

Bedrock model evaluation — offline, for picking a model. Generator + judge, a prompt dataset, scores and rationales in the console and S3.
AgentCore Evaluations — online and on-demand, for trusting an agent in production. Built-in LLM-as-a-judge evaluators scoring real traces on a 0–1 scale, published to CloudWatch.

The upgrade is not the score by itself. It is the score attached to the trace that produced it — the difference between a thermometer and a diagnosis. And because the telemetry is OpenTelemetry-compatible, the same spans can fan out through an OTel Collector to Langfuse, Arize Phoenix, Datadog, or Grafana.

The judge gives you the number.
The trace tells you why.
OpenTelemetry makes it portable.
Calibration keeps it honest.

The 0.91 → 0.30 Failure Mode

Here's a production failure that no latency or error-rate dashboard would ever catch. In AWS's own re:Invent 2025 session on the feature (AIM3348), they showed a travel agent whose tool-selection accuracy fell from 0.91 to 0.30 in production. The agent was still succeeding at calling tools — it was just calling the wrong ones. Latency was fine. Errors were fine. The thing that was broken was judgment, and only an evaluation layer surfaced it. (AWS used this in the AIM3348 session; I'm treating it as an illustrative production-regression pattern, not a benchmark or an expected drop.)

That's the gap. A classic service has objective metrics — p99 under 200 ms, error rate under 0.1% — that pass or fail with no argument. An agent raises subjective questions: was the answer useful, did it pick the right tool, did the conversation reach its goal, was the output safe? You can't regex those, and BLEU/ROUGE only match word patterns, not understanding.

The honest version of the old workflow was: ask 20 questions, read the answers, decide by feel, ship, cross fingers. It doesn't scale and it collapses when someone asks "how do we know it works?" So "better judgment" doesn't mean trust the agent more — it means convert each subjective question into a consistent, comparable number a human or a pipeline can act on. That's what LLM-as-a-judge does, and AWS turned what used to be months of in-house plumbing into a managed service.

Model Evaluation vs AgentCore Evaluations

Get the two products straight, because people conflate them constantly.

Amazon Bedrock model evaluation is for choosing a model. A generator answers prompts from your dataset; an evaluator (judge) model scores each pair and explains why. AWS introduced LLM-as-a-judge for model evaluation in December 2024, alongside automatic metrics (exact match, BLEU, ROUGE) and human evaluation. You pick the judge, select metrics (correctness, completeness, tone, plus responsible-AI metrics like harmfulness and refusal), bring your own prompts, and compare across jobs. Offline, pre-deployment: which model wins on my data.

AgentCore Evaluations is for trusting an agent already running. Announced at re:Invent on December 2, 2025 and now generally available as of March 31, 2026, it scores real agent traces — multi-turn sessions, tool calls — against managed evaluators, continuously or on demand, and publishes results to CloudWatch Logs and scores to CloudWatch Metrics, next to your latency and token telemetry.

It ships more than a dozen built-in evaluators (AWS's materials cite thirteen), all scored 0–1 and reference-free — no pre-labeled golden answer required, which is what makes them usable on live traffic. Four families:

Response quality — correctness, faithfulness, helpfulness, relevance, conciseness, coherence, instruction-following, refusal.
Safety — harmfulness, stereotyping.
Task completion — goal success rate (session-level): did the conversation accomplish what the user came for?
Component level — tool selection accuracy, tool parameter accuracy. (This is the family that caught the 0.91 → 0.30 drop.)

Custom evaluators let you bring your own judge model, prompt, and scoring schema (per trace, session, or tool call). In my experience, built-ins cover the common starting cases; reach for custom only when you have a domain-specific rubric — compliance, brand voice — the generic ones can't express.

Verdict — model evaluation decides what to build on; AgentCore Evaluations decides whether to keep trusting what you shipped. Use the first once per model decision; run the second forever. Start with three evaluators — helpfulness, tool-selection accuracy, goal success rate — not thirteen.

Better Judgment Is the Number Plus the Trace

A score alone is a thermometer: helpfulness is 0.71 and now you're worried but you don't know why. The leverage is the score being bound to the trace that produced it. When goal success drops, you drill from the CloudWatch Metrics aggregate into the sessions scored "No," read the span timeline — system prompt, user turn, each LLM call, each tool invocation — and see the failure directly: "these forty sessions failed, and they all called the wrong tool after the same ambiguous phrasing."

That's the judgment upgrade, stated plainly:

Comparable — 0.71 today vs 0.78 last week is a real comparison, not two vibes.
Alertable — page when helpfulness drops 10% over eight hours, instead of hearing it from a support ticket.
Defensible — "how do we know it works" gets a trend line and traces, not a spreadsheet of gut calls.
Diagnosable — the trace tells you what to change, so the next deploy is informed.

Two modes map onto two kinds of judgment. Online evaluation samples a configurable percentage of production traces, aggregates in real time, and feeds alerts — your standing early-warning system. On-demand evaluation scores specific spans by ID, ad hoc — for debugging a regression or gating a release in CI/CD.

Verdict — online eval is monitoring; on-demand eval is testing. Wiring both is what moves you from "we evaluate sometimes" to "evaluation is how we ship."

The Telemetry Is OpenTelemetry, So the Architecture Is Yours

This is the most DEV-relevant part: evaluation doesn't have to live in one vendor's dashboard, because the telemetry is OpenTelemetry.

AgentCore Evaluations consumes instrumented agent telemetry through standard libraries — OpenTelemetry (opentelemetry-instrumentation-langchain), OpenInference (openinference-instrumentation-langchain), and ADOT (AWS Distro for OpenTelemetry). Strands Agents and LangGraph are natively supported; anything else you instrument yourself. AgentCore Observability emits OpenTelemetry-compatible telemetry, which is why it integrates not just with CloudWatch but with Datadog, LangSmith, and Langfuse.

The shared vocabulary is the OpenTelemetry GenAI semantic conventions — still in Development, but the convergence point the industry is moving toward. They give you a predictable span tree:

invoke_agent                         (the top-level agent run)
  ├── chat            gen_ai.request.model, gen_ai.usage.input_tokens,
  │                   gen_ai.usage.output_tokens, gen_ai.response.finish_reasons
  ├── execute_tool    the tool name, parameters, result
  ├── chat            (next reasoning step)
  └── execute_tool    ...

Privacy default worth knowing: instrumentation does not capture prompt content or tool arguments by default — gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions are opt-in. Your evaluators need some content to judge, so enable capture selectively, not globally.

Two data sources define your options: an AgentCore Runtime agent endpoint, or a CloudWatch log group (agent runs anywhere, you ship traces to CloudWatch). The second means evaluation isn't contingent on hosting on AgentCore.

Put an OpenTelemetry Collector in the middle and one trace stream serves AWS-native evaluation and whatever you already run — no double-instrumentation:

Tighter version of "which downstream backend," honest mid-2026 read (nobody's won outright; per-span pricing is still unsettled, so optimize for portability):

Inside LangChain/LangGraph        → LangSmith
OTel neutrality + platform eng    → Arize Phoenix (OSS)
Evals should drive deploys        → Braintrust
Already run Datadog/New Relic/etc.  → use it if it ingests OTLP / supports GenAI semconv well enough
Self-hosted, own the data         → Langfuse / Grafana Tempo+Loki
AWS-native, no new vendor          → CloudWatch + AgentCore Evaluations

Verdict — make the OTel Collector your fan-out point, and instrument to the GenAI semantic conventions, not a vendor SDK. That keeps you portable: swap the downstream backend tomorrow without touching agent code.

A Minimal (Conceptual) Setup

Setup order, any backend: add the OpenTelemetry SDK and an OTLP exporter early in the process lifecycle; add auto-instrumentation (OpenLLMetry from Traceloop, or OpenInference from Arize) for LLM client spans for free; wrap your top-level agent function in an invoke_agent span.

export OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_SERVICE_NAME="claims-triage-agent"
# Enable GenAI content capture ONLY where evaluators need it (sensitive)
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT="true"

The Collector is where you sample, redact sensitive attributes, and route. Conceptual shape (exporter/endpoint names vary by deployment):

receivers:
  otlp:
    protocols:
      http:
      grpc:

processors:
  batch:
  attributes:
    actions:
      - key: gen_ai.input.messages    # redact before fan-out if not needed
        action: delete
      - key: gen_ai.output.messages
        action: delete

exporters:
  otlp/cloudwatch:
    endpoint: ${CLOUDWATCH_OTLP_ENDPOINT}
  otlp/secondary:
    endpoint: ${OBSERVABILITY_BACKEND_OTLP_ENDPOINT}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes, batch]
      exporters: [otlp/cloudwatch, otlp/secondary]

A privacy or sampling change is now one config edit, not a re-instrumentation project.

Wiring Evaluation Into CI/CD

On-demand evaluation is what turns "we have evals" into a quality gate. Shape only — confirm the exact API surface against current docs, since this moved preview → GA and names shift:

import boto3

client = boto3.client("bedrock-agentcore-control")

resp = client.create_on_demand_evaluation(
    spanIds=load_trace_ids("staging_run.json"),
    evaluators=[
        "Builtin.Helpfulness",
        "Builtin.ToolSelectionAccuracy",
        "Builtin.GoalSuccessRate",
        "custom-compliance-check",   # only if your domain needs it
    ],
)
# Poll for results, then fail the build if any metric regresses

Dropped into CI/CD, it blocks a deploy when quality drops:

# .github/workflows/agent-quality-gate.yml
name: Agent Quality Gate
on:
  pull_request:
    branches: [main]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - run: ./deploy_staging.sh
      - run: python test_scenarios.py --output staging_run.json
      - run: python evaluate_on_demand.py --input staging_run.json
      - run: python quality_gate.py --min-score 0.7 --fail-on-regression

Same move as a unit-test suite — except the "test" is a probabilistic judge, so your threshold is a policy decision, not a hard truth. 0.7 is a reasonable starting line for many teams, but calibrate it against human review before enforcing it hard. Fail on regression, not just on an absolute floor — the 0.91 → 0.30 collapse was a drop, not a low number.

Where the Score Is Lying to You

A judge is an opinion from a model, not a measurement from an instrument. Hold these next to every number:

Judge bias. LLM judges can favor longer answers, their own family's style, or the first option compared. A 0.8 is the judge's calibrated opinion, not objective truth.
Reference-free is a feature and a limit. It enables online eval, but it assesses plausibility against a rubric, not correctness against verified ground truth. For factual-critical domains, pair with a real source of truth.
The "circular" objection has a core. "Senior reviewing a junior, not a circle" holds only if the judge is genuinely stronger or differently-trained, with a good rubric. Judging a model with a weaker sibling of itself is closer to circular than you'd like.
Sampling hides things. At 10% you see trends, not every incident. A rare catastrophic failure can live in the 90% you never scored.
CRIS caveat. Built-ins use Cross-Region Inference: your data stays in-Region, but prompts/results may be processed (encrypted) in neighboring Regions. Single-Region regulatory constraints → use custom evaluators pinned to your Region.

Verdict — calibrate the judge before you trust it. Periodically spot-check its scores against human judgment on your own data. If they correlate, the number means something. If not, you're automating a bias at scale.

Cost, Sampling, and Where the Approach Flips

Evaluation cost scales with what you score. AgentCore Evaluations bills per 1,000 tokens (built-ins) or per 1,000 runs (custom), and it adds up — AWS's own example reaches roughly $1,800/month for 45,000 runs. As I noted in my earlier piece on AgentCore cost, a 5% eval sample can outweigh Gateway, Policy, and long-term memory combined. Sampling is the dial: dev/staging 50–100%, production 10–20%, high volume (>100k sessions/day) 2–5%.

The downstream tracing bill is a separate, uncapped meter — sample at the Collector before you scale. Online eval flips to on-demand-only below a few hundred sessions/day, where a sparse sample isn't representative. Built-ins flip to custom when "quality" has a domain-specific definition, or when a single-Region rule forces you off CRIS.

What I'd wire in day one: instrument once to the GenAI semconv via OpenInference/ADOT; OTel Collector as the fan-out and redaction point; three built-ins at 10% production sampling; an on-demand quality gate in CI/CD with fail-on-regression; CloudWatch alarms on score drops; and a periodic judge-vs-human calibration check.

The Takeaway

"Is this agent good enough to ship, and is it still good enough today" stops being a feeling and becomes a decision you can defend — with a trend line, a threshold, and the specific traces behind a regression. The judge gives you the number, the trace tells you why, OpenTelemetry makes it portable, and calibration keeps it honest. Get those four right and evaluation stops being a thing you do before launch and becomes how you decide — every deploy, on evidence instead of nerve.

AgentCore Evaluations was announced at re:Invent in December 2025 and reached general availability on March 31, 2026; at preview it launched in four Regions (US East N. Virginia, US West Oregon, Asia Pacific Sydney, Europe Frankfurt) — verify the current Region list, evaluator catalog, limits, and pricing against the official AWS docs, as these change often. The code is illustrative of API shape, not guaranteed to match the current SDK; confirm method and parameter names against the current bedrock-agentcore / bedrock-agentcore-control references. The OpenTelemetry GenAI semantic conventions are still in Development, so attribute and span names may shift. The 0.91 → 0.30 tool-selection figure is from AWS's re:Invent 2025 session (AIM3348), presented as a demonstration.

DEV Community