mgbec

Posted on Jun 25 • Originally published at Medium on Jun 25

I Like Criticism, It Makes You Strong

#agents #awsbedrockguardrails #amazonbedrockagentco #agentevaluation

I like criticism, it makes you strong- LeBron James

In my last project (https://github.com/mgbec/AgentCoreObservabilityInterAgent), I built an AgentCore multi-agent system with some enhanced observability. The end user would ask a question, and depending on the orchestrator’s judgement of the complexity, it would be either answered immediately, or routed to the specialist. There was also a Fact Checker agent which could verify specific claims or statements.

Building the Critic Loop

This time around (https://github.com/mgbec/multi-agent-runtime-with-evals), I took the same project and added a Critic agent to check the quality of my workflow’s response to users’ questions. If the Critic decided the answer was not very good, it would trigger a retry:

Critic Agent (runtime, during the loop):

-Runs as part of the live request

-The Orchestrator calls it before returning a response to the user

-Acts as a quality gate: “Is this answer good enough?”

-Can trigger retries in real-time

-Adds latency and cost to every request that uses it

Evaluates based on:

Accuracy — Are the facts correct?
Completeness — Does it fully address the question?
Structure — Is it well-organized and clear?
Depth — Does it provide sufficient detail and examples?
Sources — Does it cite references when making specific claims?

Scoring criteria:

9–10: Comprehensive, well-structured, accurate, with examples and sources

7–8: Good coverage, mostly accurate, but missing some depth or examples

5–6: Addresses the question but lacks detail, structure, or accuracy

3–4: Partially relevant, significant gaps or inaccuracies

1–2: Off-topic, incorrect, or unhelpful

The Orchestrator system prompt ends up like this:

“content”: “You are an orchestrator agent that coordinates between specialized agents.\n You have three tools available:\n\n 1. call_specialist_agent — For detailed analysis, explanations, and complex research tasks\n 2. call_factchecker_agent — For verifying claims, checking facts, and assessing truthfulness\n 3. call_critic_agent — For evaluating the quality of responses from other agents\n\n Routing guidelines:\n — For questions requiring detailed analysis or explanation → use call_specialist_agent\n — For verifying specific claims or statements → use call_factchecker_agent\n — For complex queries that involve both analysis AND fact verification → use BOTH specialist and factchecker\n — For simple greetings or basic questions → handle directly yourself\n\n Quality feedback loop (use for important questions):\n — After getting a response from the specialist, use call_critic_agent to evaluate it\n — Pass the critic both the original question AND the specialist’s response\n — If the critic scores below 7/10, call the specialist again with the critic’s feedback\n — Include the critic’s suggestion in your retry prompt to the specialist\n — Present the final (improved) response to the user\n\n When using multiple tools, synthesize their responses into a coherent answer.\n Always mention if you used the critic to improve a response.”

Since I already worked out the process in my last project, deployment to AgentCore was very easy.

Testing the Critic Loop

These are some scripts with five separate scenarios to test the critic loop:

python test_critic_loop.py # Quick test (first 2 scenarios)

python test_critic_loop.py — all # All 5 scenarios

python test_critic_loop.py — scenario 4 # Run a specific one

Scenarios:

-Basic Critic Evaluation Specialist → Critic scores it

-Feedback Loop Specialist → Critic scores low → Specialist retries

-Critic on Fact Checker Fact Checker → Critic evaluates the verdict

-Full Pipeline Specialist + Fact Checker + Critic all in one request

-Critic Disagreement Forces a deliberately weak answer → Critic catches it

Each scenario shows the elapsed time, which agents were used, and whether Critic evaluation content appeared in the response.

Testing the New Workflow — Orchestrator, Specialist, Fact Checker, and Critic

I adjusted the test script used in my previous project (test_multi_agent.py) to add tests including the Critic Agent:

The script produces json or csv output:

AgentCore Evaluations

As I talked about before, this Critic agent runs inside of the workflow, at runtime. To do a little further testing, I also added an evaluation that used built-in AgentCore evals -”End-to-End Goal Attainment.”

AgentCore Evaluations run after the loop, not during the workflow. They run asynchronously after the request is complete and look at the traces/spans that were already recorded. They can be used for monitoring trends, regression testing, A/B comparisons, and more.

In the script -eval_goal_attainment.py, we run three AgentCore built-in evaluators:

Builtin.Helpfulness — Was the response useful?

Builtin.GoalSuccessRate — Did the agent achieve the user’s goal?

Builtin.ToolSelectionAccuracy — Were the right tools (Specialist/FactChecker/Critic) selected?

The result is an eval that broadly looks at End-to-End Goal Attainment “Did the user get what they asked for?” It doesn’t care about which agents were called or how, it just judges whether the final response satisfied the user’s intent.

The End-To-End Goal Attainment eval is very little effort to set up in this instance, since we are using AgentCore Evals, the ones that are already baked in. We could definitely set up more evals, but they would require a little more effort. That might be a future project. Some more potential evals:

Routing Quality

“Did the Orchestrator pick the right agent(s)?”

You’d give it a set of questions with expected routing and check if the Orchestrator’s tool selection matches:

“What is Kubernetes?” (should call Specialist only)

“Is it true that…”(should call Fact Checker only)

“Explain X and verify Y”( should call both)

“Hello”(should answer directly-no tools)

This catches regressions if you change the system prompt or swap models, in the event that the routing logic silently degrades.

Web Search Utilization

“Did the agent search when it should have?”

The Specialist and Fact Checker have web searches. You’d evaluate:

-Question about something recent -OpenClaw, a 2026 product (should have searched)

-Question about well-known fact- i.e. water boils at 100°C (searching is fine but not required)

-Agent said “I don’t have information about that” (should have searched but didn’t so it is a fail)

This catches the case where agents fall back to “I don’t know” instead of using their tools.

Critic Calibration

“Is the Critic scoring consistently and accurately?”

Compare Critic’s live scores against an independent offline judge. But also check:

Does the Critic give the same score for the same quality of response across different topics?

Does the Critic’s “suggestion” field actually identify real weaknesses?

Does a retry based on Critic feedback actually improve the score?

Response Faithfulness

“Did the agent make things up?”

When the Specialist uses web search, does the response accurately reflect what the search results said? Or does it hallucinate details not in the search results? An evaluator would compare the web_search tool output against the final response and flag invented claims.

Observability

I talked about this more in my last article — https://dev.to/aws-builders/somethings-going-onobservability-in-your-agentic-workflow-2eh8. Again, AgentCore and application logs give us some really awesome detail.

Evaluation Observability

In this project evaluation results from eval_goal_attainment.py are saved locally to eval_results.json and printed to console.

In the AgentCore console (if using online evaluations), you can look under AgentCore>Evaluations>whatever your project name is. This should show scores over time and per-evaluator trends

In CloudWatch (for online/batch evaluations configured via the console or SDK) evaluation results are written to CloudWatch Logs under a delivery destination you configure and are visible in the GenAI Observability dashboard if configured as part of an online evaluation.

Security Observability

This is a big topic in multi-agent workflows and something I need to work on more. Right now I have a simple guardrail that tries to catch:

-Harmful content in user input

-Harmful content in LLM output

-Prompt injection attempts in user input

-Sensitive data (AWS keys, SSNs, credit cards) in either direction

-Web search results containing harmful content

I know this is definitely not sufficient, but I think that is a project for another day. Inter-agent payloads between agents in this project would bypass this guardrail because they are boto3 API calls, not LLM calls. There are also many other guardrails or security measures that we could apply here.

If we wanted to look at our current Guardrail’s activity, we can look in CloudWatch:

CloudWatch Metrics:

Metrics available: GuardrailsInvocations, GuardrailsBlockedInput, GuardrailsBlockedOutput

CloudWatch Logs (if you enable Guardrail logging):

This is not enabled by default, but when it is, blocked requests go to a log group you specify, showing what was blocked and why.

In-agent visibility:

When the guardrail blocks something, the agent receives an error or filtered response from the ConverseStream call. The agent’s response will contain the blocked_input_messaging or blocked_outputs_messaging text you configured.

This shows up in the agent’s CloudWatch log group as part of the response, but not in your span since it is not part of the OTEL trace.

Wrap Up

Critical analysis and evaluation are absolutely necessary with the non-deterministic and autonomous nature of agentic workflows. We need to look at the full trajectory to make sure we are complying with all of our intended constraints and functionality.

Thorough evals and observability can help us uncover hidden failures and make sure we can trust our agents’ decisions, reasoning, and final results. The data we collect will be even more valuable as our workflows have more and more agents, tools, and protocols interacting with one another.

For more on agentic security and governance, OWASP has a number of resources available here: https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/. Anthropic has an article making agentic evals easier to understand- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents.

Thanks for reading!

DEV Community