michal salanci for AWS Community Builders

Posted on May 12

Make 'em behave! Don't let your AI agents hallucinate

#bedrock #aws #agents #ai

I built a multi-agent project, for users to ask questions about their AWS infrastructure (3 AWS accounts managed by AWS Organizations) and get answers in human readable way.

The system connects to users AWS infrastructure and provide the answer by reading various log types and creating API calls to multiple AWS resources.

This project was build with Kiro, Kiro spec driven development and Kiro powers.

Project repo
Part 1: I built a multi-agent project on AWS, with Strands AI and AgentCore
Part 2: Give 'em something to read! Building a data pipeline for your agentic AI project
Part 3: Make 'em safe! Security for your agentic AI project
Part 4: Make 'em remember! Memory in the agentic AI project
Part 5: Make 'em visible! See what is happening inside your agentic workflow
Part 6: When shebangs party hard with your MAC path on OpenTelemetry
Part 7: Make 'em behave! Don't let your AI agents hallucinate

No matter what, they will try!

This article is about hallucinations, or to be more precise: how I tried to make hallucinations more difficult to happen, easier to detect and less dangerous when happenning anyway.

Because let's face the truth:

You cannot just tell an AI agent: Do not hallucinate and expect it won't.
LLM's only purpose it's generate text. If there is nothing to generate, or not enough data to generate from guess what it does.

The problem

At the begging I thought the main challenge would be something like: can the agent answer questions about my AWS accounts?
It turned out my main challenge actually was: Can I trust the answer?

If users asks...

./alexandra.sh --new "Give me last CloudTrail row from today"

...and if the agent invents one row, drops one important finding, access the wrong account, or queries the wrong date, the final answer still looks nice and professional but it's worthy of nothing.

Multi-agent makes it worse

With multi-agent pattern known as agents as tools this could get even worse.

SCENARIO 1:

Supervisor agent receives question Give me last CloudTrail row from today.
Supervisor agent correctly understands to invoke CloudTrail subagent, so it does.
Despite its instructions, CloudTrail subagent incorrectly creates an SQL query with yesterday's date. This is not truth, this is pure hallucination.
SQL query is not syntactically wrong, so Athena retrieves the rows from DataLake (for the wrong date) and sends the data back to CloudTrail subagent.
Response is sent back to supervisor agent, which doesn't care if it is right. It got its rows so it summarizes.
Hallucination of one became a hard truth for the other read here
Response seems legit, so user has no doubt.

SCENARIO 2:

Supervisor agent receives question Give me last CloudTrail row from today.
Supervisor agent correctly understands to invoke CloudTrail subagent, so it does.
CloudTrail subagent correctly creates an SQL query with today's date.
SQL query is not syntactically wrong, so Athena retrieves the rows from DataLake and sends the data back to CloudTrail subagent.
CloudTrail subagent, despite its instructions not to summarize, actually summarizes the output and send to supervisor agent.
Summarized response is received by supervisor agebt, which doesn't care if it is right. It got its data so it summarizes. It is actually summarizing a summary.
When two agents are summarizing, the danger of hallucination doubles. Even if sub-agent summary is correct, it should not summarized - this is the job of supervisor.
And if sub-agent fabricated just a single fact, the supervisor's summary becomes invalid. Same pattern as before about hallucination and ground truth.
Response seems legit, so user has no doubt.

Hallucination patterns

During the testing I observed nine hallucinations and sorted them into categories (H1 - H9) for better mitigation:

H1: Supervisor says "no results" even though a tool returned data.
H2: Supervisor agent drops rows from the tool result.
H3: Supervisor agent fabricates rows or fields that were not returned.
H4: Supervisor agent picks the wrong subagent.
H5: Supervisor agent passes the wrong account or time range.
H6: Subagent creates incorrect or too big SQL.
H7: Subagent returns a summary instead of raw evidence.
H8: Supervisor asks a follow-up question instead of answering with the data it already has.
H9: Summary of supervisor agent is out of the line from user's question

Layers of mitigation

There are several layer I use to deal with the hallucination patterns, from "prompt to hooks."

It all starts with prompt

Bulletproof prompt is absolutely the must.
Every agent in the project uses a structured (RISEN - Role, Instructions, Steps, Expectation, Narrowing) prompt.

For example, the CloudTrail subagent's prompt does not say:

You are a helpful assistant, answer questions about AWS.

Instead, it is says exactly what that particular agent is:

You are a CloudTrail log analyst.
You translate natural language questions about AWS API activity into Athena SQL.
Use lttm_logs.cloudtrail_logs.
Always include partition keys.
Return raw result rows.
Do not summarize or paraphrase the data.

A narrow prompt reduces the chance, agent starts doing creative writing instead of serious log analysis.

However, prompt instructions are not enforced, because the model may still ignore, misunderstand, or do something almost right but still wrong.

Prompt is just first layer, but not the only layer.

Layer 2: One summarizer only

This was already mentioned before - I want my subagents not to summarize at all.
But this is a problem - generating the text is what LLM was created for, so no matter how many times I tell it in the prompt not to summarize, it will.

So I let it summarize and gratefully ignore it.

Whatever the subagent creates, raw tool result (the Athena response) is the only part of the data I want supervisor to receive, so this is exactly what is extracted.

sub-agent returns result (sub-agent summary and raw rows)
raw rows are extracted as raw_json

result = cloudtrail_agent(question)

raw_json = _extract_raw_result(cloudtrail_agent)

if not raw_json:
    return str(result)

rows = json.loads(raw_json)
if isinstance(rows, list):
    return format_athena_rows(rows)

Raw rows looks something like this:

“[
{"eventtime": "2026-04-25T10:30:00Z", "eventname": "CreateBucket", "eventsource": "s3.amazonaws.com", "useridentity": "arn:aws:iam::123:user/admin"},
{"eventtime": "2026-04-25T09:15:00Z", "eventname": "TerminateInstances", "eventsource": "ec2.amazonaws.com", "useridentity": "arn:aws:iam::123:role/deploy"}
]”

Rows are then deterministically formatted by another function, so supervisor receives data formatted in the way it expects:

Results: 2 rows returned.

Row 1:
  eventtime: 2026-04-25T10:30:00Z
  eventname: CreateBucket
  eventsource: s3.amazonaws.com
  useridentity: arn:aws:iam::<account-id>:user/admin

Row 2:
  eventtime: 2026-04-25T09:15:00Z
  eventname: TerminateInstances
  eventsource: ec2.amazonaws.com
  useridentity: arn:aws:iam::<account-id>:role/deploy

This is the data supervisor agent works with and summarizes. It receives data deterministically formatted while subagent summary is not the source of truth anymore.

Layer 3: The hooks

Deterministic validations are essential part of my anti-hallucination layers.
Here I am using 3 hooks:

SQLValidatorHook - is SQL query is correct?
SQLRewriteHook - might SQL response be too big?
OutputIntegrityHook - did supervisor agent summarize anything?

Those hooks run on different Strands events.

SQLValidatorHook

Because subagent generates SQL, there is always a chance SQL goes bad.
This hooks runs on every subagent creating SQL queries and is invoked before query is sent to Athena...

class SQLValidatorHook(HookProvider):
    def register_hooks(self, registry: HookRegistry, **kwargs: Any) -> None:
        registry.add_callback(BeforeToolCallEvent, self.on_before_tool_call)

    def on_before_tool_call(self, event: BeforeToolCallEvent) -> None:
        if event.tool_use.get("name") != "run_athena_query":
            return

        sql = event.tool_use.get("input", {}).get("sql", "")
        if not sql:
            return

        errors = validate_sql(sql)
        if errors:
            msg = f"SQL validation failed: {'; '.join(errors)}. Fix and retry."
            event.cancel_tool = msg

... and calls function validate_sql which checks for patterns like:

awsdatacatalog. prefix in SQL
Blocked keywords: DROP, DELETE, UPDATE, INSERT, ALTER, TRUNCATE
wrong table
wrong partition keys (must match the glue table)
SELECT * is used

This hook is a mix of antihallucination and security and is also described here.

Example problem:
Sub-agent creates SQL like this:

SELECT *
FROM cloudtrail_logs
WHERE eventname = 'CreateBucket'

That looks innocent, but it's actually wrong. It should use the real Glue table name, explicit columns and required partitions.

The hook rejects it and sends feedback back into the agent loop, so model can retry and fix it:

SQL validation failed: Use fully qualified table name: 'lttm_logs.cloudtrail_logs'; Missing required partition keys in WHERE: account_id, region, year, month, day; Use explicit column names instead of SELECT *.
Fix and retry.

SQLRewriteHook

This hook runs as well on every subagent creating SQL queries and truncates the lines, if user asked for too many rows.

Why is this a problem?
If a user asks:

./alexandra.sh --new "show me last 1000 CloudTrail events"

The agent actually gets too much data back and the model may:

truncate the answer
summarize too aggressively
drop rows
retry again and again
confidently produce a partial answer
or simply context window hits the token limitation

None of that is good, so that's why SQLRewriteHook adds LIMIT 20 to the SQL query.

current_limit = self._get_current_limit(sql)
target_limit = self._default_limit

if current_limit is None:
    sql = self._set_limit(sql, target_limit)
    emit_status(f"Added LIMIT {target_limit} to prevent oversized results")

elif current_limit > target_limit:
    sql = self._set_limit(sql, target_limit)
    emit_status(
        f"Requested {current_limit} lines, but due to context limitations stripping to {target_limit}"
    )
    self._limit_was_capped = True

if sql != original_sql:
    event.tool_use["input"]["sql"] = sql

User see this behavior in streaming:

⏳ CloudTrail agent processing...
⏳ Added LIMIT 20 to prevent oversized results
⏳ Athena query executing (QueryExecutionId: 43a72cbd-39a7-4c5f-8dba-8be31aa2e45c)

But models are smart! During the testing I realized that if I limit it like that, the model retries to query the 100 rows (or whatever the initial request was), instead of actual 20.
That actually makes sense because model sees that it was asked for 100 but it created SQL query for 20, so it tries to correct itself.

Therefore the hook also blocks the retry from happening and actually explains who is the boss here.

if self._limit_was_capped and self._last_query_returned_rows:
    event.cancel_tool = (
        "Your previous query already returned data with the maximum allowed rows. "
        "Do NOT retry for more rows. Return the results you already have to the user."
    )

The same hook is called one more time and that's when results from Athena are returned, when it's check if Athena did not return empty response.

OutputIntegrityHook

Time to time even supervisor agent joined the dope party and started to hallucinate in its own way, by actually receiving the data but outputting No results found instead and going for retry. Well, at least it tried, until I played with better cards.

OutputIntegrityHook runs on supervisor agent, checks which sub-agent (which query_* tool) returned the data,

QUERY_TOOLS = {
    "query_cloudtrail", "query_cloudwatch", "query_config", "..."
}

remembers that data and after response is generated, it checks for "contradiction" and "follow-up-question" patterns.

CONTRADICTION_PATTERNS = [
    "no results found", "no results were found", "didn't return any", "..."
]

FOLLOWUP_PATTERNS = [
    "would you like me to", "shall i", "should i check", "..."
]

This catche two stupid but dangerous behaviors:

Tool returned data, but model says no data.
Tool returned data, but model asks whether it should check something.

Nice try buddy. Now do your job!

LLM-as-judge

Some problems are easy to catch with deterministic or regex-ish checks like we saw above, but other need more sophisticated touch.
Especially if problem needs some kind of a judgement to be solved.

Example:

./alexandra.sh --new "Give me last CloudTrail row from today"

If supervisor agent invokes CuardDuty gent, this is wrong.

Therefore I added SupervisorSteeringHandler plugin, an LLM-as-judge layer.

This is the first and last check running on supervisor agent, because it runs on two different Strands events:

On BeforeToolCallEvent - the routing check

Plugin checks if the supervisor agent called the right sub-agent, using the right AWS account and right time range.

On AfterModelResponse - the response validation

It checks if the final response faithfully represents the tool result.

None of that is deterministic check, it actually calls another LLM, in my case it's Claude Haiku 4.5

The routing check

Before the supervisor agent calls a subagent as its tool, the judge receives:

User's original question
Which subagent is about to be calle being called
Prompt which is about be passed to tool

Judge validates it and returns either VALID or GUIDE with some guidance what to do, such as

GUIDE: use the cloudtrail instead, because the user asked about cloudtrail rows

The plugin then returns corrective feedback to the supervisor, which supervisor knows what to do with - either pass data to subagent or correct:

if verdict.upper().startswith("GUIDE"):
    reason = verdict.split(":", 1)[1].strip()
    return Guide(reason=reason)
else:
    return Proceed(reason=f"Routing validated for {tool_name}")

The response validation

The second time the judge runs is after the supervisor generates the final response. It compares subagent result vs supervisor agent response.

It actually checks if supervisor is:

Skipping the rows or summarizing too much - Subagent returned 17 rows, supervisor showed 9 rows.
Fabricating results - Supervisor mention parameters which are not present in any subagent result.

Yes, that's AI checking AI

Conclusion

During the building and testing this project, here are some facts I learned:

Do not rely only on prompt - just because LLM have one, doesn't mean it will follow it for 100% all the time.
Use deterministic hooks where possible - even if the code looks big and ungly with huge lists of values, code is a code and once it's written, it's followed.
If the check needs a judgement, use it - LLM as judge is your friend.