Dakota Riley for AWS Community Builders

Posted on Jul 27

Building an AWS GuardDuty Alert Triage Agent

#security #aws #cloud #genai

Overview

I've been interested in exploring the application of AI Agents to security automation use cases for a while, so I built one.

AWS GuardDuty is a threat detection service in AWS that produces alerts (called findings) for hundreds of behaviors in AWS that could be considered malicious. From a detection perspective, Cloud APIs are essentially a giant LOLBin-as-a-Service: pretty much everything has a legitimate use case, and most alerts require additional contextualization and investigation to be worth acting on.

What makes an AI Agent?

AI Agents have been all the hype lately, but what actually are they? In short - most of the foundational models support capability beyond just sending text and getting a text response back, which helps with building applications around them, such as:

Structured outputs - this lets you ask the LLM to return its response matching a provided JSON Schema, which is very useful for programmatically consuming the outputs.

Tool usage - you can provide the LLM with a set of "tools", which allow it to request that external code or functionality be executed. This is typically implemented in the form of functions in your agent's code that are called upon request, and the response is returned to the LLM.

Model Context Protocol (MCP) Integrations - arguably this counts as tool-calling, but MCP provides a standardized way for services/external systems to advertise capabilities to LLM agents, instead of having to write tool functions in the agent's code itself.

Building The Agent!

Tech I used:

PydanticAI: Library for building the agent. There are multiple frameworks out there however.
Pydantic Logfire: Observability offering for monitoring your agent's interactions with a generous free tier, powered by oTel
Discord: Somewhere nice to send the output to
AWS GuardDuty: Nice but noisy alert generation service 😄
Foundation model APIs (GPT, Claude, etc)
Stratus Red Team: Detonate different TTPs to trigger GuardDuty findings

Getting started with PydanticAI is pretty straightforward. I wanted to define my structured outputs first. I created an AlertAssessment class that represented what format I want the agent to return the output in:

class AlertAssessment(BaseModel):
    GuardDutyAlertTitle: str
    GuardDutyAlertDescription: str
    Conclusion: AlertConclusion
    Description: str
    AlertTimeline: List[AlertTimelineEvent] = Field(
        description="A chronological list of events that occurred during the alert. This should include resource creation, activities taken by the calling identity, etc"
    )
    ActionLog: List[InvestigationAction] = Field(
        description="A chronological log of investigation actions taken to assess this alert. Include guardduty searches, cloudtrail searches, etc"
    )
    RelatedAlerts: List[RelatedAlert] = Field(
        description="Other GuardDuty alerts that may be related to this alert based on timing, resources, or actors involved"
    )
    Evidence: Optional[List[EvidenceItem]] = Field(
        default=None,
        description="A list of evidence items that support the conclusion of the alert. This should include guardduty searches, cloudtrail events IDs, resource arns, etc"
    )

The field descriptions are actually included in the JSONSchema provided to the agent, and useful in giving context in what actually belongs in those fields. I defined an AlertConclusion enum with a set of allowed values. I want the agent to classify the alert as one of the following:

class AlertConclusion(StrEnum):
    MALICIOUS = "malicious"
    RED_TEAM_ACTIVITY = "red_team_activity"
    NON_MALICIOUS = "non_malicious"
    GENERATED_FINDING = "generated_finding"
    INCONCLUSIVE = "inconclusive"

Instantiating an agent is relatively easy and consistent across most of these LLM agent frameworks. Note that we reference our defined model for AlertAssessment in the output_type field:

model: AnthropicModel = AnthropicModel("claude-3-5-sonnet-20241022")

agent = Agent(
    model=model,
    instrument=True,
    output_type=[AlertAssessment, UserInquiry],
    allow_arbitrary_tools=True,
    system_prompt=(
        'You are a cloud security expert triaging AWS GuardDuty alerts. '
        'Retrieve and assess the specified alert using provided tools. '
        'Skip generated findings (fields starting with "Generated"). '
     ...
     ...
     ...
    ))

I want to give the LLM some tools to help with triaging the alerts. The agent should be able to retrieve the GuardDuty alert in full, make queries to CloudTrail, obtain metadata about resources, and search GuardDuty to find related findings. PydanticAI allows you to define python functions as tools by just adding a decorator to the top of your desired functions:

@agent.tool_plain
def get_guardduty_alert(finding_id: str) -> dict:
  ...

@agent.tool_plain
def get_cloudtrail_events_for_resource_name(resource_name: str, startTime: datetime = None, endTime: datetime = None) -> List[dict]: 
  ...

@agent.tool_plain
def get_cloudtrail_events_for_identity(userName: str, startTime: datetime = None, endTime: datetime = None) -> List[dict]:
  ...
@agent.tool_plain
def search_guardduty_findings(criteria_field: str = None, criteria_values: List[str] = None, since: datetime = None) -> List[dict]:
  ...

Not shown are the function docstrings, which like the field descriptions are also included to the LLM in tool definitions and help the LLM better select when and how to use them.

While developing, I iterated locally by just calling my bot from discord to triage selected GuardDuty finding types and leaned on PydanticAI to dive into problematic runs. This was particularly useful because I saw the tool calls, arguments, results, and the token usage without having to manually instrument the code:

Summarizing what I have built:

A user in a Discord channel can invoke the bot using !triage with a command to triage a specific GuardDuty finding "Triage the latest GuardDuty alert", or "Triage GuardDuty finding 2139f904fk902kf302" both work.

The Agent will retrieve the specific GuardDuty finding using its get_guardduty_alert and search_guardduty_findings tools.

The Agent will use its tools for searching CloudTrail, GuardDuty for related findings, and retrieving metadata about resources (IAM Roles, Users, EC2 Instances) to gain additional context around the alert. Quick note that I used the CloudTrail LookupEvents API here instead of a SIEM to keep it simple.

Finally, the agent will reply with a halfway decent looking structured response in a Discord Embed, and gives the user some clickable buttons to choose to escalate or discard the alert:

Putting it to the test

First, I grabbed one of the IAM Anomalous Behavior GuardDuty findings, generated by myself during normal activity (pretty sure I triggered this browsing around one of my AWS accounts...)

Input: !triage the latest anomalous access finding involving the IAM User dakota-macbook-aquia

Result:

Okay, so not terrible. The assessment of non_malicious was correct, and there really wasn't much else to indicate anything about this session being malicious. We also told the agent in the system prompt to bias towards a non_malicious assessment, unless it can prove otherwise.

Next, I used the aws guardduty create-sample-findings CLI command to generate a fake finding for an EC2 Instance communicating with a Tor entry node:

Input: !triage 362af2710dce4ce294c09e2034092ae4 (The direct finding ID to make it easy)

Result:

The agent here did follow its instructions to not attempt further triage by tool calling on generated findings (therefore saving tokens and money). Also, it was right about the particular IP range appearing all over the AWS documentation , neat!

Finally, I used Stratus Red Team to detonate the Steal EC2 Credentials technique, which triggers UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration.OutsideAWS. I figured this one will be fun because it could technically be classified as Malicious or Red Team Activity. I asked our GuardDuty agent friend to triage it:

Input: !triage aacbb621e81baa25d482ac989736d09f

Output:

There is a lot to take in here. The agent classified it as red_team_activity, which is correct in this scenario. It explicitly calls out the Terraform user agent (which is used by Stratus under the hood for infrastructure provisioning), as well as the naming of the Instance Profile stratus-red-team-ec2-steal-credentials-role, and tag applied to the instance of StratusRedTeam: true.

The most impressive observation made was that the IP Address that created the EC2 Instance, and the one that used the stolen credentials, was the same one.

One thing I did notice is I was able to get a consistent classification for GuardDuty alerts on subsequent runs, but the reasoning and information would vary from session to session.

Design thoughts and iterations

Generic Tool Functions vs Specific

At first, I tried to use very generic tool definitions, for example: search_cloudtrail was a tool I implemented that wrapped the AWS CloudTrail LookupEvents API 1:1. The Agent actually really struggled with using the LookupEvents API, submitting the incorrect fields to arguments. I instead abstracted this away into more specific query functions: get_cloudtrail_events_for_resource_name, get_cloudtrail_events_for_identity, and it had much more success.

There are several AWS MCP offerings that expose AWS APIs to LLMs and agents, but this is a reason I avoided it here: I didn't want to additionally overload the context more than was needed, and most of the MCP server implementations had the same issue as my search_cloudtrail tool above: too generic.

Part of me wonders if my usage of the LookupEvents API itself was a cause of this, as the arguments for the API are a bit unintuitive, and most security folks out there will be using some sort of SIEM, and not this API directly (meaning there probably wasn't tons of training data on this).

Dynamically selecting a markdown playbook and adding that to the context

I had this idea of storing alert playbooks as code (markdown), accessible to the LLM, giving it a function to list all available playbooks, and allowing it to select the appropriate one based on the alert. One could even dynamically provide different sets of tools based on the alert/playbook to help limit the amount of unneeded context. Sadly, this landed in the "for next time" category.

Deterministic vs letting the LLM handle things

In my code, I have an ActionLog field that allows the LLM to provide the actions it has taken in triaging the alert. If I were to write this again, I would instead make this deterministic via a decorator on all tool functions that is returned with the LLMs response. I also had some challenges with dates being extracted from the LLM, but the timezone not being provided, creating a confusing response for all.

I would also likely include the raw outputs of log searches in the response to allow the end user to QA/easily dig deeper if needed.

I didn't really come away with a great answer, but I think recognizing that not everything has to be handled by the LLM and deterministic code still works just fine for a lot of things is a solid takeaway. Engineering is but a series of tradeoffs, right?

Pruning unneeded data

I had a fun troubleshooting session where I gave the LLM a tool for looking up CloudTrail events, and it would eat up all of its context from the raw results of those queries across a large date range. I ended up adding logic to the Event lookup tools to redact the data down to only the fields that were critical (UserName, UserAgent, EventName, Time, and EventId). If I had been using an actual SIEM, this probably would have been easier, but the challenge of sending only the relevant data to the LLM and preserving valuable context would absolutely still exist.

Final thoughts:

No, I don't think we will be replacing skilled analysts anytime soon with an army of LLM agents. Human in the loop is going to be here for a long time. Even then, who will build the plumbing for all of that?
Real environments are dirty, undocumented, and disorganized snowflakes. This was my personal AWS Lab environment, which is much more simplistic.
On the more positive: I do feel LLMs have a lot of utility for those who are building SOAR automations. Tons of effort is spent on parsing out specific fields and using that to select the correct actions to take. Instead of that, we could build playbooks that can handle multiple classes of alerts, are resistant to upstream schema modifications (when data source formats change), and can even handle scenarios not previously identified. Using LLMs to route and tie together the correct deterministic actions for an alert to better help a human make an escalation decision is where I see this going.
There is LOTS of engineering to be had: managing what should be deterministic vs what the LLM handles, context windows, data engineering, cost management, etc. Turns out your AI Agent is just.... GASP.... software. Oh, and you can even write tests for your agent using evals
I am excited and optimistic about the future here. I also had a blast building this, and am going to continue to iterate on it.
Some people say LLMs/GenAI is overhyped, and others say it is an incredibly useful technology. What if it is both?

DEV Community