ajithmanmu

Posted on May 4

AWS DevOps Agent: From Setup to First Real Investigation (And the Gotchas in Between)

#aws #devops #ai #cloudwatch

When an alert fires at 2am, the first 15 minutes aren't spent fixing anything. They're spent gathering context — opening CloudWatch, checking New Relic, scanning recent deploys, pulling up the runbook. By the time you actually understand what's happening, you've burned the fastest part of your response window.

AWS DevOps Agent is designed to do that first pass for you. It's an agentic AI system that investigates incidents autonomously — querying your telemetry, correlating deploy events, and posting findings to Slack before you've finished reading the page notification. I set it up as a PoC for our team recently and wanted to share what the setup actually looks like, including the things that tripped me up.

What the Agent Actually Does

The DevOps Agent sits between your alerting layer and your engineers. When a P1/P2 alert fires, it automatically:

Reads CloudWatch logs and metrics
Queries New Relic for transaction data, error rates, span-level telemetry
Correlates recent deploy events with the incident timeline
Posts an investigation summary to Slack with observations ranked by relevance

Engineers arrive with context rather than a blank slate. That's the core value proposition.

Setup Overview

The setup has a few distinct pieces: creating the Agent Space, wiring up New Relic, connecting GitHub, connecting Slack, and uploading a custom Skill. Here's how each went.

Creating the Agent Space

The first thing I hit: us-west-1 is not a supported region. The DevOps Agent is GA but only available in a subset of regions — us-east-1, us-west-2, ap-southeast-2, ap-northeast-1, eu-central-1, and eu-west-1 as of this writing.

The good news is you don't have to move your infrastructure. Creating the Agent Space in us-west-2 still gives it access to resources in us-west-1 (and other regions) through cross-region discovery. During initial setup it mapped over 4,000 entity relationships automatically — services, ECS clusters, ALBs, Lambda functions — without any manual configuration.

One thing worth knowing: there are two interfaces. The AWS console is for admin and setup only. The actual investigation experience — running manual investigations, viewing topology, reading findings — happens in a separate web app, accessed via an IAM auth link from the console. I spent longer than I'd like to admit looking for things in the wrong place.

New Relic Integration: Telemetry Source vs Capability Webhook

New Relic integration is two separate things and it's easy to confuse them:

Telemetry Source — the agent can read your New Relic data during an investigation (transaction metrics, spans, etc.)
Capability Webhook — New Relic can trigger investigations automatically when an alert fires

You need both. The Telemetry Source requires a Full Platform User API key — a regular API key won't work and you'll get a 403 with no helpful error message. If you don't have admin access to New Relic, flag this upfront so you're not blocked waiting on someone else.

Once connected, I created a New Relic workflow pointing at the Agent Space webhook with a custom payload template. This is where I spent most of my debugging time.

Fixing the Payload Template

New Relic's alert payloads don't match what the DevOps Agent expects out of the box. Two fixes were required:

Action field casing. New Relic sends uppercase — ACTIVATED, CLOSED. The DevOps Agent expects lowercase — created, closed. Fix with a conditional in the template:

{{#eq state "ACTIVATED"}}created{{/eq}}
{{#eq state "CLOSED"}}closed{{/eq}}

Service field type. New Relic sends the entity name as an array ["My Service"]. The agent expects a string. Fix:

{{ json entitiesData.names.[0] }}

Neither of these is documented anywhere obvious. I found them by running a test investigation and reading the agent's error observations.

Scoping with Alert Policies

Rather than exposing every alert to the agent, I scoped it to specific New Relic alert policies covering the services my team owns, filtered to P1/P2 priority only. This keeps investigation volume (and cost) predictable. Each team can have their own New Relic workflow pointing to the same webhook destination — clean separation, shared infrastructure.

GitHub Integration

The agent supports GitHub integration for correlating deploy events with incident timelines — useful for the most common first question in any incident: "was this caused by a recent deploy?" Setup is a two-step GitHub App install (account-level, then repo-level) with read-only access.

Once connected, the agent can pull recent commit and deploy activity and factor it into its investigation. If your team uses New Relic deployment markers already, there's some overlap — but the GitHub integration adds code-level context that telemetry alone doesn't give you.

Slack Integration

Straightforward — register Slack as a capability provider at the account level, then enable it per Agent Space. I'd recommend creating a dedicated channel for investigations rather than posting to a general engineering channel. The agent posts detailed findings with multiple observations; it's noisy in context, but useful in its own space.

Custom Skill: What to Put In and What to Leave Out

This is the part most writeups skip over, but it's where you get the most leverage.

The agent already knows a lot from topology mapping — your service names, ECS cluster layout, log group names, ALB configuration. Don't repeat any of that in the Skill. What the agent can't discover on its own is everything outside your AWS account. That's what the Skill is for.

Here's the structure I used:

1. Investigation priority order

Tell the agent explicitly what to check first. Left to its own judgment, it will start with internal metrics. But in practice, a large percentage of incidents are caused by third-party outages — and finding that early saves everyone time.

1. Check external dependency status pages first
2. Check for recent deployments
3. Analyze internal metrics (CloudWatch, New Relic error rates, traces)
4. Correlate across sources and build a timeline

2. External dependency status pages

Organize these by category so the agent knows which ones are relevant to which type of incident:

Infrastructure: your CDN, DNS, cloud provider health dashboard
Payments: payment processor, app store IAP status pages
Monitoring: your APM tool, analytics platforms
Content/third-party: any external APIs or services your product depends on

The agent will check the ones relevant to the alert context and include their status in its findings. This alone has caught third-party outages that would have taken significantly longer to identify manually.

3. Runbook pointers

The agent can't access Confluence or internal wikis directly. But you can give it the URLs and it will surface the right one in its Slack output so the on-call engineer can jump straight to it. Map them by service or incident type.

4. Communication guidelines

Tell the agent how to structure its Slack findings:

Lead with the root cause or most likely hypothesis
Include a timeline of key events with timestamps
Reference specific metrics and thresholds
If an external dependency is down, state it clearly
End with recommended next steps for the on-call engineer

This makes the agent's output immediately actionable rather than a wall of observations to interpret.

End-to-End Test

To validate the setup before real alerts, I created a test alert policy with a condition that would fire reliably (p95 response time > 10ms on a service where baseline is ~60ms).

The first triggered investigation ran for about 15 minutes and produced 23 observations. The agent correctly identified it as a false positive (threshold was too low). What it also surfaced — unprompted — was a real issue: ECS task churn on a different service, with recurring task stops every few hours and a 2-hour replacement delay. That had nothing to do with the test alert. It just noticed it while reviewing CloudWatch events.

A second test a few hours later triggered faster — the agent recognized the same condition from the previous investigation and completed the analysis more quickly.

Incident deduplication also worked as expected: when two alerts for the same condition fired 80 seconds apart, the agent linked the second to the first investigation rather than starting a duplicate. This matters for high-severity incidents where multiple conditions often trigger simultaneously.

Cost

Pricing is $0.0083/second (~$0.50/minute), billed only during active investigations. At roughly 6–7 incidents per month averaging 10–12 minutes each, that comes to around $40–50/month. There's a 2-month free trial (20 hours of investigation time) — useful for validating the PoC without any spend pressure.

One gap: at the time of setup, the DevOps Agent service didn't appear as a filterable dimension in AWS Budgets or Cost Anomaly Detection. I have the CLI commands ready for when charges start showing up in billing. If you're cost-conscious (and you should be), wire up AWS Budgets with a monthly threshold as soon as the service appears in your billing dimensions.

What I'd Tell Someone Setting This Up

Check your region first. Don't create the Agent Space in an unsupported region and wonder why nothing works.
Get the New Relic Full Platform User API key sorted before you start. It's a blocker and it requires admin access.
Test the payload template early. Run a manual investigation from the web app, read the error observations, and fix the action casing and service field before connecting real alert policies.
Scope narrowly to start. P1/P2 only, your team's services only. Expand once you've seen a few real investigations.
Build the Skill around what the agent can't discover. Status pages, runbook links, investigation priority order, communication guidelines — not infrastructure details it already knows from topology mapping.

Is It Worth It?

For a team that handles production incidents regularly, yes. The setup is half a day of work. The ongoing maintenance is minimal. And the value isn't the AI itself — it's that the first 15 minutes of an incident are done by the time you open your laptop.

The agent isn't perfect. It sometimes surfaces obvious things you'd have found in 30 seconds. But on a bad night with a pager, "obvious with context already in Slack" is more useful than "obvious after you've dug through four different tools."

AWS Community Builder — Serverless category. Questions or feedback welcome in the comments.

DEV Community