Indika_Wimalasuriya for AWS Community Builders

Posted on Dec 29, 2025 • Edited on Jan 5

AWS DevOps Agent: 10 best practices to get the most out of It

#awsdevopsagent #aws #devops #aiops

One of the key releases that happened as part of AWS re:Invent 2025 was the launch of new frontier autonomous agents by AWS:

AWS DevOps Agent
AWS Security Agent
Kiro Autonomous Agent

Out of these, the AWS DevOps Agent is going to revolutionize the way DevOps and SRE teams work. In this guide, I'm going to cover the key best practices you need to consider to get the most out of your AWS DevOps Agent.

1. AWS DevOps Agent is not a tool; it’s a capability.

You read that correctly. The DevOps Agent is not a magic bullet that will solve all your problems while you sip your cup of tea. It's a capability, and the results will, of course, depend on how you use it.

Example: You can't just install an AIOps agent and expect the MTTR (Mean Time to Repair) to decrease. Alerts will still fire the same way, runbooks won’t be executable, and there will be no service ownership or defined SLOs (Service Level Objectives).

To get the most out of the DevOps agent, you need to define SLOs for each service, convert runbooks into executable processes, provide observability, ensure change visibility, and enable other capabilities so the agent can correlate deployments, suggest resolutions, and execute with humans in the loop.

Remember, capabilities involve people, processes, and tools, not just software.

2. Observability is the key; the agent needs context.

The importance of observability is as crucial as ever. If you thought you could park the observability discussion, you’re in for a rude shock. The agent needs context to act, and context comes from your telemetry data (metrics, logs, and traces).

It’s best to aggregate all your telemetry sources. If CloudWatch isn’t your cup of tea, integrations are available for all top observability tools, such as Datadog, Dynatrace, New Relic, and Splunk.

The idea is to ensure the agent can see the blast radius of an incident with the help of telemetry data so it can understand your system’s internal state and act on any changes with the correct intent.

Example: A load balancer encounters 5xx errors. Without observability context, the DevOps agent only sees the 5xx error count and would likely suggest scaling the load balancer or services. However, with full telemetry, the agent can identify that traces are slow due to SQL queries, and logs show that the RDS connection pool is exhausted with high CPU saturation in the database.

Now, the DevOps agent can conclude that the root cause is the RDS issue, which is causing the upstream 5xx errors. Scaling the ALB (Application Load Balancer) won’t resolve the problem.

We need to enable agents to understand the blast radius, not just the symptoms. Observability is key to providing this context.

3. Define golden signals (latency, error rate, saturation and traffic) so agent is able to work with symptoms

Agents reason better on symptoms or effoects instead of alerts which usually generate lot of noise. The more symptoms agent is having access, the better your agent is able to act on them.

Example: Instead of defining alerts based on infrastructure metrics like CPU > 80% or memory > 75%, you define thresholds such as checkout latency P95 > 2s or error rate > 1%. Alerts are then triggered due to increased latency or rising error rates.

In this case, the agent is able to reason about user experience, even when infrastructure metrics are not in an alarming state. This leads to better detection of end-user–impacting issues and more effective root cause analysis.

4. Agents need guidance; instead of wikis, provide agents with tools.

It’s common to provide runbooks that offer investigation guidance to agents. But remember, unless you provide real capabilities to your agent, a runbook is just documentation. While documentation is useful, you should aim to provide the agent with actionable solutions.

For example, provide Lambda functions that can pull telemetry data or execute remediation actions. Step Functions or other automated workflows that are part of the runbook can be made easily executable by the agent.

Just remember: your newest team member can’t repost failed orders simply by reading how to do it. But if there’s a Lambda function available, they may be able to use it. For instance, one Lambda function can identify the root cause, a second can determine the correct recovery or reporting function, and a third can execute the appropriate Lambda function.

Guidance must be clearly defined, with preconditions, safe actions, and always with rollback steps. This approach enables your agent to evolve from a recommendation-only agent into one that can actively remediate issues.

Example: Documentation may state that if the SQS backlog increases, you should check consumer health and restart pods. However, the agent cannot perform these actions on its own. Instead, you need to provide Lambda functions that can fetch queue depth and consumer lag, another Lambda function that can analyze failure patterns, and another function that can safely restart the consumer deployment. A Step Functions workflow can be used to orchestrate all these steps, including rollback.

During an incident, the agent can invoke these Lambda functions, identify stalled consumers, recommend execution of the Step Functions workflow, and carry it out after approval. In this scenario, the agent acts as an active operator, not just a passive observer.

5. An agent is like a human, focus on guardrails instead of denying permissions.

Just giving your agent administrative access is as bad as denying it the permissions it actually needs. An agent requires a reasonable level of access to do its magic.

While least-privilege IAM roles are important, it’s often more effective to focus on guardrails—clearly defining what the agent can and cannot do. For example, you might allow broad access for diagnostics while tightly controlling or restricting remediation actions.

With agents, you need to start becoming comfortable with autonomy that operates within well-defined rails, rather than blocking everything by default. This balance enables the agent to be effective while still keeping your environment safe.

Example of a bad approach: Giving an agent admin access is risky—one bad prompt could cause a production outage. On the other hand, if the agent only has read access to metrics, it delivers zero remediation value.

A better approach is to provide read-only access across all services, while allowing remediation only through approved capabilities such as Lambda functions or Step Functions. Direct delete, terminate, or drop permissions should never be allowed.

This model enables the agent to diagnose issues freely, while remediation can occur only through safe, audited paths with built-in guardrails. Autonomy within guardrails is the way forward.

6. Have a KT plan for the agent. Your team member needs some babysitting.

Treat the DevOps agent as your new team member. It may be a superhero when it comes to AWS, but it’s still a novice when it comes to your specific cloud implementation.

You need to train the agent with detailed information so it can develop a full understanding of your architecture, implementations, and even business context. Treat it like an expert Solution Architect who has just joined the team—don’t assume prior knowledge. Share everything you have and onboard it properly, rather than letting it jump straight into firefighting.

Example: When you onboard a new solution architect, you provide proper knowledge transfer (KT), share architecture diagrams, explain why things exist, outline the business rationale, and discuss past failures. You need to do the same with your DevOps agent.

Provide architecture diagrams, documentation, service mappings, business context, and known failure patterns. This enables the agent to prioritize a payment API over reporting jobs when managing alerts and to avoid repeating known bad remediations.

Always remember: context reduces incorrect automation actions.

7. Let agents know what your developers are doing.

Yes, it’s a DevOps agent—but it still needs visibility into what your developers are working on. It’s essential to connect your CI/CD pipelines and provide this visibility to the agent.

This allows the agent to correlate operational issues with recent code changes and deployments. As a result, the agent can identify specific commits or pipeline executions and isolate them to better understand the root cause of issues.

Let’s be frank: most incidents today are code-related or deployment-related. The old saying still holds true—if you don’t touch it, it won’t break on its own. So when something isn’t working, let your agent answer the critical question: what changed?

This significantly accelerates the agent’s ability to isolate root causes and reduce mean time to resolution (MTTR).

Example: Let’s say there is a latency spike at a certain time. The DevOps agent checks the CI/CD pipeline and identifies that a deployment occurred shortly before the spike. The commit included changes to payment-related files.

The agent then pulls additional metrics and correlates them with high confidence, concluding that the alert is caused by the recent deployment and recommending a rollback. Without this CI/CD context, the agent would waste time investigating infrastructure issues, increasing MTTR.

8.Hold your agent’s hand until it grows up. Start with a human in the loop and actively steer the work.

Of course, initially you need to be heavily involved—you can’t realistically expect a fully autonomous agent from day one. You need to observe its behavior, explain context, and provide detailed recommendations that the agent can act on.

Any remediation action should go through an approval process at the beginning. This is how the journey starts. Over time, you can gradually increase autonomy by putting the right guardrails in place. Remember, even a DevOps agent has to earn your trust.

Steering the agent is equally important. It’s your environment, so you need to stay actively involved. Use chat features to provide details, discuss failure scenarios, and plan responses in real time. If you notice false alarms or incorrect root-cause analysis, correct the agent. Explain why you disagree so it can learn effectively.

The idea is not to wait and see until the agent fails, but to proactively take action to ensure the agent succeeds.

Example: The agent recommends restarting RDS, but a human rejects the action and explains that an RDS restart could cause data loss or customer impact during peak hours. The agent learns about time windows, business constraints, and safer alternatives.

In later phases, the agent can automatically restart stateless services, while still requiring approval for any data-layer changes. Trust is built through guided autonomy.

9. Measure agent performance using business metrics.

An agent is not a shiny object that you deploy and forget about. It’s actually useless if it doesn’t positively improve outcomes. That means you need to start measuring the right metrics.

Track indicators such as Mean Time to Resolve (MTTR), noise reduction, the percentage of root causes identified automatically, and the percentage of remediations executed by the agent. These metrics help you understand whether the agent is delivering real value.

Unless you measure performance and take the necessary actions based on those insights, there will be no meaningful improvement.

Example: Before introducing the agent, MTTR was 45 minutes and 120 alerts were generated per service-impacting incident. After configuring the DevOps agent, MTTR dropped to 18 minutes, alert noise reduced to 35 alerts per incident, 40% of incidents were auto-diagnosed, and 20% were auto-remediated.

These are the real business benefits you should strive to achieve. If you can’t demonstrate measurable impact, the agent is just a shiny demo.

10. Actively look into agent investigation gaps and work to resolve them.

A DevOps agent will not be right on the first attempt, especially in the early stages. There will be many investigations it cannot continue due to implementation gaps, missing context, lack of telemetry data, missing capabilities, or permission issues.

You need to regularly review these investigation gaps and provide the necessary inputs to the agent. Over time, this will enable the agent to become more effective and smarter in the long run.

Example: The agent stops investigating and reports that it is unable to determine the root cause due to missing database query metrics. In response, you enable RDS Performance Insights, add slow query logs, and create a Lambda function to fetch query statistics.

With this additional context, the agent is able to identify long-running queries and suggest actions such as index creation or query throttling.

Every failure is a training data point for your agent, not a reason to abandon it or point fingers when it falls short.

Finally, you need to continuously evolve with the AWS DevOps agent and take it on the journey.

If you’re new to AWS DevOps and want to learn step by step, I’m creating a video series that does exactly that. You can check it out here:

DEV Community

AWS DevOps Agent: 10 best practices to get the most out of It

Top comments (0)