DEV Community

kartik manimuthu
kartik manimuthu

Posted on

I Built an Autonomous AI DevOps Agent Using LangGraph and AWS Bedrock — Here's Everything I Learned

How a reflection-based AI architecture solved the practical problems that simple LLM chatbots can't.


I have a confession. At the peak of the AI hype cycle, I tried the obvious thing: I wired up a ChatGPT-like assistant to our AWS CLI and called it a "DevOps co-pilot." Engineers loved it for one week. Then reality set in. It confidently ran the wrong commands. It forgot context mid-session. It had no ability to recover from its own mistakes. It was, in the bluntest engineering terms, a glorified Google search with a subprocess.run() wrapper.

So I started over. I built Nucleus Cloud Ops — an AI Ops platform that doesn't just suggest commands, but plansexecutesreflects, and self-corrects across multi-account AWS environments. This article is everything I learned building it.

🎬 See it in action first. Before we dive into the architecture, here's the agent running a real multi-step AWS investigation end-to-end:


The Real Problem With AI in DevOps

Before we talk about the solution, let's be honest about the actual problems that Agentic AI needs to solve in a real DevOps context. It's not "how do I query an LLM?" The hard problems are:

1. Long-horizon task planning. A request like "Analyze my Lambda functions, identify cold-start issues, and generate a cost optimization report" is not a single operation. It's 8–12 sequential and parallel steps. Vanilla LLM prompting breaks down at this complexity.

2. State persistence. A DevOps session is not a one-shot interaction. An engineer may context-switch between accounts, revisit earlier tool outputs, and build upon prior findings across multiple sessions. State must persist.

3. Cross-account security. Your AI agent will have access to production AWS accounts. You cannot cut corners here. The architecture must enforce least-privilege and zero-trust by design, not by policy.

4. Failure recovery. AWS CLI commands fail. API rate limits hit. IAM policies block unexpected actions. A practical agent must detect these failures and attempt autonomous recovery, not just print an error and give up.

5. Trust. Engineers will not trust a black box. They need to see the agent's plan before execution, understand why each step is being taken, and retain the ability to intervene.

The architecture I'm about to describe was designed to solve all five of these problems specifically.


The Architecture: A Reflection Pattern Built on LangGraph

The core AI engine of Nucleus Cloud Ops is a stateful, cyclic graph built on LangGraph. If you're not familiar with LangGraph, think of it as a way to define AI agents as persistent, directed graphs where nodes are functions and edges define conditional control flow.

The key architectural decision was adopting a Reflection Pattern — not because it's trendy, but because it directly addresses the failure recovery problem.

Here is the graph:


User Request
    │
    ▼
[Planner Node]  ──▶  [Executor Node]  ──▶  [Reflector Node]
                           ▲   │                   │
                           │   ▼                   │ (isComplete = false)
                       [Tool Node]           [Reviser Node]
                           │                       │
                       AWS / Grafana               └─────────▶ [Executor Node] (retry)
                       / K8s MCP

                                                   │ (isComplete = true)
                                                   ▼
                                           [Final Summary Node]
Enter fullscreen mode Exit fullscreen mode

And here is how this maps to the full deployed platform — the High-Level Architecture Diagram of Nucleus Cloud Ops:

Nucleus Cloud Ops — High-Level Architecture Diagram

Review

Nucleus Cloud Ops — High-Level Architecture Diagram

Node-by-Node Breakdown

🔷 Planner Node

The first thing the agent does is not run a tool. It generates a structured execution plan. Given the user's natural language request, the Planner (using Claude 4.5 Sonnet via Amazon Bedrock) produces a numbered, step-by-step sequence of actions.

json

{
"steps": [
"1. Assume cross-account role for account STX-EKYC-PROD",
"2. Run `aws lambda list-functions` to enumerate all functions",
"3. For each function, check CloudWatch metrics for p99 initialization duration",
"4. Query Cost Explorer for last 30 days Lambda spend by function",
"5. Correlate cold-start frequency with cost per invocation",
"6. Generate a markdown report with top-10 highest cost offenders"
  ]
}
Enter fullscreen mode Exit fullscreen mode

This plan is rendered in the UI before execution begins. The engineer can review, edit, or abort before a single AWS API call is made.

⚙️ Executor Node (Generate)

The Executor takes the next pending step from the plan and contextualizes it into tool calls. It uses the LLM to determine the exact combination of tools needed for this step.

Crucially, it has access to the full conversation history, prior tool outputs, and LangGraph state — so it never loses context, even across multiple turns or retries.

🔧 Tool Node

This is where things get interesting. Our Tool Node isn't a static set of functions. It's a dynamically-loaded set of capabilities based on the selected Agent Skill:

  • Read-Only Skill: AWS CLI (describe-*list-*get-* actions only), Web Search, File read
  • DevOps Mutation Skill: All of the above + EC2 start/stop, RDS scaling, ECS deployment triggers
  • MCP Servers: Grafana (for metrics and dashboards) and Kubernetes (via the MCP protocol)

This skill-isolation pattern ensures the agent operates with the minimum permissions required for the task at hand.

🔍 Reflector Node

This is the key innovation. After every Executor cycle, the output doesn't go directly back to the user. It goes to the Reflector — a secondary LLM loop that acts as a critical reviewer.

The Reflector's prompt is deliberately adversarial:

"You are a quality controller. Review the following tool execution output. Identify: (1) incomplete steps, (2) error messages that were dismissed, (3) logical inconsistencies, (4) security violations. Return isComplete: true only if you are confident the step was fully and correctly resolved."

If the Reflector returns isComplete: false, it also returns a structured critique:

json

{
"isComplete":false,
"critique":"The Lambda list was retrieved, but the function 'payment-processor-prod' threw an AccessDeniedException which was not handled. The cold-start analysis is incomplete without this function's metrics.",
"suggestedFix":"Retry with expanded IAM permissions in the assume-role, or flag this function as excluded and note it in the report."
}
Enter fullscreen mode Exit fullscreen mode

🔄 Reviser Node

The Reviser receives the Reflector's critique and updates the agent's state, injecting the suggested fix back into the execution context. It then re-routes to the Executor for a retry. This loop runs up to a configurable MAX_ITERATIONS (default: 5) before the agent concedes and surfaces the issue to the user.

In practice, I've observed this loop self-correcting ~80% of transient failures without human intervention.


The AWS Infrastructure Stack

The AI logic runs somewhere, and that somewhere matters for cost, latency, and security.

Compute: ECS Fargate

Both the Next.js 15 web frontend and the LangGraph agent backend run on AWS ECS Fargate — fully serverless, no EC2 to manage. This was a deliberate choice. We wanted to avoid the maintenance burden of long-running servers while still supporting persistent WebSocket connections for streaming agent outputs.

The LangGraph agent streams its execution state via server-sent events (SSE) to the Next.js API routes, which then push updates to the browser in real time. Users watch their agent think, plan, and execute — step by step.

AI: Amazon Bedrock

We use Claude 4.5 Sonnet via Amazon Bedrock's ChatBedrockConverse API for reasoning and planning. The Bedrock integration was remarkably straightforward with LangChain:

python

from langchain_awsimport ChatBedrockConverse

llm = ChatBedrockConverse(
model="anthropic.claude-sonnet-4-5",
region_name="ap-south-1",
temperature=0,
)
Enter fullscreen mode Exit fullscreen mode

One practical lesson: set temperature=0 for DevOps agents. You want deterministic, reproducible reasoning, not creative variation when running AWS commands.

For vector search, we use Amazon Titan Embeddings v2 to power semantic search over the resource inventory.

State: DynamoDB (Single Table Design)

Agent state persistence is critical. We use DynamoDB with a Single Table Design to store:

  • LangGraph Checkpoints: The full agent state at every node. If a session drops, the agent resumes from its last checkpoint, not from scratch.
  • Agent Conversations: Full history of every conversation, accessible for audit and re-engagement.
  • App Table: User RBAC, account configurations, schedule definitions.

The LangGraph DynamoDB checkpoint saver integrates cleanly:

python

from langgraph.checkpoint.dynamodbimport DynamoDBSaver

checkpointer = DynamoDBSaver(table_name="langraph-checkpoints")
graph = agent_workflow.compile(checkpointer=checkpointer)
Enter fullscreen mode Exit fullscreen mode

Data Lake: Amazon S3 Tables (Apache Iceberg)

We perform continuous resource discovery across all connected AWS accounts — EC2 instances, RDS databases, ECS services, Lambda functions, VPCs. This inventory is stored in Amazon S3 Tables using Apache Iceberg format, giving us time-travel queries and schema evolution for free.

When an engineer asks "what's the oldest RDS instance across all our accounts?", the agent doesn't need to make live API calls — it queries the pre-built Iceberg inventory, which is refreshed by a scheduled Discovery Lambda.


Security Architecture: Hub-and-Spoke Cross-Account Access

This section deserves its own article, but the short version is: no permanent credentials, ever.


┌──────────────────────────────┐
                  │   Nucleus Hub Account        │
                  │   ┌─────────────────────┐   │
                  │   │  Agent (ECS)         │   │
                  │   │  Task Role           │   │
                  └───┼──────────────────────┼───┘
                      │                      │
            sts:AssumeRole              sts:AssumeRole
                      │                      │
           ┌──────────▼───┐       ┌──────────▼───┐
           │ Account A    │       │ Account B    │
           │ (Non-Prod)   │       │ (Production) │
           │ ReadOnlyRole │       │ DevOpsRole   │
           └──────────────┘       └──────────────┘
Enter fullscreen mode Exit fullscreen mode

When the engineer selects an account in the UI, the get_aws_credentials tool performs an STS AssumeRole call at execution time. The resulting temporary credentials (valid for 1 hour) are injected into the subprocess environment for that execution context only.

The cross-account roles are generated from auto-generated CloudFormation templates and enforce service-specific, least-privilege IAM policies. A "Read-Only" session cannot invoke ec2:StopInstances, even if the Executor tries to call it.


Real-World Results

Here's what this system actually does in production for our team:

Task Before (Manual) After (Nucleus AI Agent)
Monthly cost analysis report (4 accounts) 2.5 hours 4 minutes
Lambda function audit + timeout diagnosis 45 minutes 6 minutes
S3 public access security sweep 1 hour 3 minutes
Nightly non-prod environment shutdown Cron script (fragile) AI-governed schedule
ElastiCache scaling investigation 90 minutes of log diving 8 minutes

The agent doesn't replace the engineer — it eliminates the toil so the engineer can focus on the decision, not the data gathering.


What I'd Do Differently

Honesty section. Not everything was smooth:

  1. Start with tool boundaries. We spent weeks debugging agent behavior before realizing the issue was too many overlapping tools. Define crisp, single-responsibility tools and the LLM's planning quality improves dramatically.
  2. Streaming is non-negotiable. A DevOps agent running a 4-minute task with no output is terrifying. I cannot overstate how much better the user experience is when you stream every thought, tool call, and intermediate result to the UI in real time.
  3. The Reflector prompt is your most important asset. The quality of your self-correction loop is 100% determined by how adversarially you craft the Reflector's system prompt. Be vicious. It will surface more real issues than you expect.
  4. Rate limiting is a real operational concern. AWS API rate limits will hit you during large-scale discovery operations. Build exponential backoff into your tool implementations from day one.

Getting Started

The full source code for Nucleus Cloud Ops is open source. Everything described in this article — the LangGraph agent, the Next.js interface, the AWS CDK infrastructure, the cross-account security model — is available to explore, fork, and contribute to:

🐙 github.com/kartikmanimuthu/nucleus-cloud-ops

If this architecture interests you, here's your starter kit reading list alongside the repo:


🌍 Open Source — Contribute & Build Together

Nucleus Cloud Ops is fully open source and available on GitHub:

github.com/kartikmanimuthu/nucleus-cloud-ops

Whether you want to:

  • ⭐ Star the repo to follow progress
  • 🐛 Open an issue to report a bug or suggest a feature
  • 🔧 Submit a PR to add a new Agent Skill, MCP server integration, or AWS tool
  • 📖 Improve the docs to help others get started faster

...your contribution is welcome. The cloud operations problem space is vast, and the best solutions will be built by a community, not any single team. Let's build it together.


Final Thoughts

The shift from "LLM wrapper" to "autonomous AI agent" is not a prompt engineering problem. It's an architecture problem. The reflection pattern, stateful graph execution, persistent checkpointing, and ironclad cross-account security are what separate a production-ready AI Ops platform from a polished demo.

We're at the very beginning of what Agentic AI will do for infrastructure management. The teams that invest in the right architecture now will be the ones operating at 10x productivity when the next generation of models arrives.

If you have questions about any part of this system, drop them in the comments. I read every one.


Kartik Manimuthu is a Architect building GenAI DevOps tooling on AWS. Follow for more articles on Cloud Architecture, LangGraph, and enterprise AI systems.

#AWS #LangGraph #AgenticAI #DevOps #CloudArchitecture #AmazonBedrock #GenAI

Top comments (0)