POTHURAJU JAYAKRISHNA YADAV for AWS Community Builders

Posted on Apr 17

I Replaced 47 DevOps Scripts With One AI Agent — Here’s What Happened

#agents #aws #eks #automation

The Hook: I Was Wrong About Automation

I thought I automated DevOps.

I had 47 deployment scripts.

Then I started replacing them with an AI agent —
and most scripts became unnecessary.

Not by following instructions.
By making decisions.

And the 2 AM debugging stopped.

Note: The code shown here is simplified for clarity.
The GitHub repo contains a more modular implementation.

That's when I realized:

I wasn't automating.
I was hardcoding decisions.

I gave Claude a list of AWS tools and one instruction: "Deploy this app."

No hardcoded logic.
No decision trees.

Just: describe the goal, Claude figures out how.

Deployment time: 3 hours → in minutes for most cases.

(Built this while deploying real workloads on AWS: Docker, ECS, EC2, IAM.)

🎯 Is This For You?

This post is for:

DevOps engineers with 10+ deployment scripts
Platform engineers building internal developer platforms
Anyone exploring AI agents beyond chatbots
Teams on AWS (Docker, EC2, ECS, IAM)

If manual deployment takes > 30 minutes, read this.

✨ What's Possible

Old way (manual):

aws ecs create-cluster --cluster-name prod
aws ecs register-task-definition --family myapp ...
# [50+ commands, 3 hours, manual debugging]

New way (agent):

agent.run("Deploy FastAPI with PostgreSQL, auto-scale to 20, "
          "minimal IAM perms, CloudWatch monitoring")

# ✅ Done in minutes for most cases, with significantly reduced debugging

No scripts.

Just Claude thinking out loud about your infrastructure.

🚀 How It Actually Works

You describe the goal (natural language):

"Deploy my app on 5 ECS tasks, auto-scale to 20 on high CPU"

Claude gets these tools:

dispatcher = {
    "ecs__create_cluster": ecs.create_cluster,
    "ecs__register_task": ecs.register_task_definition,
    "ecs__create_service": ecs.create_service,
    "iam__create_role": iam.create_role,  # For permissions
}

Claude reasons autonomously:

"User wants ECS deployment.
 I need to:
 1. Check if cluster exists
 2. Register task definition
 3. Create service with 5 tasks
 4. Setup auto-scaling
 5. Verify it's running"

It executes:

Call ecs__create_cluster() → Get cluster ARN
Call ecs__register_task() → Get task definition
Call ecs__create_service() → Get running tasks
Call ecs__setup_autoscaling() → Confirmed
Return: "All 5 tasks running, auto-scaling 5-20"

** Minimal manual intervention in most cases. Just reasoning + execution + feedback loops.**

⚠️ The Problem I Started With

I had:

deploy_docker.py — 150 lines of Docker logic
deploy_ec2.py — 200 lines of EC2 logic
deploy_ecs.py — 300 lines of ECS logic
3 routers trying to chain them together
0 way to handle "deploy to both EC2 AND ECS"

Every new service = new script. Every new workflow = rewrite everything.

Solution: Reduce rigid scripts — let the agent handle orchestration logic dynamically.

Instead of routing to specific tools, give Claude all available tools and let it decide which ones to use, in which order, adapting as it goes.

# All tools in one place
tools = {
    "docker__run": run_container,
    "ec2__create": create_instance,
    "ecs__deploy": create_service,
    "iam__create_role": create_role,
    # ... more tools
}

# Let Claude orchestrate
agent.run("Deploy with Docker locally, then scale to ECS production")
# Claude figures out: Docker first, then ECS, then IAM for permissions

🏗️ The Architecture (3 Layers)

┌──────────────────────────────┐
│ User Goal (natural language) │  
│ "Deploy scalable production  │
│  stack with auto-scaling"    │
└──────────┬───────────────────┘
           │
           ▼
    ┌─────────────────────────────────────────────┐
    │  Claude (Bedrock)                           │
    │  ← Reads goal + available tools             │
    │  ← Decides sequence of actions              │
    │  ← Adapts when things fail                  │
    └──────────────┬────────────────────────────┘
                   │
    ┌──────────────┴──────────────┐
    │                             │
    ▼                             ▼
┌─────────────┐  ┌──────────────────┐
│ AWS APIs    │  │ Conversation     │
│ (via boto3) │  │ Memory (DynamoDB)│
│ ← Executes  │  │ ← Recalls setup  │
│   decisions │  │   from last week │
└─────────────┘  └──────────────────┘
    │                    │
    └────────┬───────────┘
             │
    ┌────────▼─────────┐
    │ Actual Resources │
    │ EC2, ECS, Docker │
    │      IAM, etc    │
    └──────────────────┘

3 moving parts:

Claude — Reasons about the task
Tools — Execute AWS API calls
Memory — Remember past deployments for coherence

🔧 The Setup (Code Foundation)

Here's the entire base system in ~100 lines. Full code on GitHub.

import boto3
from agents.memory import load_history, save_message

MODEL = "apac.amazon.nova-lite-v1:0"  # Fast, cheap 

class BaseAgent:
    """Foundation for all agents (Docker, EC2, ECS, IAM)"""
    AGENT_KEY = None  # "docker", "ec2", etc.
    SYSTEM_PROMPT = ""  # Agent's personality
    CAPABILITIES = []  # What this agent can do

    def __init__(self, name: str, region: str = "us-east-1"):
        self.name = name
        self.session = boto3.Session(region_name=region)
        self._bedrock = None

    @property
    def bedrock(self):
        """Lazy load Bedrock connection (only when needed)"""
        if not self._bedrock:
            self._bedrock = self.session.client("bedrock-runtime")
        return self._bedrock

    def run(self, task: str, session_id: str = None) -> dict:
        """The agentic loop: think → decide → act → repeat"""
        tools = self.get_tools()
        dispatcher = self.get_dispatcher()
        messages = load_history(session_id) + [{"role": "user", "content": [{"text": task}]}]

        for _ in range(10):  # Max 10 iterations to prevent infinite loops
            # Ask Claude what to do
            response = self.bedrock.converse(
                modelId=MODEL,
                system=self.SYSTEM_PROMPT,
                messages=messages,
                tools=tools,
                toolUseDepth=2,
            )

            # Is Claude done?
            if response["stopReason"] == "endTurn":
                return {"status": "SUCCESS", "message": response["content"]}

            # Execute tools Claude wants to call
            tool_results = []
            for tool_use in response["content"]:
                tool_name = tool_use["toolUse"]["name"]
                try:
                    result = dispatcher<a href="tool_use["toolUse"]["input"]">tool_name</a>
                except Exception as e:
                    result = {"error": str(e), "status": "failed"}

                tool_results.append({"toolUseId": tool_use["toolUse"]["toolUseId"],
                                    "content": [{"json": result}]})

            # Add Claude's decision + results to conversation
            messages.append({"role": "assistant", "content": response["content"]})
            messages.append({"role": "user", "content": [{"toolResult": tr} for tr in tool_results]})

            # Save for future reference
            if session_id:
                save_message(session_id, messages[-1])

        return {"status": "FAILED", "reason": "Max iterations reached"}

What's happening:

Lazy loading (@property bedrock): Don't connect to Claude until needed
Tool loop: Call tools, capture results, show Claude the output
Error handling: Don't crash—tell Claude what failed, let it adapt
Memory: Save each turn so Claude remembers next week

🚀 Agents in Action: 3 Real Examples

DockerAgent: Deploy Locally

class DockerAgent(BaseAgent):
    AGENT_KEY = "docker"
    SYSTEM_PROMPT = """You manage Docker containers.
Rules: Pull image first, check if container exists, use sensible defaults."""

    CAPABILITIES = [
        {"name": "list_containers", "description": "List running containers"},
        {"name": "run_container", "description": "Pull and run image"},
        {"name": "stop_container", "description": "Stop a container"},
    ]

Usage:

docker_agent = DockerAgent()
docker_agent.run("Deploy FastAPI on port 8000 with health check")
# Claude calls: list_containers → run_container → confirms it's running

EC2Agent: Scale to Cloud

class EC2Agent(BaseAgent):
    AGENT_KEY = "ec2"
    SYSTEM_PROMPT = """You manage EC2 instances.
Rules: Tag for organization, verify security groups, create new→test→retire old."""

    CAPABILITIES = [
        {"name": "describe_instances", "description": "List instances"},
        {"name": "create_instance", "description": "Launch new instance"},
        {"name": "stop_instance", "description": "Stop instance"},
    ]

Usage:

ec2_agent = EC2Agent()
ec2_agent.run("Create 2 t2.micro instances, tag as app-server")
# Claude: checks if instances exist → creates 2 → returns IPs → confirms running

ECSAgent: Production Auto-Scaling

class ECSAgent(BaseAgent):
    AGENT_KEY = "ecs"
    SYSTEM_PROMPT = """You manage ECS services at production scale.
Rules: Task definition first, use Fargate, always configure auto-scaling."""

    CAPABILITIES = [
        {"name": "create_cluster", "description": "Create ECS cluster"},
        {"name": "register_task_definition", "description": "Register blueprint"},
        {"name": "create_service", "description": "Deploy service"},
        {"name": "setup_autoscaling", "description": "Configure scaling rules"},
    ]

Usage:

ecs_agent = ECSAgent()
ecs_agent.run("Deploy with 5 tasks, auto-scale to 20 on high CPU, health monitoring")
# Claude orchestrates: cluster → task definition → service → autoscaling → verification

Real Usage: From Local to Production

Scenario: Deploy FastAPI from laptop to production in minutes for most cases, with significantly reduced debugging.

Stage 1: Test locally

docker_agent.run("Run myapp:latest on port 8000", session_id="prod_001")
# ✅ Docker container running

Stage 2: Scale to cloud

ec2_agent.run("Create 2 instances for load balancing", session_id="prod_001")
# ✅ 2 EC2 instances up (same session = Claude remembers port 8000)

Stage 3: Auto-scaling production

ecs_agent.run("Deploy with 5-20 auto-scaling, CloudWatch monitoring", session_id="prod_001")
# ✅ 5 ECS tasks running, auto-scaling 5-20 based on CPU

All with the same session_id: Claude remembers the image name, port, and configuration from Stage 1. Everything connects seamlessly.

📚 Memory: Agents Remember

# Day 1
agent.run("Deploy on Docker with port 8000", session_id="user_123")

# Week later
agent.run("Move this to ECS for production", session_id="user_123")
# Claude reads history from DynamoDB:
# "I remember this app used port 8000. I'll keep that for ECS too."

DynamoDB stores every conversation turn. Claude recalls past decisions and uses them for coherence.

🛡️ Production-Ready Code Patterns

This is what the user feedback emphasized. Real code needs safety.

Pattern 1: Error Handling (Don't Crash, Recover)

# ❌ Wrong: crashes on error
result = dispatcher[tool_name](tool_input)

# ✅ Right: tell Claude about the error
try:
    result = dispatcher[tool_name](tool_input)
except Exception as e:
    result = {
        "error": str(e),
        "status": "failed",
        "request_id": request_id,  # For debugging
    }
# Claude sees the error and tries a different approach

Why this matters: If Docker pull fails, don't give up. Tell Claude: "Pull failed, but let me check if the image is locally cached." Claude adapts.

Pattern 2: Timeouts (Prevent Hanging)

from functools import wraps
import signal

def with_timeout(seconds=30):
    """Prevent tools from running forever"""
    def decorator(func):
        def wrapper(*args, **kwargs):
            def handler(signum, frame):
                raise TimeoutError(f"Tool execution exceeded {seconds}s")

            signal.signal(signal.SIGALRM, handler)
            signal.alarm(seconds)
            try:
                result = func(*args, **kwargs)
                signal.alarm(0)  # Cancel timeout
                return result
            except TimeoutError as e:
                return {"error": str(e), "status": "timeout"}
        return wrapper
    return decorator

@with_timeout(seconds=60)
def create_instance(params):
    """If EC2 creation hangs, timeout after 60 seconds"""
    ec2 = boto3.client("ec2")
    return ec2.create_instances(**params)

Pattern 3: Input Validation (Prevent Bad Requests)

def create_instance(params):
    """Validate before executing"""
    # Check required fields
    required = ["ImageId", "InstanceType"]
    for field in required:
        if field not in params:
            return {"error": f"Missing required field: {field}", "status": "failed"}

    # Validate instance type
    valid_types = ["t2.micro", "t2.small", "t3.medium"]
    if params["InstanceType"] not in valid_types:
        return {
            "error": f"InstanceType must be one of {valid_types}",
            "status": "failed"
        }

    # Now safe to execute
    ec2 = boto3.client("ec2")
    return ec2.create_instances(**params)

Pattern 4: Audit Logging (Proof for Compliance)

import json
from datetime import datetime

def log_action(request_id: str, user: str, action: str, params: dict, result: dict):
    """Log everything for audits"""
    log_entry = {
        "request_id": request_id,
        "timestamp": datetime.utcnow().isoformat(),
        "user": user,
        "action": action,
        "params": json.dumps(params),  # What was requested
        "result_status": result.get("status"),
        "result_code": result.get("error"),
    }

    # Save to CloudWatch Logs or DynamoDB
    dynamodb.put_item(TableName="agent_audit", Item=log_entry)

# In the tool dispatcher:
result = dispatcher[tool_name](tool_input)
log_action(request_id, user_id, tool_name, tool_input, result)

Pattern 5: Cost Guards (Don't Deploy Expensive Mistakes)

MONTHLY_BUDGET = 1000  # $1000/month limit

def estimate_cost(action: str, params: dict) -> float:
    """Estimate AWS cost before executing"""
    if action == "create_instance":
        instance_type = params.get("InstanceType", "t2.micro")
        hourly_rates = {
            "t2.micro": 0.012,
            "t2.small": 0.023,
            "t3.medium": 0.042,
        }
        months_running = 1
        return hourly_rates.get(instance_type, 0) * 730 * months_running

    elif action == "create_rds":
        # RDS: ~$400/month for 20GB
        return 400

    return 0

# In the runnable:
estimated = estimate_cost(tool_name, tool_input)
if estimated > MONTHLY_BUDGET:
    return {
        "error": f"Cost ${estimated:.2f} exceeds budget ${MONTHLY_BUDGET}",
        "status": "rejected",
    }
# Only execute if under budget
result = dispatcher[tool_name](tool_input)

Pattern 6: Role-Based Access (Security Boundaries)

# Define who can do what
USER_PERMISSIONS = {
    "admin": ["create_instance", "delete_instance", "create_role", "delete_role"],
    "developer": ["create_instance", "stop_instance", "deploy_to_ecs"],
    "readonly": ["describe_instances", "list_containers", "get_logs"],
}

def check_permission(user_role: str, action: str) -> bool:
    """Prevent unauthorized actions"""
    allowed = USER_PERMISSIONS.get(user_role, [])
    if action not in allowed:
        return False
    return True

# Before executing:
if not check_permission(user_role, tool_name):
    return {
        "error": f"User role '{user_role}' cannot perform '{tool_name}'",
        "status": "permission_denied",
    }
result = dispatcher[tool_name](tool_input)

✅ Why Agents > Scripts

	Scripts	Agents
New workflow	Rewrite code	Claude adapts
Error recovery	Crashes	Tries alternatives
Reasoning	None (hardcoded)	Full decision log
Maintenance	Grows linearly	One framework
Learning	Manual proof	Automatic audit trail

🎯 Getting Started

Just 3 commands:

git clone https://github.com/jayakrishnayadav24/ai-agents
cd ai-agents
pip install -r requirements.txt

Then try:

from agents.docker_agent import DockerAgent

agent = DockerAgent()
response = agent.run(
    "Deploy FastAPI app on port 8000 with health check"
)
print(response)

That's it. Claude handles the rest.

📚 What's Next (Part 2)

This was the concept. Github:

✅ Full working code (all agents)
✅ DynamoDB setup for memory
✅ Deploying agents to Lambda
✅ Real production patterns (request tracing, cost estimation)
✅ Demo with actual deployments

⚠️ Where This Breaks

Complex compliance environments (manual approvals still needed)
Cost estimation is approximate
Requires well-defined tools (garbage in → garbage out)

The Bottom Line

Before: 47 scripts, 3 hours/deployment, constant debugging

After: 1 agent framework, in minutes for most cases, with significantly reduced debugging/deployment.

Why? Because Claude doesn't follow scripts. Claude plans and executes the steps based on the goal.

It adapts. It learns. It remembers.

That's not automation anymore. That's the future of infrastructure.

How many deployment scripts are you still maintaining?

10+? 20+? 50+?

I want to see how bad this problem is. Drop a comment 👇

Your feedback shapes Part 2.

DEV Community

I Replaced 47 DevOps Scripts With One AI Agent — Here’s What Happened

The Hook: I Was Wrong About Automation

(Built this while deploying real workloads on AWS: Docker, ECS, EC2, IAM.)

🎯 Is This For You?

✨ What's Possible

🚀 How It Actually Works

⚠️ The Problem I Started With

🏗️ The Architecture (3 Layers)

🔧 The Setup (Code Foundation)

🚀 Agents in Action: 3 Real Examples

DockerAgent: Deploy Locally

EC2Agent: Scale to Cloud

ECSAgent: Production Auto-Scaling

Real Usage: From Local to Production

📚 Memory: Agents Remember

🛡️ Production-Ready Code Patterns

Pattern 1: Error Handling (Don't Crash, Recover)

Pattern 2: Timeouts (Prevent Hanging)

Pattern 3: Input Validation (Prevent Bad Requests)

Pattern 4: Audit Logging (Proof for Compliance)

Pattern 5: Cost Guards (Don't Deploy Expensive Mistakes)

Pattern 6: Role-Based Access (Security Boundaries)

✅ Why Agents > Scripts

🎯 Getting Started

📚 What's Next (Part 2)

The Bottom Line

Top comments (0)