The Hook: I Was Wrong About Automation
I thought I automated DevOps.
I had 47 deployment scripts.
Then I started replacing them with an AI agent —
and most scripts became unnecessary.
Not by following instructions.
By making decisions.
And the 2 AM debugging stopped.
Note: The code shown here is simplified for clarity.
The GitHub repo contains a more modular implementation.
That's when I realized:
I wasn't automating.
I was hardcoding decisions.
I gave Claude a list of AWS tools and one instruction: "Deploy this app."
No hardcoded logic.
No decision trees.
Just: describe the goal, Claude figures out how.
Deployment time: 3 hours → in minutes for most cases.
(Built this while deploying real workloads on AWS: Docker, ECS, EC2, IAM.)
🎯 Is This For You?
This post is for:
- DevOps engineers with 10+ deployment scripts
- Platform engineers building internal developer platforms
- Anyone exploring AI agents beyond chatbots
- Teams on AWS (Docker, EC2, ECS, IAM)
If manual deployment takes > 30 minutes, read this.
✨ What's Possible
Old way (manual):
aws ecs create-cluster --cluster-name prod
aws ecs register-task-definition --family myapp ...
# [50+ commands, 3 hours, manual debugging]
New way (agent):
agent.run("Deploy FastAPI with PostgreSQL, auto-scale to 20, "
"minimal IAM perms, CloudWatch monitoring")
# ✅ Done in minutes for most cases, with significantly reduced debugging
No scripts.
Just Claude thinking out loud about your infrastructure.
🚀 How It Actually Works
You describe the goal (natural language):
"Deploy my app on 5 ECS tasks, auto-scale to 20 on high CPU"
Claude gets these tools:
dispatcher = {
"ecs__create_cluster": ecs.create_cluster,
"ecs__register_task": ecs.register_task_definition,
"ecs__create_service": ecs.create_service,
"iam__create_role": iam.create_role, # For permissions
}
Claude reasons autonomously:
"User wants ECS deployment.
I need to:
1. Check if cluster exists
2. Register task definition
3. Create service with 5 tasks
4. Setup auto-scaling
5. Verify it's running"
It executes:
- Call
ecs__create_cluster()→ Get cluster ARN - Call
ecs__register_task()→ Get task definition - Call
ecs__create_service()→ Get running tasks - Call
ecs__setup_autoscaling()→ Confirmed - Return: "All 5 tasks running, auto-scaling 5-20"
** Minimal manual intervention in most cases. Just reasoning + execution + feedback loops.**
⚠️ The Problem I Started With
I had:
-
deploy_docker.py— 150 lines of Docker logic -
deploy_ec2.py— 200 lines of EC2 logic -
deploy_ecs.py— 300 lines of ECS logic - 3 routers trying to chain them together
- 0 way to handle "deploy to both EC2 AND ECS"
Every new service = new script. Every new workflow = rewrite everything.
Solution: Reduce rigid scripts — let the agent handle orchestration logic dynamically.
Instead of routing to specific tools, give Claude all available tools and let it decide which ones to use, in which order, adapting as it goes.
# All tools in one place
tools = {
"docker__run": run_container,
"ec2__create": create_instance,
"ecs__deploy": create_service,
"iam__create_role": create_role,
# ... more tools
}
# Let Claude orchestrate
agent.run("Deploy with Docker locally, then scale to ECS production")
# Claude figures out: Docker first, then ECS, then IAM for permissions
🏗️ The Architecture (3 Layers)
┌──────────────────────────────┐
│ User Goal (natural language) │
│ "Deploy scalable production │
│ stack with auto-scaling" │
└──────────┬───────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Claude (Bedrock) │
│ ← Reads goal + available tools │
│ ← Decides sequence of actions │
│ ← Adapts when things fail │
└──────────────┬────────────────────────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ AWS APIs │ │ Conversation │
│ (via boto3) │ │ Memory (DynamoDB)│
│ ← Executes │ │ ← Recalls setup │
│ decisions │ │ from last week │
└─────────────┘ └──────────────────┘
│ │
└────────┬───────────┘
│
┌────────▼─────────┐
│ Actual Resources │
│ EC2, ECS, Docker │
│ IAM, etc │
└──────────────────┘
3 moving parts:
- Claude — Reasons about the task
- Tools — Execute AWS API calls
- Memory — Remember past deployments for coherence
🔧 The Setup (Code Foundation)
Here's the entire base system in ~100 lines. Full code on GitHub.
import boto3
from agents.memory import load_history, save_message
MODEL = "apac.amazon.nova-lite-v1:0" # Fast, cheap
class BaseAgent:
"""Foundation for all agents (Docker, EC2, ECS, IAM)"""
AGENT_KEY = None # "docker", "ec2", etc.
SYSTEM_PROMPT = "" # Agent's personality
CAPABILITIES = [] # What this agent can do
def __init__(self, name: str, region: str = "us-east-1"):
self.name = name
self.session = boto3.Session(region_name=region)
self._bedrock = None
@property
def bedrock(self):
"""Lazy load Bedrock connection (only when needed)"""
if not self._bedrock:
self._bedrock = self.session.client("bedrock-runtime")
return self._bedrock
def run(self, task: str, session_id: str = None) -> dict:
"""The agentic loop: think → decide → act → repeat"""
tools = self.get_tools()
dispatcher = self.get_dispatcher()
messages = load_history(session_id) + [{"role": "user", "content": [{"text": task}]}]
for _ in range(10): # Max 10 iterations to prevent infinite loops
# Ask Claude what to do
response = self.bedrock.converse(
modelId=MODEL,
system=self.SYSTEM_PROMPT,
messages=messages,
tools=tools,
toolUseDepth=2,
)
# Is Claude done?
if response["stopReason"] == "endTurn":
return {"status": "SUCCESS", "message": response["content"]}
# Execute tools Claude wants to call
tool_results = []
for tool_use in response["content"]:
tool_name = tool_use["toolUse"]["name"]
try:
result = dispatcher<a href="tool_use["toolUse"]["input"]">tool_name</a>
except Exception as e:
result = {"error": str(e), "status": "failed"}
tool_results.append({"toolUseId": tool_use["toolUse"]["toolUseId"],
"content": [{"json": result}]})
# Add Claude's decision + results to conversation
messages.append({"role": "assistant", "content": response["content"]})
messages.append({"role": "user", "content": [{"toolResult": tr} for tr in tool_results]})
# Save for future reference
if session_id:
save_message(session_id, messages[-1])
return {"status": "FAILED", "reason": "Max iterations reached"}
What's happening:
-
Lazy loading (
@property bedrock): Don't connect to Claude until needed - Tool loop: Call tools, capture results, show Claude the output
- Error handling: Don't crash—tell Claude what failed, let it adapt
- Memory: Save each turn so Claude remembers next week
🚀 Agents in Action: 3 Real Examples
DockerAgent: Deploy Locally
class DockerAgent(BaseAgent):
AGENT_KEY = "docker"
SYSTEM_PROMPT = """You manage Docker containers.
Rules: Pull image first, check if container exists, use sensible defaults."""
CAPABILITIES = [
{"name": "list_containers", "description": "List running containers"},
{"name": "run_container", "description": "Pull and run image"},
{"name": "stop_container", "description": "Stop a container"},
]
Usage:
docker_agent = DockerAgent()
docker_agent.run("Deploy FastAPI on port 8000 with health check")
# Claude calls: list_containers → run_container → confirms it's running
EC2Agent: Scale to Cloud
class EC2Agent(BaseAgent):
AGENT_KEY = "ec2"
SYSTEM_PROMPT = """You manage EC2 instances.
Rules: Tag for organization, verify security groups, create new→test→retire old."""
CAPABILITIES = [
{"name": "describe_instances", "description": "List instances"},
{"name": "create_instance", "description": "Launch new instance"},
{"name": "stop_instance", "description": "Stop instance"},
]
Usage:
ec2_agent = EC2Agent()
ec2_agent.run("Create 2 t2.micro instances, tag as app-server")
# Claude: checks if instances exist → creates 2 → returns IPs → confirms running
ECSAgent: Production Auto-Scaling
class ECSAgent(BaseAgent):
AGENT_KEY = "ecs"
SYSTEM_PROMPT = """You manage ECS services at production scale.
Rules: Task definition first, use Fargate, always configure auto-scaling."""
CAPABILITIES = [
{"name": "create_cluster", "description": "Create ECS cluster"},
{"name": "register_task_definition", "description": "Register blueprint"},
{"name": "create_service", "description": "Deploy service"},
{"name": "setup_autoscaling", "description": "Configure scaling rules"},
]
Usage:
ecs_agent = ECSAgent()
ecs_agent.run("Deploy with 5 tasks, auto-scale to 20 on high CPU, health monitoring")
# Claude orchestrates: cluster → task definition → service → autoscaling → verification
Real Usage: From Local to Production
Scenario: Deploy FastAPI from laptop to production in minutes for most cases, with significantly reduced debugging.
Stage 1: Test locally
docker_agent.run("Run myapp:latest on port 8000", session_id="prod_001")
# ✅ Docker container running
Stage 2: Scale to cloud
ec2_agent.run("Create 2 instances for load balancing", session_id="prod_001")
# ✅ 2 EC2 instances up (same session = Claude remembers port 8000)
Stage 3: Auto-scaling production
ecs_agent.run("Deploy with 5-20 auto-scaling, CloudWatch monitoring", session_id="prod_001")
# ✅ 5 ECS tasks running, auto-scaling 5-20 based on CPU
All with the same session_id: Claude remembers the image name, port, and configuration from Stage 1. Everything connects seamlessly.
📚 Memory: Agents Remember
# Day 1
agent.run("Deploy on Docker with port 8000", session_id="user_123")
# Week later
agent.run("Move this to ECS for production", session_id="user_123")
# Claude reads history from DynamoDB:
# "I remember this app used port 8000. I'll keep that for ECS too."
DynamoDB stores every conversation turn. Claude recalls past decisions and uses them for coherence.
🛡️ Production-Ready Code Patterns
This is what the user feedback emphasized. Real code needs safety.
Pattern 1: Error Handling (Don't Crash, Recover)
# ❌ Wrong: crashes on error
result = dispatcher[tool_name](tool_input)
# ✅ Right: tell Claude about the error
try:
result = dispatcher[tool_name](tool_input)
except Exception as e:
result = {
"error": str(e),
"status": "failed",
"request_id": request_id, # For debugging
}
# Claude sees the error and tries a different approach
Why this matters: If Docker pull fails, don't give up. Tell Claude: "Pull failed, but let me check if the image is locally cached." Claude adapts.
Pattern 2: Timeouts (Prevent Hanging)
from functools import wraps
import signal
def with_timeout(seconds=30):
"""Prevent tools from running forever"""
def decorator(func):
def wrapper(*args, **kwargs):
def handler(signum, frame):
raise TimeoutError(f"Tool execution exceeded {seconds}s")
signal.signal(signal.SIGALRM, handler)
signal.alarm(seconds)
try:
result = func(*args, **kwargs)
signal.alarm(0) # Cancel timeout
return result
except TimeoutError as e:
return {"error": str(e), "status": "timeout"}
return wrapper
return decorator
@with_timeout(seconds=60)
def create_instance(params):
"""If EC2 creation hangs, timeout after 60 seconds"""
ec2 = boto3.client("ec2")
return ec2.create_instances(**params)
Pattern 3: Input Validation (Prevent Bad Requests)
def create_instance(params):
"""Validate before executing"""
# Check required fields
required = ["ImageId", "InstanceType"]
for field in required:
if field not in params:
return {"error": f"Missing required field: {field}", "status": "failed"}
# Validate instance type
valid_types = ["t2.micro", "t2.small", "t3.medium"]
if params["InstanceType"] not in valid_types:
return {
"error": f"InstanceType must be one of {valid_types}",
"status": "failed"
}
# Now safe to execute
ec2 = boto3.client("ec2")
return ec2.create_instances(**params)
Pattern 4: Audit Logging (Proof for Compliance)
import json
from datetime import datetime
def log_action(request_id: str, user: str, action: str, params: dict, result: dict):
"""Log everything for audits"""
log_entry = {
"request_id": request_id,
"timestamp": datetime.utcnow().isoformat(),
"user": user,
"action": action,
"params": json.dumps(params), # What was requested
"result_status": result.get("status"),
"result_code": result.get("error"),
}
# Save to CloudWatch Logs or DynamoDB
dynamodb.put_item(TableName="agent_audit", Item=log_entry)
# In the tool dispatcher:
result = dispatcher[tool_name](tool_input)
log_action(request_id, user_id, tool_name, tool_input, result)
Pattern 5: Cost Guards (Don't Deploy Expensive Mistakes)
MONTHLY_BUDGET = 1000 # $1000/month limit
def estimate_cost(action: str, params: dict) -> float:
"""Estimate AWS cost before executing"""
if action == "create_instance":
instance_type = params.get("InstanceType", "t2.micro")
hourly_rates = {
"t2.micro": 0.012,
"t2.small": 0.023,
"t3.medium": 0.042,
}
months_running = 1
return hourly_rates.get(instance_type, 0) * 730 * months_running
elif action == "create_rds":
# RDS: ~$400/month for 20GB
return 400
return 0
# In the runnable:
estimated = estimate_cost(tool_name, tool_input)
if estimated > MONTHLY_BUDGET:
return {
"error": f"Cost ${estimated:.2f} exceeds budget ${MONTHLY_BUDGET}",
"status": "rejected",
}
# Only execute if under budget
result = dispatcher[tool_name](tool_input)
Pattern 6: Role-Based Access (Security Boundaries)
# Define who can do what
USER_PERMISSIONS = {
"admin": ["create_instance", "delete_instance", "create_role", "delete_role"],
"developer": ["create_instance", "stop_instance", "deploy_to_ecs"],
"readonly": ["describe_instances", "list_containers", "get_logs"],
}
def check_permission(user_role: str, action: str) -> bool:
"""Prevent unauthorized actions"""
allowed = USER_PERMISSIONS.get(user_role, [])
if action not in allowed:
return False
return True
# Before executing:
if not check_permission(user_role, tool_name):
return {
"error": f"User role '{user_role}' cannot perform '{tool_name}'",
"status": "permission_denied",
}
result = dispatcher[tool_name](tool_input)
✅ Why Agents > Scripts
| Scripts | Agents | |
|---|---|---|
| New workflow | Rewrite code | Claude adapts |
| Error recovery | Crashes | Tries alternatives |
| Reasoning | None (hardcoded) | Full decision log |
| Maintenance | Grows linearly | One framework |
| Learning | Manual proof | Automatic audit trail |
🎯 Getting Started
Just 3 commands:
git clone https://github.com/jayakrishnayadav24/ai-agents
cd ai-agents
pip install -r requirements.txt
Then try:
from agents.docker_agent import DockerAgent
agent = DockerAgent()
response = agent.run(
"Deploy FastAPI app on port 8000 with health check"
)
print(response)
That's it. Claude handles the rest.
📚 What's Next (Part 2)
This was the concept. Github:
- ✅ Full working code (all agents)
- ✅ DynamoDB setup for memory
- ✅ Deploying agents to Lambda
- ✅ Real production patterns (request tracing, cost estimation)
- ✅ Demo with actual deployments
⚠️ Where This Breaks
- Complex compliance environments (manual approvals still needed)
- Cost estimation is approximate
- Requires well-defined tools (garbage in → garbage out)
The Bottom Line
Before: 47 scripts, 3 hours/deployment, constant debugging
After: 1 agent framework, in minutes for most cases, with significantly reduced debugging/deployment.
Why? Because Claude doesn't follow scripts. Claude plans and executes the steps based on the goal.
It adapts. It learns. It remembers.
That's not automation anymore. That's the future of infrastructure.
How many deployment scripts are you still maintaining?
10+? 20+? 50+?
I want to see how bad this problem is. Drop a comment 👇
Your feedback shapes Part 2.
Top comments (0)