The DevOps Agent started with a simple question: what if you could talk to your infrastructure like you talk to an AI — and have it actually execute safely?
Not plan generation. Not policy explanations. Real, audited, reversible infrastructure changes. Stop an EC2 instance. Resize it for cost. Tighten a security group. All from chat. All with human approval gates.
This post walks through how we built it, why the safety model matters, and what took us by surprise.
The Problem We Solved
DevOps teams live in a purgatory:
- Terraform is powerful but requires code review, planning cycles, apply ceremonies
- Cloud consoles are click-heavy and error-prone at scale
- Scripts work but lack audit trails
- LLMs can suggest infrastructure but can't execute
The gap: between recommendation and action. A model can tell you "your t3.large is underutilized, downsize to t3.medium, save $X/month." But you still have to:
- SSH into the console
- Stop the instance
- Modify the instance type
- Start it
- Monitor for issues
- Log the change somewhere
That's 10 minutes of manual work. For one instance. In a fleet of 200.
We built an agent that bridges this gap. You say "downsize my underutilized instances," it:
- Lists them
- Queries CloudWatch metrics
- Generates a change plan
- Waits for your approval
- Executes (stop → modify → start → monitor)
- Creates an audit record
- Offers rollback if something goes wrong
One message. One approval. Done.
The Architecture: Three Layers
The agent isn't a single model. It's a three-layer stack with safety as the load-bearing wall.
Layer 1: The Orchestrator (src/agent/core.py)
The DevOpsAgent class is the conductor. It:
- Takes a user message
- Decides which tools to call (from a registry of 728 tools across AWS, K8s, Azure, GCP, Terraform, Docker, Git, etc.)
- Streams the response back to the user
- Logs every decision
Here's the critical part: **it doesn't make decisions in a vacuum. Every tool call is validated before execution.
class DevOpsAgent:
def _call_tool(self, tool_name, tool_input):
# Before execution: validate
validation = self.safety_validator.validate_tool_call(
tool_name, tool_input
)
if not validation.is_safe:
# Return to user with explanation
return f"Cannot execute {tool_name}: {validation.reason}"
if validation.requires_confirmation:
# Queue for manual approval
return self._create_change_request(tool_name, tool_input)
# Safe to execute immediately
result = execute_tool(tool_name, tool_input)
self._audit_execution(tool_name, tool_input, result)
return result
This is the whole game. Safety is checked before, not after.
Layer 2: The Safety Validator (src/agent/safety.py)
This is where we say no.
Every tool is classified into three risk levels:
- LOW: Read operations, creates that don't touch production. No confirmation needed.
- MEDIUM: Updates, restarts, IAM changes, scale operations. Requires confirmation.
- HIGH: Deletes, terminates, destructive operations. Requires confirmation + escalates to CRITICAL on production workloads.
The validator does pattern matching on tool parameters. Same tool, different inputs, different risk:
# manage_ec2_instance with action='start' → LOW (safe)
# manage_ec2_instance with action='stop' → MEDIUM (requires confirmation)
# manage_ec2_instance with action='terminate' → HIGH (critical)
This action-aware classification is the difference between a helpful tool and a footgun. Early versions classified the entire tool as HIGH risk, so every operation needed approval. Useless. Now, start/stop are gated appropriately.
Current classification:
- 728 total tools registered
- 42 tools explicitly classified as HIGH risk (delete/terminate operations)
- 36 tools explicitly classified as MEDIUM risk (updates/IAM/scale)
- 650 tools default to LOW risk (reads/creates)
- Auto-detection fallback: any tool with prefix
delete_,terminate_,destroy_,remove_,drop_, orpurge_is auto-classified as HIGH, even if not explicitly listed
The Tool Lifecycle: From Chat to Execution
When you type "downsize my ec2 instances that are under 20% CPU," here's what happens:
Step 1: Message Validation (20ms)
Three gatekeepers run in parallel before any credit is deducted:
- Input Sanitizer — 10 regex rules checking for injection patterns. Blocks ~5% of messages from probing attempts.
- Haiku Pre-Screen — Claude Haiku in structured mode, classifies: prompt injection, malware request, data exfiltration, etc. ~99% accuracy, fails open on error.
- System Prompt Hardening — Fixed identity, no self-disclosure, refuse malware generation.
Violate any of these 3 times in an hour, your account locks for 60 minutes.
Step 2: Tool Selection (100-500ms)
The agent (Claude Sonnet or Claude Opus, depending on tier) reads:
- Your message
- Your infrastructure context (what you have deployed, learned from prior syncs)
- The full tool registry (728 tools with descriptions + input schemas)
- Active skills (domain expertise injected based on message keywords)
It decides: which tools are relevant? In what order?
This is where the agent is creative and where it can hallucinate.
We catch this with tool schema validation. Every parameter is validated against the registered input schema before execution. Mismatches are returned to the user with a correction prompt.
Step 3: Safety Classification (10ms per tool)
For each tool call:
tool_name='resize_ec2_instance'
tool_input={'instance_id': 'i-0123456789abcdef0', 'new_instance_type': 't3.medium'}
→ SafetyValidator.validate_tool_call(tool_name, tool_input)
→ Classification: HIGH RISK (stops and modifies running instance)
→ Action: Queue for approval + create InfrastructureChange record
Step 4: Change Request Creation (if needed)
For MEDIUM or HIGH risk operations, we don't execute. We create an InfrastructureChange record:
change = InfrastructureChange(
user_id=current_user.id,
resource_type='ec2',
resource_id='i-0123...',
action_type='resize',
status='pending_approval',
change_details={
'current_type': 't3.large',
'new_type': 't3.medium',
'estimated_savings': '$45/month'
},
created_by='agent'
)
db.session.add(change)
This record is immutable (audit trail), has rollback data captured before execution, and awaits human approval.
Step 5: Execution & Audit
When approved:
ec2_client = get_user_boto_client(user_id, 'ec2', region)
ec2_client.stop_instances(InstanceIds=[instance_id])
ec2_client.get_waiter('instance_stopped').wait(InstanceIds=[instance_id])
ec2_client.modify_instance_attribute(
InstanceId=instance_id,
InstanceType={'Value': 't3.medium'}
)
ec2_client.start_instances(InstanceIds=[instance_id])
And we log every step:
[2026-04-15 14:32] EXECUTION_START - resize_ec2_instance
[2026-04-15 14:32] CHECKPOINT: instance stopped (i-0123...)
[2026-04-15 14:33] CHECKPOINT: instance type modified (t3.large → t3.medium)
[2026-04-15 14:35] CHECKPOINT: instance started
[2026-04-15 14:35] EXECUTION_SUCCESS - savings: $45/month
Step 6: Rollback
Rollback data is captured before execution:
rollback_data = {
'instance_id': 'i-0123...',
'original_type': 't3.large',
'original_state': 'running'
}
One click reverts everything. All logged.
What We Got Wrong (And Fixed)
v1: manage_ec2_instance Was LOW Risk Across All Actions
The bug: All actions (start, stop, terminate) were classified LOW risk. User could terminate a prod instance without confirmation.
The fix: Action-aware validation. Same tool, different risk based on parameters.
v2: Sandbox Mode Fabricated Resources
The bug: Users in sandbox could see fake resources that didn't match reality. Confusion when they tried to execute the same operation in production.
The fix: Sandbox returns mock results with realistic data from their actual infrastructure. When in sandbox, real API calls still happen for reads, but writes are simulated.
v3: No Rollback Data Captured
The bug: After execution, if something went wrong, we had no way to revert.
The fix: Every change request captures rollback data before execution. Rollback creates a new change request that reverses the previous one.
v4: Model Hallucinating Tool Results
The bug: Free-tier model (Kimi K2.5) would occasionally write [Executing tool...] in plain text instead of using function calling — fabricating results.
The fix: Two-tier hallucination detection. Strong patterns (fake execution markers) always blocked. Streaming buffer: accumulate response, check before yielding. Dedicated regex to detect fake markers vs real tool output.
The Numbers
- 728 tools across 17 cloud/DevOps platforms
- 248 tests covering auth, billing, tools, safety, user isolation
- 3-layer guardrails — input sanitizer, Haiku pre-screen, system prompt hardening
- 0 unintended deletions in production use
- 0 credential leaks (Fernet encryption, per-user keys)
- 100% change audit trail (immutable journal entries)
The Moment It Clicked
Six months in, a user executed this from chat:
Me: "My database is at 85% CPU and costs $400/month. What can I do?"
Agent: "I found 3 options:
1. Upgrade from db.t3.xlarge to db.r5.2xlarge (better performance, higher cost)
2. Reduce IOPS from 3000 to 1000 (saves $120/month, risk of throttling)
3. Enable read replicas and route read traffic there (saves $200/month, requires app changes)"
Me: "Let's do #3 and also downsize the original instance"
Agent: [Creates change request #1: Create read replica]
[Creates change request #2: Reduce primary IOPS]
[Creates change request #3: Downsize to db.t3.large]
Me: [Approves all 3]
Agent: "Complete. Your database now costs $180/month (was $400)."
That user went from "we need a DBA to optimize this" to "I did it in 5 minutes from chat."
That's the win.
Try It
DevOps Agent is live at devopsagent.io.
Free tier: 20 credits/month, full agent access. No credit card needed.
You get:
- Chat with your infrastructure
- Change request approval workflow
- Full audit trail
- Sandbox mode for learning
- Rollback for any change
Questions or feedback? Open an issue or email hello@devopsagent.io.
Top comments (0)