shittuay

Posted on Apr 14

Building an AI DevOps Agent: Architecture and Live Infrastructure Automation

#ai #automation #devops #architecture

The DevOps Agent started with a simple question: what if you could talk to your infrastructure like you talk to an AI — and have it actually execute safely?

Not plan generation. Not policy explanations. Real, audited, reversible infrastructure changes. Stop an EC2 instance. Resize it for cost. Tighten a security group. All from chat. All with human approval gates.

This post walks through how we built it, why the safety model matters, and what took us by surprise.

The Problem We Solved

DevOps teams live in a purgatory:

Terraform is powerful but requires code review, planning cycles, apply ceremonies
Cloud consoles are click-heavy and error-prone at scale
Scripts work but lack audit trails
LLMs can suggest infrastructure but can't execute

The gap: between recommendation and action. A model can tell you "your t3.large is underutilized, downsize to t3.medium, save $X/month." But you still have to:

SSH into the console
Stop the instance
Modify the instance type
Start it
Monitor for issues
Log the change somewhere

That's 10 minutes of manual work. For one instance. In a fleet of 200.

We built an agent that bridges this gap. You say "downsize my underutilized instances," it:

Lists them
Queries CloudWatch metrics
Generates a change plan
Waits for your approval
Executes (stop → modify → start → monitor)
Creates an audit record
Offers rollback if something goes wrong

One message. One approval. Done.

The Architecture: Three Layers

The agent isn't a single model. It's a three-layer stack with safety as the load-bearing wall.

Layer 1: The Orchestrator (`src/agent/core.py`)

The DevOpsAgent class is the conductor. It:

Takes a user message
Decides which tools to call (from a registry of 728 tools across AWS, K8s, Azure, GCP, Terraform, Docker, Git, etc.)
Streams the response back to the user
Logs every decision

Here's the critical part: **it doesn't make decisions in a vacuum. Every tool call is validated before execution.

class DevOpsAgent:
    def _call_tool(self, tool_name, tool_input):
        # Before execution: validate
        validation = self.safety_validator.validate_tool_call(
            tool_name, tool_input
        )

        if not validation.is_safe:
            # Return to user with explanation
            return f"Cannot execute {tool_name}: {validation.reason}"

        if validation.requires_confirmation:
            # Queue for manual approval
            return self._create_change_request(tool_name, tool_input)

        # Safe to execute immediately
        result = execute_tool(tool_name, tool_input)
        self._audit_execution(tool_name, tool_input, result)
        return result

This is the whole game. Safety is checked before, not after.

Layer 2: The Safety Validator (`src/agent/safety.py`)

This is where we say no.

Every tool is classified into three risk levels:

LOW: Read operations, creates that don't touch production. No confirmation needed.
MEDIUM: Updates, restarts, IAM changes, scale operations. Requires confirmation.
HIGH: Deletes, terminates, destructive operations. Requires confirmation + escalates to CRITICAL on production workloads.

The validator does pattern matching on tool parameters. Same tool, different inputs, different risk:

# manage_ec2_instance with action='start' → LOW (safe)
# manage_ec2_instance with action='stop' → MEDIUM (requires confirmation)
# manage_ec2_instance with action='terminate' → HIGH (critical)

This action-aware classification is the difference between a helpful tool and a footgun. Early versions classified the entire tool as HIGH risk, so every operation needed approval. Useless. Now, start/stop are gated appropriately.

Current classification:

728 total tools registered
42 tools explicitly classified as HIGH risk (delete/terminate operations)
36 tools explicitly classified as MEDIUM risk (updates/IAM/scale)
650 tools default to LOW risk (reads/creates)
Auto-detection fallback: any tool with prefix delete_, terminate_, destroy_, remove_, drop_, or purge_ is auto-classified as HIGH, even if not explicitly listed

The Tool Lifecycle: From Chat to Execution

When you type "downsize my ec2 instances that are under 20% CPU," here's what happens:

Step 1: Message Validation (20ms)

Three gatekeepers run in parallel before any credit is deducted:

Input Sanitizer — 10 regex rules checking for injection patterns. Blocks ~5% of messages from probing attempts.
Haiku Pre-Screen — Claude Haiku in structured mode, classifies: prompt injection, malware request, data exfiltration, etc. ~99% accuracy, fails open on error.
System Prompt Hardening — Fixed identity, no self-disclosure, refuse malware generation.

Violate any of these 3 times in an hour, your account locks for 60 minutes.

Step 2: Tool Selection (100-500ms)

The agent (Claude Sonnet or Claude Opus, depending on tier) reads:

Your message
Your infrastructure context (what you have deployed, learned from prior syncs)
The full tool registry (728 tools with descriptions + input schemas)
Active skills (domain expertise injected based on message keywords)

It decides: which tools are relevant? In what order?

This is where the agent is creative and where it can hallucinate.

We catch this with tool schema validation. Every parameter is validated against the registered input schema before execution. Mismatches are returned to the user with a correction prompt.

Step 3: Safety Classification (10ms per tool)

For each tool call:

tool_name='resize_ec2_instance'
tool_input={'instance_id': 'i-0123456789abcdef0', 'new_instance_type': 't3.medium'}

→ SafetyValidator.validate_tool_call(tool_name, tool_input)
→ Classification: HIGH RISK (stops and modifies running instance)
→ Action: Queue for approval + create InfrastructureChange record

Step 4: Change Request Creation (if needed)

For MEDIUM or HIGH risk operations, we don't execute. We create an InfrastructureChange record:

change = InfrastructureChange(
    user_id=current_user.id,
    resource_type='ec2',
    resource_id='i-0123...',
    action_type='resize',
    status='pending_approval',
    change_details={
        'current_type': 't3.large',
        'new_type': 't3.medium',
        'estimated_savings': '$45/month'
    },
    created_by='agent'
)
db.session.add(change)

This record is immutable (audit trail), has rollback data captured before execution, and awaits human approval.

Step 5: Execution & Audit

When approved:

ec2_client = get_user_boto_client(user_id, 'ec2', region)
ec2_client.stop_instances(InstanceIds=[instance_id])
ec2_client.get_waiter('instance_stopped').wait(InstanceIds=[instance_id])
ec2_client.modify_instance_attribute(
    InstanceId=instance_id,
    InstanceType={'Value': 't3.medium'}
)
ec2_client.start_instances(InstanceIds=[instance_id])

And we log every step:

[2026-04-15 14:32] EXECUTION_START - resize_ec2_instance
[2026-04-15 14:32] CHECKPOINT: instance stopped (i-0123...)
[2026-04-15 14:33] CHECKPOINT: instance type modified (t3.large → t3.medium)
[2026-04-15 14:35] CHECKPOINT: instance started
[2026-04-15 14:35] EXECUTION_SUCCESS - savings: $45/month

Step 6: Rollback

Rollback data is captured before execution:

rollback_data = {
    'instance_id': 'i-0123...',
    'original_type': 't3.large',
    'original_state': 'running'
}

One click reverts everything. All logged.

What We Got Wrong (And Fixed)

v1: manage_ec2_instance Was LOW Risk Across All Actions

The bug: All actions (start, stop, terminate) were classified LOW risk. User could terminate a prod instance without confirmation.

The fix: Action-aware validation. Same tool, different risk based on parameters.

v2: Sandbox Mode Fabricated Resources

The bug: Users in sandbox could see fake resources that didn't match reality. Confusion when they tried to execute the same operation in production.

The fix: Sandbox returns mock results with realistic data from their actual infrastructure. When in sandbox, real API calls still happen for reads, but writes are simulated.

v3: No Rollback Data Captured

The bug: After execution, if something went wrong, we had no way to revert.

The fix: Every change request captures rollback data before execution. Rollback creates a new change request that reverses the previous one.

v4: Model Hallucinating Tool Results

The bug: Free-tier model (Kimi K2.5) would occasionally write [Executing tool...] in plain text instead of using function calling — fabricating results.

The fix: Two-tier hallucination detection. Strong patterns (fake execution markers) always blocked. Streaming buffer: accumulate response, check before yielding. Dedicated regex to detect fake markers vs real tool output.

The Numbers

728 tools across 17 cloud/DevOps platforms
248 tests covering auth, billing, tools, safety, user isolation
3-layer guardrails — input sanitizer, Haiku pre-screen, system prompt hardening
0 unintended deletions in production use
0 credential leaks (Fernet encryption, per-user keys)
100% change audit trail (immutable journal entries)

The Moment It Clicked

Six months in, a user executed this from chat:

Me: "My database is at 85% CPU and costs $400/month. What can I do?"

Agent: "I found 3 options:
1. Upgrade from db.t3.xlarge to db.r5.2xlarge (better performance, higher cost)
2. Reduce IOPS from 3000 to 1000 (saves $120/month, risk of throttling)
3. Enable read replicas and route read traffic there (saves $200/month, requires app changes)"

Me: "Let's do #3 and also downsize the original instance"

Agent: [Creates change request #1: Create read replica]
       [Creates change request #2: Reduce primary IOPS]
       [Creates change request #3: Downsize to db.t3.large]

Me: [Approves all 3]

Agent: "Complete. Your database now costs $180/month (was $400)."

That user went from "we need a DBA to optimize this" to "I did it in 5 minutes from chat."

That's the win.

Try It

DevOps Agent is live at devopsagent.io.

Free tier: 20 credits/month, full agent access. No credit card needed.

You get:

Chat with your infrastructure
Change request approval workflow
Full audit trail
Sandbox mode for learning
Rollback for any change

Questions or feedback? Open an issue or email hello@devopsagent.io.

Top comments (2)

Kalyan Ram Jaladi • Apr 17

Good article. How the agent is getting the historical information to do correct reasoning and decision-making?

shittuay • Apr 18

The agent’s historical reasoning is only as good as what you explicitly pass into its context. It doesn’t remember anything on its own, you architect the memory layer, and the model reasons over whatever you feed it.