Autonomous AWS SysAdmin Agent (AI/MCP)

#architecture #mcp #ai #devops

The 3 AM Wake-Up Call
We’ve all been there. It’s 3:14 AM. PagerDuty screams. You open your laptop, squinting at the brightness, and ssh into a server. You run htop. You run tail -f /var/log/syslog. You realize the disk is full because a log rotation script failed. You run rm -rf /tmp/junk. You go back to sleep.

Why are we still doing this in 2026?

For the last decade, “DevOps” meant writing scripts to automate these tasks. But scripts are fragile. They follow rules: “If disk > 90%, delete folder X.” But what if the problem is folder Y? The script fails. The human wakes up.

I decided I was done writing rules. I wanted to build a system that understands Context.

Enter “Agentic” Infrastructure
I spent the last two weeks building the AWS SysAdmin MCP Agent.

It is not a script. It is an AI application running on AWS Fargate that acts as a Level 3 Autonomous Operator. It uses the Model Context Protocol (MCP) — the new standard for connecting LLMs to tools — to safely control my infrastructure.

Here is why this changes everything.

The Problem with “Chatbots” in Ops
Most people use ChatGPT for coding. They copy-paste an error log, get a fix, and paste it back into the terminal. This is Level 1 Automation. It’s still manual.

To get to Level 3 (Autonomy), the AI needs “Hands.” It needs to be able to:

See: Read logs directly.
Think: Analyze the root cause.
Act: Execute the fix.
But giving an AI sudo access terrifies every Security Engineer on the planet (including me).

The Architecture: “The Secure Cell”
To solve the security problem, I didn’t give the AI my keys. I built a Control Plane.

The Stack:

The Brain (Any LLM): Claude 3.5 Sonnet or Gemini Pro. It runs outside the infrastructure. It has no credentials.
The Body (MCP Server): A Docker container running on AWS Fargate in a private subnet.
The Hands (Tools): Python functions (read_log, restart_service, check_disk) exposed via MCP.
The Guardrails (AWS Secrets Manager): The ssh keys to my target servers are locked in Secrets Manager. The AI never sees them. It just asks the Agent: “Please run command X on Server Y.” The Agent authenticates and executes using the stored key.
Real-World Scenario: “The Nginx Crash”
Here is what happens when a server fails now:

Trigger: I ask the Agent: “Why is the web server returning 500 errors?”
Diagnosis: The Agent doesn’t guess. It calls read_log_file(path='/var/log/nginx/error.log').
Reasoning: Only the AI reads the messy log lines: [error] 23#23: *1201 worker_connections are not enough.
Action: The AI understands this is a config limit. It proposes: “I need to increase worker_connections in nginx.conf execute_command('sed -i ...') and restart."
Execution: I approve (or set it to auto-approve). The Agent fixes it.
Total time: 45 seconds.
Human effort: 0 mental load.

Why “MCP” Matters
Before MCP, building this required complex, brittle API wrappers for every single tool. With Model Context Protocol, the “Brain” and the “Tools” are decoupled. I can swap Claude for Gemini, or Fargate for local Docker, and the protocol remains the same. It is the USB-C for AI.

Conclusion: The End of “If-Then-Else”
We are moving from “Imperative Ops” (telling the computer exactly what to do) to “Declarative Intent” (telling the computer what we want fixed).

My Bash scripts were great. They served me well. But they can’t read. They can’t think. It’s time to let them retire.

DEV Community

Autonomous AWS SysAdmin Agent (AI/MCP)

Top comments (0)