I Built an AI to Monitor Servers. Then I Built a Chaos Proxy to Break Them 💥

#agents #ai #devops #monitoring

It’s 3:00 AM. Your phone is buzzing furiously. Your Grafana dashboard looks like a Jackson Pollock painting done entirely in red. A CPU on server-04 is screaming at 99%.

Cool graph, you think, rubbing your eyes. But what do I actually do about this?

We don’t have a data problem in modern DevOps. We have an Actionable Intelligence problem. We've built massive pipelines to funnel petabytes of Redfish server telemetry into time-series databases... just so we can set up Slack alerts that everyone inevitably mutes.

What if we put an AI in the loop? Not just a chatbot that spits out generic stack-overflow tips, but an Agentic AI ... a digital colleague that can reach out, inspect the infrastructure, and say: "Hey, Server 3 is melting down due to a runaway memory leak. I suggest a graceful reboot. Want me to pull the trigger?"

But there was a catch. To test a server-healing AI, I needed broken servers. And I really didn't want to explain to my hosting provider why I intentionally deep-fried my bare-metal rig.

So, I built NeurOps: half infrastructure intelligence, half intentional sabotage.

Here is the story of how I built an AI agent to monitor my servers, and a Chaos Proxy designed specifically to lie to it.

😈 Meet the Chaos Proxy: My Digital Gremlin

In the enterprise world, servers talk via the Redfish API. It's the standard RESTful way to ask a motherboard, "Hey, are you on fire?"

Instead of hooking my AI monitoring tool directly to the servers, I built a FastAPI middleware called the Chaos Management Proxy.

Normally, this proxy is a model citizen. It intercepts the Redfish request, grabs the real JSON payload from the server, and passes it along. But hit the right endpoint, and it turns into an absolute gremlin. With a simple POST request, it intercepts the payload mid-flight and injects a "Deep Merge" override.

Take a look at this snippet from the proxy router:

@app.post("/simulate/{server_id}/memory/leak")
def memory_leak(server_id: ServerEnum):
    # Deep merge this dict into the actual live Redfish API response!
    overrides[server_id.value]["Memory"] = {
        "UsagePercent": 92,
        "Status": {"Health": "Critical"}
    }
    return {"message": f"Memory leak injected for {server_id.value}"}

With one API call, the proxy alters reality. The monitoring system thinks the server is dying. The actual hardware is sipping a digital piña colada. We can simulate thermal spikes, disk failures, or even a slow, torturous CPU degradation ... all safely in software.

🧠 The LLM is a Routing Engine (Wait, That's Clever)

So the servers are (virtually) melting. How does the AI step in?

I used the Google Agent Development Kit (ADK) and Gemini to build NeuroTalk. Here’s the secret sauce: a good AI agent isn’t just a clever prompt. It’s about giving the AI the right tools and explicitly teaching it when to use them.

Here is the actual configuration of my AI Agent:

agent = Agent(
    name="NeuroTalk",
    model=Gemini(model="gemini-3-flash-preview"),
    tools=[
        get_live_status,    # Hits the live Redfish API via Chaos Proxy
        get_past_issues     # Queries BigQuery for historical telemetry
    ],
    instruction="""
    Tool selection strategy:
    1. Real-time Status: When asked about "current status", ALWAYS use get_live_status().
    3. Historical Analysis: Only use get_past_issues() when explicitly asked for trends.
    4. Combined Analysis: Use both if you need to compare live data with history.
    """
)

The LLM doesn't just guess; it acts as an intelligent router.

Ask it: "Why is server-02 acting weird right now?" ➡️ It writes a Python script to hit the live Chaos Proxy API.
Ask it: "Has server-02 been running hot all week?" ➡️ It writes a SQL query to hit BigQuery.

It investigates before it speaks.

🚧 The Statefulness Trap

It wasn't all smooth sailing. I quickly ran into a major problem: State.

If a CPU hits 90%, is it a 2-second spike because a cron job started, or is the server entering a death spiral? LLMs are notoriously bad at analyzing high-frequency time-series data on the fly.

To solve this, I had to build a fast, localized deque-based ring buffer into the polling collector (Neurosight) just to track the last 5 intervals.

# A simple ring buffer for trend detection!
def is_increasing(arr):
    return len(arr) == TREND_WINDOW and all(x < y for x, y in zip(arr, list(arr)[1:]))

If the temperature goes up 5 times in a row, the collector flags a TEMP_TREND_UP anomaly before the server actually hits the critical threshold. It attaches this tag to the payload sent to BigQuery. The AI simply reads this tag, bypassing the need to do any complex math.

🎭 The 5-Step Dance of Destruction and Salvation

When you boot up NeurOps, here is the wild sequence of events that happens in seconds:

The Target: We spin up Redfish emulators (or connect to real servers).
The Sabotage: We hit the Chaos Proxy and inject a fake 95°C thermal event on server-01.
The Detection: The Neurosight Collector polls the proxy, sees the 95°C spike, flags a TEMP_CRITICAL anomaly, and fires the data via Google Pub/Sub into BigQuery.
The Investigation: An engineer opens the Streamlit UI and asks NeuroTalk: "What just happened to server-01?"
The Salvation: The AI Agent queries BigQuery, sees the thermal spike, reads the Redfish status, and responds: "Server-01 has experienced a critical thermal event. I recommend triggering the /heal/server-01/reboot webhook to attempt a recovery."

🛠️ If You Want to Build This...

If you are looking to build agentic AI into your own DevOps workflows, here are my biggest takeaways:

Don't let the AI guess. Give it strict tools. An LLM without access to a live API or a database is just a very confident hallucinator. Treat it like a junior dev ... give it read-only API keys and watch what it does.
Chaos Engineering is mandatory. You cannot trust your AI if you have never watched it panic. Build a proxy, intercept payloads, and break things on purpose.
Start stupid simple. You don't need a massive Kubernetes cluster to test this. A simple FastAPI proxy and a Python polling script will get you 90% of the way there.

🏁 Wrapping Up

We are entering a wildly exciting era where AI doesn't just help us write code; it actively manages the infrastructure the code runs on. By combining standard protocols (Redfish), robust data pipelines (BigQuery), and Agentic AI, we can stop staring at dashboards at 3 AM and start actually fixing problems.

If you thought this was interesting, drop a comment! How are you using AI in your DevOps workflows? Or better yet... what is the most creative way you've ever broken a server on purpose?

Let me know below! 👇