DEV Community

Cover image for Building an AI-Agent Decision Engine for Self-Healing To Protect Uptime (Part 1)
Tom
Tom

Posted on • Originally published at bubobot.com

Building an AI-Agent Decision Engine for Self-Healing To Protect Uptime (Part 1)

Building AI-Powered Self-Healing Infrastructure

What if your infrastructure could monitor, analyze, and heal itself before you even wake up? Let's explore how AI-driven decision making transforms traditional monitoring from reactive firefighting into proactive uptime protection.

The Evolution Beyond Traditional Monitoring

Traditional monitoring tells you what happened after downtime occurs. AI-powered intelligent infrastructure tells you what happened, why it happened, and automatically fixes it to maintain uptime.

This is the shift from "alert and pray" to "analyze and heal."

How AI-Driven Self-Healing Works

The AI Agent Decision Engine operates on a simple principle: Uptime First, Human Intervention When Necessary.

Here's how it categorizes issues:

EMERGENCY_HEALING scenarios (immediate action):

  • Disk usage > 65% (service failure imminent)

  • Memory usage > 65% (OOM kill risk)

  • Single process consuming > 30% CPU for > 5 minutes

  • Critical services down (nginx, database, PM2 apps)

NOTIFY_ONLY scenarios (human review):

  • Performance degraded but services functional

  • Resource usage elevated but not threatening availability

  • Temporary spikes that may self-resolve

  • Issues during business hours unless critical

The system doesn't just react to alerts—it analyzes current system state versus the original alert to make intelligent decisions.

Building Your Self-Healing Workflow

Here's how to implement this using n8n, creating infrastructure that handles PM2 applications, Node.js services, and traditional server monitoring.

Step 1: Alert Reception and Enrichment

Start with a webhook that receives Prometheus alerts, then enrich with context:

const alerts = items[0].json.body.alerts || [];
return alerts.map(alert => {
  const startsAt = new Date(alert.startsAt);
  const hour = startsAt.getUTCHours();
  const isBusinessHours = hour >= 9 && hour < 17;
  const durationMinutes = (Date.now() - startsAt.getTime()) / 1000 / 60;

  return {
    json: {
      alertname: alert.labels.alertname,
      severity: alert.labels.severity,
      instance: alert.labels.instance,
      description: alert.annotations.description,
      isBusinessHours: isBusinessHours,
      durationMinutes: durationMinutes
    }
  };
});

Enter fullscreen mode Exit fullscreen mode

Step 2: AI-Powered Triage Decision

The first AI agent analyzes whether this requires emergency healing or just notification:

Analyze this alert and decide: EMERGENCY_HEALING or NOTIFY_ONLY

Decision Criteria:
EMERGENCY_HEALING:
- Disk usage > 65% (service failure imminent)
- Memory usage > 65% (OOM kill risk)
- Critical services down
- Any condition threatening availability within 30 minutes

NOTIFY_ONLY:
- Performance degraded but services functional
- Resource usage elevated but not critical
- Temporary spikes that may self-resolve

Respond with JSON:
{
  "decision": "EMERGENCY_HEALING|NOTIFY_ONLY",
  "threat_level": "CRITICAL|HIGH|MEDIUM|LOW",
  "immediate_actions": [{"command": "...", "purpose": "..."}],
  "reasoning": "Why this decision ensures system survival"
}

Enter fullscreen mode Exit fullscreen mode

Step 3: System Analysis and Remediation Planning

For critical alerts, the system SSH into servers to run diagnostic scripts:

# System health analysis
bash /opt/system-doctor.sh --report-json --check-only

Enter fullscreen mode Exit fullscreen mode

A second AI agent compares the original alert with current system state:

Example AI response during high CPU:

{
  "situation_assessment": {
    "alert_vs_reality": "CPU usage critically high at 85%",
    "issue_status": "ONGOING",
    "action_required": "CORRECTIVE"
  },
  "targeted_actions": [
    {
      "action": "Terminate stress-ng processes",
      "command": "kill -9 245136 245137",
      "justification": "Processes consuming 82.3% CPU",
      "risk_level": "SAFE",
      "execution_order": 1
    }
  ]
}

Enter fullscreen mode Exit fullscreen mode

Step 4: Safe Command Execution

Safety validation ensures only approved commands execute:

function validateCommand(command, riskLevel) {
  const dangerousPatterns = [
    'rm -rf /',
    'shutdown',
    'reboot',
    'mkfs'
  ];

  const isDangerous = dangerousPatterns.some(pattern =>
    command.toLowerCase().includes(pattern)
  );

  if (isDangerous || riskLevel === 'RISKY') {
    return { safe: false, reason: `Blocked: ${command}` };
  }
  return { safe: true };
}

Enter fullscreen mode Exit fullscreen mode

Only SAFE and MODERATE risk commands execute automatically. RISKY commands require manual approval.

Safety Mechanisms

The system implements comprehensive safety layers:

  1. Command Pattern Blocking: Prevents destructive operations

  2. Risk Level Assessment: SAFE/MODERATE/RISKY classification

  3. Business Hours Consideration: Reduced automation during work hours

  4. Execution Ordering: Prioritized command sequences

  5. Audit Trails: Complete logging of decisions and actions

Real-World Results

Teams implementing AI-driven self-healing report:

  • Faster incident resolution: Issues fixed in seconds vs minutes

  • Reduced alert fatigue: Only genuine emergencies escalate to humans

  • Improved uptime: Proactive healing prevents user-facing outages

  • Better sleep: Critical issues resolved automatically outside business hours

Implementation Workflow

Prometheus Alert
   AI Triage (Emergency vs Notify)
     System Analysis (SSH diagnostics)
       AI Remediation Planning
         Safe Command Execution
           Discord Notification

Enter fullscreen mode Exit fullscreen mode

Getting Started

  1. Set up monitoring: Configure Prometheus + AlertManager

  2. Install diagnostics: Deploy system health scripts on servers

  3. Import workflow: Use the n8n template from our GitHub

  4. Configure AI: Add OpenAI API key and SSH credentials

  5. Test safely: Start with non-critical alerts in staging

Considerations and Limitations

While powerful, AI-driven automation has important considerations:

Benefits:

  • Intelligent decision making

  • Adapts to unique environments

  • Handles edge cases creatively

Limitations:

  • Non-deterministic behavior

  • Data privacy concerns (cloud APIs)

  • Complex audit trails

  • Potential for "hallucinated" commands

What's Next

Part 2 of this series will cover deterministic alternatives for teams who prefer predictable, rule-based automation while maintaining intelligent analysis capabilities.

We'll explore:

  • Rule-based decision trees

  • Hybrid approaches (AI analysis + deterministic execution)

  • Production-hardened workflows for enterprise environments

Resources

The future of infrastructure management isn't just about monitoring—it's about building systems that can think, analyze, and heal themselves proactively.


This is Part 1 of our DevOps automation series. For the complete implementation guide with detailed code examples and safety considerations, check out our full blog post.

DevOpsAutomation #AIInfrastructure #ProactiveMonitoring #SelfHealing #IntelligentInfrastructure

Read more at https://bubobot.com/blog/building-an-ai-agent-decision-engine-for-self-healing-to-protect-uptime-part-1?utm_source=dev.to

Top comments (0)