Soon Seah Toh

Posted on Feb 17

Inside the Agentic Loop: How 5 AI Agents Autonomously Investigate IT Incidents

#ai #programming #sre #devops

恭喜发财! Happy Year of the Horse 🐴

As we celebrate new beginnings this CNY, I want to share a technical deep dive into something that represents a genuine paradigm shift in IT operations: the Dark NOC.

Not a marketing term. A working system.

The Problem

For 30 years, NOC teams have operated the same way: humans staring at dashboards, waiting for alerts, then scrambling through runbooks to figure out what happened.

The workflow is always the same:

Alert fires
Check dashboard
Compare against known patterns
Follow runbook
Escalate if unknown

That's pattern matching. And AI does pattern matching better than humans — orders of magnitude better.

The Architecture: 5 Specialized Agents

We built Astra AI with 5 domain-specific agents, each with their own tool set and reasoning capabilities:

Agent	Domain	Key Tools
Infrastructure	Device health, topology	`status/summary`, `alarms/list`, `metrics/top`
Network	Traffic, connectivity	`netflow/summary`, `netflow/top-sources`, `netflow/top-destinations`
Application	Services, performance	`apm/services`, `apm/top-slow-transactions`, `services/list`
Security	Threats, anomalies	`logs/query`, `alarms/list`
RCA Orchestrator	Correlation, prediction	`rca/analyze`, `rca/service-health`, `rca/forecast`

The Agentic Loop

This is the core innovation. Each agent doesn't just run a single prompt — it runs a tool-calling loop:

Agent receives incident context
  → Analyzes the situation
  → Decides which tool to call next
  → Executes the tool (alarms, metrics, logs, NetFlow, APM traces)
  → Evaluates results
  → Decides next action
  → Loops until root cause is identified

The agent autonomously decides its investigation path. No predefined runbook. No hardcoded decision tree. The LLM reasons about what data it needs, calls the appropriate tool, evaluates the results, and decides what to investigate next.

Root Cause Analysis: The Correlation Engine

When the RCA Orchestrator's rca/analyze tool executes, here's what happens:

Phase 1: Alarm Aggregation

Collects all active alarms within the time window
Filters by severity, device, and timerange

Phase 2: Root Cause Identification

Multiple heuristics applied with confidence scoring:

Root cause criteria:
  First alarm in time window     → 0.9 confidence
  Critical severity alarm        → 0.85 confidence
  Device with most alarms        → 0.7 confidence
  Upstream device in hierarchy   → weighted by position

Phase 3: Impact Mapping

{
  "impactChain": {
    "rootDevice": "core-switch-01",
    "affectedDevices": ["web-server-01", "web-server-02", "db-primary"],
    "affectedServices": ["user-auth", "payment-api"],
    "blastRadius": { "devices": 5, "services": 2, "users": 500 }
  }
}

Phase 4: Predictive Escalation

For each WARNING alarm:
  → Predict escalation to CRITICAL in (timeWindow / 2)
  → Confidence: 0.65

For each MINOR alarm:
  → Predict escalation to CRITICAL in (timeWindow / 4)
  → Confidence: 0.8

Multi-Agent Delegation

The agents don't work in isolation — they delegate to each other:

Infrastructure Agent detects service-level alarm
→ Delegates to Application Agent
Application Agent finds suspicious log patterns
→ Delegates to Security Agent
Security Agent identifies anomaly
→ All findings flow to RCA Orchestrator
RCA Orchestrator synthesizes root cause + blast radius + recommendations

The Memory Layer

This is the breakthrough that makes it a true "Dark NOC."

Every investigation gets persisted:

Investigation steps and tool calls
Root cause findings
Correlation patterns
Resolution actions

When a new incident matches a pattern from months ago, the agent recalls the previous investigation. It doesn't start from scratch — it references what worked before.

Think of it as institutional memory that never quits, never forgets, and never takes a sick day.

What a Dark NOC Actually Looks Like

500TB of telemetry correlated in 1-3 seconds
Multi-agent correlation across infrastructure, network, application, and security — simultaneously
Predictive escalation: "This WARNING will become CRITICAL in 2 hours"
Auto-action workflows: ServiceNow tickets, PagerDuty alerts, remediation scripts — triggered before a human sees the problem
Self-learning: every incident makes the system smarter

The human team doesn't disappear. They evolve into architects and strategists, handling the 10% of incidents that genuinely need human judgment.

The 90% that's pattern-matching? The agents handle that now.

The Bottom Line

The Dark NOC isn't science fiction. It's running. Five specialized AI agents, each with their own reasoning loop, their own tool chains, and permanent memory — investigating incidents autonomously, 24/7.

新年快乐 — may the Year of the Horse bring speed, strength, and autonomous operations to your NOC. 🐴

This is Astra AI in Cloud Vista v15. If you're building autonomous operations or working on agentic AI for IT, I'd love to hear your approach in the comments.

DEV Community