DEV Community

Soon Seah Toh
Soon Seah Toh

Posted on

Inside the Agentic Loop: How 5 AI Agents Autonomously Investigate IT Incidents

ζ­ε–œε‘θ΄’! Happy Year of the Horse 🐴

As we celebrate new beginnings this CNY, I want to share a technical deep dive into something that represents a genuine paradigm shift in IT operations: the Dark NOC.

Not a marketing term. A working system.

The Problem

For 30 years, NOC teams have operated the same way: humans staring at dashboards, waiting for alerts, then scrambling through runbooks to figure out what happened.

The workflow is always the same:

  1. Alert fires
  2. Check dashboard
  3. Compare against known patterns
  4. Follow runbook
  5. Escalate if unknown

That's pattern matching. And AI does pattern matching better than humans β€” orders of magnitude better.

The Architecture: 5 Specialized Agents

We built Astra AI with 5 domain-specific agents, each with their own tool set and reasoning capabilities:

Agent Domain Key Tools
Infrastructure Device health, topology status/summary, alarms/list, metrics/top
Network Traffic, connectivity netflow/summary, netflow/top-sources, netflow/top-destinations
Application Services, performance apm/services, apm/top-slow-transactions, services/list
Security Threats, anomalies logs/query, alarms/list
RCA Orchestrator Correlation, prediction rca/analyze, rca/service-health, rca/forecast

The Agentic Loop

This is the core innovation. Each agent doesn't just run a single prompt β€” it runs a tool-calling loop:

Agent receives incident context
  β†’ Analyzes the situation
  β†’ Decides which tool to call next
  β†’ Executes the tool (alarms, metrics, logs, NetFlow, APM traces)
  β†’ Evaluates results
  β†’ Decides next action
  β†’ Loops until root cause is identified
Enter fullscreen mode Exit fullscreen mode

The agent autonomously decides its investigation path. No predefined runbook. No hardcoded decision tree. The LLM reasons about what data it needs, calls the appropriate tool, evaluates the results, and decides what to investigate next.

Root Cause Analysis: The Correlation Engine

When the RCA Orchestrator's rca/analyze tool executes, here's what happens:

Phase 1: Alarm Aggregation

  • Collects all active alarms within the time window
  • Filters by severity, device, and timerange

Phase 2: Root Cause Identification

Multiple heuristics applied with confidence scoring:

Root cause criteria:
  First alarm in time window     β†’ 0.9 confidence
  Critical severity alarm        β†’ 0.85 confidence
  Device with most alarms        β†’ 0.7 confidence
  Upstream device in hierarchy   β†’ weighted by position
Enter fullscreen mode Exit fullscreen mode

Phase 3: Impact Mapping

{
  "impactChain": {
    "rootDevice": "core-switch-01",
    "affectedDevices": ["web-server-01", "web-server-02", "db-primary"],
    "affectedServices": ["user-auth", "payment-api"],
    "blastRadius": { "devices": 5, "services": 2, "users": 500 }
  }
}
Enter fullscreen mode Exit fullscreen mode

Phase 4: Predictive Escalation

For each WARNING alarm:
  β†’ Predict escalation to CRITICAL in (timeWindow / 2)
  β†’ Confidence: 0.65

For each MINOR alarm:
  β†’ Predict escalation to CRITICAL in (timeWindow / 4)
  β†’ Confidence: 0.8
Enter fullscreen mode Exit fullscreen mode

Multi-Agent Delegation

The agents don't work in isolation β€” they delegate to each other:

  1. Infrastructure Agent detects service-level alarm
  2. β†’ Delegates to Application Agent
  3. Application Agent finds suspicious log patterns
  4. β†’ Delegates to Security Agent
  5. Security Agent identifies anomaly
  6. β†’ All findings flow to RCA Orchestrator
  7. RCA Orchestrator synthesizes root cause + blast radius + recommendations

The Memory Layer

This is the breakthrough that makes it a true "Dark NOC."

Every investigation gets persisted:

  • Investigation steps and tool calls
  • Root cause findings
  • Correlation patterns
  • Resolution actions

When a new incident matches a pattern from months ago, the agent recalls the previous investigation. It doesn't start from scratch β€” it references what worked before.

Think of it as institutional memory that never quits, never forgets, and never takes a sick day.

What a Dark NOC Actually Looks Like

  • 500TB of telemetry correlated in 1-3 seconds
  • Multi-agent correlation across infrastructure, network, application, and security β€” simultaneously
  • Predictive escalation: "This WARNING will become CRITICAL in 2 hours"
  • Auto-action workflows: ServiceNow tickets, PagerDuty alerts, remediation scripts β€” triggered before a human sees the problem
  • Self-learning: every incident makes the system smarter

The human team doesn't disappear. They evolve into architects and strategists, handling the 10% of incidents that genuinely need human judgment.

The 90% that's pattern-matching? The agents handle that now.

The Bottom Line

The Dark NOC isn't science fiction. It's running. Five specialized AI agents, each with their own reasoning loop, their own tool chains, and permanent memory β€” investigating incidents autonomously, 24/7.

新年快乐 β€” may the Year of the Horse bring speed, strength, and autonomous operations to your NOC. 🐴


This is Astra AI in Cloud Vista v15. If you're building autonomous operations or working on agentic AI for IT, I'd love to hear your approach in the comments.

Top comments (0)