DEV Community

Cover image for How I Built an AI-Powered Alert Triage System — Now with MCP Architecture
Slavyan Donev
Slavyan Donev

Posted on • Edited on

How I Built an AI-Powered Alert Triage System — Now with MCP Architecture

How I Built an AI-Powered Alert Triage System — Now with MCP Architecture

agents #ai #automation #cybersecurity

The Problem

Every SOC analyst and MSP team I've talked to has the same complaint:

"We get 200 alerts a day. Maybe 10 are real. But someone has to check all 200."

That's alert fatigue. And it's not a small problem — the average analyst spends 3-5 hours daily on manual triage. Most of that time is wasted on false positives.

I decided to build something to fix this. Two weeks later, I had a working MVP. Then I went a step further and refactored it with Model Context Protocol (MCP) — Anthropic's open standard for connecting AI agents to external tools. Here's exactly how I built it.


The Architecture (v2 — MCP Edition)

The original system had the agent calling tools directly. The new architecture introduces an MCP server as a modular tool layer:

Alert Input (Defender/SentinelOne/JSON)
        ↓
Alert Normalizer
        ↓
LangGraph Triage Agent
  ├── Enrich Node  ──► MCP Client
  ├── Analyze Node        ↓
  └── Human-in-the-Loop  MCP Server
        ↓             ├── virustotal_check()
Output (Risk Score    ├── mitre_lookup()
+ Slack + Audit Log)  └── slack_notify()
Enter fullscreen mode Exit fullscreen mode

Why MCP? Instead of hardcoding tool calls inside the agent, MCP separates them into a dedicated server. The agent doesn't care how VirusTotal works — it just calls a tool by name and gets a result. This makes the system modular, testable, and easy to extend.


Step 1: Alert Normalizer

The first challenge: every security tool outputs alerts in a different format. Defender looks different from SentinelOne, which looks different from a generic SIEM.

I built a normalizer that takes any alert format and converts it to a single internal structure:

@dataclass
class NormalizedAlert:
    alert_id: str
    source: str          # defender / sentinelone / generic
    severity: str        # Low / Medium / High / Critical
    title: str
    timestamp: str
    mitre_technique: Optional[str]
    hostname: Optional[str]
    username: Optional[str]
    source_ip: Optional[str]
    raw: dict            # Original alert for audit
Enter fullscreen mode Exit fullscreen mode

This means the rest of the system doesn't care where the alert came from. It always works with the same format.


Step 2: LangGraph State Machine

I used LangGraph to build the agent as a state machine. Each step in the triage process is a separate node:

class TriageState(TypedDict):
    alert: dict
    enrichment: Optional[dict]
    risk_score: Optional[int]
    risk_level: Optional[str]
    explanation: Optional[str]
    recommendation: Optional[str]
    needs_human: Optional[bool]
    error: Optional[str]
Enter fullscreen mode Exit fullscreen mode

The graph flows like this:

enrich → analyze → [human_review if score >= 70] → format_output
Enter fullscreen mode Exit fullscreen mode

Why LangGraph instead of a simple chain? Because real triage isn't linear. You need conditional routing — a Critical alert should follow a different path than a Low one. LangGraph makes this explicit and debuggable.


Step 3: The MCP Server (New in v2)

This is the biggest architectural change. All three enrichment tools are now exposed via a FastMCP server:

# mcp-server/server.py
from mcp.server.fastmcp import FastMCP
from tools.virustotal import check_ip
from tools.mitre import get_technique_summary
from tools.slack_notifier import send_alert_notification

mcp = FastMCP("Alert Triage MCP Server")

@mcp.tool()
def virustotal_check(ip: str) -> str:
    """Проверява IP адрес в VirusTotal и връща reputation данни."""
    result = check_ip(ip)
    return f"IP: {result.ip} | Malicious: {result.malicious_votes} | Known bad: {result.is_known_bad}"

@mcp.tool()
def mitre_lookup(technique_id: str) -> str:
    """Търси MITRE ATT&CK техника по ID (напр. T1059.001)."""
    return get_technique_summary(technique_id)

@mcp.tool()
def slack_notify(alert_id: str, risk_score: int, ...) -> str:
    """Праща Slack нотификация за критичен алерт."""
    success = send_alert_notification(triage_result)
    return "Sent" if success else "Failed"

if __name__ == "__main__":
    mcp.run()
Enter fullscreen mode Exit fullscreen mode

The agent connects to this server via an MCP client wrapper:

# agents/mcp_tools.py
async def _call_tool(tool_name: str, args: dict) -> str:
    server_params = StdioServerParameters(
        command="python", args=["mcp-server/server.py"]
    )
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            result = await session.call_tool(tool_name, args)
            return result.content[0].text

def virustotal_check(ip: str) -> str:
    return asyncio.run(_call_tool("virustotal_check", {"ip": ip}))

def mitre_lookup(technique_id: str) -> str:
    return asyncio.run(_call_tool("mitre_lookup", {"technique_id": technique_id}))
Enter fullscreen mode Exit fullscreen mode

The result: The LangGraph agent no longer imports tools directly. It goes through MCP — clean separation of concerns.


Step 4: Enrichment Tools

Before the LLM sees the alert, two tools run automatically via MCP:

VirusTotal IP Lookup

Why this matters: An alert marked "Low severity" came in for SSH login attempts. The source IP had 4 malicious votes on VirusTotal. The system automatically escalated it to High. Without enrichment, that alert would have been ignored.

MITRE ATT&CK Context

Instead of hitting an API for every request, I built a local database of the most common techniques:

MITRE_DB = {
    "T1059.001": MitreTechnique(
        "T1059.001", "PowerShell", "Execution",
        "Adversaries use PowerShell to execute commands, often with encoded payloads...",
        "high"
    ),
    "T1486": MitreTechnique(
        "T1486", "Data Encrypted for Impact (Ransomware)", "Impact",
        "Adversary encrypts data to disrupt availability...",
        "high"
    ),
}
Enter fullscreen mode Exit fullscreen mode

This context goes directly into the LLM prompt — giving the model real knowledge about what each technique means and how dangerous it is.


Step 5: The LLM Analysis

The Triage Agent sends the enriched alert to Groq (Llama 3.3 70B) with a structured prompt that returns JSON:

{
  "risk_score": 95,
  "risk_level": "Critical",
  "explanation": "The source IP is flagged as MALICIOUS by 17 VirusTotal engines...",
  "recommendation": "Block IP immediately and isolate the device.",
  "needs_human": true
}
Enter fullscreen mode Exit fullscreen mode

Key design decision: temperature 0.1. Security analysis needs consistency, not creativity.


Step 6: Human-in-the-Loop

For any alert with risk score >= 70, the MCP slack_notify tool fires a formatted Slack notification. AI assists — humans decide on critical actions.


Step 7: REST API with FastAPI

@router.post("/triage", response_model=TriageResponse)
def triage_alert(alert_request: AlertRequest):
    normalized = normalize_alert(alert_request.model_dump(exclude_none=True))
    result = run_triage(normalized)
    return TriageResponse(...)
Enter fullscreen mode Exit fullscreen mode

Microsoft Defender can now send a webhook to POST /triage and get back a full analysis in ~3 seconds.


Real Results

Running 6 sample alerts through the system:

  • A "Low severity" SSH alert was escalated to High because VirusTotal flagged the source IP (4 malicious votes)
  • A data exfiltration alert scored 95/100 Critical — destination IP had 17 VirusTotal votes, known Tor exit node used for C2

Tech Stack

Component Technology
Agent framework LangGraph
LLM Groq — Llama 3.3 70B (free tier)
Tool layer MCP — Model Context Protocol
Threat intel VirusTotal API (free tier)
ATT&CK mapping Local MITRE database
Notifications Slack Webhooks
API FastAPI

Total cost for MVP: $0


Key Lessons

  1. MCP separates tools from agents — your agent becomes a thin client, tools become reusable services
  2. Enrich before you analyze — LLM without real threat intel is just guessing
  3. LangGraph over simple chains — conditional routing requires a proper state machine
  4. Human-in-the-Loop is not optional — never automate critical security decisions
  5. Start with the data — understanding real alerts before coding saved hours

Currently looking for MSP and SOC teams for a free 2-week pilot.

If your team deals with alert fatigue — comment below or DM me.

GitHub: [alert-triage-mvp] | Built with LangGraph + MCP + Groq

Top comments (0)