How I Built an AI-Powered Alert Triage System — Now with MCP Architecture
agents #ai #automation #cybersecurity
The Problem
Every SOC analyst and MSP team I've talked to has the same complaint:
"We get 200 alerts a day. Maybe 10 are real. But someone has to check all 200."
That's alert fatigue. And it's not a small problem — the average analyst spends 3-5 hours daily on manual triage. Most of that time is wasted on false positives.
I decided to build something to fix this. Two weeks later, I had a working MVP. Then I went a step further and refactored it with Model Context Protocol (MCP) — Anthropic's open standard for connecting AI agents to external tools. Here's exactly how I built it.
The Architecture (v2 — MCP Edition)
The original system had the agent calling tools directly. The new architecture introduces an MCP server as a modular tool layer:
Alert Input (Defender/SentinelOne/JSON)
↓
Alert Normalizer
↓
LangGraph Triage Agent
├── Enrich Node ──► MCP Client
├── Analyze Node ↓
└── Human-in-the-Loop MCP Server
↓ ├── virustotal_check()
Output (Risk Score ├── mitre_lookup()
+ Slack + Audit Log) └── slack_notify()
Why MCP? Instead of hardcoding tool calls inside the agent, MCP separates them into a dedicated server. The agent doesn't care how VirusTotal works — it just calls a tool by name and gets a result. This makes the system modular, testable, and easy to extend.
Step 1: Alert Normalizer
The first challenge: every security tool outputs alerts in a different format. Defender looks different from SentinelOne, which looks different from a generic SIEM.
I built a normalizer that takes any alert format and converts it to a single internal structure:
@dataclass
class NormalizedAlert:
alert_id: str
source: str # defender / sentinelone / generic
severity: str # Low / Medium / High / Critical
title: str
timestamp: str
mitre_technique: Optional[str]
hostname: Optional[str]
username: Optional[str]
source_ip: Optional[str]
raw: dict # Original alert for audit
This means the rest of the system doesn't care where the alert came from. It always works with the same format.
Step 2: LangGraph State Machine
I used LangGraph to build the agent as a state machine. Each step in the triage process is a separate node:
class TriageState(TypedDict):
alert: dict
enrichment: Optional[dict]
risk_score: Optional[int]
risk_level: Optional[str]
explanation: Optional[str]
recommendation: Optional[str]
needs_human: Optional[bool]
error: Optional[str]
The graph flows like this:
enrich → analyze → [human_review if score >= 70] → format_output
Why LangGraph instead of a simple chain? Because real triage isn't linear. You need conditional routing — a Critical alert should follow a different path than a Low one. LangGraph makes this explicit and debuggable.
Step 3: The MCP Server (New in v2)
This is the biggest architectural change. All three enrichment tools are now exposed via a FastMCP server:
# mcp-server/server.py
from mcp.server.fastmcp import FastMCP
from tools.virustotal import check_ip
from tools.mitre import get_technique_summary
from tools.slack_notifier import send_alert_notification
mcp = FastMCP("Alert Triage MCP Server")
@mcp.tool()
def virustotal_check(ip: str) -> str:
"""Проверява IP адрес в VirusTotal и връща reputation данни."""
result = check_ip(ip)
return f"IP: {result.ip} | Malicious: {result.malicious_votes} | Known bad: {result.is_known_bad}"
@mcp.tool()
def mitre_lookup(technique_id: str) -> str:
"""Търси MITRE ATT&CK техника по ID (напр. T1059.001)."""
return get_technique_summary(technique_id)
@mcp.tool()
def slack_notify(alert_id: str, risk_score: int, ...) -> str:
"""Праща Slack нотификация за критичен алерт."""
success = send_alert_notification(triage_result)
return "Sent" if success else "Failed"
if __name__ == "__main__":
mcp.run()
The agent connects to this server via an MCP client wrapper:
# agents/mcp_tools.py
async def _call_tool(tool_name: str, args: dict) -> str:
server_params = StdioServerParameters(
command="python", args=["mcp-server/server.py"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
result = await session.call_tool(tool_name, args)
return result.content[0].text
def virustotal_check(ip: str) -> str:
return asyncio.run(_call_tool("virustotal_check", {"ip": ip}))
def mitre_lookup(technique_id: str) -> str:
return asyncio.run(_call_tool("mitre_lookup", {"technique_id": technique_id}))
The result: The LangGraph agent no longer imports tools directly. It goes through MCP — clean separation of concerns.
Step 4: Enrichment Tools
Before the LLM sees the alert, two tools run automatically via MCP:
VirusTotal IP Lookup
Why this matters: An alert marked "Low severity" came in for SSH login attempts. The source IP had 4 malicious votes on VirusTotal. The system automatically escalated it to High. Without enrichment, that alert would have been ignored.
MITRE ATT&CK Context
Instead of hitting an API for every request, I built a local database of the most common techniques:
MITRE_DB = {
"T1059.001": MitreTechnique(
"T1059.001", "PowerShell", "Execution",
"Adversaries use PowerShell to execute commands, often with encoded payloads...",
"high"
),
"T1486": MitreTechnique(
"T1486", "Data Encrypted for Impact (Ransomware)", "Impact",
"Adversary encrypts data to disrupt availability...",
"high"
),
}
This context goes directly into the LLM prompt — giving the model real knowledge about what each technique means and how dangerous it is.
Step 5: The LLM Analysis
The Triage Agent sends the enriched alert to Groq (Llama 3.3 70B) with a structured prompt that returns JSON:
{
"risk_score": 95,
"risk_level": "Critical",
"explanation": "The source IP is flagged as MALICIOUS by 17 VirusTotal engines...",
"recommendation": "Block IP immediately and isolate the device.",
"needs_human": true
}
Key design decision: temperature 0.1. Security analysis needs consistency, not creativity.
Step 6: Human-in-the-Loop
For any alert with risk score >= 70, the MCP slack_notify tool fires a formatted Slack notification. AI assists — humans decide on critical actions.
Step 7: REST API with FastAPI
@router.post("/triage", response_model=TriageResponse)
def triage_alert(alert_request: AlertRequest):
normalized = normalize_alert(alert_request.model_dump(exclude_none=True))
result = run_triage(normalized)
return TriageResponse(...)
Microsoft Defender can now send a webhook to POST /triage and get back a full analysis in ~3 seconds.
Real Results
Running 6 sample alerts through the system:
- A "Low severity" SSH alert was escalated to High because VirusTotal flagged the source IP (4 malicious votes)
- A data exfiltration alert scored 95/100 Critical — destination IP had 17 VirusTotal votes, known Tor exit node used for C2
Tech Stack
| Component | Technology |
|---|---|
| Agent framework | LangGraph |
| LLM | Groq — Llama 3.3 70B (free tier) |
| Tool layer | MCP — Model Context Protocol |
| Threat intel | VirusTotal API (free tier) |
| ATT&CK mapping | Local MITRE database |
| Notifications | Slack Webhooks |
| API | FastAPI |
Total cost for MVP: $0
Key Lessons
- MCP separates tools from agents — your agent becomes a thin client, tools become reusable services
- Enrich before you analyze — LLM without real threat intel is just guessing
- LangGraph over simple chains — conditional routing requires a proper state machine
- Human-in-the-Loop is not optional — never automate critical security decisions
- Start with the data — understanding real alerts before coding saved hours
Currently looking for MSP and SOC teams for a free 2-week pilot.
If your team deals with alert fatigue — comment below or DM me.
GitHub: [alert-triage-mvp] | Built with LangGraph + MCP + Groq
Top comments (0)