Nguyen Dong

Posted on Mar 17

How 5 AI Agents Run Our SOC Autonomously — Architecture Deep Dive

#cybersecurity #ai #startup #opensource

We replaced a 6-person SOC team with 5 AI agents running 24/7 for $5/month in API costs. Here's the architecture.

The Problem: Alert Fatigue is Killing SMB Security

The average SOC receives 11,000 alerts per day. Enterprise teams with 10+ analysts struggle to keep up. Now imagine an SMB with zero security staff.

That was our starting point. We built VRadar — a cloud SOC platform for SMBs — and quickly realized that collecting alerts is useless if nobody's reading them. A dashboard with 1,000 unread alerts is the same as having no dashboard at all.

So we did something unconventional: we built 5 specialized AI agents, each handling a different aspect of SOC operations. Not one monolithic AI — five focused agents that collaborate.

The 5 Agents

Here's what each agent does and how they interact:

                        ┌─────────────────┐
                        │   AI Operator   │ ← Alert triage (GPT-4o-mini)
                        │   Every 5 min   │   Batch 100 alerts
                        └────────┬────────┘
                                 │ escalate / create incident
                        ┌────────▼────────┐
                        │   AI Monitor    │ ← Infrastructure health
                        │   Every 10 min  │   10 health checks
                        └────────┬────────┘
                                 │ degraded service alert
                        ┌────────▼────────┐
                        │  AI Optimizer   │ ← Resource + threat defense
                        │  & Firewall     │   Auto-block attackers
                        └────────┬────────┘
                                 │ knowledge for responses
                        ┌────────▼────────┐
                        │   AI Care       │ ← Customer support (RAG)
                        │   Real-time     │   Auto-reply chat + social
                        └────────┬────────┘
                                 │ content from knowledge base
                        ┌────────▼────────┐
                        │  AI Marketing   │ ← Content + social media
                        │   On-demand     │   Auto-reply FB comments
                        └─────────────────┘

Agent 1: AI Operator — The Autonomous SOC Analyst

Job: Triage every security alert and decide what to do.

How it works:

Cron job runs every 5 minutes
Pulls up to 100 unprocessed alerts from PostgreSQL
Each alert goes through GPT-4o-mini with function calling
LLM chooses from 5 actions: block_ip, create_incident, acknowledge, escalate, notify_customer
Each action executes real consequences (Wazuh Active Response, incident creation, notifications)

The economics trick — Hybrid AI mode:

90%+ of security alerts are LOW or MEDIUM severity (Windows Event 4624 "successful login", Sysmon process creation, etc.). Sending these to GPT-4o costs $0.002/alert. At 1,000 alerts/day, that's $60/month per tenant.

Our Hybrid mode: LOW + MEDIUM → rule-based auto-acknowledge ($0), HIGH + CRITICAL → LLM triage ($0.0002/alert). Total: ~$2-5/month per tenant. 94% cost savings.

if (processingMode === 'hybrid_ai') {
  const lowMedAlerts = alerts.filter(a => 
    ['LOW', 'MEDIUM'].includes(a.severity));
  // Rule-based: auto-ack, $0
  await autoAcknowledgeLow(lowMedAlerts);

  const highCritAlerts = alerts.filter(a => 
    ['HIGH', 'CRITICAL'].includes(a.severity));
  // LLM: function calling, ~$0.0002/alert
  await processWithLLM(highCritAlerts);
}

Human-in-the-loop: Every AI decision is logged in AiOperatorDecision with confidence score. Admin can override any decision. There's an evaluation system (6 mock scenarios) for testing AI accuracy without executing real actions.

Agent 2: AI Monitor — Infrastructure Watchdog

Job: Ensure all 12 Docker containers and security services are healthy.

10 health checks (6 infra + 4 security):

Check	What it does
Docker containers	Auto-discover + verify all 12 containers running
PostgreSQL	Connection + query latency
ClickHouse	Connection + table accessibility
Redis	Connection + memory usage
Disk usage	Alert if > 85%
Memory usage	Alert if > 90%
Anomaly detection	ML service health (IsolationForest + LSTM)
Agent heartbeat	Wazuh agent connectivity check
Failed logins	Brute force detection (suspicious patterns)
SSL certificate	TLS handshake to vradar.io:443, check expiry

Runs every 10 minutes. Results stored in SystemConfig. Uptime trend visualization in the dashboard.

Agent 3: AI Optimizer & Firewall — Self-Defense System

Job: Optimize resources and auto-block attackers.

This agent is unique because it runs on every single HTTP request via middleware:

// threat-defense.middleware.ts — runs on EVERY request
app.use(threatDefenseMiddleware);

What it tracks per IP:

Request rate (Redis counter, 60-second window)
4xx error rate (scanning detection)
Known-bad User-Agent patterns (nmap, sqlmap, nuclei, etc.)

Auto-response: IP exceeds threshold → blocked in Redis → all future requests return 403.

Resource monitoring (8 sub-checks):

OS disk/RAM usage
Redis memory consumption
ClickHouse table sizes across all tenants
AI cost tracking (LLM API calls in last 24h)
Expired session detection
Device capacity per tenant
Cross-service degradation alerts

Result: We've auto-blocked 2,197 malicious IPs on our VPS without human intervention. Current active blocks: 209.

Agent 4: AI Care — RAG-Powered Customer Support

Job: Auto-reply customer chat messages using Retrieval-Augmented Generation.

Architecture:

Customer sends message → Chat API
    ↓
triggerAICareReply() — async
    ↓
ChromaDB semantic search (all-MiniLM-L6-v2 embeddings)
    ↓
Top 3 relevant knowledge chunks retrieved
    ↓
GPT-4o-mini generates reply with context
    ↓
Confidence check (threshold: 0.7)
    ↓
If confident → auto-reply as AI_CARE bot
If not → escalate to human agent

Knowledge base: Kreuzberg (document extraction + OCR) processes uploaded PDFs/DOCX → chunks → ChromaDB vector store. We pre-loaded 820 lines of VRadar product knowledge.

Bonus: Works on Facebook Messenger and Zalo OA too. Same RAG pipeline, different input channels.

Agent 5: AI Marketing — Content & Social Manager

Job: Generate marketing content and manage social media interactions.

Content generation: 5 channels (Facebook, LinkedIn, Zalo, Blog, Email) with distinct tones
DALL-E 3 image generation: Branded cybersecurity visuals ($0.04/image)
Facebook comment auto-reply: Webhook receives mentions → RAG-enriched AI response → posts via Graph API
Smart scheduling: Platform-specific optimal posting times

The Shared Brain: Unified LLM Service

All 5 agents share one llm.service.ts:

const response = await callLLMText(prompt, {
  model: 'gpt-4o-mini',     // Default for all agents
  maxTokens: 500,            // Cost control
  systemPrompt: agentPrompt, // Agent-specific context
});

Key design decisions:

Single model everywhere: GPT-4o-mini ($0.15/1M tokens) instead of GPT-4o ($2.50/1M). 94% savings, negligible quality difference for SOC tasks.
Redis caching: Knowledge search results (1h TTL) + AI replies (30min TTL). Same question = cached answer = $0.
Graceful degradation: If OpenAI is down, agents log the failure but don't crash. Security monitoring continues without AI triage.
Cost tracking: Every LLM call logged with token count. Dashboard shows daily/weekly AI spend.

Running Cost Breakdown

Agent	Frequency	Cost/Month (per tenant)
AI Operator (Hybrid)	Every 5 min	~$2-5
AI Monitor	Every 10 min	~$0.50
AI Optimizer	Continuous (middleware)	$0 (rule-based)
AI Care	On-demand (chat)	~$0.10-1.00
AI Marketing	On-demand	~$0.50-2.00
Total		~$3-8/month

Compare this to a 6-person SOC team: $300K-600K/year in salaries alone.

Lessons From Building AI Agents for Production

Don't build one mega-agent. Specialized agents with clear boundaries are easier to debug, test, and iterate. Our AI Operator went through 4 rewrites — without touching the other agents.
Hybrid AI is mandatory. Sending every LOW-severity alert to an LLM is burning money. Rule-based filtering for the 90% + LLM for the 10% = same security, 94% less cost.
Function calling > text parsing. GPT-4o-mini with structured function calling (block_ip, create_incident) is dramatically more reliable than asking it to output JSON or parse text responses.
Cache aggressively. The same customer asks "What is VRadar?" 50 times. Redis cache = $0 after the first answer.
Log everything. Every AI decision, every confidence score, every action taken. When a customer asks "why did AI block this IP?", you need the audit trail.
Build an evaluation mode. Our AI Operator has 6 test scenarios that run through the full LLM pipeline without executing real actions. Test before you deploy.

Try VRadar

VRadar is live at vradar.io — AI-powered SOC from $25/device/month.

All 5 agents are running in production right now, monitoring real customer networks across ASEAN.

I'm Dong, solo dev from Vietnam. Built VRadar's 5-agent SOC system over 3 months. Happy to deep-dive on any architectural question in the comments.

Tags: #ai #cybersecurity #soc #llm #gpt4 #architecture #startup #buildinpublic #agents #aisecurity #devto #opensource

DEV Community