RedSOC: Open-source framework to benchmark adversarial attacks on AI-powered SOCs — 100% detection rate across 15 attack scenarios [paper + code]

#security #python #llm #machinelearning

I've been working on a problem that I think is underexplored: what happens when you actually attack the AI assistant inside a SOC?
Most organizations are now running RAG-based LLM systems for alert triage, threat intelligence, and incident response. But almost nobody is systematically testing how these systems fail under adversarial conditions.
So I built RedSOC — an open-source adversarial evaluation framework specifically for LLM-integrated SOC environments.
What it does:
Three attack types are implemented and benchmarked:

Corpus poisoning (PoisonedRAG threat model) — inject malicious documents into the knowledge base to steer analyst responses toward dangerous advice
Direct prompt injection — embed override instructions in the user query
Indirect prompt injection — hide adversarial instructions inside retrieved documents (Greshake et al. threat model)

The detection layer runs three mechanisms in parallel without requiring model internals:

Semantic anomaly scoring (cosine similarity between query and retrieved docs)
Provenance tracking (whitelist-based source verification)
Response consistency checking (answer vs source divergence)

Benchmark results (15 scenarios, Llama 3.2, fully local via Ollama)

Attack Class	Attack Success Rate	Detection Rate
Corpus poisoning	80%	100%
Direct injection	60%	100%
Indirect injection	100%	100%
Overall	80%	100%

Indirect prompt injection succeeds 100% of the time against an undefended RAG pipeline. The detection layer catches everything at 100% with zero misses across all 15 scenarios.
Stack: Python, LangChain, FAISS, Ollama (Llama 3.2) — runs fully local, no API keys needed.
The accompanying survey paper maps the full adversarial threat landscape (RAG poisoning, prompt injection, multi-agent hijacking, concept drift) with 16 citations including PoisonedRAG, AgentPoison, MemoryGraft, and the recent DarkSide paper.
Code: https://github.com/krishnakaanthreddyy1510-cell/RedSOC
Paper: [arXiv link — pending, will update]
Happy to answer questions about the detection architecture or the benchmark methodology. Feedback welcome — especially from anyone who's seen these attack patterns in production.