I've been working on a problem that I think is underexplored: what happens when you actually attack the AI assistant inside a SOC?
Most organizations are now running RAG-based LLM systems for alert triage, threat intelligence, and incident response. But almost nobody is systematically testing how these systems fail under adversarial conditions.
So I built RedSOC — an open-source adversarial evaluation framework specifically for LLM-integrated SOC environments.
What it does:
Three attack types are implemented and benchmarked:
Corpus poisoning (PoisonedRAG threat model) — inject malicious documents into the knowledge base to steer analyst responses toward dangerous advice
Direct prompt injection — embed override instructions in the user query
Indirect prompt injection — hide adversarial instructions inside retrieved documents (Greshake et al. threat model)
The detection layer runs three mechanisms in parallel without requiring model internals:
Semantic anomaly scoring (cosine similarity between query and retrieved docs)
Provenance tracking (whitelist-based source verification)
Response consistency checking (answer vs source divergence)
Benchmark results (15 scenarios, Llama 3.2, fully local via Ollama)
| Attack Class | Attack Success Rate | Detection Rate |
|---|---|---|
| Corpus poisoning | 80% | 100% |
| Direct injection | 60% | 100% |
| Indirect injection | 100% | 100% |
| Overall | 80% | 100% |
Indirect prompt injection succeeds 100% of the time against an undefended RAG pipeline. The detection layer catches everything at 100% with zero misses across all 15 scenarios.
Stack: Python, LangChain, FAISS, Ollama (Llama 3.2) — runs fully local, no API keys needed.
The accompanying survey paper maps the full adversarial threat landscape (RAG poisoning, prompt injection, multi-agent hijacking, concept drift) with 16 citations including PoisonedRAG, AgentPoison, MemoryGraft, and the recent DarkSide paper.
Code: https://github.com/krishnakaanthreddyy1510-cell/RedSOC
Paper: [arXiv link — pending, will update]
Happy to answer questions about the detection architecture or the benchmark methodology. Feedback welcome — especially from anyone who's seen these attack patterns in production.

Top comments (0)