DEV Community: KRISHNAKAANTH REDDY YEDUGURU

I found 100% prompt injection success rate against AI SOC assistants - here is the detection layer I built

KRISHNAKAANTH REDDY YEDUGURU — Mon, 27 Apr 2026 17:49:48 +0000

Two thirds of enterprises now run AI in their Security Operations Centers. Nobody is red-teaming these systems before deployment.
I spent the last few months building RedSOC — an open-source adversarial evaluation framework for LLM-integrated SOC environments — to fix that.
Here is what I found.
The Benchmark
I tested 15 adversarial scenarios across three attack classes against a realistic SOC assistant built on LangChain, FAISS, and Llama 3.2.
Attack Results
Indirect Prompt Injection — 100% attack success rate
This was the most alarming finding. An attacker who plants adversarial instructions inside a threat intelligence document can redirect analyst guidance with zero access to SOC infrastructure. The model cannot distinguish between information it is meant to analyze and instructions it is meant to follow.
Corpus Poisoning — 80% attack success rate
Five malicious documents among thousands in a knowledge base is enough to corrupt analyst responses for targeted queries. The attacker needs only the ability to contribute to any public threat feed or CVE database the pipeline trusts.
Direct Prompt Injection — 60% attack success rate
Lower success rate because Llama 3.2's safety training provides partial resistance to human-originating override attempts. This resistance disappears against indirect and document-mediated attacks.
The Detection Layer
I built a three-mechanism detection layer that catches all attack classes with no model internals required meaning it works with hosted APIs like GPT-4o and Claude.
Mechanism 1 — Semantic Anomaly Detection
Computes cosine similarity between query embeddings and retrieved document embeddings. Adversarially crafted documents often diverge semantically from the queries that trigger their retrieval.
Mechanism 2 — Provenance Tracking
Maintains a whitelist of trusted document sources. Any retrieved document from an untrusted source is flagged immediately regardless of content. This mechanism alone achieved 100% detection for corpus poisoning independently.
Mechanism 3 — Response Consistency Checking
Measures semantic similarity between the generated response and retrieved documents. A response steered by injected instructions diverges from retrieved context in embedding space.
Unified verdict: 100% detection across all 15 scenarios with zero misses.
Thirteen of 15 scenarios produced HIGH threat verdicts. Two produced MEDIUM. Zero LOW verdicts across all attack scenarios.
Why This Matters
Existing defenses like RevPRAG achieve 98% detection but require LLM activation states — unavailable in any enterprise deployment using hosted APIs. RAGForensics achieves 97.4% but operates post-hoc after analyst exposure.
RedSOC achieves 100% detection in real time with no model internals. It is the only evaluated approach that is simultaneously effective and deployable in production hosted API environments.
The Code
python# Detection verdict
detectors_triggered = sum([
semantic["anomaly_detected"],
provenance["anomaly_detected"],
consistency["anomaly_detected"]
])

threat_level = (
"HIGH" if detectors_triggered >= 2 else
"MEDIUM" if detectors_triggered == 1 else
"LOW"
)

recommendation = (
"BLOCK and ALERT analyst" if threat_level == "HIGH"
else "FLAG for review" if threat_level == "MEDIUM"
else "PASS"
)
Stack

LangChain for orchestration
FAISS for vector retrieval
Ollama + Llama 3.2 for local inference
sentence-transformers for embeddings
Python — no API keys required

Links
GitHub: https://github.com/krishnakaanthreddyy1510-cell/RedSOC
Preprint: https://doi.org/10.6084/m9.figshare.32016498
Benchmark data: https://doi.org/10.6084/m9.figshare.32016534
Paper currently under review at IEEE Access and Journal of Information Security and Applications.
Happy to answer questions about the methodology or detection layer design in the comments.

Corpus poisoning and indirect prompt injection against RAG-based SOC assistants benchmark results (80% and 100% ASR respectively)

KRISHNAKAANTH REDDY YEDUGURU — Mon, 13 Apr 2026 17:08:59 +0000

https://medium.com/@krishnakaanthreddyy1510/how-i-poisoned-an-ai-security-assistant-and-built-the-code-to-prove-it-8eef04ad16db

Originally published on Medium

KRISHNAKAANTH REDDY YEDUGURU — Mon, 13 Apr 2026 16:21:38 +0000

https://medium.com/@krishnakaanthreddyy1510/why-ai-powered-socs-are-the-next-attack-surface-11693e55c80c

RedSOC: Open-source framework to benchmark adversarial attacks on AI-powered SOCs — 100% detection rate across 15 attack scenarios [paper + code]

KRISHNAKAANTH REDDY YEDUGURU — Thu, 09 Apr 2026 07:47:48 +0000

I've been working on a problem that I think is underexplored: what happens when you actually attack the AI assistant inside a SOC?
Most organizations are now running RAG-based LLM systems for alert triage, threat intelligence, and incident response. But almost nobody is systematically testing how these systems fail under adversarial conditions.
So I built RedSOC — an open-source adversarial evaluation framework specifically for LLM-integrated SOC environments.
What it does:
Three attack types are implemented and benchmarked:

Corpus poisoning (PoisonedRAG threat model) — inject malicious documents into the knowledge base to steer analyst responses toward dangerous advice
Direct prompt injection — embed override instructions in the user query
Indirect prompt injection — hide adversarial instructions inside retrieved documents (Greshake et al. threat model)

The detection layer runs three mechanisms in parallel without requiring model internals:

Semantic anomaly scoring (cosine similarity between query and retrieved docs)
Provenance tracking (whitelist-based source verification)
Response consistency checking (answer vs source divergence)

Benchmark results (15 scenarios, Llama 3.2, fully local via Ollama)

Attack Class	Attack Success Rate	Detection Rate
Corpus poisoning	80%	100%
Direct injection	60%	100%
Indirect injection	100%	100%
Overall	80%	100%

Indirect prompt injection succeeds 100% of the time against an undefended RAG pipeline. The detection layer catches everything at 100% with zero misses across all 15 scenarios.
Stack: Python, LangChain, FAISS, Ollama (Llama 3.2) — runs fully local, no API keys needed.
The accompanying survey paper maps the full adversarial threat landscape (RAG poisoning, prompt injection, multi-agent hijacking, concept drift) with 16 citations including PoisonedRAG, AgentPoison, MemoryGraft, and the recent DarkSide paper.
Code: https://github.com/krishnakaanthreddyy1510-cell/RedSOC
Paper: [arXiv link — pending, will update]
Happy to answer questions about the detection architecture or the benchmark methodology. Feedback welcome — especially from anyone who's seen these attack patterns in production.