Hey everyone. I'm starting a 10-week solo research project (advised by two of my professors) focused on something that's been bugging me about the current AI hype: the agentic supply chain is a massive security hole.
Everyone is rushing to plug LLMs into everything using frameworks like LangChain or Anthropic’s new MCP (Model Context Protocol). We're basically handing AI the keys to read databases, execute bash scripts, and send emails.
But the scary part is what happens when an agent downloads a malicious community-built tool.
Traditional security scanners like Semgrep or Bandit are looking for bad code. But they completely miss the new threat vector: malicious semantic intent. If a hacker hides a prompt injection or a system override command inside a tool's README.md or a description field, an LLM will read it and get hijacked. To the AI, plain text is an execution surface.
To tackle this, I'm building a pre-execution security scanner specifically for AI agent skills and MCP servers.
The Threat Model
Before touching the code, I mapped out the attack surface. The main threats I'm targeting are:
Indirect Prompt Injections: Invisible Unicode characters or hidden instructions in manifest files that hijack the context window.
Privilege Escalation: A tool that claims it only needs to "read the weather" but the AST (Abstract Syntax Tree) shows it calling os.system().
Data Exfiltration: A local tool opening an undeclared outbound HTTP connection to leak .env files.
State Poisoning: Manipulating state dictionaries in LangGraph to force the agent down an unintended execution path.
The Architecture
I'm structuring the scanner as a three-layer pipeline, fusing the results at the end.
Layer 1: Static Analysis
Before anything runs, we rip the code apart. The scanner parses mcp.json and LangChain tools, using Python's ast module to scan for dangerous sinks. If a tool asks for no network permissions but imports requests, we catch it here.
Layer 2: Semantic LLM Judge
This is where it gets agent-specific. I'm feeding the untrusted descriptions and READMEs to an isolated, local LLM judge. It hunts for role-boundary injections and persona hijacking. It also checks for cross-field consistency—if the tool is named web_search but the code executes bash commands, the LLM flags the semantic mismatch between the claimed capability and the actual code.
Layer 3: Dynamic Sandbox
If a skill looks suspicious but passes the first two layers, we detonate it. It runs inside a locked-down Docker container with strict seccomp profiles. I'm using strace to trace system calls and watch for undeclared network egress or filesystem writes.
Finally, taking a mathematical approach to the risk scoring, a Bayesian verdict aggregator fuses the signals from all three layers to output a deterministic decision: SAFE, WARN, or BLOCK.
The Roadmap
Over the next 10 weeks, the plan is to build out the static AST scanner, engineer the LLM judge and permission state-machine, and orchestrate the Docker sandbox. The final stretch will be heavily focused on red-teaming the scanner and benchmarking evasion variants.
I'll be posting updates here as I build out each module. If you're working in AI security, building with MCP, or have ideas for malicious edge cases I should add to my test corpus, I'd love to hear about them in the comments.
Top comments (0)