Andrea

Posted on Feb 18 • Edited on Feb 19

We built a firewall for AI agents. It doesn't use AI.

#ai #security #mcp #opensource

In August 2025, HiddenLayer published research showing an end-to-end attack against Cursor. A developer cloned a GitHub project and asked the IDE to help set it up. The project's README contained a prompt injection — invisible when viewed on GitHub. Cursor read the README, the injection took over, and the agent used grep to find API keys in the developer's workspace before exfiltrating them with curl. No permission requested. No confirmation dialog. The developer's own AI assistant turned into an attacker's shell.

That same month, Aim Security disclosed CurXecute (CVE-2025-54135): a single poisoned Slack message, fetched through an MCP server, could rewrite Cursor's mcp.json and execute arbitrary commands. Cursor auto-ran the new config without asking.

A few months earlier, Invariant Labs had demonstrated that a malicious MCP server could silently exfiltrate a user's entire WhatsApp history — messages, contacts, conversations — by poisoning tool descriptions that the LLM read but the user never saw.

These aren't theoretical risks. These are documented attacks against tools that developers use every day. And they all share the same root cause: AI agents have unrestricted access to your machine, with no policy layer between what the model decides to do and what actually happens.

We built SentinelGate because we wanted something between "trust the agent completely" and "don't use agents at all."

sentinel-gate run -- claude

That's it. One command. SentinelGate detects what agent you're running, starts a local server, generates a session API key, installs the right hooks for that specific agent, and launches it with everything wired up. When you're done, it tears everything down — hooks removed, key revoked, system back to its original state.

There's an admin UI at localhost:8080 for managing policies, viewing the audit log, and testing rules. No Docker, no database, no YAML unless you want it.

In the HiddenLayer scenario, an outbound control rule blocking unknown destinations would have stopped the curl exfiltration. In Invariant's WhatsApp attack, the agent would have hit a policy denying access to contact and message tools outside the allowed scope. In CurXecute, a policy blocking writes to config files would have killed the chain at step one. All deterministic. All configured in advance. All producing the same result every single time.

What it looks like in practice

We ran Claude Code through SentinelGate on a demo project with a block-sensitive-files policy active. Claude reads the project's README — allowed, no issue. Then it tries to read config/secrets.env. This is what appears in the audit log:

TOOL NAME       read_text_file
TOOL ARGUMENTS  {"path": "/demo-project/config/secrets.env"}
IDENTITY        claude-agent
DECISION        Deny — policy denied: matched rule block-sensitive-files
LATENCY         0.39 ms
PROTOCOL        mcp

The agent gets back MCP error -32600: Access denied by policy. Claude Code sees the denial, explains to the developer that a SentinelGate policy rule intercepted the file access and blocked it because the path matches a rule targeting sensitive files, and moves on. The entire exchange took less than a millisecond.

No credentials were read. No human had to intervene. The rule was written once, and it fires the same way whether it's Tuesday morning or Friday at midnight.

Here's the full flow from the terminal — Claude Code reads the README, then gets blocked on secrets.env:

And in the admin UI, the audit trail with the full context of the denial:

The argument against AI-powered security

Most tools in this space use one model to watch another model. A classifier that evaluates whether a tool call is "probably safe" — with confidence scores, false positive rates, and results that change depending on how the model is feeling today.

Here's the thing about that approach: if your SQL injection protection failed 1% of the time, you'd rip it out by Friday. If your authentication layer gave different answers for the same credentials on consecutive requests, nobody would call it "adaptive" — they'd call it broken. But somehow, when the security layer is an LLM, we're supposed to accept probabilistic results as a feature.

There's a place for AI-powered analysis. But the enforcement layer — the thing that decides allow or deny — should be deterministic. That's the part we build. SentinelGate uses explicit rules, written by the operator, evaluated the same way every time. CEL expressions — the same policy language behind Google Cloud IAM, Kubernetes admission control, and Envoy. Not something we invented. Something that already works at scale.

# Block anything touching files with "secret" in the path
action_arg_contains(arguments, "secret")

# Only admins can run shell commands
action_type == "command_exec" && !("admin" in identity_roles)

# Block data exfiltration to known paste/tunnel services
dest_domain_matches(dest_domain, "*.pastebin.com")
  || dest_domain_matches(dest_domain, "*.ngrok.io")

Simple patterns like delete_* and read_* cover most cases. CEL handles the rest. Typed, sub-millisecond evaluation. You test every rule in a sandbox before it goes live.

If the rule says deny, it's deny. Today, next week, six months from now. No drift.

MCP protection is airtight. We mean that literally.

When SentinelGate runs as an MCP proxy, your agent knows one address: localhost:8080/mcp. That's it. The real tool endpoints — filesystem server, database, Slack, Gmail, whatever — are configured as upstreams inside SentinelGate. The agent can't bypass it because it literally doesn't know where else to go.

Every tool call from every upstream passes through your policies before reaching anything. This isn't enforcement by convention. The architecture makes bypass structurally impossible. The agent would need information it doesn't have.

With MCP becoming the standard protocol for agent-to-tool communication, this is the defence that matters most. Not "agents usually respect this." Guaranteed by design.

Where the guarantee is different

I want to be precise here because this is the kind of thing that erodes trust if you find out later.

MCP interception is a hard security boundary. Runtime hooks — the patches on Python's subprocess, Node's child_process, file system calls — and HTTP proxy interception are not. They work well because AI agents use standard libraries that respect hook injection and proxy settings. Claude Code calls open(). Python scripts use requests. In practice, this covers the real threat model: mistakes, prompt injection, overreach.

But a deliberately malicious process could bypass them. A native extension, ctypes, curl --noproxy '*'. If you need adversarial isolation — code actively trying to escape — that's container territory. Put SentinelGate in front of the container, not instead of it.

We could skip this paragraph and the article would read better. We include it because letting someone deploy SentinelGate thinking it's an impenetrable sandbox would be worse than never having them as a user.

Eleven steps between intent and execution

Every action — MCP tool call, HTTP request, runtime hook — flows through the same interceptor chain. Same policies, same audit trail, regardless of protocol. It's not a patchwork of different controls bolted together.

Three things in that chain worth calling out:

Audit happens before the decision. Denied attempts get logged too. If an agent tried to read /etc/shadow and got blocked, you see it in the log with the full context — identity, arguments, matched rule, timestamp. This matters for incident investigation and for understanding what your agents are actually attempting.

Tool quarantine. SentinelGate snapshots every tool's definition when it first connects to an upstream MCP server. If a tool's schema changes later — different parameters, different description — it flags the drift and blocks the tool until you review it. This is defence against the exact attack Invariant Labs called "MCP Rug Pulls": a tool that looks benign on first approval, then silently changes its definition to something malicious.

Response scanning. It's not enough to check what goes out. SentinelGate also scans what comes back from tools and HTTP endpoints for prompt injection patterns. A response that says "ignore previous instructions and..." gets flagged before it reaches the agent. Monitor mode or enforce mode, your choice.

The full chain has eleven steps — validation, rate limiting, auth, audit, quarantine, CEL evaluation, human-in-the-loop approval (Pro tier), outbound control, response scanning, routing. The details are in the docs. The point is: one pipeline, protocol-agnostic.

It works with what you're already using

sentinel-gate run figures out the right interception strategy automatically. Claude Code gets a PreToolUse hook. Gemini CLI (which has no hook API) gets its native tools disabled and everything routed through the MCP proxy instead. Python and Node.js agents get runtime patches on standard library functions — open(), subprocess, fs, child_process. No code changes in your scripts.

It also works with Cursor, Windsurf, Cline, and Codex — though those are MCP-only since they can't be wrapped with run.

The mechanism varies. The policy engine doesn't.

Open source, self-hosted, yours

Code's on GitHub under AGPL-3.0. Runs on your machine. Nothing phones home, nothing requires an account. Single binary for macOS, Linux, and Windows.

There's a Pro tier for teams that need SSO, SIEM integration, human-in-the-loop approval workflows, multi-tenancy, and compliance reporting. But the core — CEL policies, RBAC, full audit trail, runtime protection, MCP proxy, admin UI — is open source and will stay that way.

We built SentinelGate because we wanted to use AI agents on real projects without the nagging feeling that one bad document could exfiltrate our credentials. If that sounds familiar, give it five minutes. That's all it takes.

GitHub: Sentinel-Gate/Sentinelgate

curl -sSfL https://raw.githubusercontent.com/Sentinel-Gate/Sentinelgate/main/install.sh | sh
sentinel-gate run -- claude

If you're running agents against anything that matters, I'd like to hear how you're handling security today. What's working? What's not? Drop a comment or open a discussion on GitHub.