Aditya P Dixit

Posted on Jun 27

I Got Tired of AI Agents Having Root Access to Everything, So I Built XRisk

#opensource #ai #security

Everyone is building AI agents.

Very few people are building the thing that sits between an AI agent and a disastrous decision.

That's why I built XRisk.

XRisk is an open-source autonomous safety engine that acts as a decision layer between an AI agent and the real world.

Instead of blindly executing an action, an agent asks XRisk:

"Should I actually do this?"

XRisk responds with one of three deterministic decisions:

✅ Allow
⚠️ Confirm
❌ Block
Why I Started This Project

As I experimented with increasingly autonomous AI systems, I noticed the same pattern over and over again.

Most projects focused on making agents more capable.

Almost nobody was asking:

"What happens when the agent is wrong?"

Consider a few examples.

An agent accidentally leaks API keys.
A prompt injection convinces it to ignore previous instructions.
A model decides to execute a shell command.
An autonomous workflow loops forever and keeps calling expensive APIs.
A deployment bot pushes code without human approval.

Most agent frameworks assume the model behaves.

Reality doesn't.

I wanted something deterministic sitting between intention and execution.

Not another model.

Not another prompt.

An actual policy engine.

What XRisk Does

XRisk evaluates every proposed action before it's executed.

It combines multiple safety signals into a single explainable decision.

Some of the things it checks include:

Policy-as-code with layered precedence
Prompt injection detection
Sensitive data and secret detection
Capability token validation
Network egress restrictions
Circuit breakers for autonomous loops
Tamper-evident audit logs
Supply-chain verification
Policy conflict detection
Deterministic forensic replay

Instead of a mysterious "Safety Score: 67%," XRisk explains why it made a decision.

Example

Imagine an AI assistant wants to execute:

{
"tool": "deploy",
"actor": "release-bot",
"prompt": "Deploy production immediately."
}

Instead of sending that directly to your deployment system...

XRisk intercepts it.

It evaluates:

Does policy require approval?
Is the actor allowed to deploy?
Is the destination trusted?
Are capability tokens valid?
Does this resemble prompt injection?
Is this part of a dangerous execution loop?

Only then does it decide whether to:

Allow
Confirm
Block
One Design Decision I Feel Strongly About

I deliberately avoided using another LLM to make safety decisions.

LLMs are excellent at generating text.

Policy enforcement should be deterministic.

If an action is blocked, I want to know exactly why it was blocked.

Every decision should be reproducible.

Every audit should be explainable.

Every policy should be inspectable.

That's the philosophy behind XRisk.

What's Next

I'm currently working toward:

Threat intelligence correlation
Zero-trust workload identities
Autonomous containment
Adversarial simulation
Multi-party approval workflows

The long-term vision is to make XRisk a reusable security layer that can sit in front of any AI agent, regardless of framework.

I'd Love Feedback

This project is still evolving, and I'd genuinely appreciate feedback from people building AI systems.

Some questions I'm particularly interested in:

What attack vectors am I missing?
Which policies would you want in production?
What integrations would make this more useful?
How would you design a safety engine differently?

If you'd like to contribute, open an issue, suggest improvements, or submit a PR. Even small documentation fixes are welcome.

Thanks for reading—I hope XRisk becomes something that helps make AI systems not just more capable, but more trustworthy.

Link: https://github.com/Hootsworth/XRisk