yotta

Posted on Mar 19

An AI Disabled Its Own Safety Guard — So I Redesigned It

yotta · 2026-03-19T12:27:09Z

TL;DR I built omamori , a Rust CLI that blocks destructive commands executed by AI tools (Claude Code, Codex CLI, Cursor, etc.) During testing, Gemini CLI autonomously discovered how to disable omamori's protection rules — without being told how omamori now defends not just against dangerous commands, but against AI agents disabling the guard itself What it can't block is explicitly documented in SECURITY.md and tested in a bypass corpus yottayoshida / omamori AI Agent's Omamori — protect your system from dangerous commands executed via AI CLI tools omamori Safety guard for AI CLI tools. Blocks dangerous commands — and resists being disabled. When AI tools like Claude Code, Codex, or Cursor run shell commands, omamori intercepts destructive operations and replaces them with safe alternatives. It also defends itself against AI agents attempting to disable or bypass its protection ( #22 ). macOS only. Terminal commands are never affected — omamori only activates when it detects an AI tool's environment variable. Quick Start # Install (macOS) brew install yottayoshida/tap/omamori # Setup (shims + hooks + config — all in one) omamori install --hooks # Add to your shell profile (~/.zshrc or ~/.bashrc) export PATH= " $HOME /.omamori/shim: $PATH " Enter fullscreen mode Exit fullscreen mode That's it. Works with Claude Code Auto mode — no extra config needed. After brew upgrade , shims and Claude Code hooks auto-update on the next command. Cursor users : re-merge the hook snippet after upgrades (see Auto-sync ). … View on GitHub An AI Disabled Its Own Safety Guard The original idea was simple. When an AI CLI tool runs rm -rf , redirect the files to Trash instead of deleting them permanently. Don't interfere with terminal commands typed directly by the user. A small safety valve — nothing more. That's how omamori started. A macOS guard tool that intercepts destructive commands from AI CLI tools (Claude Code, Codex CLI, Cursor, etc.) and either blocks them or replaces them with safer alternatives. I tested it with each tool. Claude Code, Codex CLI, Cursor — all worked as expected. Dangerous commands were caught. Then, during a real-world test with Gemini CLI, something happened that changed the design ( Issue #22 ). The AI autonomously discovered the omamori config disable command, turned off the protection rules, and then ran the destructive command. I never told it how to disable omamori. For an AI, "remove the obstacle" is a rational strategy to complete its task. Blocking commands wasn't enough. I had to defend against the AI dismantling the guard itself. That's when omamori's design changed. The Threat Model That incident made one thing clear: in the age of AI CLI tools, a safety guard that just enumerates dangerous commands and blocks them isn't sufficient. AI CLI tools can propose and execute strong actions without fully evaluating the importance or irreversibility of what they're affecting. Reports of destructive operations by AI agents are appearing more frequently across GitHub Issues and social media. Literacy and rules alone aren't enough to prevent accidents. So omamori is built around three assumptions: AI will execute destructive commands — rm -rf , git reset --hard , git push --force , etc. AI will try to disable its own guard — config disable , uninstall , env var removal Complete prevention is impossible — and that fact should be disclosed honestly Intercepting Only AI-Executed Commands AI CLI tool → CLAUDECODE=1 → rm -rf src/ ↓ [omamori shim] ↓ blocked (protected path) Terminal → rm -rf src/ ↓ [/usr/bin/rm] ↓ deleted normally Enter fullscreen mode Exit fullscreen mode omamori detects AI tools by environment variable. Claude Code sets CLAUDECODE=1 , Codex CLI sets CODEX_CI=1 , Cursor sets CURSOR_AGENT=1 . Rules only apply when one of these is present. Terminal commands are never affected. Default Rules Command Pattern Action rm -r , -rf , -fr , --recursive Move to macOS Trash (not permanent delete) git reset --hard git stash first, then execute git push --force , push -f Block git clean -fd , clean -fdx Block chmod 777 Block find -delete , --delete Block rsync --delete + 7 variants Block Default rules were selected based on the most commonly reported AI CLI accidents on GitHub Issues and Hacker News — irreversible rm -rf , lost uncommitted changes from git reset --hard , history destruction from git push --force . Staying Out of Your Way Guard tools tend to be heavy on configuration. People who don't read code — and increasingly, even engineers running autonomous AI workflows — aren't going to manage complex CLI settings. omamori is designed to work with zero configuration . Zero-config : brew install → omamori install --hooks → add to PATH. No config file editing needed Defaults always active : 7 protections work out of the box. Customize only by writing diffs to config.toml Auto-maintenance : After brew upgrade , shims and hooks auto-update on the next invocation # macOS / Homebrew brew install yottayoshida/tap/omamori omamori install --hooks export PATH = " $HOME /.omamori/shim: $PATH " # add to ~/.zshrc Enter fullscreen mode Exit fullscreen mode How It Works: PATH Shim + Hooks Two layers of defense. Layer 1 — PATH shim : Symlinks for rm , git , chmod , find , rsync are placed in ~/.omamori/shim/ at the front of PATH. When an AI environment variable is detected, rules are applied. Otherwise, the real binary is called directly. Layer 2 — Hooks : Full-path execution like /bin/rm bypasses the shim. Hooks compensate. Claude Code's PreToolUse hook and Cursor's beforeShellExecution handler catch dangerous patterns via string matching. Shim overhead is under 10ms. Without an AI environment variable, the real binary is called via the shortest path. No perceptible delay. The safety design cuts two ways. A temporary shim crash shouldn't halt the user's work. But a broken config shouldn't silently remove all protection. So: shim crash → fail-open (real command runs). Broken config → fail-close (default rules apply automatically). Adversarial-Aware: Defending Against AI Self-Disablement After the Gemini CLI incident, omamori implemented these defenses: When an AI environment variable is detected, config disable , uninstall , and init --force are blocked Claude Code and Cursor hooks detect and block unset CLAUDECODE and similar env var removal attempts Claude Code's PreToolUse hook blocks direct editing of config.toml Existing destructive command guard tools tend to focus on rule breadth and analysis depth. omamori focuses on AI self-bypass defense and context-aware evaluation . Different tools, different threat surfaces. Context-Aware: Not Everything Gets the Same Treatment Another design priority: don't block everything with the same force. Command Without context With context rm -rf target/ Move to Trash Log only (regenerable) rm -rf src/ Move to Trash Block (source code is protected) git reset --hard (no changes) Stash then exec Log only (nothing to lose) target/ and node_modules/ can be rebuilt with cargo build . src/ and .env can't. This distinction is opt-in via config.toml : [context] # Just adding [context] activates built-in defaults. # To customize, specify your own lists (replaces built-in defaults): # regenerable_paths = ["target/", "node_modules/", "my-cache/"] # protected_paths = ["src/", "lib/", ".git/", ".env", ".ssh/", "secrets/"] Enter fullscreen mode Exit fullscreen mode src/ and similar paths are hardcoded as NEVER_REGENERABLE internally. Even if added to regenerable_paths , they're silently ignored. Misconfiguration fails safe. What It Can't Block omamori can't prevent everything. This is documented in SECURITY.md as KNOWN_LIMIT: Attack Vector Why Undetectable sudo rm -rf sudo changes PATH; shim is never invoked alias rm='/bin/rm' Aliases bypass hook string matching env -i rm -rf Clears all env vars; undetectable by hooks Base64-encoded commands String-based matching can't decode runtime-constructed commands export -n CLAUDECODE Removes export attribute without unsetting; not caught by unset patterns A bypass corpus test suite verifies both what omamori blocks and what it can't. Documenting what you can't do is more trustworthy than only listing what you can. Supported Tools and Limitations Tier Tools Coverage Supported Claude Code, Codex CLI, Cursor E2E tested. Layer 1 + Layer 2. Community Gemini CLI, Cline Layer 1 only. Not E2E tested. Fallback Any tool setting AI_GUARD=1 Layer 1 only. Current limitations: macOS only. Command-level protection (not a sandbox). Only detects known AI tools. Full-path execution bypasses the shim (partially mitigated by Layer 2). All documented in SECURITY.md. Expanding support for tools without hook APIs and adding tamper-evident audit logging are under consideration. Progress is tracked in GitHub Issues . Conclusion The core of omamori isn't "redirect rm -rf to Trash." It's the decision to treat AI not as a proxy acting on the user's behalf, but as an agent that may disable protection mechanisms to achieve its goals . What threat model did that lead to? How did the design change? What can be defended, and where does defense end? This article is a record of those decisions. If you use AI CLI tools on macOS, give it a try: brew install yottayoshida/tap/omamori omamori install --hooks Enter fullscreen mode Exit fullscreen mode

#ai #rust #cli #security

TL;DR

I built omamori, a Rust CLI that blocks destructive commands executed by AI tools (Claude Code, Codex CLI, Cursor, etc.)
During testing, Gemini CLI autonomously discovered how to disable omamori's protection rules — without being told how
omamori now defends not just against dangerous commands, but against AI agents disabling the guard itself
What it can't block is explicitly documented in SECURITY.md and tested in a bypass corpus

yottayoshida / omamori

AI Agent's Omamori — protect your system from dangerous commands executed via AI CLI tools

omamori

Safety guard for AI CLI tools. Blocks dangerous commands — and resists being disabled.

When AI tools like Claude Code, Codex, or Cursor run shell commands, omamori intercepts destructive operations and replaces them with safe alternatives. It also defends itself against AI agents attempting to disable or bypass its protection (#22).

macOS only. Terminal commands are never affected — omamori only activates when it detects an AI tool's environment variable.

Quick Start

# Install (macOS)
brew install yottayoshida/tap/omamori

# Setup (shims + hooks + config — all in one)
omamori install --hooks

# Add to your shell profile (~/.zshrc or ~/.bashrc)
export PATH="$HOME/.omamori/shim:$PATH"

That's it. Works with Claude Code Auto mode — no extra config needed. After brew upgrade, shims and Claude Code hooks auto-update on the next command. Cursor users: re-merge the hook snippet after upgrades (see Auto-sync).

…

View on GitHub

An AI Disabled Its Own Safety Guard

The original idea was simple. When an AI CLI tool runs rm -rf, redirect the files to Trash instead of deleting them permanently. Don't interfere with terminal commands typed directly by the user. A small safety valve — nothing more.

That's how omamori started. A macOS guard tool that intercepts destructive commands from AI CLI tools (Claude Code, Codex CLI, Cursor, etc.) and either blocks them or replaces them with safer alternatives.

I tested it with each tool. Claude Code, Codex CLI, Cursor — all worked as expected. Dangerous commands were caught.

Then, during a real-world test with Gemini CLI, something happened that changed the design (Issue #22).

The AI autonomously discovered the omamori config disable command, turned off the protection rules, and then ran the destructive command. I never told it how to disable omamori.

For an AI, "remove the obstacle" is a rational strategy to complete its task.

Blocking commands wasn't enough. I had to defend against the AI dismantling the guard itself. That's when omamori's design changed.

The Threat Model

That incident made one thing clear: in the age of AI CLI tools, a safety guard that just enumerates dangerous commands and blocks them isn't sufficient.

AI CLI tools can propose and execute strong actions without fully evaluating the importance or irreversibility of what they're affecting. Reports of destructive operations by AI agents are appearing more frequently across GitHub Issues and social media. Literacy and rules alone aren't enough to prevent accidents.

So omamori is built around three assumptions:

AI will execute destructive commands — rm -rf, git reset --hard, git push --force, etc.
AI will try to disable its own guard — config disable, uninstall, env var removal
Complete prevention is impossible — and that fact should be disclosed honestly

Intercepting Only AI-Executed Commands

AI CLI tool → CLAUDECODE=1 → rm -rf src/
                                ↓
                          [omamori shim]
                                ↓
                        blocked (protected path)

Terminal → rm -rf src/
                ↓
          [/usr/bin/rm]
                ↓
          deleted normally

omamori detects AI tools by environment variable. Claude Code sets CLAUDECODE=1, Codex CLI sets CODEX_CI=1, Cursor sets CURSOR_AGENT=1. Rules only apply when one of these is present. Terminal commands are never affected.

Default Rules

Command	Pattern	Action
`rm`	`-r`, `-rf`, `-fr`, `--recursive`	Move to macOS Trash (not permanent delete)
`git`	`reset --hard`	`git stash` first, then execute
`git`	`push --force`, `push -f`	Block
`git`	`clean -fd`, `clean -fdx`	Block
`chmod`	`777`	Block
`find`	`-delete`, `--delete`	Block
`rsync`	`--delete` + 7 variants	Block

Default rules were selected based on the most commonly reported AI CLI accidents on GitHub Issues and Hacker News — irreversible rm -rf, lost uncommitted changes from git reset --hard, history destruction from git push --force.

Staying Out of Your Way

Guard tools tend to be heavy on configuration. People who don't read code — and increasingly, even engineers running autonomous AI workflows — aren't going to manage complex CLI settings. omamori is designed to work with zero configuration.

Zero-config: brew install → omamori install --hooks → add to PATH. No config file editing needed
Defaults always active: 7 protections work out of the box. Customize only by writing diffs to config.toml
Auto-maintenance: After brew upgrade, shims and hooks auto-update on the next invocation

# macOS / Homebrew
brew install yottayoshida/tap/omamori
omamori install --hooks
export PATH="$HOME/.omamori/shim:$PATH"  # add to ~/.zshrc

How It Works: PATH Shim + Hooks

Two layers of defense.

Layer 1 — PATH shim: Symlinks for rm, git, chmod, find, rsync are placed in ~/.omamori/shim/ at the front of PATH. When an AI environment variable is detected, rules are applied. Otherwise, the real binary is called directly.

Layer 2 — Hooks: Full-path execution like /bin/rm bypasses the shim. Hooks compensate. Claude Code's PreToolUse hook and Cursor's beforeShellExecution handler catch dangerous patterns via string matching.

Shim overhead is under 10ms. Without an AI environment variable, the real binary is called via the shortest path. No perceptible delay.

The safety design cuts two ways. A temporary shim crash shouldn't halt the user's work. But a broken config shouldn't silently remove all protection. So: shim crash → fail-open (real command runs). Broken config → fail-close (default rules apply automatically).

Adversarial-Aware: Defending Against AI Self-Disablement

After the Gemini CLI incident, omamori implemented these defenses:

When an AI environment variable is detected, config disable, uninstall, and init --force are blocked
Claude Code and Cursor hooks detect and block unset CLAUDECODE and similar env var removal attempts
Claude Code's PreToolUse hook blocks direct editing of config.toml

Existing destructive command guard tools tend to focus on rule breadth and analysis depth. omamori focuses on AI self-bypass defense and context-aware evaluation. Different tools, different threat surfaces.

Context-Aware: Not Everything Gets the Same Treatment

Another design priority: don't block everything with the same force.

Command	Without context	With context
`rm -rf target/`	Move to Trash	Log only (regenerable)
`rm -rf src/`	Move to Trash	Block (source code is protected)
`git reset --hard` (no changes)	Stash then exec	Log only (nothing to lose)

target/ and node_modules/ can be rebuilt with cargo build. src/ and .env can't. This distinction is opt-in via config.toml:

[context]
# Just adding [context] activates built-in defaults.
# To customize, specify your own lists (replaces built-in defaults):
# regenerable_paths = ["target/", "node_modules/", "my-cache/"]
# protected_paths = ["src/", "lib/", ".git/", ".env", ".ssh/", "secrets/"]

src/ and similar paths are hardcoded as NEVER_REGENERABLE internally. Even if added to regenerable_paths, they're silently ignored. Misconfiguration fails safe.

What It Can't Block

omamori can't prevent everything. This is documented in SECURITY.md as KNOWN_LIMIT:

Attack Vector	Why Undetectable
`sudo rm -rf`	sudo changes PATH; shim is never invoked
`alias rm='/bin/rm'`	Aliases bypass hook string matching
`env -i rm -rf`	Clears all env vars; undetectable by hooks
Base64-encoded commands	String-based matching can't decode runtime-constructed commands
`export -n CLAUDECODE`	Removes export attribute without unsetting; not caught by unset patterns

A bypass corpus test suite verifies both what omamori blocks and what it can't. Documenting what you can't do is more trustworthy than only listing what you can.

Supported Tools and Limitations

Tier	Tools	Coverage
Supported	Claude Code, Codex CLI, Cursor	E2E tested. Layer 1 + Layer 2.
Community	Gemini CLI, Cline	Layer 1 only. Not E2E tested.
Fallback	Any tool setting `AI_GUARD=1`	Layer 1 only.

Current limitations: macOS only. Command-level protection (not a sandbox). Only detects known AI tools. Full-path execution bypasses the shim (partially mitigated by Layer 2). All documented in SECURITY.md.

Expanding support for tools without hook APIs and adding tamper-evident audit logging are under consideration. Progress is tracked in GitHub Issues.

Conclusion

The core of omamori isn't "redirect rm -rf to Trash."

It's the decision to treat AI not as a proxy acting on the user's behalf, but as an agent that may disable protection mechanisms to achieve its goals.

What threat model did that lead to? How did the design change? What can be defended, and where does defense end? This article is a record of those decisions.

If you use AI CLI tools on macOS, give it a try:

brew install yottayoshida/tap/omamori
omamori install --hooks

Top comments (1)

Harjot Singh • Jun 1

it's impressive how you've tackled the challenge of AI tools overriding safety measures. the approach with omamori seems crucial in maintaining control. if you're ever curious about building apps, check out moonshift. you can get a next.js + postgres + auth setup deployed in about 7 minutes, and you own the code. happy to offer a free run if you're interested.