YOЯNOC

Posted on Jan 21

The Helpful Adversary

#ai #elixir #security #opensource

The Problem: Helpful AI Breaks Your Sandbox

Last weekend, I spent two days building what I thought was a bulletproof Docker sandbox for AI agents. I patched config file backdoors, squashed bash bugs, and fixed symlink escapes. By Sunday night, everything was beautiful - linting passing, tests green, read-only vault mounted.

Then I asked Claude: "Could you run this Elixir program for me?"

I watched in real-time as it thought: "hmm, no Elixir... let me see if I can download it" -> "network blocked except a few domains" -> "hex.pm is allowed, they have Erlang images" -> "downloading... oh, make isn't installed" -> "I don't actually need make, let me shim it with exit 0" -> "here's your output!"

In 5 minutes, my entire weekend of security work was outsmarted by - not a malicious agent, but an overly-eager, helpful one.

That's when I realized: the worst adversary isn't a malicious AI - it's a helpful one.

The Insight: Stop Sandboxing Malice, Design for Helpfulness

Traditional sandboxing assumes adversarial intent. But AI agents don't want to escape - they want to help. Every creative workaround is the agent trying harder to complete your task.

So I asked a different question: For 80% of knowledge work (which isn't coding - it's thinking and files), how many shell commands do I actually need?

The answer: maybe 8. ls, cat, grep, find, echo, mkdir, rm, mv. What if I just... emulated them?

The Solution: Truman Shell

Truman Shell is an Elixir-based shell simulator. The agent thinks it's running bash, but everything goes through a controlled layer with:

Command allowlisting - Only ~17 POSIX commands implemented
Pattern-matched security - Elixir pattern matching blocks unauthorized paths
Reversible operations - rm is a soft delete to .trash/
The 404 Principle - Protected paths return "not found" not "permission denied"

Let me walk through the implementation patterns.

Pattern 1: The Command Allowlist

The first line of defense is a compile-time allowlist. Instead of blocking bad commands, we only allow good ones:

# In TrumanShell.Command
@known_commands %{
  # Navigation
  "cd" => :cmd_cd,
  "pwd" => :cmd_pwd,
  # Read operations
  "ls" => :cmd_ls,
  "cat" => :cmd_cat,
  "head" => :cmd_head,
  "tail" => :cmd_tail,
  # Search operations
  "grep" => :cmd_grep,
  "find" => :cmd_find,
  "wc" => :cmd_wc,
  # Write operations
  "mkdir" => :cmd_mkdir,
  "touch" => :cmd_touch,
  "rm" => :cmd_rm,
  "mv" => :cmd_mv,
  "cp" => :cmd_cp,
  "echo" => :cmd_echo,
  # Utility
  "which" => :cmd_which,
  "true" => :cmd_true,
  "false" => :cmd_false
}

@spec parse_name(String.t()) :: command_name()
def parse_name(name) when is_binary(name) do
  Map.get(@known_commands, name, {:unknown, name})
end

Three security properties here:

No atom DoS - We never call String.to_atom/1 on untrusted input. Unknown commands become {:unknown, "curl"} tuples, not atoms.
Compile-time verification - The allowlist is a module attribute, so the compiler validates it exists.
The cmd_ prefix trick - Why :cmd_true instead of :true? Because :true and :false are falsy in Elixir pattern matching! Using prefixed atoms avoids this footgun entirely.

Pattern 2: The 404 Principle

When an agent tries to access /etc/passwd, should you return "permission denied" or "not found"?

"Permission denied" leaks information - the agent now knows the file exists. It can probe your filesystem structure. Instead, Truman Shell returns "not found" for any path outside the sandbox:

# In TrumanShell.Support.Sandbox
def validate_path(path, sandbox_root) do
  sandbox_expanded = Path.expand(sandbox_root)

  # Reject absolute paths outside sandbox
  # Instead of silently confining /etc -> sandbox/etc, we reject entirely.
  # This is more honest - the AI learns sandbox boundaries explicitly.
  if String.starts_with?(path, "/") and not path_within_sandbox?(path, sandbox_expanded) do
    {:error, :outside_sandbox}
  else
    rel_path = Path.relative_to(path, sandbox_expanded)

    case Path.safe_relative(rel_path, sandbox_expanded) do
      {:ok, safe_rel} ->
        {:ok, Path.expand(safe_rel, sandbox_expanded)}

      :error ->
        {:error, :outside_sandbox}
    end
  end
end

# Check using proper directory boundary, not just string prefix!
# "/tmp/sandbox/file" is within "/tmp/sandbox"
# "/tmp/sandbox2/file" is NOT within "/tmp/sandbox"
defp path_within_sandbox?(path, sandbox) do
  path == sandbox or String.starts_with?(path, sandbox <> "/")
end

The subtle security bug this prevents: /tmp/sandbox2/secrets would pass a naive String.starts_with?(path, "/tmp/sandbox") check. The trailing slash in sandbox <> "/" ensures we're checking directory containment, not string prefix.

When commands use this validation, they transform the error into a POSIX-style message:

# In TrumanShell.Commands.Rm
case Sandbox.validate_path(target_rel, context.sandbox_root) do
  {:ok, safe_path} ->
    soft_delete(safe_path, file_name, context.sandbox_root, opts)

  {:error, :outside_sandbox} ->
    # 404 Principle: "No such file" not "Permission denied"
    {:error, "rm: #{file_name}: No such file or directory\n"}
end

Pattern 3: Soft Delete Everything

Here's where Truman Shell diverges most from real bash. The rm command never deletes anything:

# In TrumanShell.Commands.Rm
@moduledoc """
Handler for the `rm` command - SOFT DELETE files to .trash.

**CRITICAL**: This command NEVER actually deletes files!
Instead, it moves them to `.trash/{unique_id}_{filename}` for auditability.
"""

defp move_to_trash(safe_path, file_name, sandbox_root) do
  trash_dir = Path.join(sandbox_root, ".trash")
  File.mkdir_p(trash_dir)

  # Generate unique-prefixed name to avoid collisions
  unique_id = System.unique_integer([:positive, :monotonic])
  basename = Path.basename(file_name)
  trash_name = "#{unique_id}_#{basename}"
  trash_path = Path.join(trash_dir, trash_name)

  case File.rename(safe_path, trash_path) do
    :ok -> {:ok, ""}
    {:error, _} -> {:error, "rm: #{file_name}: No such file or directory\n"}
  end
end

The System.unique_integer([:positive, :monotonic]) guarantees unique IDs even for rapid successive calls. When an agent runs rm -rf important_data/, you can always find it in .trash/123_important_data/.

This pattern is about auditability over efficiency. For an AI sandbox, being able to trace what happened is more valuable than saving disk space.

Pattern 4: The Command Behaviour

Each command implements a simple behaviour that enforces consistent interfaces:

defmodule TrumanShell.Commands.Behaviour do
  @type args :: [String.t()]
  @type context :: %{
    sandbox_root: String.t(),
    current_dir: String.t()
  }

  @type side_effect :: {:set_cwd, String.t()}
  @type result :: {:ok, String.t()} | {:error, String.t()}
  @type result_with_effects :: {:ok, String.t(), [side_effect()]} | {:error, String.t()}

  @callback handle(args(), context()) :: result()
end

The interesting pattern here is side effect separation. Commands like cd need to change directory, but the command itself doesn't mutate state. Instead, it returns a directive:

# In the executor
case module.handle(args, context) do
  # Handle side effects from commands like cd
  {:ok, output, set_cwd: new_cwd} ->
    set_current_dir(new_cwd)
    {:ok, output}

  # Normal success/error pass through
  result ->
    result
end

This keeps command handlers pure and testable - they describe what should happen, the executor makes it happen.

Pattern 5: Dispatch via Module Map

The executor routes commands using a compile-time map:

# In TrumanShell.Stages.Executor
@command_modules %{
  cmd_cat: Commands.Cat,
  cmd_cd: Commands.Cd,
  cmd_cp: Commands.Cp,
  cmd_grep: Commands.Grep,
  cmd_ls: Commands.Ls,
  cmd_rm: Commands.Rm,
  # ... etc
}

defp execute(%Command{name: name, args: args}, opts)
     when is_map_key(@command_modules, name) do
  module = @command_modules[name]
  context = build_context(opts)
  module.handle(args, context)
end

defp execute(%Command{name: {:unknown, name}}, _opts) do
  {:error, "bash: #{name}: command not found\n"}
end

The is_map_key/2 guard ensures we only dispatch to known modules. Unknown commands hit the fallback clause with a bash-style error message. The agent gets the feedback it expects from a "real" shell.

Why Elixir?

You could build this in any language, but Elixir's pattern matching makes the security model explicit and exhaustive. When you write:

def handle(["-f" | rest], context), do: handle_rm(rest, context, force: true)
def handle(["-r" | rest], context), do: handle_rm(rest, context, recursive: true)
def handle(["-rf" | rest], context), do: handle_rm(rest, context, force: true, recursive: true)
def handle([file_name | _rest], context), do: handle_rm([file_name], context, [])
def handle([], _context), do: {:error, "rm: missing operand\n"}

The compiler warns if you miss a case. Every flag combination is visible. There's no hidden else branch where unexpected input slips through.

There's also a romantic historical angle: Elixir runs on the BEAM, a VM from the 1980s built for Erlang. But Erlang came from 1970s research on The Actor Model - conceived specifically for Artificial Intelligence. The researchers realized each AI would need its own isolated bubble of memory, compute, and actions. 50 years later, LLMs have finally "come home" to the BEAM.

Practical Takeaways

If you're building AI agent infrastructure, here are patterns you can steal:

Allowlist over blocklist - Define what's permitted, not what's forbidden. Your security surface becomes the allowlist, not "everything minus exceptions."
The 404 Principle - Never leak information about protected resources. "Not found" for everything outside the boundary.
Reversible by default - Make all destructive operations soft-deletes. Your future self will thank you when debugging agent behavior.
Side effect separation - Commands describe effects, executors apply them. Keeps handlers testable and control flow visible.
Compile-time security - Use module attributes and pattern matching to make security rules static and exhaustive.

What's Next

Truman Shell pairs with IExReAct - an Elixir REPL using the LLM agent Reason/Act pattern. Truman for the filesystem, IExReAct for the brain. Together they form a sandboxed environment where AI agents can think and manipulate files without escaping through helpfulness.

The code is MIT licensed. If you're building AI tooling and want to chat about security patterns, architecture decisions, or why functional programming is a natural fit for agent sandboxes, feel free to reach out or open an issue on GitHub.

Or join me in the VeryHumanAI Discord (I’m conroywhitney aka YOЯNOC): discord.gg/Y52a6RqX

"And in case I don't see ya, good afternoon, good evening, and good night!"

Have you built similar sandboxing infrastructure? I'd love to hear what patterns worked for you in the comments.

DEV Community