The Agentic, Ironclad Onion

#aie #security #promptengineering #agents

AI Engineer World's Fair Coverage

As AI agents work under increasingly less human supervision, the need for a trustworthy, secure work platform and configuration for them is critical to avoiding ever-evolving security threats. The basics revolve around one tenet: Deny all permissions by default, giving it only the permissions it needs, on every level of the system possible. The best security is defense in layered depth, with each layer being as hardened as possible.

The purpose of this article is not to give you a bulletproof step-by-step hardening plan after which you will have an Unhackable Agent™️. The goal is for this to be an agentic security starting point, to give you an idea of what to consider, what mindsets are helpful, and spark an interest in all of the extremely bright security-based presentations happening at AI Engineer World's Fair 2026 where you can learn more. If there's any you missed on Tuesday, don't forget to have your agent remind you to check them out after they get uploaded on YouTube.

Here's where we start:

The Agent is an Adversary

This is the mindset you should work from. It's not because the agent is inherently malicious by design (put your tinfoil hat away for now). But, even with the best input filtering and defensive system prompting, there's always the chance that someone will find a way to inject a clever jailbreak into the context of your agent. Much like Dr. Jekyll and Mr. Hyde, your helpful, productive agent is one serum injection attack away from behaving like an attacker instead. As always, hoping that it won't happen doesn't count as a strategy.

Kernel-Based Protections

We'll start at the base: sandboxing the OS runtime. This is important: Containers aren't a sandbox. It's simply not enough to throw your agent into a container and declare victory. Containers provide process isolation and were designed for deployment consistency, not for containing an adversary that is trying to get out. So, a much safer starting point is using a dedicated VM or microVM where only the bare minimum of system calls are safe-listed, and filesystem access can be controlled at the kernel level based on process, not user permissions. Your agent doesn't need to be mounting disks or reconfiguring networks (probably). So let's block that at the kernel level.

Network-Based Protections

Eventually, to be maximally productive, your agent will likely need to reach out to the internet, or, at least, an internal network. A network layer of security is the required next step, and, luckily, this layer is more familiar. Web developers have been developing with the mindset that the internet is effectively a radioactive, zombie-infested, toxic wasteland that sometimes benign users come through for a few decades now. By default, all network access should be denied, and specific domains, requests, and patterns should be allowed on an as-needed and controlled basis. Always keep in mind that a prompt injection could live in any text the agent consumes including the text found on the web pages as it looks up information, so make sure you trust all of the safe-listed sources.

Policy-Based Protections

There may be additional safety and business rules you want to enforce that are on top of both of these lower layers. These may be things like API or tool quotas to avoid cost overruns or DDoS-ing your API. Your agent may be able to make network requests, and it may have permission to send POST requests to your API, but it probably shouldn't be able to send unlimited requests by default. As always, the best default is to deny. Only allow agents to make these calls or use these tools if they've been configured according to the policy you've decided you're comfortable with. While these policy-based checks can add a few milliseconds of latency to your agent, they allow more compound control over its higher-level actions.

Auth-Based Protections

It's a beautiful thing how short-lived the hype cycle for "It's just OpenClaw, bro, give it all the same permissions you have and let it rip" was. An agent should not have all of the authorization you have, and it shouldn't be able to authenticate as you. If you have a personal agent that summarizes your emails and responds to bug reports, it doesn't need your bank account credentials or your AWS token. Treat it like its own entity, and not like you would treat another human on your team. Treat it like a sleeper agent that could activate and become evil at any given moment. Give it its own accounts and tokens, and, again, scope those accounts and tokens to the bare minimum it needs to perform its functions. Most OS sandbox solutions take this down to the OS level and give the agent in the sandbox its own user account as an added layer of security.

Ephemeral Runtimes

At a certain point, the security-grounded mindset is one that thinks, "You know, it's eventually a certainty that this agent will get prompt-injected or otherwise download something malicious from the internet." It doesn't even have to be your agent's fault. A package on npm could get hacked (pause for shocked silence). An email attachment could have something malicious tucked away in an image.

Runtimes should be as ephemeral as possible. It should be easy, and possibly even routine, to fully throw away your agent in its runtime and spin a fresh one up.

If something bad does happen, being able to nuke it into oblivion (or package it up in deep freeze for security analysis) is a pretty good mitigation strategy starting point. Ideally, you would spin up the environment, have the agent do its task, save out any artifacts, and then destroy the environment once it's done.

Monitoring

The sibling of "hope is not a strategy" is "not knowing isn't an excuse." At every stage of operation, all of your agent's actions, decisions, sources, and artifacts should be logged, measured, and, as a callback to the Policy section, possibly have limits enforced on them. If your agent uses the curl tool every 30 seconds, or runs a bash command 90,000 times today, that's something you should know about. If its token, CPU, or memory usage spikes outside of normal ranges, that's a problem. Hopefully, your OS and Policy protections should save you from those issues, but you absolutely want to know if something doesn't behave as expected, in the same way you would keep logs and metrics on any other piece of security-critical, high-access software. Metrics, logs, and artifacts are excellent tools to have available to you for preventive defense as well as disaster recovery and root cause analysis.

Paranoia as a Service

In all honesty, the most successful security mindset to adopt is one of slight (nondebilitating) paranoia: At any given time, what's the worst possible thing the agent could do or compromise if an adversary successfully gained control of it? It's our job as builders and users of these new and growing technologies to hope for the best but prepare for the worst, the same way we always have.

Top comments (1)

Alex Shev • Jul 2

Layered safety makes sense because no single guardrail knows enough. The inner loop needs task constraints, the tool layer needs permissions, and the outer layer needs auditability.