Mohamed Diallo

Posted on Jan 13

4 ways to sandbox untrusted code in 2026

#ai #python #webassembly #security

We're moving from deterministic systems to probabilistic ones, where executing untrusted code is becoming common, especially now that AI is everywhere.

LLMs have architectural flaws, the most notorious is prompt injection, which exploits the fact that LLMs can't tell the difference between system prompt, legitimate user instructions and malicious ones injected from external sources.

For example, researchers demonstrate how a hidden instruction can be injected from a web page through the coding agent and extract sensitive data from your .env file.

Developers using AI coding tools are usually aware of these risks. But non-technical users accessing AI through third-party services are often left vulnerable, as we've seen with incidents of data leaks and browser-based AI exploits.

The most important thing is to keep unaware users safe when we build AI agents. That's why I'm sharing 4 ways to sandbox untrusted code in 2026, from the easiest to the most complex solution.

Python dominates AI/ML, especially the AI agent space, so all examples use it to keep things accessible.

1. Webassembly

Wasm is basically a portable binary format that lets us run code almost as fast as native code within a secure isolated memory space.

I've been implementing a solution to run isolated Wasm instances. The idea was to keep it simple, so we can sandbox untrusted code just by adding a decorator, making it the easiest option:

from capsule import task

@task(name="analyze_data", compute="MEDIUM", ram="512MB", timeout="30s", max_retries=1)
def analyze_data(dataset: list) -> dict:
    # Your code runs safely in a Wasm sandbox
    return {"processed": len(dataset), "status": "complete"}

You can find the project on My Github if you’re interested.

When to use it

It’s way more adapted to fine-grained task-level isolation. But you can definitely use it with other solutions that sandbox the entire agent system.

Pros

Portability: runs on any OS, no setup needed.
No elevated privileges by default: no filesystem, network, or environment variable access unless explicitly granted
Low overhead: Wasm doesn’t embed an entire OS. So it starts almost instantly compared to other solutions.

Cons

Compatibility with python ecosystem: Many Python libraries use C extensions (NumPy, Pandas) and those don't work in Wasm yet. If your code needs anything beyond pure Python, it’s getting complicated

2. Docker

Docker is the most commonly used containerization tool for running untrusted code. However, many security teams don't really recommend it as a true sandbox for seriously untrusted code (especially AI-generated).

Why? Because containers share the host kernel. So if there's a kernel exploit (even rare ones), an attacker could break out and compromise the whole system.

Here's a basic example of running untrusted code from Python:

import docker

client = docker.from_env()

untrusted_code = """
print('Hello from untrusted code!')
import os
print(os.listdir('/'))
"""

container = client.containers.run(
    "python:3.11-slim",
    command=["python", "-c", untrusted_code],
    remove=True,              # Auto-cleanup
    network_disabled=True,    # No network access
    read_only=True,           # Filesystem read-only
    mem_limit="128m",         # Memory limit
    cpu_quota=50000,          # CPU limit
    detach=False,
    user="1000:1000"          # Non-root user
)

print(container.decode())

When to use it

It’s often the default choice when you need a quick isolation without spending too much time setting up something more robust. It works fine for non-intensive workloads, local dev, or as a starting point before layering something like Firecracker (coming up next).

Pros

Simple: Easy to use with a mature ecosystem.
AI-focused evolution: Docker recently introduced new sandboxing features specifically for AI agents (end of 2025).

Cons

Shared kernel risk: Containers share the host kernel, so a kernel vulnerability could lead to container escape.
Permissive by default: The default configuration is quite permissive.
Not true isolation for high-risk code: For AI-generated or LLM code, it's often insufficient alone.

3. gVisor

gVisor sits between containers and VMs. it’s an application kernel that emulates a kernel without directly using the real host kernel. Think of it as a syscall wrapper that intercepts everything before it touches the real kernel, making it extremely robust.

Here's how to use it with Docker and Kubernetes:

docker run --runtime=runsc \
  --rm \
  -it \
  python:3.11-slim \
  python -c "print('Running in gVisor sandbox')"

With Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: gvisor  # Use gVisor runtime
  containers:
  - name: app
    image: python:3.11-slim
    command: ["python", "-c", "print('Sandboxed')"]

When to use it

gVisor works well for both task-level isolation and sandboxing entire agents. In my opinion, if you're already using Kubernetes, gVisor is the natural fit, though it's flexible enough to work with any container runtime.

Pros

Strong isolation: One of the most robust isolation tools available, without the rigidity of full VMs.
Backed by Google: Used extensively in Google Kubernetes Engine (GKE) so it's well-maintained and production-ready.
Flexible: Works with Docker, containerd, and Kubernetes out of the box.

Cons

Linux-only: It was designed to secure Linux containers by intercepting and reimplementing Linux system calls, so no Windows or macOS support.
Performance overhead: Adds syscall interception overhead, which can slow down syscall-heavy workloads.
Compatibility issues: Some applications that rely on specific kernel features may not work properly.

4. Firecracker

Firecracker is a microVM that provides a sandboxed environment for running untrusted code. AWS originally built it for Lambda, making it one of the most “secure-by-default” options. It’s used by many third-party services because it offers very strong isolation (hardware-level with a separate kernel per VM).

Setting up Firecracker directly is complex (custom kernels, JSON configs, HTTP APIs). Most developers use platforms like Modal or E2B that abstract this away:

from e2b_code_interpreter import Sandbox

# E2B uses Firecracker under the hood
with Sandbox() as sandbox:
    execution = sandbox.run_code("""
    print('Running in a Firecracker microVM!')
    import sys
    print(f'Python version: {sys.version}')
    """)

    print(execution.text)

When to use it

When you need maximum isolation for untrusted code and are not afraid of configuration. It isn't really adapted for granular task isolation, since it introduces more overhead.

Pros

Strongest isolation: Hardware-enforced isolation with a dedicated kernel per VM—about as secure as it gets.
Battle-tested at scale: Powers AWS Lambda, handling billions of executions daily.
Fast boot times: Starts in milliseconds, unlike traditional VMs.

Cons

Complex setup: Running it standalone requires JSON config, HTTP API management, and custom kernel/rootfs images.
Linux-only: Requires KVM (Kernel-based Virtual Machine).
Higher overhead: More resource usage than containers, especially for I/O-heavy workloads.

Final thoughts

Firecracker and gVisor have shown us that strong isolation is possible. And now, we're seeing newer players like WebAssembly come in, offering much more granular task-level isolation.

If you're designing agent systems now, I'd recommend planning for failure from the start. Assume that an agent will, at some point, execute untrusted code, process malicious instructions, or simply consume excessive resources. Your architecture must be ready to contain all of these scenarios.

Thank you for reading. Feel free to reach out with any thoughts or feedback.

DEV Community

4 ways to sandbox untrusted code in 2026

1. Webassembly

When to use it

Pros

Cons

2. Docker

When to use it

Pros

Cons

3. gVisor

When to use it

Pros

Cons

4. Firecracker

When to use it

Pros

Cons

Final thoughts

Top comments (0)