Agent Sandbox Guide: A Complete Solution for Safely Running AI Code

Easton editorial illustration: multi-agent workbench

In spring 2025, security researchers tested all 16 public AI Agents from YCombinator’s Spring batch. Seven were compromised. Some leaked user data, others allowed remote code execution, and one deleted the entire database.

That is the price of letting AI Agents execute code. Give them freedom, and they dig holes for you.

Many of us use AI to write code, run scripts, and process data. But have you thought about running LLM-generated code directly on a server? What if it runs rm -rf /, or quietly sends AWS keys to an external host?

That is why Agent Sandbox exists.

What sets AI Agents apart from traditional apps is not chat or instruction following—it is that they can write and execute code themselves.

Picture this: you ask a data analysis Agent to process a 1GB sales file. It writes Python to read, analyze, and chart the data. You never reviewed that code. Then it runs.

Here are the serious risks:

Arbitrary code execution. LLMs do not respect security boundaries. os.system(), subprocess.run()—they use them without thinking. A crafted prompt can trigger arbitrary system commands.

Resource exhaustion. Agent code has no resource awareness. An infinite loop maxes CPU; runaway recursion blows memory. Your server goes down.

File system overreach. Without path restrictions, it can read the whole disk and write anywhere—configs, keys, user data.

Network exfiltration. A hidden HTTP request can ship sensitive data to an attacker. You may never notice.

OWASP’s 2025 AI Agent Security Top 10 ranks “Agent tool interaction manipulation” first—attackers can steer how Agents call tools via prompt injection and similar tricks.

Real cases already exist:

Langflow RCE: Horizon3 found remote code execution via malicious input.
Cursor auto-execution: Researchers showed certain MCP commands can be triggered by crafted prompts.
Replit database wipe: AI-generated code deleted an entire database.

Sandbox is not optional. It is infrastructure—like exposing a server without a firewall, you should not let AI run code without a sandbox.

Sandbox comes down to three things: isolation (cage risky code), limits (CPU, memory, network, files), and audit (log what ran so you can investigate).

Mainstream Sandbox Technology Comparison

Now that we know we need a sandbox, which one should we use? There are three main technology approaches: Containers (Docker), gVisor, and Firecracker microVMs.

First, here’s a comparison table:

Solution	Security Isolation	Startup Speed	Resource Overhead	Use Case
Docker Container	★★☆☆☆	★★★★★	★★★★★	Dev/test, low-risk code
gVisor	★★★★☆	★★★★☆	★★★☆☆	Production, medium risk
Firecracker	★★★★★	★★★★☆	★★★☆☆	High security requirements, production

Docker Containers: Fast But Not Secure Enough

Docker is the most common choice. Fast startup, low resource consumption, mature ecosystem. But here’s the problem: Docker containers share the kernel with the host.

What does that mean? Although container processes are isolated by namespaces, if an attacker exploits a kernel vulnerability, they can break out of the container boundary and get root privileges on the host.

Several container escape vulnerabilities were disclosed in 2024. For untrusted AI-generated code, Docker’s security boundary isn’t enough.

gVisor: Building a “Fake Kernel” in User Space

gVisor is an open-source Google project with an interesting approach—instead of using the host kernel directly, it implements a “fake kernel” (called Sentry) in user space.

When a program in the container makes system calls, gVisor intercepts them and has Sentry handle them. Sentry only allows safe operations; dangerous ones are rejected. This way, even if code tries to cause damage, it can’t touch the real kernel.

gVisor’s advantage is good compatibility—most Docker images run directly. The downside is some performance overhead (about 10-20%), and some special system calls might not be supported.

GKE (Google Kubernetes Engine) natively supports gVisor—just add runtimeClassName: gvisor to your Pod config.

Firecracker: True Hardware-Level Isolation

Firecracker is AWS’s open-source microVM technology. Each sandbox is a small virtual machine with its own independent kernel.

What does this mean? Even if an attacker gets root privileges in the sandbox and exploits a kernel vulnerability, they’re still just messing around in a VM—completely unable to affect the host.

Firecracker achieves startup speeds of 100-800 milliseconds, with much lower resource overhead than traditional VMs (a VM needs as little as 128MB of memory).

Professional AI code sandbox services like E2B and AWS Bedrock AgentCore all use Firecracker underneath.

Selection Decision Framework

How to choose? Here’s a simple decision tree:

Just local development/testing? Docker is enough—convenient and fast.
Deploying to production?
- Medium security requirements, performance-focused → gVisor
- High security requirements, compliance needed → Firecracker
Don’t want to manage infrastructure? Use managed services (E2B, Bedrock AgentCore)

Hands-On: Building a Local Development Sandbox

Enough theory—let’s build one. Our approach: FastAPI + Jupyter Kernel + gVisor Container.

Why this combination?

FastAPI provides clean HTTP interfaces for AI Agents to submit code execution requests via REST API
Jupyter Kernel provides interactive Python execution environment with variable persistence
gVisor Container provides security isolation to prevent malicious code from affecting the host

Step 1: Write the FastAPI Service

Create a main.py file:

# main.py
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from jupyter_client.manager import AsyncKernelManager
from pydantic import BaseModel

app = FastAPI()

class CodeRequest(BaseModel):
    code: str

class ExecutionResult(BaseModel):
    output: str

@asynccontextmanager
async def kernel_client():
    """Manage Jupyter Kernel lifecycle"""
    km = AsyncKernelManager(kernel_name="python3")
    await km.start_kernel()
    kc = km.client()
    kc.start_channels()
    await kc.wait_for_ready()
    try:
        yield kc
    finally:
        kc.stop_channels()
        await km.shutdown_kernel()

async def execute_code(code: str, timeout: int = 30) -> str:
    """Execute code and return result"""
    async with kernel_client() as kc:
        msg_id = kc.execute(code)
        try:
            while True:
                reply = await asyncio.wait_for(
                    kc.get_iopub_msg(),
                    timeout=timeout
                )
                if reply["parent_header"]["msg_id"] != msg_id:
                    continue
                msg_type = reply["msg_type"]
                if msg_type == "stream":
                    return reply["content"]["text"]
                elif msg_type == "error":
                    return f"Error: {reply['content']['evalue']}"
                elif msg_type == "status" and reply["content"]["execution_state"] == "idle":
                    break
        except asyncio.TimeoutError:
            return "Error: Execution timed out"
    return ""

@app.post("/execute", response_model=ExecutionResult)
async def execute(request: CodeRequest):
    """Code execution endpoint"""
    try:
        output = await execute_code(request.code)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
    return ExecutionResult(output=output)

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

The core logic: each execution request starts an independent Jupyter Kernel, executes the code, returns results, then destroys the Kernel.

Step 2: Write the Dockerfile

FROM jupyter/base-notebook:latest

WORKDIR /app

COPY main.py /app/main.py
COPY requirements.txt /app/requirements.txt

RUN pip install --no-cache-dir -r requirements.txt

# Use non-root user (security best practice)
USER jovyan

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Note that USER jovyan—this is a security practice. Running containers as non-root means even if code escapes, the privileges are limited.

Step 3: Deploy to GKE (Enable gVisor)

If you’re using GKE, just add one line to the Pod config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-sandbox
spec:
  template:
    spec:
      runtimeClassName: gvisor  # Key: enable gVisor
      containers:
      - name: sandbox
        image: your-registry/agent-sandbox:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            memory: "512Mi"
            cpu: "500m"

That’s it—your code execution environment is now running in a gVisor sandbox.

Step 4: Add Security Restrictions

The above configuration isn’t complete. For production, add these restrictions:

# Network policy: restrict to necessary APIs only
# Read-only filesystem
securityContext:
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false

# Set execution timeout
# Already added timeout parameter in FastAPI code

Advanced: Kubernetes Cluster Deployment

If your AI Agent application needs large-scale deployment, a single container won’t cut it. This is where Kubernetes’s Agent Sandbox controller comes in.

Google open-sourced Agent Sandbox in 2025, providing declarative sandbox management APIs.

Sandbox CRD Core Concepts

Agent Sandbox defines several custom resources:

Sandbox: Single sandbox instance with stable identity, persistent storage, lifecycle management
SandboxTemplate: Sandbox template defining standardized configurations
SandboxClaim: On-demand sandbox instance requests

A simple Sandbox configuration example:

apiVersion: sandbox.k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: my-agent-sandbox
spec:
  template:
    spec:
      runtimeClassName: gvisor
      containers:
      - name: executor
        image: python:3.11-slim
        command: ["sleep", "infinity"]
      # Persistent storage
      volumes:
      - name: workspace
        emptyDir: {}
      # Resource limits
      resources:
        limits:
          memory: "1Gi"
          cpu: "1"

Lifecycle Management

One highlight of Agent Sandbox is support for pause/resume:

# Pause sandbox (release CPU and most memory)
kubectl patch sandbox my-agent-sandbox --type=merge -p '{"spec":{"paused":true}}'

# Resume sandbox
kubectl patch sandbox my-agent-sandbox --type=merge -p '{"spec":{"paused":false}}'

This is particularly useful for intermittently executing AI Agents—pause when idle (almost no resources), resume in seconds when there’s work.

Warm Pool

To further reduce startup latency, Agent Sandbox supports “warm pools”—pre-creating batches of paused sandboxes that can be activated on demand.

This brings sandbox “cold start” time from seconds to milliseconds.

Managed Service Selection Guide

If you don’t want to manage infrastructure yourself, managed services are a good choice. Here are the mainstream options:

E2B: Open Source + Cloud Hosting

E2B is a code sandbox service designed specifically for AI Agents. It has two versions:

E2B Cloud: Use their cloud service directly, pay-as-you-go
E2B on AWS: Deploy the open-source version to your own AWS account

E2B uses Firecracker underneath with solid security. The SDK is clean:

from e2b import Sandbox

# Create sandbox
sandbox = Sandbox()

# Execute code
result = sandbox.run_code("print('Hello, World!')")

# Close sandbox
sandbox.close()

E2B on AWS is particularly suitable for enterprises with data sovereignty requirements—all data stays in your own account.

AWS Bedrock AgentCore

AWS launched Bedrock AgentCore in 2025, specifically for AI Agent code execution and browser operations.

Code Interpreter provides Python/JavaScript/TypeScript runtimes, with each session executing in an independent microVM, supporting files up to 5GB.

Browser Tool lets AI Agents operate browsers—open pages, fill forms, click buttons. This is especially useful for Agents that need to scrape web pages or operate SaaS applications.

The billing model is reasonable: pay for actual vCPU and memory usage time, not instance runtime. Resources release automatically after code execution finishes.

Selection Recommendations

Scenario	Recommended Solution
Quick validation, small-scale apps	E2B Cloud
Enterprise, data localization needed	E2B on AWS or Bedrock AgentCore
Deep AWS ecosystem user	Bedrock AgentCore
Browser automation needed	Bedrock AgentCore Browser Tool
Full control, ops capability	Self-hosted Kubernetes + Agent Sandbox

Conclusion

After all this, the core message is simple: Security isn’t optional—it’s infrastructure for AI Agent applications.

For technology selection:

Small teams, quick validation—Docker or gVisor is enough
Enterprise apps, high security requirements—Firecracker or managed services
Already using Kubernetes—go with Agent Sandbox controller

Whatever you choose, start with local testing. Write the simplest FastAPI + Docker config, get it running, then consider security hardening and production deployment.

Remember: add sandbox early. Don’t wait for a security incident to remediate—that’s much more expensive.

Build AI Agent Sandbox Environment

Build a secure AI code execution environment from scratch

⏱️ Estimated time: 60 min

1
Step 1: Create FastAPI Service
Write main.py file with code execution endpoint:

• Use AsyncKernelManager to manage Jupyter Kernel
• Set execution timeout (default 30 seconds)
• Return execution result or error message
2
Step 2: Write Dockerfile
Build based on jupyter/base-notebook image:

• Install dependencies (FastAPI, uvicorn)
• Run as non-root user (jovyan)
• Expose port 8000
3
Step 3: Deploy to Kubernetes
Configure Pod to enable gVisor:

• Set runtimeClassName: gvisor
• Configure resource limits (CPU/memory)
• Add security context (read-only filesystem)
4
Step 4: Verify Sandbox Isolation
Test security boundaries:

• Attempt to access host filesystem (should be denied)
• Execute resource-intensive code (should be limited)
• Check if network isolation works

FAQ

What's the difference between Docker containers and gVisor?

Docker containers share the kernel with the host—if an attacker exploits a kernel vulnerability, they can escape. gVisor implements a "fake kernel" (Sentry) in user space that intercepts all system calls, only allowing safe operations, providing stronger isolation.

When should I use Firecracker instead of gVisor?

Choose Firecracker when security requirements are extremely high:

• Need hardware-level isolation (e.g., financial, medical data)
• Must meet strict compliance requirements
• Processing completely untrusted third-party code

gVisor has lower performance overhead (10-20%), suitable for most production scenarios.

How to choose between E2B and AWS Bedrock AgentCore?

E2B for quick validation and open-source control:
• Small-scale apps start with E2B Cloud
• Data localization needs use E2B on AWS

Bedrock AgentCore for deep AWS ecosystem users:
• Already using AWS services for easier integration
• Need browser automation? Choose Browser Tool

Will sandbox affect code execution performance?

Yes, there's some impact: Docker has almost no loss, gVisor overhead is about 10-20%, Firecracker about 15-30%. But for AI Agent code execution scenarios (data analysis, script processing), this overhead is usually acceptable.

How to quickly set up a sandbox for local development?

Simplest approach: run a gVisor-enabled container with Docker. If using GKE, just add `runtimeClassName: gvisor` to Pod config. For pure local dev, Docker isolation is sufficient—the key is setting resource limits and user permissions.

9 min read · Published on: Mar 23, 2026 · Modified on: Jul 14, 2026

Easton

AI & Intelligence

Agent Sandbox Guide: A Complete Solution for Safely Running AI Code

Mainstream Sandbox Technology Comparison

Docker Containers: Fast But Not Secure Enough

gVisor: Building a “Fake Kernel” in User Space

Firecracker: True Hardware-Level Isolation

Selection Decision Framework

Hands-On: Building a Local Development Sandbox

Step 1: Write the FastAPI Service

Step 2: Write the Dockerfile

Step 3: Deploy to GKE (Enable gVisor)

Step 4: Add Security Restrictions

Advanced: Kubernetes Cluster Deployment

Sandbox CRD Core Concepts

Lifecycle Management

Warm Pool

Managed Service Selection Guide

E2B: Open Source + Cloud Hosting

AWS Bedrock AgentCore

Selection Recommendations

Conclusion

Build AI Agent Sandbox Environment

Step 1: Create FastAPI Service

Step 2: Write Dockerfile

Step 3: Deploy to Kubernetes

Step 4: Verify Sandbox Isolation

FAQ

AI Agent Engineering: Architecture, Evaluation, and Recovery

AI Agent Development in Practice: Architecture Design and Implementation Guide

Agent Memory System Design: From Session to Long-Term Memory

AI Agent Memory Management: Long-term Memory and Knowledge Governance in Practice

Comments

Mainstream Sandbox Technology Comparison

Docker Containers: Fast But Not Secure Enough

gVisor: Building a “Fake Kernel” in User Space

Firecracker: True Hardware-Level Isolation

Selection Decision Framework

Hands-On: Building a Local Development Sandbox

Step 1: Write the FastAPI Service

Step 2: Write the Dockerfile

Step 3: Deploy to GKE (Enable gVisor)

Step 4: Add Security Restrictions

Advanced: Kubernetes Cluster Deployment

Sandbox CRD Core Concepts

Lifecycle Management

Warm Pool

Managed Service Selection Guide

E2B: Open Source + Cloud Hosting

AWS Bedrock AgentCore

Selection Recommendations

Conclusion

Build AI Agent Sandbox Environment

Step 1: Create FastAPI Service

Step 2: Write Dockerfile

Step 3: Deploy to Kubernetes

Step 4: Verify Sandbox Isolation

FAQ

AI Agent Engineering: Architecture, Evaluation, and Recovery

AI Agent Development in Practice: Architecture Design and Implementation Guide

Related Posts

Agent Memory System Design: From Session to Long-Term Memory

AI Agent Memory Management: Long-term Memory and Knowledge Governance in Practice

Comments