Agent Sandbox Guide: A Complete Solution for Safely Running AI Code
In spring 2025, security researchers tested all 16 public AI Agents from YCombinator’s Spring batch. Seven were compromised. Some leaked user data, others allowed remote code execution, and one deleted the entire database.
That is the price of letting AI Agents execute code. Give them freedom, and they dig holes for you.
Many of us use AI to write code, run scripts, and process data. But have you thought about running LLM-generated code directly on a server? What if it runs rm -rf /, or quietly sends AWS keys to an external host?
That is why Agent Sandbox exists.
What sets AI Agents apart from traditional apps is not chat or instruction following—it is that they can write and execute code themselves.
Picture this: you ask a data analysis Agent to process a 1GB sales file. It writes Python to read, analyze, and chart the data. You never reviewed that code. Then it runs.
Here are the serious risks:
Arbitrary code execution. LLMs do not respect security boundaries. os.system(), subprocess.run()—they use them without thinking. A crafted prompt can trigger arbitrary system commands.
Resource exhaustion. Agent code has no resource awareness. An infinite loop maxes CPU; runaway recursion blows memory. Your server goes down.
File system overreach. Without path restrictions, it can read the whole disk and write anywhere—configs, keys, user data.
Network exfiltration. A hidden HTTP request can ship sensitive data to an attacker. You may never notice.
OWASP’s 2025 AI Agent Security Top 10 ranks “Agent tool interaction manipulation” first—attackers can steer how Agents call tools via prompt injection and similar tricks.
Real cases already exist:
- Langflow RCE: Horizon3 found remote code execution via malicious input.
- Cursor auto-execution: Researchers showed certain MCP commands can be triggered by crafted prompts.
- Replit database wipe: AI-generated code deleted an entire database.
Sandbox is not optional. It is infrastructure—like exposing a server without a firewall, you should not let AI run code without a sandbox.
Sandbox comes down to three things: isolation (cage risky code), limits (CPU, memory, network, files), and audit (log what ran so you can investigate).
Mainstream Sandbox Technology Comparison
Now that we know we need a sandbox, which one should we use? There are three main technology approaches: Containers (Docker), gVisor, and Firecracker microVMs.
First, here’s a comparison table:
| Solution | Security Isolation | Startup Speed | Resource Overhead | Use Case |
|---|---|---|---|---|
| Docker Container | ★★☆☆☆ | ★★★★★ | ★★★★★ | Dev/test, low-risk code |
| gVisor | ★★★★☆ | ★★★★☆ | ★★★☆☆ | Production, medium risk |
| Firecracker | ★★★★★ | ★★★★☆ | ★★★☆☆ | High security requirements, production |
Docker Containers: Fast But Not Secure Enough
Docker is the most common choice. Fast startup, low resource consumption, mature ecosystem. But here’s the problem: Docker containers share the kernel with the host.
What does that mean? Although container processes are isolated by namespaces, if an attacker exploits a kernel vulnerability, they can break out of the container boundary and get root privileges on the host.
Several container escape vulnerabilities were disclosed in 2024. For untrusted AI-generated code, Docker’s security boundary isn’t enough.
gVisor: Building a “Fake Kernel” in User Space
gVisor is an open-source Google project with an interesting approach—instead of using the host kernel directly, it implements a “fake kernel” (called Sentry) in user space.
When a program in the container makes system calls, gVisor intercepts them and has Sentry handle them. Sentry only allows safe operations; dangerous ones are rejected. This way, even if code tries to cause damage, it can’t touch the real kernel.
gVisor’s advantage is good compatibility—most Docker images run directly. The downside is some performance overhead (about 10-20%), and some special system calls might not be supported.
GKE (Google Kubernetes Engine) natively supports gVisor—just add runtimeClassName: gvisor to your Pod config.
Firecracker: True Hardware-Level Isolation
Firecracker is AWS’s open-source microVM technology. Each sandbox is a small virtual machine with its own independent kernel.
What does this mean? Even if an attacker gets root privileges in the sandbox and exploits a kernel vulnerability, they’re still just messing around in a VM—completely unable to affect the host.
Firecracker achieves startup speeds of 100-800 milliseconds, with much lower resource overhead than traditional VMs (a VM needs as little as 128MB of memory).
Professional AI code sandbox services like E2B and AWS Bedrock AgentCore all use Firecracker underneath.
Selection Decision Framework
How to choose? Here’s a simple decision tree:
- Just local development/testing? Docker is enough—convenient and fast.
- Deploying to production?
- Medium security requirements, performance-focused → gVisor
- High security requirements, compliance needed → Firecracker
- Don’t want to manage infrastructure? Use managed services (E2B, Bedrock AgentCore)
Hands-On: Building a Local Development Sandbox
Enough theory—let’s build one. Our approach: FastAPI + Jupyter Kernel + gVisor Container.
Why this combination?
- FastAPI provides clean HTTP interfaces for AI Agents to submit code execution requests via REST API
- Jupyter Kernel provides interactive Python execution environment with variable persistence
- gVisor Container provides security isolation to prevent malicious code from affecting the host
Step 1: Write the FastAPI Service
Create a main.py file:
# main.py
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from jupyter_client.manager import AsyncKernelManager
from pydantic import BaseModel
app = FastAPI()
class CodeRequest(BaseModel):
code: str
class ExecutionResult(BaseModel):
output: str
@asynccontextmanager
async def kernel_client():
"""Manage Jupyter Kernel lifecycle"""
km = AsyncKernelManager(kernel_name="python3")
await km.start_kernel()
kc = km.client()
kc.start_channels()
await kc.wait_for_ready()
try:
yield kc
finally:
kc.stop_channels()
await km.shutdown_kernel()
async def execute_code(code: str, timeout: int = 30) -> str:
"""Execute code and return result"""
async with kernel_client() as kc:
msg_id = kc.execute(code)
try:
while True:
reply = await asyncio.wait_for(
kc.get_iopub_msg(),
timeout=timeout
)
if reply["parent_header"]["msg_id"] != msg_id:
continue
msg_type = reply["msg_type"]
if msg_type == "stream":
return reply["content"]["text"]
elif msg_type == "error":
return f"Error: {reply['content']['evalue']}"
elif msg_type == "status" and reply["content"]["execution_state"] == "idle":
break
except asyncio.TimeoutError:
return "Error: Execution timed out"
return ""
@app.post("/execute", response_model=ExecutionResult)
async def execute(request: CodeRequest):
"""Code execution endpoint"""
try:
output = await execute_code(request.code)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
return ExecutionResult(output=output)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
The core logic: each execution request starts an independent Jupyter Kernel, executes the code, returns results, then destroys the Kernel.
Step 2: Write the Dockerfile
FROM jupyter/base-notebook:latest
WORKDIR /app
COPY main.py /app/main.py
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# Use non-root user (security best practice)
USER jovyan
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Note that USER jovyan—this is a security practice. Running containers as non-root means even if code escapes, the privileges are limited.
Step 3: Deploy to GKE (Enable gVisor)
If you’re using GKE, just add one line to the Pod config:
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-sandbox
spec:
template:
spec:
runtimeClassName: gvisor # Key: enable gVisor
containers:
- name: sandbox
image: your-registry/agent-sandbox:latest
ports:
- containerPort: 8000
resources:
limits:
memory: "512Mi"
cpu: "500m"
That’s it—your code execution environment is now running in a gVisor sandbox.
Step 4: Add Security Restrictions
The above configuration isn’t complete. For production, add these restrictions:
# Network policy: restrict to necessary APIs only
# Read-only filesystem
securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
# Set execution timeout
# Already added timeout parameter in FastAPI code
Advanced: Kubernetes Cluster Deployment
If your AI Agent application needs large-scale deployment, a single container won’t cut it. This is where Kubernetes’s Agent Sandbox controller comes in.
Google open-sourced Agent Sandbox in 2025, providing declarative sandbox management APIs.
Sandbox CRD Core Concepts
Agent Sandbox defines several custom resources:
- Sandbox: Single sandbox instance with stable identity, persistent storage, lifecycle management
- SandboxTemplate: Sandbox template defining standardized configurations
- SandboxClaim: On-demand sandbox instance requests
A simple Sandbox configuration example:
apiVersion: sandbox.k8s.io/v1alpha1
kind: Sandbox
metadata:
name: my-agent-sandbox
spec:
template:
spec:
runtimeClassName: gvisor
containers:
- name: executor
image: python:3.11-slim
command: ["sleep", "infinity"]
# Persistent storage
volumes:
- name: workspace
emptyDir: {}
# Resource limits
resources:
limits:
memory: "1Gi"
cpu: "1"
Lifecycle Management
One highlight of Agent Sandbox is support for pause/resume:
# Pause sandbox (release CPU and most memory)
kubectl patch sandbox my-agent-sandbox --type=merge -p '{"spec":{"paused":true}}'
# Resume sandbox
kubectl patch sandbox my-agent-sandbox --type=merge -p '{"spec":{"paused":false}}'
This is particularly useful for intermittently executing AI Agents—pause when idle (almost no resources), resume in seconds when there’s work.
Warm Pool
To further reduce startup latency, Agent Sandbox supports “warm pools”—pre-creating batches of paused sandboxes that can be activated on demand.
This brings sandbox “cold start” time from seconds to milliseconds.
Managed Service Selection Guide
If you don’t want to manage infrastructure yourself, managed services are a good choice. Here are the mainstream options:
E2B: Open Source + Cloud Hosting
E2B is a code sandbox service designed specifically for AI Agents. It has two versions:
- E2B Cloud: Use their cloud service directly, pay-as-you-go
- E2B on AWS: Deploy the open-source version to your own AWS account
E2B uses Firecracker underneath with solid security. The SDK is clean:
from e2b import Sandbox
# Create sandbox
sandbox = Sandbox()
# Execute code
result = sandbox.run_code("print('Hello, World!')")
# Close sandbox
sandbox.close()
E2B on AWS is particularly suitable for enterprises with data sovereignty requirements—all data stays in your own account.
AWS Bedrock AgentCore
AWS launched Bedrock AgentCore in 2025, specifically for AI Agent code execution and browser operations.
Code Interpreter provides Python/JavaScript/TypeScript runtimes, with each session executing in an independent microVM, supporting files up to 5GB.
Browser Tool lets AI Agents operate browsers—open pages, fill forms, click buttons. This is especially useful for Agents that need to scrape web pages or operate SaaS applications.
The billing model is reasonable: pay for actual vCPU and memory usage time, not instance runtime. Resources release automatically after code execution finishes.
Selection Recommendations
| Scenario | Recommended Solution |
|---|---|
| Quick validation, small-scale apps | E2B Cloud |
| Enterprise, data localization needed | E2B on AWS or Bedrock AgentCore |
| Deep AWS ecosystem user | Bedrock AgentCore |
| Browser automation needed | Bedrock AgentCore Browser Tool |
| Full control, ops capability | Self-hosted Kubernetes + Agent Sandbox |
Conclusion
After all this, the core message is simple: Security isn’t optional—it’s infrastructure for AI Agent applications.
For technology selection:
- Small teams, quick validation—Docker or gVisor is enough
- Enterprise apps, high security requirements—Firecracker or managed services
- Already using Kubernetes—go with Agent Sandbox controller
Whatever you choose, start with local testing. Write the simplest FastAPI + Docker config, get it running, then consider security hardening and production deployment.
Remember: add sandbox early. Don’t wait for a security incident to remediate—that’s much more expensive.
Build AI Agent Sandbox Environment
Build a secure AI code execution environment from scratch
⏱️ Estimated time: 60 min
- 1
Step1: Create FastAPI Service
Write main.py file with code execution endpoint:
• Use AsyncKernelManager to manage Jupyter Kernel
• Set execution timeout (default 30 seconds)
• Return execution result or error message - 2
Step2: Write Dockerfile
Build based on jupyter/base-notebook image:
• Install dependencies (FastAPI, uvicorn)
• Run as non-root user (jovyan)
• Expose port 8000 - 3
Step3: Deploy to Kubernetes
Configure Pod to enable gVisor:
• Set runtimeClassName: gvisor
• Configure resource limits (CPU/memory)
• Add security context (read-only filesystem) - 4
Step4: Verify Sandbox Isolation
Test security boundaries:
• Attempt to access host filesystem (should be denied)
• Execute resource-intensive code (should be limited)
• Check if network isolation works
FAQ
What's the difference between Docker containers and gVisor?
When should I use Firecracker instead of gVisor?
• Need hardware-level isolation (e.g., financial, medical data)
• Must meet strict compliance requirements
• Processing completely untrusted third-party code
gVisor has lower performance overhead (10-20%), suitable for most production scenarios.
How to choose between E2B and AWS Bedrock AgentCore?
• Small-scale apps start with E2B Cloud
• Data localization needs use E2B on AWS
Bedrock AgentCore for deep AWS ecosystem users:
• Already using AWS services for easier integration
• Need browser automation? Choose Browser Tool
Will sandbox affect code execution performance?
How to quickly set up a sandbox for local development?
7 min read · Published on: Mar 23, 2026 · Modified on: Jun 1, 2026
AI Development
If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.
Previous
Tired of Switching AI Providers? One AI Gateway for Monitoring, Caching & Failover (Cut Costs by 40%)
A hands-on guide to managing multiple AI providers (OpenAI, Claude, Gemini) with AI Gateway. Learn how to implement automatic failover, intelligent caching, and global monitoring to reduce costs by 40% and boost availability to 99.9%. Includes three solution comparisons and complete code examples.
Part 4 of 39
Next
Can't Afford Vector Databases? Vectorize Free Tier Lets You Build Semantic Search in 30 Minutes
Cloudflare Vectorize zero-cost tutorial: Build semantic search in 30 minutes, saving $50/month compared to Pinecone. Complete code + pitfall guide included, perfect for personal projects and MVPs, with 5 million free vector quota.
Part 6 of 39
Related Posts
Complete Workers AI Tutorial: 10,000 Free LLM API Calls Daily, 90% Cheaper Than OpenAI
Complete Workers AI Tutorial: 10,000 Free LLM API Calls Daily, 90% Cheaper Than OpenAI
AI-Powered Refactoring of 10,000 Lines: A Real Story of Doing a Month's Work in 2 Weeks
AI-Powered Refactoring of 10,000 Lines: A Real Story of Doing a Month's Work in 2 Weeks
OpenAI Blocked in China? Set Up Workers Proxy for Free in 5 Minutes (Complete Code Included)
Comments
Sign in with GitHub to leave a comment