This prompt is for teams that want to bolt an agent-native sandbox onto an existing monorepo.
The goal is not to create another E2E suite or a prettier docker-compose up. The goal is to give coding agents a disposable, inspectable, worktree-local environment where they can behave more like a careful human doing manual QA: open the app in a browser, click through real flows, inspect logs, read local side effects, check emails, review analytics/error events, adjust seeded scenarios, and produce visual proof without waiting for a human to run the project.
Think of this as giving agents vision and x-ray access to the app. Each agent/worktree should be able to spin up its own realistic runtime, keep working longer, avoid port fights with other agents, and leave behind evidence artifacts that make reviews easier.
The shape you are aiming for is this:
agent-a worktree
https://web.agent-a.acme.local
https://admin.agent-a.acme.local
https://api.agent-a.acme.local
/tmp/acme-sandbox/workspaces/agent-a/
/tmp/acme-sandbox/runs/run_abc123/
agent-b worktree
https://web.agent-b.acme.local
https://admin.agent-b.acme.local
https://api.agent-b.acme.local
/tmp/acme-sandbox/workspaces/agent-b/
/tmp/acme-sandbox/runs/run_def456/
human-fix-payment worktree
https://web.human-fix-payment.acme.local
https://admin.human-fix-payment.acme.local
https://api.human-fix-payment.acme.local
/tmp/acme-sandbox/workspaces/human-fix-payment/
/tmp/acme-sandbox/runs/run_ghi789/
Instead of one shared localhost:3000 dev server that every agent collides with, each worktree gets its own little universe: app URLs, backend URLs, local DB, mock providers, logs, screenshots, email outbox, analytics ledger, error sink, and teardown. Agents can run in parallel because they are no longer fighting over ports, global env files, browser sessions, or a single mutable local database.
This is the sizzle:
- Agent A can fix checkout while Agent B investigates admin reporting.
- Both can click through real browser flows at the same time.
- Each gets clean logs scoped to its own services.
- Each gets its own seeded users, orders, fixtures, webhooks, and artifacts.
- Agents can choose a tiny seed radius for quick UI inspection or a larger scenario bundle for full workflow proof.
- Each can prove its result with screenshots plus durable side effects.
- The human reviewer gets a compact evidence trail instead of "I ran it locally".
The point is not that the URLs must literally be acme.local. The point is that developers and agents should be able to look at the workspace and immediately know: this branch has its own working app, its own services, and its own proof folder.
Use this as the first prompt in your repository. It should start a discovery and design process before implementation.
Reference projects and docs to inspect while designing this:
- Procpane: https://github.com/jokull/procpane
- Use as the process-supervision model: healthcheck-gated services, addressable tasks, queryable status, tail/grep, process signaling, and agent-readable logs.
- The repository is
jokull/procpane, notprocpane/jokull.
- Urgentry: https://github.com/urgentry/urgentry
- Use as a concrete example of a local Sentry-compatible error sink. If the app already uses Sentry SDKs, a local Sentry-compatible target can give agents an inspectable issue list without sending sandbox errors to production.
- Superset: https://superset.sh/ and https://docs.superset.sh/workspaces
- Use as the worktree/workspace companion model. Superset workspaces are isolated git worktrees with their own directory, terminal sessions, ports, and agent workflows.
- OrbStack: https://orbstack.dev/
- Use as the preferred macOS Docker/runtime/networking layer, especially for local container domains and fast bind-mounted development.
- Docker docs: https://docs.docker.com/
- Use for the container boundary and image/build mechanics.
- Turborepo docs: https://turborepo.dev/docs
- Use when the monorepo already has Turbo; Procpane is especially natural when services map onto Turbo tasks.
- Playwright docs: https://playwright.dev/docs/intro
- Use for browser automation primitives, screenshots, console/page error capture, auth state, and headed/headless browser control. Do not let this turn the sandbox into a brittle E2E suite by default.
You are an expert developer-experience engineer working inside an existing monorepo.
Your task is to design, then implement, an agent sandbox for this project.
An agent sandbox is a disposable, per-worktree runtime for coding agents. It is a level above a normal manual dev environment. It should let agents create realistic scenarios, run the app in a browser, inspect all relevant services and side effects, and produce proof artifacts while staying isolated from production writes and from other parallel worktrees.
This is not primarily an E2E test suite. Do not frame the work around brittle committed browser tests. Frame it around an ergonomic agent playground and workspace companion with measurable green lights, x-ray visibility, safe integration fidelity, and repeatable teardown.
Build two modes on the same substrate:
-
Disposable proof runs
- Start a fresh sandbox run.
- Seed a scenario.
- Open one or more app surfaces in a real browser.
- Capture screenshots, logs, network requests, local side effects, and summary evidence.
- Tear the run down cleanly.
-
Worktree workspace mode
- Each worktree gets its own long-lived sandbox runtime.
- Ports and browser URLs are stable for that worktree.
- Parallel agents do not share one global dev server.
- Workspace setup emits machine-readable metadata that agents and tools can consume.
- The sandbox becomes the runtime companion for every agent workspace.
The first public green light should be:
From a fresh worktree, the setup/run command can start a working sandbox with the right services, stable ports, routed browser URLs, process visibility, and an artifact directory, then print a clear readiness summary.
Use this stack unless the repo or host environment gives you a concrete reason not to:
- One disposable Docker container per sandbox run or workspace.
- Procpane (https://github.com/jokull/procpane) as the in-container process supervisor.
- The repo mounted into the container, with generated sandbox env files.
- The monorepo's native package manager and task runner for builds. Prefer Turbo when the repo uses it; otherwise adapt to pnpm/npm/yarn/bun/nx/rush/etc.
- OrbStack (https://orbstack.dev/) on macOS for fast Docker and ergonomic local container networking.
- Stable pretty URLs for browser-facing services where possible, especially per-worktree URLs.
- Superset (https://superset.sh/) or equivalent worktree orchestration as the primary parallel-agent workspace flow.
- Plain localhost ports as a fallback for Linux, VPS, Docker Desktop, CI, or hosts without OrbStack-style local domains.
This pattern is local-workstation first. It also has strong upside on a dedicated Mac mini, workstation box, or beefy VPS that can host multiple long-running agent worktrees. It is not optimized for tiny stateless CI containers. It wants a real machine with Docker, persistent caches, browser automation, enough RAM/CPU, and stable networking.
Before editing files, inspect the monorepo and produce a short design plan. Do not guess the topology.
Discover:
- Package manager and workspace layout.
- Browser app surfaces: web app, admin app, docs app, dashboard, embedded widget, etc.
- Backend services: API, workers, queues, cron, webhooks, RPC, background jobs.
- Data stores: SQL, NoSQL, cache, search, object storage, queues.
- Existing dev commands and build graph.
- Existing Docker/devcontainer/docker-compose setup, if any.
- Existing browser automation or preview tooling.
- Existing environment variables and secret loading.
- Existing production, staging, and local integration boundaries.
- Existing observability: logs, analytics, error tracking, tracing, email, webhooks.
- Existing worktree, agent, or Superset conventions.
Then ask concise questions only for decisions that cannot be inferred safely from the codebase. Good questions include:
- Which app surfaces should be browser-addressable first?
- Which scenarios matter most for manual QA?
- Which integrations may use real read-only credentials?
- Which vendors have safe test-mode credentials?
- Which systems must never receive writes from the sandbox?
- Which production data sources may agents read through existing tools, and which are forbidden?
- Which remote-only features should be disabled and explicitly marked as caveats?
- What is the intended local host environment: macOS with OrbStack, Linux, VPS, or mixed?
After discovery, produce a design plan and wait for approval before implementing. The plan should include the command surface, services, ports, artifact paths, integration policy, and first green-light milestone.
Design the sandbox around these concepts.
Run registry:
- Every sandbox run has a run id.
- Every run writes a
run.jsonwith services, ports, URLs, artifact paths, process names, seed summary, and fidelity caveats. - Store run artifacts under a stable host directory such as
/tmp/<project>-sandbox/runs/<run-id>/. - Write an
env.shor equivalent so agents can source useful URLs and paths. - Provide
status,summary, anddowncommands.
Workspace registry:
- Every worktree/workspace has a stable workspace id.
- Derive the workspace id from worktree path, branch, git common dir, and/or workspace name.
- Preserve port reservations across teardown unless explicitly released.
- Store workspace state under a stable host directory such as
/tmp/<project>-sandbox/workspaces/<workspace-id>/. - Emit metadata for Superset or equivalent tooling, such as
.superset/ports.json. - No-argument commands should prefer the current workspace run when available.
- Aim for readable workspace-scoped URLs when possible:
https://web.<workspace>.<project>.local
https://admin.<workspace>.<project>.local
https://api.<workspace>.<project>.local
- If the project has more surfaces, extend the pattern:
docs,storybook,worker,mail,errors,queue,analytics. - If pretty local domains are not available, still keep the same conceptual model with direct localhost ports and metadata:
{
"workspace": "agent-a",
"services": {
"web": { "url": "http://127.0.0.1:43100" },
"admin": { "url": "http://127.0.0.1:43101" },
"api": { "url": "http://127.0.0.1:43102" }
},
"artifacts": "/tmp/acme-sandbox/runs/run_abc123"
}Container boundary:
- Use one container per run/workspace.
- Mount the repo into the container.
- Mask host
.env,.env.*,.dev.vars, and similar secret files inside the container by default. - Generate sandbox env files from a curated allowlist.
- Reject live/production credentials where prefix or metadata checks are possible.
- Copy only explicit read-only, public, fake, or vendor-test-mode values.
- Keep Docker image build context small.
- Cache package installs and build outputs where safe.
- Make image rebuilds explicit when refreshing base tools.
Process supervision:
- Run all app services through Procpane.
- Give every service an addressable process name.
- Add health checks for each service.
- Expose commands for status, tail, grep, restart/signal, and log collection.
- On startup failure, print the failing process and relevant logs.
- Do not require agents to shell into the container to inspect routine failures.
- Leave breadcrumbs in comments/docs pointing maintainers to the exact process-supervisor project you modeled this on: https://github.com/jokull/procpane.
Browser access:
- Provide a browser helper that can open a target app surface and path.
- Capture screenshots.
- Capture console errors, page errors, failed requests, and route state.
- Support seeded auth state for common roles.
- Support both direct localhost ports and pretty per-workspace URLs.
- Prefer OrbStack local domains on macOS when available.
- Avoid maintaining a hand-written route registry; infer route targets from app source and scenario metadata where possible.
Artifacts:
- Write all important evidence into the run artifact directory.
- Include screenshots, browser result JSON, process logs, external request logs, analytics events, email outbox, error-tracking events/issues, media request logs, webhook logs, seed summary, and final summary.
- Make artifacts easy for agents to quote in PRs.
- The final summary should combine browser evidence with durable invariants: DB rows, URL state, session/cookie state, request logs, email artifacts, analytics events, error issues, queue entries, or process health.
Scenario seeds:
- Build small, named scenario bundles instead of one giant fixture.
- Provide seed radii or tiers such as
z,s,m,l, and optionallyxl. - Radii should expand to ordered named bundles. For example:
z -> baseline reference data only
s -> baseline + admin user + customer user
m -> s + ordinary domain spine, such as account/order/project/booking
l -> m + checkout/provider/content/workflow bundles
xl -> l + expensive or broad scenario data, only if truly useful
- Keep exact bundle names project-specific. Examples of generic bundle names:
reference-dataadmin-usercustomer-userbasic-accountorderbookingcheckout-readycontent-pageprovider-fixtureworkflow-with-webhookqueue-backloganalytics-demoerror-demo
- Seeds should create real local data in the local DB or stores.
- Scenario metadata should describe useful app paths, relevant ids, critical UI/media selectors, expected side effects, and auth roles.
- Prefer product-real behavior over sandbox-only branches.
- Bundles should be composable and idempotent where practical.
- Bundles may share a seed context object so later bundles can build on earlier ids. For example, a
bookingbundle can reuse the customer/order created by anorderbundle; awebhookbundle can reuse the payment id created bycheckout-ready. - Every seed run should write both machine-readable and human-readable summaries, such as
seed-summary.jsonandseed-summary.md. - The summary should include selected bundles, generated ids, useful app paths, auth roles, expected side effects, and scenario caveats.
For every integration, classify it before implementation. Create a short policy table in the design plan with:
- Integration name.
- Local strategy.
- Whether writes are possible.
- Secret source and safety rule.
- Fidelity caveat.
- Evidence surface.
Use these decision patterns.
Run real locally:
- Use for services that are part of the app stack and safe to own locally.
- Examples: web app, admin app, API, worker, local DB, Redis, queue worker, search index, object-store emulator.
- Evidence: process health, logs, DB rows, queue state, browser output.
Use read-only real service:
- Use when realism matters and writes are impossible or safely scoped.
- Examples: public CMS token, read-only product catalog, maps/browser public key, read-only feature metadata.
- Never copy write-capable tokens into the sandbox.
- Evidence: rendered content, request logs, explicit caveat that writes are not covered.
Use vendor test mode:
- Use when the vendor has a real sandbox/test environment.
- Examples: Stripe or other payment processors in test mode, email provider sandbox domains, SMS test numbers, webhook test endpoints.
- Reject live keys by prefix or vendor metadata when possible.
- Evidence: webhook logs, signed fixture events, DB state changes, local analytics events, provider test ids.
Mock with request recording:
- Use when writes would affect partners, cost money, create bookings, mutate CRM records, send messages, or trigger fulfillment.
- Examples: booking providers, shipping providers, CRM, enrichment APIs, supplier APIs, irreversible fulfillment, most AI providers by default, partner inventory APIs, tax engines, address verification, fraud scoring.
- Route traffic through a local mock proxy.
- Record every request as JSONL with sensitive headers/body fields redacted.
- Validate that requests match seeded scenario ids where possible.
- Evidence: request log counts, latest payloads, matched fixture names, unregistered request list.
Shim in browser:
- Use when a third-party browser script is not the thing under inspection.
- Examples: payment elements, analytics scripts, CAPTCHA script, ads, tag managers, heatmaps, support widgets.
- Make the shim explicit in readiness/status output.
- Evidence: page loads, console stays clean, app receives expected callbacks.
Route to local sink:
- Use for observability and side effects that agents need to inspect.
- Email should go to a local outbox with HTML, text, headers, attachments, and metadata artifacts.
- Error tracking should go to a local Sentry-compatible sink or envelope capture.
- Analytics should go to a local endpoint and JSONL event ledger. For a PostHog-style SDK, point the host/API URL at the sandbox mock proxy, accept capture/batch/identify/group calls, redact or hash identifiers, and write every event to
analytics-events.jsonl. - Webhooks can go to a local receiver or signed fixture sender.
- Queues should expose local queue state and manual trigger commands.
- Evidence: outbox index, analytics counts, error issue list, envelope logs, queue entries.
Disable explicitly:
- Use when local fidelity would be fake, dangerous, or too expensive for the first phase.
- Examples: anti-bot/CAPTCHA, fraud/risk engines, production-only remote bindings, irreversible fulfillment, high-cost AI agents, live compliance systems.
- Disable only in local sandbox mode.
- Fail closed outside local sandbox mode.
- Surface the disabled feature in status/summary as a fidelity caveat.
- Do not pretend the scenario covers it.
Add an app-code seam:
- Use when production code is too rigid to support safe local fidelity.
- Good seams include env-configured base URLs, local transports, feature flags, seeded auth helpers, provider routing, explicit local disable switches, and graceful degradation for optional read-only content.
- Avoid bad sandbox bleed.
- Do not add product branches like
if sandbox then fake success. - Do not bypass business logic just to make a flow green.
- Do not teach production code about a specific harness command.
- Do not silently disable risky systems outside local sandbox mode.
Escalate to a human:
- Escalate when a choice changes business behavior, security posture, payment state, booking state, fulfillment, production data access, or compliance guarantees.
Seeding:
- Treat seeding as a first-class product surface for agents, not as hidden fixture setup.
- Use seed radii for speed and intent:
z: zero-ish baseline data. Reference tables, currencies, feature metadata, or anything required for the app to boot realistically.s: small. Baseline plus common users/roles and empty-state browsing.m: medium. Small plus the ordinary domain spine agents need for most work, such as an account/order/project/booking.l: large. Medium plus checkout, provider, content, webhook, queue, or analytics scenario data.xl: optional. Broad or expensive seeds for demos and complex workflows.
- Let agents request exact bundles when that is clearer than a radius.
- Keep bundle order deterministic and deduplicate expanded bundles.
- Keep a shared seed context so bundles can compose without hardcoded ids.
- Write
seed-summary.jsonfor tools andseed-summary.mdfor humans. - Include app paths and browser hints in the seed result, not only database ids.
- Good seed metadata includes:
- user ids and roles
- login/auth state hints
- primary app paths
- relevant entity ids
- provider fixture ids
- expected webhook ids
- expected analytics events
- visible selectors
- critical media selectors
- known caveats
- Bad seed design creates one giant kitchen-sink fixture that every scenario depends on, or hides important generated ids in logs that agents cannot parse.
Payments:
- Prefer vendor test mode, such as Stripe test mode or the equivalent for the project's provider.
- Require test keys.
- Reject live keys.
- Support a long-running local webhook listener when useful.
- Also provide deterministic signed fixture sender for proof runs.
- Browser payment UI may be shimmed if payment-element fidelity is not under inspection.
- Evidence: webhook log, local DB row, analytics event, payment id, signed fixture result.
Payments are a good example of mixed fidelity:
- Real enough: server talks to vendor test mode, webhook signatures are valid, local DB mutations are real.
- Shimmed: embedded browser payment UI can be replaced with a deterministic browser shim when the test card iframe is not the thing being evaluated.
- Recorded: all webhook attempts and outgoing provider calls should be captured in artifacts.
- Guarded: live keys should fail startup before the sandbox reaches the browser.
Email:
- Never send real email by default.
- Route to local outbox.
- Save HTML, text, metadata, and attachments.
- Provide commands to list latest email, open HTML, and grep content.
- Evidence: outbox index and rendered email artifact.
Email outbox examples:
- Transactional email provider SDK writes to
email-outbox/. - SMTP traffic is redirected to a local SMTP capture service.
- Worker/platform email bindings write
.eml,.html,.txt, and metadata JSON files. - Commands expose
emails,email latest --html, andemail latest --grep <text>.
Analytics:
- Do not send agent QA traffic to production analytics by default.
- Route to local JSONL sink.
- Preserve event names and safe properties.
- Hash or redact identifiers.
- Provide a command to require expected events.
- Evidence: event counts and latest payloads.
PostHog-style analytics are especially useful to record locally:
- Point
POSTHOG_HOST,NEXT_PUBLIC_POSTHOG_HOST, or equivalent SDK host config at the sandbox mock proxy. - Accept common analytics routes such as capture, batch, identify, group, decide, flags, and static script requests as needed by the SDK.
- Return harmless feature-flag/bootstrap responses.
- Write every accepted event to
analytics-events.jsonl. - Keep event names, timestamps, distinct ids after hashing, safe properties, and request context.
- Redact emails, names, tokens, cookies, authorization headers, and long free-text fields by default.
- Provide
sandbox analytics --require "Checkout Started"andsandbox analytics --lateststyle commands. - Evidence: event counts, missing required events, latest payloads, and the artifact path.
Error tracking:
- Route SDK DSNs to a local compatible sink or envelope recorder.
- Consider Urgentry (https://github.com/urgentry/urgentry) when the app already emits Sentry-compatible events and you want a real local issue UI instead of a raw envelope log.
- Preserve enough grouping/context to debug.
- Provide issue list/detail commands.
- Evidence: local issue count, envelope logs, stack trace.
Sentry-style error tracking examples:
- Use a fake DSN whose host points at the local container.
- Route browser and server SDK envelopes to Urgentry or a small envelope recorder.
- Keep issue title, stack frames, tags, release/environment, request URL, and breadcrumbs.
- Provide
sandbox errors,sandbox errors <issue-id>, andsandbox logs errorscommands. - Treat an empty error sink as evidence too: if the page showed a 500 but no issue was recorded, the sandbox should make that observability gap visible.
CMS/content:
- Public/read-only tokens are often acceptable if product realism depends on them.
- Never copy write-capable tokens.
- Provide optional empty-content mock mode for degradation scenarios.
- Evidence: rendered content and explicit read-only caveat.
CMS and content examples:
- Use a public/read-only token for product pages where editorial content matters.
- Use local DB fixtures for commerce or workflow state that should not depend on editable CMS content.
- Add a mock-empty-content mode to prove graceful degradation.
- Record CMS request URLs and cache hits when useful.
- Never give the sandbox a token that can publish, mutate, or delete content.
Provider/partner APIs:
- Mock and record by default.
- Validate seeded ids.
- Return deterministic fixtures.
- Log unregistered requests loudly.
- Evidence: request ledger and matched fixture names.
Provider examples:
- Booking/reservation providers: mock availability, quote, reserve, cancel, and voucher endpoints; record every attempted booking write.
- Shipping/fulfillment: mock label creation and tracking; never buy a real label.
- CRM/sales tools: mock contact/deal writes; record payloads for review.
- Tax/address/risk vendors: fixture common outcomes and expose which fixture matched.
- AI providers: default to fixtures or low-cost explicit test mode; record prompt metadata with redaction.
- Supplier APIs: validate seeded supplier/product ids so accidental real ids are obvious.
Object storage/media:
- Prefer local fixture media or read-only public media.
- Record media requests.
- Assert important images decode in the browser when the scenario cares about them.
- Evidence: media request log and browser natural dimensions.
Storage and media examples:
- S3/R2/GCS-style uploads go to a local object-store emulator or filesystem bucket.
- Public image CDN reads may use real read-only URLs if visual fidelity matters.
- Upload flows should write local artifacts and expose links in the summary.
- Media proxy failures should be visible in browser results and request logs.
Auth:
- Seed users and roles.
- Generate browser auth state for common roles.
- Avoid weakening production auth.
- Evidence: session state, protected page access, role-specific UI.
Feature flags:
- Pin local sandbox flag state in generated config.
- Document flags in status.
- Evidence: flag file and visible behavior.
Feature-flag examples:
- Local static flag file generated per run.
- Mock flag service endpoint for LaunchDarkly/Statsig/PostHog-style SDKs.
- Explicit scenario metadata that says which flags are on/off.
- Status output lists non-default flags so agents do not misread behavior.
CAPTCHA/anti-bot:
- Disable or bypass only in local sandbox mode.
- Fail closed elsewhere.
- Evidence: status caveat and absence of production token use.
AI/LLM providers:
- Mock by default.
- Optionally allow explicit low-cost test provider mode with recording.
- Redact prompts/results where needed.
- Evidence: recorded request/response metadata or fixture name.
Webhooks:
- Support both listener mode and signed fixture mode.
- Capture endpoint secret material only into sandbox artifacts.
- Evidence: webhook request log and app side effects.
Webhook examples:
- Payment succeeded/failed/refunded events.
- Subscription lifecycle events.
- Provider booking confirmation/cancellation callbacks.
- CMS publish webhooks.
- GitHub/GitLab app callbacks.
- Slack/Discord interaction payloads.
- Signed fixtures should use the same verification path as production code.
Cron and queues:
- Expose manual trigger commands.
- Run workers locally when feasible.
- Record job attempts and failures.
- Evidence: queue state, logs, side effects.
Cron and queue examples:
sandbox queue list,sandbox queue drain,sandbox queue retry <id>.sandbox cron run nightly-settlement.- Local dead-letter queue artifact.
- Worker logs and job payloads stored with redaction.
- Scenario summary includes whether expected jobs were enqueued and processed.
Remote-only infrastructure:
- If a feature depends on remote bindings or platform-only services, choose one:
- Build a local emulator.
- Use explicit remote test-mode with guardrails.
- Disable and label it as not covered.
- Do not fake coverage.
Adapt names to the repo, but preserve these capabilities:
sandbox up --seed <scenario>: start disposable proof run.sandbox up --seed z|s|m|l: start from a seed radius.sandbox up --seed admin-user,order,checkout-ready: start from explicit named bundles.sandbox seeds: list known seed radii and bundles.sandbox workspace up --seed <scenario> --reuse: start or reuse current worktree sandbox.sandbox status [run-id]: print readiness, URLs, processes, artifacts, caveats.sandbox browse [run-id] <app> <path> [--auth <role>]: open browser and capture evidence.sandbox smoke [run-id] <scenario>: run a scenario-defined browser check and write artifacts.sandbox logs <service>orsandbox pane <service>: tail/grep/status/restart supervised services.sandbox emails,sandbox email latest --html: inspect outbox.sandbox analytics --require <event>: inspect local analytics ledger.sandbox errors: inspect local error sink.sandbox requests: inspect external request recorder.sandbox webhook <fixture>: send deterministic signed webhook fixture.sandbox summary [run-id]: write final evidence summary.sandbox down [run-id]: stop and clean up.
For Superset or worktree setup, provide scripts or commands like:
- workspace setup: generate env, copy safe local state, allocate ports, write metadata.
- workspace run: start/reuse the sandbox.
- workspace teardown: stop the sandbox, optionally release ports.
The workspace setup/run output should include:
- Workspace name and id.
- Browser URLs for each app surface.
- Direct localhost ports.
- Selected seed radius or bundle list.
- Seed summary path and important generated ids.
- Artifact directory.
- Process supervisor status.
- Health/readiness summary.
- Clear failure output with log pointers.
It should feel good in a terminal. A successful startup should look more like a launch panel than a wall of logs:
Sandbox workspace ready: agent-a
Apps
web https://web.agent-a.acme.local http://127.0.0.1:43100
admin https://admin.agent-a.acme.local http://127.0.0.1:43101
api https://api.agent-a.acme.local http://127.0.0.1:43102
Seed
radius m
bundles reference-data, admin-user, customer-user, order, booking
summary /tmp/acme-sandbox/runs/run_abc123/seed-summary.md
ids customer=cus_123 order=ord_456 booking=bkg_789
X-ray
logs sandbox logs api
browser sandbox browse web /
email sandbox emails
analytics sandbox analytics
errors sandbox errors
summary sandbox summary
Artifacts
/tmp/acme-sandbox/runs/run_abc123
This output is part of the product. It teaches agents and humans how to use the sandbox without reading the entire README.
Implement in phases. Keep the full architecture in mind, but get a real green light early.
Phase 1: Discovery and design
- Inspect the repo.
- Classify services and integrations.
- Propose command names and artifact paths.
- Identify app-code seams needed for fidelity.
- Ask only necessary questions.
- Wait for approval.
Phase 2: Minimal container runtime
- Add sandbox package or scripts.
- Build one Docker container.
- Run the core app services under Procpane: https://github.com/jokull/procpane.
- Generate env safely.
- Mask host secrets.
- Write run registry and artifact directory.
- Implement
up,status, anddown.
Phase 3: Measurable green light
- From a fresh worktree, start the sandbox.
- Confirm browser-facing URLs and direct ports.
- Confirm process health.
- Confirm logs are tail/grep accessible.
- Confirm artifact metadata is written.
- Make failures actionable.
Phase 4: Browser vision
- Add browser helper.
- Capture screenshot, console errors, page errors, failed requests, and result JSON.
- Add seeded auth state if needed.
- Add a basic scenario smoke command.
Phase 5: Scenario seeds
- Add small named seed bundles.
- Add seed radii such as
z,s,m,l, and optionallyxl. - Make radii expand to ordered bundle lists, not separate hardcoded seed paths.
- Keep bundles composable: later bundles can depend on context from earlier bundles.
- Write seed summary JSON/Markdown.
- Attach scenario metadata: paths, ids, auth role, expected side effects.
Phase 6: Local observability and side effects
- Add email outbox.
- Add analytics ledger.
- Add error sink or envelope recorder.
- Add external request recorder.
- Add webhook fixture sender.
- Add media request logging if relevant.
Phase 7: Workspace mode
- Add stable per-worktree ids.
- Add stable port reservations.
- Add pretty URL routing where possible.
- Emit Superset/workspace metadata. Use Superset's worktree model as a reference: https://docs.superset.sh/workspaces.
- Make no-argument commands prefer the current workspace run.
Phase 8: Documentation and proof workflow
- Document integration policy and caveats.
- Document common commands.
- Document how to produce PR evidence.
- Document how to add new scenarios.
- Include examples of good evidence: screenshot plus DB row, email, analytics event, request log, error issue, or queue state.
Prefer harness code, but allow narrow app-code changes when they improve real configurability or local fidelity.
Good app-code seams:
- Env-configured service base URLs.
- Local-safe email transport.
- Analytics/error DSNs configurable by env.
- Provider clients routed by base URL.
- Seeded local auth/session support.
- Feature flags controlled by local config.
- Explicit local disable switches for remote-only features.
- Graceful degradation when optional read-only content is absent.
Bad sandbox bleed:
- Product branches that only exist to fake success in the sandbox.
- Business logic bypasses.
- Silent disabling outside local sandbox mode.
- Hardcoding harness-specific assumptions into app logic.
- Hiding missing fidelity instead of reporting it as a caveat.
If a feature cannot be represented honestly, mark it disabled or partially covered in status and summary.
The sandbox is successful when an agent can:
- Start a worktree-local environment without human babysitting.
- See which services are healthy and which are not.
- Open the browser app at stable URLs.
- Exercise a seeded scenario.
- Inspect logs and side effects without entering the container.
- See external calls and local observability output.
- Produce screenshot-backed evidence plus at least one durable invariant.
- Tear down cleanly.
- Run multiple agents in parallel worktrees without port conflicts.
Do not stop at a theoretical design. After approval, implement the first useful vertical slice and prove the green light with actual command output.