github.com-cloudflare-sandbox-sdk
all · 10 devs · built 2026-06-08
Repository snapshot
Monthly reports
Highlights
- Introduced comprehensive *Cloudflare Tunnels* support, including quick tunnels [c6bf7dc4 · Aron] and persistent named tunnels [pr/722], significantly enhancing external connectivity for sandbox environments.
- Enhanced *R2 integration* with new capabilities like credential-less binding mount support [3ca24fc3 · Archie Ferguson] and the ability to override R2 backup bucket URLs [pr/733].
- Improved *sandbox execution* flexibility by implementing a sessionless execution mode [ae5f9a10 · Archie Ferguson] and enabling efficient binary data streaming via `readFile({ encoding: 'none' })` [718d4e77 · Aron].
- Major architectural refactoring of the *preview port management* system, removing the container-local registry [d05c6c7b · Naresh] and aligning APIs with runtime forwarding [986dbf87 · Naresh] for improved reliability.
- Implemented a critical fix for *RPC transport recovery* after OOM errors [1e311821 · Aron], preventing the Sandbox Durable Object from becoming unresponsive.
- Enhanced the *sandbox Docker build process* by introducing a `NODE_VERSION` build argument and updating the default Node.js runtime to Node 24 LTS [7c09e874 · Matt Van Horn].
Observations
- Maintenance activity saw a significant 63% increase compared to the 2-month average (current: 11, average: 7), indicating a strong focus on refactoring, architectural improvements, and dependency management.
- The waste score decreased by 41% from the 2-month average (current: 2, average: 4), suggesting improved code quality and fewer reworks, despite several bug fixes related to *RPC transport* and *tunnels*.
- Overall commit volume decreased by 38% (59 commits this month vs 94-commit 2-month average), yet the total output (Grow + Maintenance) increased by 23% (current: 22, average: 18), suggesting more impactful changes per commit.
- Multiple commits addressed issues related to *RPC transport* and *tunnels*, including fixes for inconsistent interfaces [68c8b713 · Aron], `getProcess()` errors [06996cf3 · Aron], and specific problems with *tunnels RPC* when using `vite-plugin` [412ed987 · Aron], indicating these areas were active development and stabilization points.
- Several documentation updates were made, clarifying *Sandbox workspace boundaries* [a2181337 · whoiskatrin] and `RemoteMountBucketOptions` [7759a044 · Aron], improving user understanding and preventing misconfigurations.
- A series of version bumps and dependency updates (e.g., `capnweb` [453b5771 · Aron], `qs`, `turbo`, `ws`) were performed, contributing to the maintenance score and ensuring the project uses up-to-date libraries.
Performance over time
ETV stacked by Growth, Maintenance and Fixes — 90-day moving average, normalized to ETV / month.
Average performance per developer
ETV per active developer per month — 30-day moving average.
Active developers over time
Unique developers committing each day — 90-day moving average.
Knowledge concentration
How dependent is this repo on a small number of contributors? Higher top-1 share = higher key-person risk.
Naresh owns 46.2 % of commits.
Top contributors
Most impactful commits
Top 20 by ETV in the all-time window.
- 2.9ETVImplement sessionless execution mode (#706) * implement sessionless exec path * patch test issues * fix sessionless inconsistencies * fix issues + add e2e tests * fix execStream bug * biome fix * fix e2e test * remove configurable session passing * patch e2e test worker * dont persist in DO + change sentinel value * fix sentinel leaking * rebase * fix exec options merging bugs * fix rpc origin bug * address bonk comments * fixed proxy routing bug * update clunky method name * CI update --------- Co-authored-by: scuffi <aferguson@cloudflare.com>Archie Ferguson · ae5f9a10 · 2026-05-27
- 2.8ETVAdd desktop environment container runtime (#422) * Add desktop environment container runtime Enables running a full Linux desktop inside sandbox containers with programmatic screenshot and input control via native FFI. * Add desktop environment SDK client (#423) * Add desktop environment SDK client Sandboxes need a public API for desktop environments so Workers can manage desktops, capture screenshots, and stream VNC. The Dockerfile gains a desktop build stage with the required system dependencies. * Add desktop environment tests (#424) * Add desktop environment tests Unit tests for the container handler, service, and SDK client, plus an E2E test that exercises the full desktop lifecycle through a real deployed worker. * Add desktop viewer example (#425) React + Vite + Tailwind v4 app that demonstrates the desktop environment API with noVNC streaming and viewport-aware resolution. * Fix lint errors in desktop example and biome config * Add desktop E2E test Dockerfile and config generation * Add desktop image to CI build, push, and cleanup workflows * Fix Go build: pin golang.org/x/net to Go 1.24-compatible version go mod tidy resolved golang.org/x/net@v0.51.0 which requires go >= 1.25, breaking the go-builder stage using golang:1.24-bookworm. Pin to v0.50.0 (last Go 1.24-compatible release) and update go directive to match builder. * Fix FFI type mismatch and clickCount handling The Click FFI binding declared 'bool' (1-byte C _Bool) but the Go function expects C.int (4 bytes), causing undefined ABI behavior. Changed to 'int' and pass clickCount through directly so tripleClick emits three rapid single clicks instead of silently degrading to doubleClick. * Guard desktop stop in destroy() on container state desktop.stop() goes through containerFetch which auto-starts sleeping containers. Check ctx.container.running first so destroy() does not wake a container just to immediately tear it down. * Revert accidental backup and token doc changes The desktop branch commit inadvertently changed backup curl from streaming -T to --data-binary (loads full archive into memory), reduced timeouts from 1800s to 300s, removed the local-dev mismatch diagnostic, and changed token docs to show hyphens which the validation regex rejects. Restore all to match main. * Add error resilience to desktop worker and manager Catch stop() failures during start() error recovery so the original error propagates. Add onerror handler to the worker thread so pending promises reject instead of hanging if the worker crashes. * Fix FFI out-pointer semantics and skip stream-url in CI koffi requires koffi.out() annotation on pointer parameters to copy values back from C to JS after the call. Without it, GetScreenSize and GetMousePos always returned zeros because koffi treated int* as input-only. The stream-url E2E test requires preview URL infrastructure (custom domain with wildcard DNS) that CI workers.dev doesn't provide, so skip it with the same pattern used by other port-exposure tests. * Reset manager state on start failure DesktopManager.start() sets state to 'starting' but the catch block relied solely on stop() to reset it to 'inactive'. When stop() itself fails, state remains 'starting' permanently, blocking all subsequent start attempts. Explicitly set state to 'inactive' after cleanup. * Use pure-Go xgb path for GetScreenSize robotgo.GetScreenSize() delegates to C-based XGetMainDisplay() which holds an unsynchronized static Display pointer. In Go's c-shared build mode CGo dispatches from varying OS threads, causing the singleton to silently return zero dimensions. Switch to robotgo.GetDisplayBounds(0) which uses the github.com/kbinani/screenshot pure-Go xgb implementation, matching the existing workaround for the SaveCapture segfault. * Upgrade robotgo to v1.0.1 with uniform error handling Use dedicated v1.0.1 APIs (MouseDown/Up, KeyDown/Up, Type, MultiClick) instead of Toggle/KeyToggle/TypeStr. All Go FFI exports now return error strings via *C.char, and the koffi bindings use HeapStr with a checkError() helper for uniform error propagation. Rename TypeStr→TypeText and SaveCapture→Screenshot to match v1.0.1 naming. Click now takes a count parameter — single, double, and multi-click are handled in Go. The worker-side triple-click loop is removed since Go handles it natively via robotgo.MultiClick.Naresh · dc706497 · 2026-03-03
- 2.5ETVEnforce preview URL runtime activation (#708) * Enforce preview URL runtime activation * Raise E2E sandbox instance cap The default E2E sandbox app can hit its 50-instance cap when the three transport jobs run file-parallel Vitest suites. Raise the cap modestly to match observed CI demand while keeping capacity bounded. * Clean up preview URL E2E tests Make lifecycle synchronization wait for terminal container states and keep preview URL tests closer to public SDK flows. This avoids relying on transitional stop states or hand-edited preview hostnames. * Configure E2E warm pool in wrangler Declare warm pool sizing in the test worker config instead of mutating every container app after deploy. This keeps variant images from reserving unused warm capacity during stacked PR CI. * Reconcile tunnel lifecycle with runtime stops * Restore R2 egress handler registration ContainerProxy resolves outbound handlers through the containers registry populated by the static setter. Keep the test mock aligned with that lookup path so R2 egress mounts fail if the handler is not registered for runtime dispatch. * Remove preview containers dependency Replace the temporary containers package dependency with the latest published release and keep preview URL forwarding non-waking through an SDK-owned helper. The Sandbox DO remains responsible for preview auth and runtime activation decisions, while the helper handles TCP response forwarding and lifecycle settling.Naresh · 287ec04b · 2026-05-28
- 2.5ETVAdd process isolation and persistent sessions for all commands (#59) * Add process isolation for sandbox commands Implements PID namespace isolation to protect control plane processes (Jupyter, Bun) from sandboxed code. Commands executed via exec() now run in isolated namespaces. Key changes: - Sandboxed commands can no longer see or kill control plane processes - Platform secrets in /proc/1/environ are inaccessible - Ports 8888 (Jupyter) and 3000 (Bun) are protected from hijacking - Commands within sessions now maintain state (pwd, env vars) - Graceful fallback when CAP_SYS_ADMIN not available (dev environments) BREAKING CHANGE: Commands within the same session now share state. Previously each command was stateless. Use createSession() for isolated command execution. * Stop information exposure through stack trace * Implement secure streaming execution with ExecutionSession support - Fix streaming security hole by routing through SessionManager instead of direct spawn() - Add ExecutionSession.execStream() method for secure real-time command streaming - Maintain backward compatibility by bridging sessionId API to ExecutionSessions - Extend SessionManager with streaming capabilities using isolated control processes * Remove sessionId * Make file ops session-aware too * Remove duplicate code paths * Fix streaming and corresponding abort * Fix log fetch endpoint * Minor fixes * Rename back to sessionId * Fix pending name references * Move control script into separate file * Fix type errors * fix biome lint errors * Prevent shell command injection * Move code around * Reorganise code * Update changesetNaresh · b6757f73 · 2025-08-15
- 2.2ETVAdd WebSocket transport (#253) * add ws transport + e2e * remove unused * fix types * chore: remove ws-transport e2e test (will use existing e2e with WS header) * fix: WebSocket transport for DO environment + dual transport tests - WSTransport: Add fetch-based WebSocket connection for Workers/DO context - Uses containerFetch with upgrade headers instead of raw new WebSocket() - Required because DOs cannot use direct WebSocket() connections to containers - Transport: Pass stub and port to WSTransport for proper routing - CommandClient: Use doStreamFetch for streaming (supports both HTTP and WS) - comprehensive-workflow.test.ts: Run all tests with both HTTP and WebSocket transport * feat: add dual transport (HTTP + WebSocket) testing to e2e tests Updated test files to run with both HTTP and WebSocket transport modes: - comprehensive-workflow.test.ts - file-operations-workflow.test.ts - streaming-operations-workflow.test.ts - environment-workflow.test.ts - git-clone-workflow.test.ts - process-lifecycle-workflow.test.ts - process-readiness-workflow.test.ts - keepalive-workflow.test.ts - code-interpreter-workflow.test.ts - build-test-workflow.test.ts Each test suite now runs twice - once with HTTP transport (default) and once with WebSocket transport (X-Use-WebSocket header). This validates that the WebSocket transport works identically to HTTP for all SDK operations. * fix: use doStreamFetch for WebSocket streaming in ProcessClient and FileClient - ProcessClient.streamProcessLogs: use doStreamFetch instead of doFetch - FileClient.readFileStream: use doStreamFetch instead of doFetch This ensures proper streaming over WebSocket transport. * fix: complete WebSocket transport support for all streaming operations - base-client.ts: Add method parameter to doStreamFetch for GET/POST support - transport.ts: Update requestStream and httpRequestStream for GET/POST - process-client.ts: Use doStreamFetch with GET for process log streaming - file-client.ts: Use doStreamFetch for file streaming - interpreter-client.ts: Use doStreamFetch for code execution streaming - interpreter.ts: Use doStreamFetch for runCodeStream - process-lifecycle-workflow.test.ts: Accept 'already exposed' in port test All 105 e2e tests now pass with both HTTP and WebSocket transport! * Fix incorrect merge * Fix WebSocket transport issues Resolve SSE parsing data loss, connection race conditions, send failure handling, stream cleanup on close, module-level state, and type safety. * Add changeset for WebSocket transport * Integrate WebSocket handler into server module The WebSocket handler was in a separate index.ts that wasn't included in the build. This moves the handler integration into server.ts where the actual server is started, and removes the unused index.ts. * Refactor transport configuration and add CI matrix Replace useWebSocket boolean with transport string union type for future extensibility. Add SANDBOX_TRANSPORT env var support following the existing pattern of SANDBOX_LOG_LEVEL. Configure CI to run E2E tests with both HTTP and WebSocket transports in parallel via matrix. * Fix resource leaks and type safety in WebSocket transport Add client.disconnect() calls before client replacement and in destroy() to prevent WebSocket connection leaks. Add public streamCode() method to InterpreterClient to eliminate unsafe any cast. Add error boundaries in server WebSocket handlers and fix timeout cleanup in ws-transport. * Fix null check for stored transport in constructor * Improve WebSocket transport robustness and documentation Fix race condition in connection promise handling, add 503 retry logic for WebSocket mode to match HTTP behavior, improve error handling for stream operations, and add WSTransport unit tests. Extract client creation to helper method and organize exports. * Unify Transport abstraction and clean up API surface BaseHttpClient now always uses Transport for all requests, eliminating the previous split where HTTP mode bypassed Transport entirely. This ensures both HTTP and WebSocket modes share the same code paths for retry logic and error handling. Removed the legacy TransportResponse-based API (request, requestStream) in favor of the standard Fetch API (fetch, fetchStream). This reduces bundle size by ~2.4 kB and provides a cleaner, more familiar interface. * Refactor transport layer for cleaner separation of concerns Extract HTTP and WebSocket transports into dedicated classes with a shared base class for retry logic. This creates symmetric file structure and clear interfaces for both transport modes. SDK changes: - Create transport/ directory with ITransport interface - Add BaseTransport with shared 503 retry logic - Add HttpTransport and WebSocketTransport implementations - Update consumers to use ITransport interface Container changes: - Rename ws-handler to ws-adapter (reflects protocol adapter role) - Use 'container' log component instead of dedicated entry * Cleanup tests * Remove working documents * Simplify transport config to env var only * Use stub.fetch() for WebSocket transport connection stub.fetch() routes WebSocket upgrade requests through the parent Container class that supports the WebSocket protocol. * Remove redundant plugin config from workflow Plugins are configured via .claude/settings.json. * Revert "Remove redundant plugin config from workflow" This reverts commit 469489e5243c7af5458a7833dada6c861e4f1b9e. * Optimize CI: build Docker once, fix cleanup for transport variants pullrequest.yml: - Extract Docker build into separate job (build-docker) - Both HTTP and WebSocket E2E jobs download pre-built images - Uses docker/build-push-action with GHA layer caching per image variant cleanup.yml: - Clean up all transport variants (-http, -websocket) on PR close cleanup-stale.yml: - Update regex to match transport-suffixed worker names - Fix PR number extraction to handle suffix * Fix WebSocket transport error handling Wrap send() calls in try-catch to properly reject promises and error stream controllers when the WebSocket is disconnected. Also fail early with a clear error when request body JSON parsing fails. * Simplify CI Docker builds Build all images once and share via artifacts to both E2E jobs. * Optimize CI with buildx cache and CF registry Reduces CI runtime by using GHA buildx cache for consistent layer digests and pushing images to Cloudflare registry before deploy. Deploy then references pre-pushed images instead of building and pushing during the slow wrangler deploy step. Cache scopes are shared between e2e-tests and publish-release jobs in the release workflow, so publish-release benefits from cache warmed by e2e-tests. * Parallelize Docker image builds in CI Build base, python, and opencode images concurrently using bash background processes. Standalone builds after base is pushed since it references base from the registry. Pushes to Cloudflare registry are also parallelized where possible. --------- Co-authored-by: Naresh <naresh@cloudflare.com>deathbyknowledge · 4b4ab483 · 2025-12-18
- 2.2ETVfeat: add file watching capabilities with inotify support (#324) * feat: add file watching capabilities with inotify support * Create empty-poets-serve.md * Potential fix for code scanning alert no. 42: Incomplete string escaping or encoding Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * fixes for claude review * update tests to verify regex format for default and custom excludes * Fix error handling and type safety in WatchService and FileWatch classes. Update tests to validate new event parsing logic and ensure proper handling of inotifywait output. * Refactor WatchService tests to validate combined regex patterns for default and custom excludes * Added timeouts for event propagation * Refactored and cleaned * Small ws transport related fixes * Timing changes help account for the additional buffering * Fix WebSocket blocking issue for SSE streaming responses Streaming responses (like file watch events) were blocking the WebSocket message handler because handleStreamingResponse was awaited. This prevented other messages from being processed while a stream was active. Run streaming response handlers in the background with error logging, allowing the message handler to return immediately and process subsequent messages. * Add debug logging and stream tracking for WebSocket streaming Add detailed logging to trace streaming response handling and track active streams to prevent potential garbage collection of Response objects. * Acquire stream reader synchronously before async execution The WebSocket message handler needs to capture the Response body reader before any await points. When getReader() was called inside the async handleStreamingResponse method, Bun's WebSocket handler would return before the reader was acquired, potentially invalidating the Response body stream. By getting the reader synchronously in handleRequest before the promise starts executing, we ensure the stream remains valid throughout the async streaming loop. * Wait for inotifywait watches to be established before signaling ready The watching SSE event was sent immediately when the stream started, before inotifywait finished setting up watches. This caused flaky tests because file operations could occur before the watch was truly ready. Now we read stderr and wait for the 'Watches established' message from inotifywait before sending the watching event to clients. * Add timeout to waitForWatchesEstablished to prevent hanging If inotifywait fails to output 'Watches established' within 10 seconds, the function will return and allow the stream to proceed. This prevents indefinite hangs if inotifywait behaves unexpectedly. * Wait for first message before returning WebSocket stream For WebSocket streaming, errors were deferred until stream consumption. This caused issues where watchStream() would return successfully even when the server returned an error response. Now requestStream() waits for the first message before returning: - If it's a stream chunk, return the stream (success case) - If it's an error response, throw immediately (error case) This makes WebSocket streaming behavior match HTTP streaming, where errors are thrown immediately rather than deferred. * Address code review issues in file watching Replace empty catch blocks with debug/warn logging throughout WatchService and FileWatch to make failures visible. Fix FileWatch.established() to reject on AbortSignal during establishment, preventing indefinite hangs. Strengthen the SSE event type guard to validate required fields per event type. Add WATCH_STOP_ERROR code, integrate watch cleanup into server shutdown, and rewrite changeset for end users. * Fix file watch E2E test timeout race condition The watchWithActions timeout started at stream creation but blocked for ~6.5s inside the event handler (pre-action delay + file ops + post-action delay), leaving insufficient time to read events in CI. Reset the timeout after actions complete so the full window is available for event collection. * Fix exclude test timeout in file watch E2E The combined wait time (~18s) approached the 30s Vitest timeout. With excludes filtering most events, only 2-4 arrive, so the high stopAfterEvents threshold was never reached and the test always fell through to the full 12s reader timeout. * Remove low-level watch API to simplify public interface Remove watchStream(), stopWatch(), and listWatches() methods from Sandbox class. The handle-based API via watch() is sufficient for all use cases and prevents resource management confusion. Keep the internal WatchService methods for container use but don't expose them through the SDK public API. * Harden file watch stream lifecycle * Stabilize explicit watch stop e2e test * Isolate file-watch e2e sessions per test * Stabilize file-watch error and stop e2e cases * Stabilize websocket e2e flake handling Treat watch stop as idempotent when a watcher is already gone and make OpenCode proxy health checks resilient to transient startup failures. Relax the foreground timing threshold to reduce transport-related CI jitter without masking blocking behavior. * Treat ESRCH as success when stopping already-gone watch process Handle the race where a watch process exits before stopWatch() is called. When process.kill() throws ESRCH (no such process), clean up the watch entry and return success instead of an error. * Isolate file watch e2e sandbox * Route watch E2E through SDK surface * Use SDK watch bridge without new public API * Remove unused watch stop endpoint and reduce stream logging verbosity Remove /api/watch/stop endpoint and WatchClient.stop() method as watch lifecycle is now managed through handle-based API. Clean up verbose debug logging in WebSocket stream handler, keeping only completion summary. Clear setup timeout in watch service to prevent leak. * Scope stream cancellation to owning WebSocket connection Track connectionId for each active stream and only allow cancellation from the connection that initiated it. Prevent cross-connection cancel messages and ensure onClose only cancels streams owned by the closing connection. * Address review: parseSSEStream abort, watch readiness, include/exclude validation - Fix parseSSEStream to register abort listener that calls reader.cancel(), unblocking idle streams instead of polling signal.aborted between reads - Make watch() block until the watching event is received so callers can immediately perform actions that depend on the watcher being active - Reject requests that specify both include and exclude with a clear validation error since inotifywait does not support both simultaneously - Add timeout-race tests for WebSocket transport stream requests - Simplify E2E watch helper: use parseSSEStream + AbortSignal.timeout, remove all hardcoded delays - Remove test worker workaround that silently dropped exclude when include was present * Simplify non-existent path error test watch() now throws on establishment failure, so the test worker returns a clear error response. The 3-path defensive logic is no longer necessary. * Simplify stop-watch test: remove manual watching detection watch() now blocks until established, so the manual loop scanning for the watching event is redundant. Read one chunk to confirm the stream is live, then cancel. * chore: retrigger CI * Update .changeset/empty-poets-serve.md Co-authored-by: Naresh <ghostwriternr@gmail.com> * FIxed small review comments --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: opencode-agent[bot] <opencode-agent[bot]@users.noreply.github.com> Co-authored-by: Naresh <ghostwriternr@gmail.com>whoiskatrin · 2af3c283 · 2026-03-03
- 1.9ETVImplement no credential R2 binding mount support (#691) * Support credential-less R2 mounting * fix dynamic outbound intercept handlers * remove inline egress handler * fix various egress bugs * fix regex issues * address bonk comments * address PR comments * export containerproxy from e2e worker * address PR comments * change r2 upload path to fixedlengthstream * export from containers * update e2e worker wrangler * remove arrayBuffer from multipart upload * fix 0 len uploads + dual mount bug * set durable_object_offset_instances in wrangler * set durable_object_offset_instances in wrangler --------- Co-authored-by: scuffi <aferguson@cloudflare.com>Archie Ferguson · 3ca24fc3 · 2026-05-20
- 1.8ETVfeat: Add backup and restore API (#396) * Add backup and restore API for directory snapshots Introduces createBackup() and restoreBackup() methods to the Sandbox class, enabling point-in-time snapshots of directories. Backups are stored as compressed squashfs archives in R2, with instant restore via FUSE overlayfs (copy-on-write). Key features: - Compression with mksquashfs for efficient storage - Instant restore without extracting archives - Copy-on-write semantics preserve original backup - TTL-based automatic cleanup - Includes time-machine example and comprehensive E2E tests * Fix backup restore and stale E2E test expectations The doRestoreBackup finally block deleted the .sqsh archive after every restore, but squashfuse needs it as backing storage for the mount lifetime. Changed to catch-only cleanup so the file persists on success. Also use lazy unmount (-uz) in the container to handle stale FUSE mounts from previous restores. Updated E2E tests to expect correct HTTP status codes from the centralized error handling system (404 for not found, 408 for timeout, 400 for bad request) instead of blanket 500s. * Unmount FUSE before re-downloading backup archive Restoring the same backup a second time overwrote the .sqsh file while squashfuse still held it open, corrupting the backing store. Tear down existing FUSE mounts in the DO before writing the new archive. Also revert two E2E status expectations to 500 since IS_DIRECTORY and NOT_DIRECTORY errors are indistinguishable from generic FileSystemError across the RPC boundary. * Ensure clean upper layer on repeated backup restores When rm -rf on the mount base fails due to a lingering squashfuse mount point, the overlayfs upper directory survives with stale writes. Explicitly remove the upper and work directories first (never mount points) so the new overlay starts with a clean writable layer. * Add diagnostic logging to backup COW test * Use unique mount base per restore to avoid stale upper layer Each overlayfs restore now creates a fresh mount base with a unique suffix so writes from a previous overlay session cannot leak into subsequent restores. Old mount bases are torn down best-effort. * Skip archive re-download when file already exists A lazily-unmounted squashfuse may still hold the .sqsh file open. Overwriting it corrupts the backing store for the new mount. Compare the existing file size with the R2 object size and skip the download when they match. * Add targeted diagnostic for backup COW test * Remove diagnostic logging from backup COW test * Retrigger CI with clean container state * Use id in restore backup result * Revert chunked-read plumbing from file API The offset/length parameters on readFile were added for chunked base64 backup reads. With presigned URLs handling all transfers, chunked reads are unnecessary. * Replace base64 backup transfer with presigned R2 URLs Presigned URLs are the sole transfer mechanism for backup archives. The container curls directly to/from R2, bypassing the DO. This achieves 40-155x faster transfers (~24 MB/s up, ~93 MB/s down). Remove all base64 transfer code (uploadBackupSingle, Chunked, downloadBackupSingle, Chunked, uint8ArrayToBase64, base64ToUint8Array), the 500MB size cap, and the 50MB chunk constant. R2 credentials (R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, CLOUDFLARE_ACCOUNT_ID, BACKUP_BUCKET_NAME) are now required for backup to function. * Add presigned URL config to test and example wrangler configs Add BACKUP_BUCKET_NAME var and R2 credential types for presigned URL backup transfers. Update time-machine example with required vars and secret instructions. * Update changeset to reflect presigned URL transfer approach * Harden presigned URL backup implementation Isolate backup shell ops in a dedicated session to prevent interference with user exec() calls. Encode URL path segments, use -sSf and --retry for curl, download to .tmp with atomic mv, and change TTL default to 3 days with no upper cap. * Use backup session for container API calls too * Add R2 credentials as wrangler secrets for backup E2E --------- Co-authored-by: Naresh <naresh@cloudflare.com>whoiskatrin · 76284f02 · 2026-02-20
- 1.7ETVFix memory leaks in process management (#224) * Update default formatter * Fix memory leaks in process management Memory leaks occurred from three sources: event listeners accumulated when log streaming was cancelled, completed processes remained in memory indefinitely, and Durable Object state retained invalid references after container restarts. The ProcessStore now persists completed processes to disk while keeping active processes in memory. ReadableStream cancel handlers properly remove listeners when log streaming stops. The Sandbox onStop lifecycle hook clears container-specific state to prevent stale references. * Add changeset * Fix async cleanup and listener lifecycle The onStop lifecycle hook now properly awaits storage deletions to ensure cleanup completes before shutdown. Processes are deleted from memory before persisting to disk to prevent race conditions during concurrent access. Stream listeners are cleaned up on all close paths, not just cancellation. The list method now scans disk to include completed processes, maintaining backward compatibility. * Use structured logging for disk write failures * Fix data loss when persisting completed processes Write to disk before deleting from memory to prevent data loss if write fails. Always delete from memory regardless of write success to prevent memory leaks. * Fix list() directory check that prevented disk scanning Bun.file().exists() returns false for directories, causing list() to skip disk scanning and miss completed processes. Remove the directory existence check and rely on error handling instead. * Add comprehensive unit tests for ProcessStore Test coverage for memory/disk hybrid storage, terminal state transitions, disk persistence, write failure handling, and list() filtering. * Add E2E tests for completed process retrieval and listing Verify completed processes can be retrieved via get() with exit codes and timestamps, and that list() includes both running and completed processes. * Remove dot-prefix from process storage directory Change from /tmp/.sandbox-internal to /tmp/sandbox-internal for better debuggability. Makes it easier to discover and inspect process files during development and troubleshooting. * Fix race condition in process status updates Event handler called store.update() fire-and-forget, causing API queries to return stale data before disk writes completed. Changed SessionManager to accept async callbacks and await them, ensuring store updates complete before processing next event. This provides strong consistency while maintaining the memory leak fix that moves completed processes to disk. * Update lockfile * Clear activeMounts on container stop Applies memory leak prevention from this PR to the activeMounts feature that was merged from main. When a container stops or restarts, mount points become invalid but the Map still held stale references. Clearing it maintains consistency with portTokens and defaultSession. * Enable TypeScript checking for test files (#238) * Enable TypeScript checking for test files * Enable TypeScript checking for e2e tests Added tests/e2e/ to typecheck pipeline. Created test-worker/types.ts defining response types for test infrastructure endpoints. All tests now use proper SDK types from @repo/shared for SDK operations and test-worker types for test infrastructure. Removed non-null assertions with proper type guards. Found and fixed bugs where tests used wrong field names (contextId vs id, processId vs id) that were masked by inline types. * Fix incorrect assertions * Use public Process interface correctly * Revert enum usage * Enrich interpreter error * Fix TypeScript build output paths Build output was going to dist/src/ instead of dist/ due to rootDir: "." in tsconfig, breaking Docker container startup which expects dist/index.js. Split tsconfig into separate build and typecheck configs. Build emits src only to dist/, typecheck includes tests. * Fix type errors * Update lockfile * Fix code context e2e response checks * Prevent just-completed process appearing twiceNaresh · 71e86f42 · 2025-11-24
- 1.7ETVAdd PTY terminal passthrough for browser clients (#310) * Add PTY types to shared package * Add PtyManager for container PTY lifecycle * Add PTY handler and route registration * Add PTY message handling to WebSocket adapter * Add PTY methods to transport interface * Add PtyClient for SDK PTY operations * Add pty namespace to Sandbox class * Add E2E tests for PTY workflow * Add changeset for PTY support * Skip PTY tests when PTY allocation fails * Fix pty manager tests * fix any types, logger * fix silent logging * fix pty tests for resizing * update claude review yml * revert review change * update http tests * more test updates * remove the plugin for review * Add error handling to PTY callbacks and terminal operations * Improve PTY error handling based on code review * add structured exit codes * change fire and forget strategy * add more e2e tests * more fixes and tests * fix ws * update error propagation * update resizing tests * add collab terminal example * Potential fix for code scanning alert no. 40: Insecure randomness Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 41: Insecure randomness Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Update dependency in examples * Add PTY listeners cleanup * minor nits * update tests setup * Add logging for PTY listener registration errors and improve error handling * Enhance error handling and logging in WebSocketTransport and PtyHandler; add tests for PTY listener registration and cleanup behavior * Add connection-specific PTY listener cleanup on WebSocket close * Remove outdated comment regarding connection cleanup functions in WebSocketAdapter * Fix error handling in PTY management by updating kill method to return success status and error messages * implement signal handling for Ctrl+C, Ctrl+Z, and Ctrl+\ in the PTY manager * Potential fix for code scanning alert no. 43: Insecure randomness Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Changes based on review comments * Update dependencies and improve PTY handling in collaborative terminal example * Update PTY workflow tests to expect correct HTTP status codes for error responses * extractPtyId method to retrieve PTY IDs from responses, and update handleRegularResponse to return parsed body for further processing * Update PTY workflow tests to expect 'message' field in error responses instead of 'error' * Fixed handlers for tests * Use PR-specific Docker build cache scope to avoid cross-PR cache pollution * Add debug logging to router for route registration and matching * Add INFO-level route logging to debug container caching issues * Add retry logic for WebSocket server readiness in e2e tests The WebSocket connect tests were flaky because they didn't wait for the echo server to be ready after /api/init. Added a helper function that retries WebSocket connection with backoff before running tests. * Remove debug logging added during PTY route investigation The 404 issues were caused by stale container instances, not route registration problems. Reverting the debug logging changes: - Remove INFO-level route logging from router - Remove logRegisteredRoutes() method - Revert PR-specific Docker cache scope (not needed) * Fix sync-docs workflow to handle PR bodies with special characters Use quoted heredoc and printf to safely handle PR description content that may contain backticks, code blocks, or other shell-sensitive characters. Pass PR body via environment variable to prevent shell interpretation during prompt construction. * Fix sync-docs workflow shell escaping for opencode run Use environment variable to pass prompt to opencode run, avoiding shell interpretation of special characters like parentheses, backticks, and dollar signs that appear in PR descriptions with code examples. The prompt is stored in OPENCODE_PROMPT env var which GitHub Actions sets safely, then referenced with double quotes in the shell command. * Fix lint errors and align env type signatures The recent env var changes in 7da85c0 introduced Record<string, string | undefined> but missed updating getInitialEnv return type and getSessionInfo. Also aligns vite-plugin versions across examples. * send heartbeat events to keep container alive * Add PTY terminal passthrough for browser clients Enables browser-based terminal UIs to connect to sandbox shells via WebSocket. The terminal() method proxies connections to the container's PTY endpoint with output buffering for replay on reconnect. * Add tests and infrastructure for PTY terminal (#375) * Add tests and infrastructure for PTY terminal Unit tests for ring buffer, PTY spawning, and WebSocket handler. E2E tests for PTY workflow and browser terminal addon integration. Updates CI workflows and documentation for new test patterns. * Add collaborative terminal example and refine xterm addon (#376) * Refactor SandboxAddon to use connect(target) * Add collaborative terminal example Demonstrates the SandboxAddon connect() API with real-time room switching, presence tracking, and session isolation across multiple terminal rooms. Co-authored-by: Naresh <naresh@cloudflare.com> * Clear bash startup warning in container PTY Bash emits 'cannot set terminal process group' warnings in containers where the shell isn't a session leader. This clutters the first lines of terminal output for users. * Add build/** to turbo output cache React Router outputs to build/ instead of dist/, causing turbo to warn about missing outputs for the collaborative-terminal example. * Update collaborative-terminal README for new API The previous README documented the old PTY API (sandbox.pty.create() with JSON WebSocket protocol) that no longer exists. Updated to reflect the current implementation using session.terminal(request) with direct WebSocket passthrough and SandboxAddon for terminal integration. --------- Co-authored-by: katereznykova <kreznykova@cloudflare.com> --------- Co-authored-by: katereznykova <kreznykova@cloudflare.com> --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: Naresh <naresh@cloudflare.com> Co-authored-by: Steve James <sjames@cloudflare.com> Co-authored-by: Naresh <ghostwriternr@gmail.com>whoiskatrin · 3c035872 · 2026-02-06
- 1.7ETVImplement egress interception for S3 mounts (#727) * Implement egress interception for S3 mounts * Implement missing egress handler * Fix proxy handler errors * patch zero len bugs * patch gcs writes * address PR comments * fix password file issue * fix stale proxy issue --------- Co-authored-by: scuffi <aferguson@cloudflare.com>Archie Ferguson · 2acbd243 · 2026-06-05
- 1.6ETVAdd performance test suite (#298) * Add performance test suite * Improve CI workflow maintainability - Use env variables for worker names and account subdomain across all workflows (pullrequest, release, performance) instead of hardcoding - Include typecheck:perf in check/fix npm scripts - Add debug logging for sandbox cleanup failures - Clear timeout when promises finish to prevent dangling timers - Use PASS_THRESHOLD constant for consistent threshold checks * Optimize perf workflow and improve test robustness Create perf-specific wrangler config with only the default Sandbox container, reducing CI build time by skipping python/opencode/standalone images that perf tests don't use. Add explicit try-finally cleanup in concurrent creation test for crash safety. Fix MetricsCollector API documentation to reflect actual method signatures. * Add typed metadata interfaces for metrics collection Replace Record<string, unknown> with MeasurementMetadata and RunMetadata interfaces. Provides type safety for known fields (success, error, sandboxId, etc.) while allowing extension via index signature.Naresh · 785fada7 · 2025-12-12
- 1.6ETVCleanup debug logs and implement unified logger (#109) * Implement unified logging system - Create shared logger package with structured logging - Integrate logger into sandbox container and DO layer - Remove duplicate and unnecessary console.log statements - Add trace context support for request tracking - Update logging configuration and middleware * Delete dead codeNaresh · 9c484250 · 2025-10-17
- 1.5ETVImplement code interpreter API (#49) * feat: implement code interpreter with Jupyter kernel integration - Add Jupyter (Python) and TSLab (JavaScript/TypeScript) kernel support - Implement streaming code execution with rich MIME type outputs - Support charts, tables, images, and structured data rendering - Add CodeInterpreter class with context management - Create notebook-style UI in React example app * Fix example * Use better kernel for JS * Remove comment * Use crypto to generate session ID * Add changeset * Remove unused types * Fix lint errors * Fix build failuresNaresh · d81d2a56 · 2025-08-04
- 1.5ETVFix null pointer in shellExitedPromise during concurrent session destroy (#452) * Fix null pointer in shellExitedPromise during concurrent session destroy Session.destroy() sets shellExitedPromise to null, but execStream's polling loop accesses it via a non-null assertion after resuming from sleep. When deleteSession() runs concurrently without holding the per-session mutex, this causes a TypeError crash. Guard the nullable field, capture mutable instance fields into locals before any await points, acquire the session lock in deleteSession, and add a SESSION_DESTROYED error code (HTTP 410) so callers get a typed error instead of an unhandled crash. * Add changeset for shellExitedPromise fix * Add tests for destroy-during-streaming race condition Test the SESSION_DESTROYED error path when deleteSession races with background streaming and foreground exec. Also pass sessionDir as a parameter to buildFIFOScript rather than having it reach through this.sessionDir, and document why shellExitedPromise is intentionally read fresh in the polling loop rather than captured into a local. * Refine session-destroyed detection and race tests * Preserve shell-termination error propagation Keep shell-termination failures from being rewritten as session teardown errors so exit-command behavior is reported accurately. * Handle explicit exit commands as shell termination Treat direct exit commands as shell termination errors so they keep the expected shell-terminated message and exit code details. * Trigger CI rerun * Add typed errors for shell exit classification * Fix remaining race conditions in session destruction shellExitedPromise could hang forever if destroy() ran while a streaming command was awaiting it, because the callback returned without settling the promise. Session destruction also bypassed per-session locks in the graceful shutdown path. Error classification now uses typed errors instead of regex on message strings, and the labelers-done loop short-circuits on destroy. * Update changeset to reflect actual user-facing behavior --------- Co-authored-by: Naresh <naresh@cloudflare.com>whoiskatrin · 5cce0343 · 2026-03-05
- 1.3ETVPreview URLs survive container restarts (#600) * Preview URLs survive container restarts Preview URLs were invalidated on every container stop. Two problems: onStop() unconditionally deleted portTokens from DO storage, and even after the container came back up the runtime had no memory of which ports had been exposed, so validatePortToken() would 404 every request. Persist portTokens across onStop() (cleared only on explicit unexposePort() or full destroy()), and re-expose saved ports on container start via a new restoreExposedPorts() called from onStart() under blockConcurrencyWhile so concurrent requests queue behind restore. Storage shape changes from { portStr: token } to { portStr: { token, name? } } so the friendly name passed to exposePort() also survives restart. readPortTokens() migrates the legacy string format on read, so existing stored state is honored without an explicit migration step. All other consumers (exposePort, unexposePort, getExposedPorts, validatePortToken, desktop fallback path) go through the same helper. Added seven focused unit tests covering restore with names, legacy migration on restore, skip-if-exposed, per-port failure isolation, empty-storage no-op, onStop() not deleting portTokens, and exposePort() persisting the name. * Add E2E test and exposePort() JSDoc for restart survival Documents the new behavior on the SDK's authoritative reference point (exposePort JSDoc) instead of bloating the README, keeping the README as the minimal quick-start surface the project prefers. Adds tests/e2e/preview-url-restart-workflow.test.ts that exposes a port with a custom token, stops the container via a new test-only /api/container/stop endpoint, restarts the user process in a fresh container, and asserts the preview URL still responds 200 without re-exposing. Covers the end-to-end story (storage persistence + onStart restore + blockConcurrencyWhile race guard) that the unit tests exercise in isolation. * destroy() clears portTokens from storage onStop() was changed in this PR to preserve portTokens so preview URLs survive transient restarts. The old cleanup relied on destroy() flowing through onStop() to clear tokens; with onStop() preserving them, destroy() must delete portTokens itself or a reused DO id would let restoreExposedPorts() bring back tokens and ports from a previously destroyed sandbox. Delete portTokens explicitly in destroy() before super.destroy(). Add a regression test that stubs Container.destroy and asserts the storage key is deleted on teardown. * Fix changeset scope and trim release-note copy Beta SDK bugfix — `patch`, not `minor`. Drop the implementation details (blockConcurrencyWhile, validatePortToken) from the body; they belong in code comments, not release notes. * Simplify onStart restore and batch the exposed-port check The base @cloudflare/containers class already wraps onStart() in blockConcurrencyWhile, so nesting another one is redundant. Make onStart async and await restoreExposedPorts() directly. Fetch getExposedPorts once per restore instead of calling isPortExposed() per port. Drops cold-start container round trips from N+1 to 1 and removes a duplicate ensureDefaultSession() on the first iteration. * Test onStart error handling and snapshot restore Lock in two behaviors from the previous commit: - onStart swallows restoreExposedPorts rejections; an unhandled error inside the base class's blockConcurrencyWhile would reset the DO. - getExposedPorts is called once per restore; a rejection falls back to attempting exposePort for every saved port. * Lock in destroy() ordering against stale-read race super.destroy() is not serialized, so other DO RPCs can run during the await. Deleting portTokens first closes the window where a concurrent validatePortToken() would read stale tokens or a concurrent start path would rehydrate them. See the expanded comment at the call site for the full rationale. Also switch the existing destroy() test from direct prototype mutation to vi.spyOn — the old form leaked between tests. * Trim redundant port-restoration tests and docblock Remove the empty-storage no-op test — trivial cover for a one-line early return that other tests exercise implicitly. Rewrite the E2E docblock to describe what the test asserts (preview URL works after a full restart) without referencing internal mechanisms (blockConcurrencyWhile, queueing behind restore) the test does not actually exercise. --------- Co-authored-by: Naresh <naresh@cloudflare.com>whoiskatrin · 63e6a898 · 2026-04-21
- 1.3ETVImprove container startup resiliency (#223) * Retry container startup failures in SDK SDK previously only retried 503 errors (container provisioning delays) but not 500 errors from container startup timeouts. This caused immediate failures for production users during cold starts. Now retries both 503 and 500 errors when they match known transient container error patterns (port not found, not listening, network lost, etc). Uses fail-safe detection that only retries known-good patterns, preventing retry storms on user application errors. Increases retry budget from 60s to 120s and uses longer exponential backoff (3s, 6s, 12s, 24s, 30s) to align with platform reality that containers can take several minutes to provision. * Increase container startup timeouts Timeouts increased to 30s instance + 90s ports (was 8s + 20s). Override containerFetch to pass production-friendly defaults and provide better error messages for preview URLs. * Add user-configurable container timeouts Users can now configure timeouts via getSandbox options or env vars. Supports instanceGetTimeoutMS, portReadyTimeoutMS, and waitIntervalMS. Configuration precedence: options > env vars > SDK defaults. * Fix configuration system bugs - Use configured timeouts instead of hardcoded defaults - Add parseInt safety for 0ms values - Add env var validation with min/max bounds * Add comprehensive unit tests for retry logic * Remove fetchWithStartup helper (SDK handles retries) * Extract environment access utility for type safety Create shared getEnvString utility to safely extract string values from environment objects with proper type narrowing. * Add input validation to setContainerTimeouts Validate timeout values to prevent invalid configurations (NaN, Infinity, negative numbers, out of range). Add validation helper method and tests to ensure the public RPC method rejects malformed input. Also fix unit test mock to include getState() method from Container base class. * Simplify tests * Update bucket-mounting test to use new fetch pattern * Add bidirectional R2 verification to bucket mounting test Add R2 bucket binding to test worker with endpoints for put, get, list, and delete operations. Update test to verify bidirectional sync between R2 and mounted filesystem. Remove vi.waitFor wrapper since BaseHttpClient now handles container startup retries.Naresh · b1a86c89 · 2025-11-17
- 1.2ETVAdd disconnected file change checks (#493) * Add persistent file watch state Keep file watches usable for hibernating Durable Objects by separating live SSE delivery from retained watch state in the container. Add owner-scoped acknowledgement and idle expiry so background consumers can reconcile safely without sharing a global dirty bit. * Forward persistent watch owner IDs Pass ownerId through Sandbox.ensureWatch so persistent watches keep their ownership metadata and the reconnect workflow can validate the same consumer across ack and stop operations. * Refine persistent watch leases Replace the initial ownership-flavoured watch API with a cleaner checkpoint and lease model for background consumers. Use `changed`, `checkpointWatch()`, and returned lease tokens for the public flow, while `resumeToken` keeps `ensureWatch()` retryable without exposing another consumer's lease. * Polish persistent watch compatibility Clarify stopWatch token validation, remove redundant key normalization work, and normalize legacy watch responses so clients still see `changed` while older paths return `dirty`. * Refocus file watches on change checks Background consumers only need to know whether a path changed while disconnected. Replace the lease-based persistent watch API with checkChanges() so callers store one version token and choose whether to skip work, sync incrementally, or rescan. * Use canonical log events in WatchService --------- Co-authored-by: Naresh <naresh@cloudflare.com>whoiskatrin · fdd3efa4 · 2026-04-01
- 1.1ETVReplace Jupyter with lightweight interpreter system (#76) * Replace Jupyter with lightweight interpreter system Jupyter's 7-second startup time was unacceptable for LLM use cases where users expect immediate code execution. The heavy dependency chain and complex kernel management created unnecessary overhead. This replaces the entire Jupyter infrastructure with a direct process pool architecture. Each language (Python, JavaScript, TypeScript) now runs in dedicated executor processes that communicate via JSON over stdin/stdout. The new system eliminates Jupyter startup entirely. Cold start times are now ~175ms for JavaScript, ~215ms for TypeScript, and ~1800ms for Python. Process pools pre-warm executors at container startup, reducing process acquisition to 2-6ms for immediate availability. The Python executor uses IPython for rich output capture, JavaScript uses Node.js VM for persistent context, and TypeScript uses esbuild for fast transpilation. Build-time compilation optimizes startup performance. All existing APIs remain unchanged - clients continue to work without modification while benefiting from the architectural improvements under the hood. * Fix linting and TypeScript issues - Use node: protocol for Node.js built-in imports - Replace string concatenation with template literals - Organize imports consistently - Remove unused private class members - Add missing output types to RichOutput interface - Fix react-markdown component types (remove non-existent inline property) - Add @types/react-katex for proper LaTeX component typing * Add backward compatibility exports for renamed error types Exports JupyterNotReadyError and isJupyterNotReadyError as deprecated aliases to maintain API compatibility while encouraging migration to InterpreterNotReadyError and isInterpreterNotReadyError. * Add Jupyter notebook integration guide for SDK users Shows how to extend sandbox containers with Jupyter server for users who need full notebook interface and traditional .ipynb file support alongside the built-in lightweight code interpreters. * Create changeset * Fix linting issues while preserving deprecation warnings Uses biome-ignore blocks to suppress organizeImports rule around deprecated exports, ensuring JSDoc deprecation warnings work properly in IDEs while maintaining code style compliance.Naresh · ef9e320d · 2025-09-18
- 1.1ETVHandle intermittent interpreter failures and decouple jupyter startup (#51) * Make jupyter startup non-blocking * Handle cascading failures * Add changesetNaresh · 4aceb321 · 2025-08-05