Recovery: e2e + unit coverage, and fix runFiber recovery starvation/backoff (#1729)
* test(ai-chat,think): fix racy fiber-cleanup checks and add continue-path e2e
The recovery e2e tests asserted `hasFiberRows() === false` the instant
recovery was detected, racing the continuation/retry turn that recovery
legitimately re-runs in a fresh fiber. Poll until the fiber rows settle
instead.
Also fixes the ai-chat e2e worker, which emitted the legacy `0:{json}`
stream framing that AIChatAgent never parses (it reads `data:` SSE
frames), so no chunk was ever persisted and recovery only ever saw an
empty partial. Emit proper `data:` frames and stream enough chunks to
cross the ResumableStream flush threshold, enabling a new continue-path
test (non-empty partial -> resume the same assistant message).
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(ai-chat): e2e coverage for chat recovery budget exhaustion
Adds a deterministic exhaustion harness: agents whose turn hangs and
produces no recovery progress, so repeated SIGKILLs drive the recovery
budget without racing real streamed content. Covers onExhausted firing
with reason no_progress_timeout, recovery_aborted, and
work_budget_exceeded, plus the persisted terminal banner (#1645).
Extracts the shared wrangler/WebSocket e2e plumbing into harness.ts.
max_attempts (alarm-debounce forces >30s spacing) and stable_timeout
(not feasibly deterministic in-process) are left to unit coverage.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(ai-chat): e2e coverage for continue:false / persist:false recovery outcomes
Interrupts a turn after a non-empty partial has flushed, then asserts the
two onChatRecovery branches that suppress the default behavior:
- { continue: false } persists the partial as a durable assistant
message but does not re-run the turn (onChatMessage invoked once).
- { persist: false, continue: false } drops a plain-text partial (no
settled tool results) and does not re-run.
Adds an onChatMessage invocation counter + assistant-text accessor to
the test agent to distinguish "persisted partial" from a continuation.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(think): e2e for context-overflow compaction recovery
Add ThinkContextOverflowE2EAgent plus an in-process (no process kill)
e2e covering the opt-in contextOverflow recovery paths: reactive
compact-and-retry that recovers a turn, reactive budget exhaustion that
surfaces a terminal context_overflow error, and the proactive guard that
compacts pre-step when reported usage crosses the headroom budget.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(ai-chat): e2e coverage for stream-buffer cleanup alarm (#1706) and recovering-status broadcast (#1620)
Add two deterministic e2e tests in @cloudflare/ai-chat:
- #1706 stream-buffer cleanup alarm: new ChatBufferCleanupAgent exposes
@callable inspectors (buffer/chunk row counts, _cleanupStreamBuffers
schedule count, forced future sweep, hasReclaimableStreams). Asserts a
completed turn arms exactly one cleanup alarm, a second turn does not
stack a duplicate, and a forced future-now sweep reclaims all buffers so
a fully-swept DO reports no reclaimable streams.
- #1620 recovering-status broadcast: drives a SIGKILL/restart recovery of a
slow-stream turn and asserts the durable cf:chat:recovering flag
transitions active -> cleared (via a new getRecoveringFlag @callable), plus
a live WS frame collector observes the cf_agent_chat_recovering clear
broadcast. The durable flag is the deterministic source of truth because
the live frame is not replayed on connect.
Adds a createFrameCollector helper to harness.ts and registers
ChatBufferCleanupAgent under a new v5 migration tag.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(think): e2e for durable-submission recovery on start
Add ThinkSubmissionRecoveryE2EAgent plus an e2e covering the three
_recoverSubmissionsOnStart transitions: messages-not-applied re-enqueues
as pending, applied-but-unrecoverable surfaces as error, and a recoverable
in-flight submission (real mid-stream SIGKILL) is left running and driven
to completion by the scheduled continuation.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(agents): arm follow-up alarm for pending runFiber recovery
`_scheduleNextAlarm()` only rescheduled for active keepAlive leases, due
schedules, and facet runs — never for orphaned `cf_agents_runs` rows or
interrupted/pending managed ledger fibers still awaiting recovery. Because
orphaned fibers hold no keepAlive ref, a scan that yielded on
`fiberRecoveryScanDeadlineMs` (or a pass that retained a repeatedly-throwing
unmanaged recovery hook) never got another alarm, so the remaining fibers
starved. Add `_hasPendingFiberRecovery()` and arm a follow-up alarm whenever
recovery work is outstanding, so multi-pass recovery resumes and eventually
drains every fiber (and ages out poison rows via `fiberRecoveryMaxAgeMs`).
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(agents): e2e coverage for poison-row aging, scan-deadline yield, and concurrent fiber recovery
Add three runFiber recovery e2e tests (real `wrangler dev` + SIGKILL/restart
against `--persist-to`):
- poison-row aging: an unmanaged fiber whose `onFiberRecovered` always throws
is retained for retry across alarm passes, then dropped with a
`max_age_exceeded` skip once it exceeds `fiberRecoveryMaxAgeMs`.
- scan-deadline yield: a tiny `fiberRecoveryScanDeadlineMs` forces a single
alarm pass to yield (`scan_deadline_exceeded`) partway through 20 orphaned
fibers; subsequent passes drain the rest with no starvation.
- concurrent fibers: N concurrent fibers (mixed managed + unmanaged) are all
recovered after a kill, covering the gap that prior tests only recovered a
single fiber.
New DO test agents record recovery signals (hook invocations + skip reasons)
into durable SQL so assertions survive DO eviction between polls. Shared
spawn/kill/RPC harness lives in `recovery-helpers.ts`.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(think): e2e for messenger reply-fiber recovery
Add ThinkMessengerRecoveryE2EAgent plus an e2e covering MESSENGER_REPLY_FIBER_NAME
recovery via _handleInternalFiberRecovery: a streaming-stage interruption posts
the apology (apologize mode), and an accepted-stage interruption recovers in
answer mode and re-drives reply delivery. Uses an in-memory fake chat adapter
that records posts into agent SQL; full streamed-answer rendering is deferred
(needs a complete adapter/real transport).
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(think): e2e for workflow-turn recovery + notification drain replay
Add ThinkWorkflowRecoveryE2EAgent (reuses STEP_PROMPT_WORKFLOW with a
deterministic mock structured model). Covers the happy path (structured
workflow turn completes, notification drains, workflow resumes with the
validated output) and the recovery path (mid-stream SIGKILL): on restart the
turn is reconciled to a terminal submission and the workflow-notification drain
replays it so the workflow is unblocked. Documents the deferred gap that an
interrupted structured turn is recovered as skipped rather than completed.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(think): rename e2e callable to avoid Think.getWorkflow collision
The workflow-recovery e2e agent's @callable shadowed the inherited
Think.getWorkflow(workflowId) with an incompatible signature, failing typecheck.
Rename it to inspectWorkflowRun.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(agents): exponential backoff for the runFiber-recovery follow-up alarm
The follow-up alarm added for pending fiber recovery fired every
keepAliveIntervalMs with no backoff, so a repeatedly-throwing recovery
hook — or a `fiberRecoveryMaxAgeMs: 0` ("retain forever") row whose hook
keeps throwing — would wake the DO on every tick indefinitely (the
perpetual-heartbeat hazard #1707 guards against). Track consecutive
no-progress recovery scans and back the alarm off exponentially (capped
at 5 min); any scan that recovers a fiber (including a scan-deadline
yield that drained part of a batch) resets it, so legitimate multi-pass
draining stays prompt.
Adds e2e coverage: retain-forever poison-row backoff cadence, and
multi-pass recovery for a sub-agent (facet) child driven by the parent
alarm.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(agents): fast unit coverage for runFiber recovery alarm re-arm + backoff
Adds deterministic, in-process unit tests (no process kill / timers) that
drive `_checkRunFibers` + `_scheduleNextAlarm` directly and inspect the
physical alarm: the starvation re-arm (alarm armed while a retained
recovery row is pending), exponential backoff across no-progress scans,
backoff reset on forward progress, and no alarm once recovery drains.
Previously this behavior was only covered by the nightly e2e suite.
Adds getCurrentAlarm/getRecoveryNoProgressScans/simulateAlarmCycle test
helpers to the run-fiber test agent.
Co-authored-by: Cursor <cursoragent@cursor.com>
* docs(agents): note the fiberRecoveryMaxAgeMs:0 warm-DO trade-off
A repeatedly-throwing recovery hook with fiberRecoveryMaxAgeMs:0 ("retain
forever") is retried on the capped backoff indefinitely, so the Durable
Object never idle-evicts while the un-recoverable row exists. Document
this in the option JSDoc and docs/durable-execution.md, and recommend a
finite age. Bounding recovery by attempts is tracked in #1728.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Sunil Pai · github.com-cloudflare-agents · 1c8fdf58 · 2026-06-10