Skip to content

feat(sdk,core): preserve chat.agent context after cancel / OOM / crash#3671

Open
ericallam wants to merge 4 commits into
mainfrom
feat/chat-agent-onrecoveryboot
Open

feat(sdk,core): preserve chat.agent context after cancel / OOM / crash#3671
ericallam wants to merge 4 commits into
mainfrom
feat/chat-agent-onrecoveryboot

Conversation

@ericallam
Copy link
Copy Markdown
Member

Summary

When a chat.agent run dies mid-stream (the user cancels, the worker OOMs, an unhandled exception kills the process), the next continuation run now reconstructs the conversation context automatically. Follow-ups like "keep going" continue the cut-off response; fresh follow-ups like "actually, what's 7+8?" abandon it and answer the new question. No customer code required.

A new opt-in onRecoveryBoot hook is the override path for advanced policies — drop the partial entirely, synthesize tool results for an interrupted tool call, emit a recovery banner via the writer.

Design

Boot now reads BOTH durable stream tails on a continuation run:

  • session.out past the snapshot cursor → settled assistant turns plus an optional partialAssistant (the trailing message whose stream never received a finish chunk; cleanupAbortedParts strips streaming-in-progress fragments)
  • session.in past the last turn-complete cursor → user messages the dead run never acknowledged

When both partialAssistant and inFlightUsers are non-empty, the smart default splices [firstInFlightUser, partialAssistant] onto the chain. The model sees full prior context and responds to whatever the latest user message asks. Modern instruction-following models prioritize the latest message, so the same default works whether the follow-up is "keep going", "do X instead", or unrelated.

onRecoveryBoot?: (event: RecoveryBootEvent) => Promise<RecoveryBootResult | void> | RecoveryBootResult | void

The hook receives settledMessages, inFlightUsers, partialAssistant, pendingToolCalls, previousRunId, cause, and a lazy writer. Returns optional { chain, recoveredTurns, beforeBoot }. Agents using hydrateMessages skip the hook — customer-owned persistence is the source of truth. The hook does NOT fire when there's nothing to recover (clean continuation after chat.endRun(), fresh chat, OOM retry on a complete snapshot).

Also fixes the records-endpoint schema: data is z.unknown() (the wire has been carrying objects since the Sessions primitive landed; the previous z.string() declaration was the lie). The change is transparent to existing consumers and unblocks the session.in tail read this PR adds.

Test plan

  • Build clean: pnpm run build --filter @trigger.dev/sdk --filter @trigger.dev/core
  • SDK unit tests pass: pnpm --filter @trigger.dev/sdk exec vitest run (199 tests across 16 files, includes new recovery-boot.test.ts and replay-session-in.test.ts)
  • Webapp integration test passes: pnpm --filter webapp exec vitest run test/replay-after-crash.test.ts
  • Webapp typecheck: pnpm run typecheck --filter webapp
  • Live smoke: T18 (cancel + continue, "what's 7+8?") — chain seeds with [user-essay-request, partial-essay, "7+8?"] → model answers 15
  • Live smoke: cancel + "keep going" follow-up — model continues the cut-off response from where it stopped
  • Live smoke: cancel + unrelated question after the dead run wrote tool-call chunks — model handles the new question without re-running tools

Docs land separately on the chat-prerelease docs branch (PR #3226).

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 19, 2026

⚠️ No Changeset found

Latest commit: 98f2903

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Review Change Stack

Walkthrough

This PR implements a chat session recovery boot for chat.agent, refactors the stream records API so record.data carries parsed chunk objects, and refactors replay to return { settled, partial } for session.out. It adds extraction of pending tool calls, an optional onRecoveryBoot hook (with writer and beforeBoot), advances cursors to avoid double-delivery, injects recovered turns with priority, extends the test harness to seed recovery state, and adds comprehensive tests for recovery and replay behaviors.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main feature: preserving chat.agent context after cancellation, OOM, or crashes during continuation runs.
Description check ✅ Passed The description covers the summary, design rationale, the new onRecoveryBoot hook, schema fixes, and a comprehensive test plan; all required sections are substantially complete.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/chat-agent-onrecoveryboot

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@ericallam ericallam marked this pull request as ready for review May 19, 2026 21:03
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 new potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/trigger-sdk/test/recovery-boot.test.ts (1)

86-86: ⚡ Quick win

Replace fixed sleeps with condition-based waits to avoid flaky tests.

These hardcoded delays can intermittently fail on slower CI runners. Prefer vi.waitFor/condition assertions so tests wait for behavior, not time.

Suggested pattern
-      await new Promise((r) => setTimeout(r, 20));
-      expect(onRecoveryBoot).not.toHaveBeenCalled();
+      await vi.waitFor(() => {
+        expect(onRecoveryBoot).not.toHaveBeenCalled();
+      });

Also applies to: 116-116, 184-184, 220-220, 250-250, 291-291, 335-335, 370-370, 401-401, 438-438, 474-474

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/trigger-sdk/test/recovery-boot.test.ts` at line 86, Replace the
fixed-time sleeps (calls like await new Promise((r) => setTimeout(r, 20))) with
condition-based waits so tests don't flake: locate each occurrence of the sleep
expression in packages/trigger-sdk/test/recovery-boot.test.ts and replace it
with a vi.waitFor or an await expect(...) assertion that waits for the specific
condition (e.g., a state change, message, or stub call) relevant to that test;
for example use await vi.waitFor(() => expect(myMock).toHaveBeenCalled() or
await vi.waitFor(() => myService.isReady === true) so the test waits for
observable behavior rather than a fixed timeout. Ensure each replacement checks
the exact condition the original sleep was meant to cover (for all occurrences
previously at lines with the pattern await new Promise((r) => setTimeout(r,
...))).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/trigger-sdk/test/recovery-boot.test.ts`:
- Around line 456-457: The test spies on console.warn via warnSpy but never
asserts it was called; update the "hook-throws fallback" test (the block that
creates warnSpy and instantiates agent via chat.agent) to assert the fallback
warning path by expecting warnSpy to have been called (e.g., toHaveBeenCalled or
toHaveBeenCalledWith an expected substring/message), and repeat the same
assertion for the similar case around the code referenced at 474-481; ensure you
clean up the spy (restore/mockReset) after the assertion to avoid test
pollution.

---

Nitpick comments:
In `@packages/trigger-sdk/test/recovery-boot.test.ts`:
- Line 86: Replace the fixed-time sleeps (calls like await new Promise((r) =>
setTimeout(r, 20))) with condition-based waits so tests don't flake: locate each
occurrence of the sleep expression in
packages/trigger-sdk/test/recovery-boot.test.ts and replace it with a vi.waitFor
or an await expect(...) assertion that waits for the specific condition (e.g., a
state change, message, or stub call) relevant to that test; for example use
await vi.waitFor(() => expect(myMock).toHaveBeenCalled() or await vi.waitFor(()
=> myService.isReady === true) so the test waits for observable behavior rather
than a fixed timeout. Ensure each replacement checks the exact condition the
original sleep was meant to cover (for all occurrences previously at lines with
the pattern await new Promise((r) => setTimeout(r, ...))).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c90b708e-eb96-4d21-b3f6-86a66bc0965a

📥 Commits

Reviewing files that changed from the base of the PR and between bf473ab and 98f2903.

📒 Files selected for processing (2)
  • packages/trigger-sdk/src/v3/ai.ts
  • packages/trigger-sdk/test/recovery-boot.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/trigger-sdk/src/v3/ai.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: audit
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (9)
packages/trigger-sdk/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

In the Trigger.dev SDK (packages/trigger-sdk), prefer isomorphic code like fetch and ReadableStream instead of Node.js-specific code

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Always import tasks from @trigger.dev/sdk. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob.

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: In packages/core (@trigger.dev/core), import subpaths only, never import from root.
Add crumbs as you write code using // @Crumbs comments or `// `#region` `@crumbs blocks for debug tracing. They should be stripped by agentcrumbs strip before merge.

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
packages/trigger-sdk/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (packages/trigger-sdk/CLAUDE.md)

Always import from @trigger.dev/sdk. Never use @trigger.dev/sdk/v3 (deprecated path alias)

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
**/*.test.ts

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.ts: Use vitest exclusively for testing. Never mock anything - use testcontainers instead.
Place test files next to source files with the naming convention SourceFile.ts -> SourceFile.test.ts

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
**/*.{js,jsx,ts,tsx,json,md,yml,yaml}

📄 CodeRabbit inference engine (AGENTS.md)

Code formatting must be enforced using Prettier before committing

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Unit tests should use vitest framework
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed

Files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
🧠 Learnings (6)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
📚 Learning: 2026-05-18T14:40:02.173Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In this repo’s trigger.dev codebase, the “never mock — use testcontainers” guideline should only be applied to integration tests that talk to real external services (e.g., Redis, Postgres, S2). For unit tests that validate in-memory logic (e.g., deduplication/cache behavior in StandardRealtimeStreamsManager and similar module-boundary call counting), it is allowed to use Vitest mocks like `vi.fn()` and to stub/mock `ApiClient` objects to count calls or simulate in-process collaborators. Do not flag `vi.fn()`-based mocks as policy violations in these unit-test scenarios; reserve the rule for true external-service integration tests.

Applied to files:

  • packages/trigger-sdk/test/recovery-boot.test.ts
📚 Learning: 2026-05-18T14:40:02.173Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In the triggerdotdev/trigger.dev repo, the policy “Never mock anything — use testcontainers instead” should only be enforced for integration tests that interact with real external services (e.g., Redis, Postgres) via actual infrastructure. For unit tests that exercise pure in-memory logic (e.g., cache semantics) it is OK to stub collaborators such as `ApiClient` using Vitest (`vi.fn()`) to assert call counts or control behavior. Do not flag `vi.fn()`-based `ApiClient` stubs in unit tests as violations of the testcontainers policy.

Applied to files:

  • packages/trigger-sdk/test/recovery-boot.test.ts

Comment thread packages/trigger-sdk/test/recovery-boot.test.ts
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread packages/core/src/v3/schemas/api.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants