Skip to content

performance: Make sure all alarms are being set with allowUnconfirmed…#228

Open
gabivlj wants to merge 4 commits into
mainfrom
gv/no-storage
Open

performance: Make sure all alarms are being set with allowUnconfirmed…#228
gabivlj wants to merge 4 commits into
mainfrom
gv/no-storage

Conversation

@gabivlj

@gabivlj gabivlj commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

… so we do not block on storage, and all states don't reach to storage on hot-paths

Reaching storage would make container startups slower; let's give in on a more greedy approach where if storage were to fail we get an error later after a container has been potentially started.

@gabivlj gabivlj requested a review from a team as a code owner June 18, 2026 15:05
@ask-bonk

ask-bonk Bot commented Jun 18, 2026

Copy link
Copy Markdown

APIError: Invalid Anthropic API Key

github run

@ask-bonk

ask-bonk Bot commented Jun 18, 2026

Copy link
Copy Markdown

@gabivlj Bonk workflow was cancelled.

View workflow run · To retry, trigger Bonk again.

@pkg-pr-new

pkg-pr-new Bot commented Jun 18, 2026

Copy link
Copy Markdown

Open in StackBlitz

npm i https://pkg.pr.new/@cloudflare/containers@228

commit: 0c4b4ed

… so we do not block on storage, and all states don't reach to storage on hot-paths

Reaching storage would make container startups slower; let's give in on
a more greedy approach where if storage were to fail we get an error
later after a container has been potentially started.
@ask-bonk

ask-bonk Bot commented Jun 18, 2026

Copy link
Copy Markdown

APIError: Invalid Anthropic API Key

github run

@ask-bonk

ask-bonk Bot commented Jun 18, 2026

Copy link
Copy Markdown

APIError: Invalid Anthropic API Key

github run

@ask-bonk

ask-bonk Bot commented Jun 18, 2026

Copy link
Copy Markdown

OK renewActivityTimeout only sets sleepAfterMs in memory. The alarm is what actually checks against this. So if the constructor runs and container isn't running, we don't need an alarm. But if container IS running, scheduleNextAlarm is needed to enforce sleepAfter expiry. Good.

But wait — the new code does call this.renewActivityTimeout() unconditionally even when container isn't running. That's a minor inefficiency (the value is set but never used). Not a bug though.

There IS one concern though: When start() is called externally and container wasn't running, after this.container.start() the path now does await this.scheduleNextAlarm() — that's correct.

Overall, the PR looks reasonable. Let me identify the actionable issues:

  1. Potential issue with start() retries leaving stale pendingStoppedEvent — if doStartContainer loops through retries (tries > 0), capturePendingStoppedEvent could overwrite the pendingStoppedEvent on each iteration. But it only captures when !containerWasRunning. On retry iterations after handleError, the state should already be 'stopped' or 'stopped_with_code', so it'd capture stopped_with_code. This might cause duplicate onStop calls if onStop already fired for a prior monitor death. Actually, callOnStop has onStopCalled guard but it's reset after the call. Hmm.

Let me check: on retry, would there be a previous monitor that had .catch'd and set state to stopped? If so, when monitor sets state to stopped (clean), capture would NOT trigger (state is 'stopped', not 'healthy'/'running'/'stopped_with_code'). If monitor's exit code path called setStoppedWithCode, capture WOULD trigger. So onStop would fire after retry start - this could be correct (we want onStop for the prior exit). But what if onStop was already going to fire via syncPendingStoppedEvents in next alarm? Now drain fires it, and the alarm's syncPendingStoppedEvents would skip (state has been reset by setStoppedIfUnchanged).

This seems OK, but it's complex enough that I think there could be edge cases. However, I don't see a clear bug to point to.

  1. Removed await this.ctx.storage.sync() after setAlarm: Combined with allowUnconfirmed: true, if the runtime crashes mid-write the alarm is lost. This is documented in the PR description as an explicit tradeoff.

  2. The alarm() override on Container itself is NOT in scope of the "no override" rule for subclasses, it's the internal handler. The PR doesn't change subclass override rules.

Overall, I don't see clear bugs. The PR has a clear changeset, tests are updated, and the optimization is sound. The behavior changes (deferred onStop, conditional initial scheduleNextAlarm) are subtle but defensible.

LGTM

github run

@ask-bonk ask-bonk Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

Defers storage/alarm writes and outbound-interception setup until after this.container.start() so cold starts stop blocking on storage round-trips, and switches the persisted-state writes to allowUnconfirmed: true.

Few things worth a look before this lands:

  1. Persisted outbound configuration is no longer restored on DO hydration when the container is already running. The old constructor unconditionally called restoreOutboundConfiguration() and then (if this.container.running) ran applyOutboundInterception() inside blockConcurrencyWhile. The new constructor calls updateUsingInterception() without the persisted config and only calls restoreOutboundConfigurationOnce() inside doStartContainer -- and only when !containerWasRunning. So when a DO is evicted/rehydrated while its container is still running, this.outboundHandlerOverride / this.outboundByHostOverrides are never repopulated from storage. A subsequent setOutboundByHost("example.com", ...) call will then update only that one host and applyOutboundInterception will rebuild the props from the (empty) instance fields, effectively wiping every other runtime override that was previously persisted. This is a behavior regression that isn't covered by the new tests.

  2. onStart() no longer runs under blockConcurrencyWhile. Both start() and startAndWaitForPorts() previously wrapped await this.onStart() in blockConcurrencyWhile, which queued all incoming requests until onStart resolved. After this PR, the input gate is released on every await inside user-supplied onStart() handlers, so other requests can start being served before onStart() completes. This is a public lifecycle-hook semantics change that subclasses likely rely on -- worth either preserving the old behavior or calling it out explicitly in the changeset.

  3. onStop for the prior run now fires after onStart for the new run. Previously syncPendingStoppedEvents() ran before start() inside startAndWaitForPorts. The PR removes that call and instead drains the pending stopped event in the finally of startContainerIfNotRunning, which runs after onStart() has already fired. So a subclass that observed onStop -> onStart ordering between restarts will now see onStart -> onStop. Also worth calling out in the changeset, since the changeset only mentions storage/alarm deferral.

  4. Tests don't cover the rehydrate-with-running-container path. The new cold start should not touch storage before container.start test is good, but there's no test for "DO wakes, this.container.running === true, persisted outbound overrides should still be honored." That's exactly the regression in #1 above. Worth adding before this ships.

No blocking objections beyond #1, which I do think is a real bug; the rest are behavior changes that should be documented in the changeset or guarded behind an opt-in.

Comment thread src/lib/container.ts
)
`;
this.renewActivityTimeout();
this.updateUsingInterception();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor path no longer calls restoreOutboundConfiguration() and no longer calls applyOutboundInterception() when this.container.running is true on hydration. The old code did both inside blockConcurrencyWhile.

Concrete consequence: when the DO is evicted while the container is still running, on the next hydration the instance fields this.outboundHandlerOverride and this.outboundByHostOverrides are not repopulated from OUTBOUND_CONFIGURATION_KEY. A subsequent setOutboundByHost(host, ...) call then sees an empty outboundByHostOverrides, sets just that one host, and applyOutboundInterception rebuilds the proxy props from the (now-truncated) instance state. Every other persisted runtime override is silently dropped.

Suggested fix: restore the persisted config in the constructor (it's a sync kv.get, so it doesn't reintroduce the cold-start storage round-trip you're trying to avoid), and pass it into updateUsingInterception:

Suggested change
this.updateUsingInterception();
const persistedOutboundConfiguration = this.restoreOutboundConfigurationOnce();
this.renewActivityTimeout();
this.updateUsingInterception(persistedOutboundConfiguration);

Comment thread src/lib/container.ts
await this.ctx.blockConcurrencyWhile(async () => {
await this.onStart();
});
await this.onStart();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping blockConcurrencyWhile around onStart() changes the public lifecycle contract: the input gate is now released on every await inside a user-supplied onStart handler, so other requests can begin executing before onStart resolves. Previously they were queued.

If this is intentional, please call it out in the changeset (currently it only mentions storage/alarm deferral). If not, restoring the wrapper here costs nothing -- the storage writes you're trying to avoid already happened before this point:

Suggested change
await this.onStart();
await this.ctx.blockConcurrencyWhile(async () => {
await this.onStart();
});

Comment thread src/lib/container.ts
});
// All ports are ready
await this.state.setHealthy();
await this.onStart();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same blockConcurrencyWhile removal here as in start() -- worth keeping consistent with whatever you decide for the other call site, and worth mentioning in the changeset. Note also that this code path now fires onStart() before drainPendingStoppedEvent() (which runs in the finally of startContainerIfNotRunning), so subclasses that previously saw onStop -> onStart ordering across a restart will now see onStart -> onStop.

Comment thread src/lib/container.ts
}
await this.capturePendingStoppedEvent(containerWasRunning);
await this.state.setRunning();
await this.scheduleNextAlarm();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two related observations on this block:

  1. restoreOutboundConfigurationOnce() is the only place the persisted outbound config is read post-PR. Because it's gated by !containerWasRunning, a hydrated DO that finds an already-running container never restores the persisted overrides into instance fields (see the constructor comment).
  2. capturePendingStoppedEvent(containerWasRunning) is always called with false here (we're inside the if (!containerWasRunning) branch), so the parameter is effectively dead. Minor, but consider dropping the argument or inlining the check for clarity.

@ask-bonk

ask-bonk Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review posted on PR #228. Key findings:

  1. Likely regression: persisted outbound config isn't restored when DO hydrates with an already-running container — runtime overrides like setOutboundByHost will silently wipe other persisted overrides.
  2. Lifecycle semantics change: onStart() no longer runs under blockConcurrencyWhile, so other requests can interleave with user-supplied onStart handlers.
  3. Hook ordering change: onStop for a previous run now fires after onStart for the new run.
  4. Test gap: no coverage for the rehydrate-with-running-container path.

Suggestions were posted as inline GitHub suggestion blocks.

github run

…ent new trade-offs

this means we need to bump to 0.4.0 as there is some breaking changes in
behaviour.
@ask-bonk

ask-bonk Bot commented Jun 19, 2026

Copy link
Copy Markdown

LGTM

github run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant