performance: Make sure all alarms are being set with allowUnconfirmed…#228
performance: Make sure all alarms are being set with allowUnconfirmed…#228gabivlj wants to merge 4 commits into
Conversation
|
APIError: Invalid Anthropic API Key |
|
@gabivlj Bonk workflow was cancelled. View workflow run · To retry, trigger Bonk again. |
commit: |
… so we do not block on storage, and all states don't reach to storage on hot-paths Reaching storage would make container startups slower; let's give in on a more greedy approach where if storage were to fail we get an error later after a container has been potentially started.
|
APIError: Invalid Anthropic API Key |
|
APIError: Invalid Anthropic API Key |
|
OK But wait — the new code does call There IS one concern though: When Overall, the PR looks reasonable. Let me identify the actionable issues:
Let me check: on retry, would there be a previous monitor that had .catch'd and set state to stopped? If so, when monitor sets state to stopped (clean), capture would NOT trigger (state is 'stopped', not 'healthy'/'running'/'stopped_with_code'). If monitor's exit code path called setStoppedWithCode, capture WOULD trigger. So onStop would fire after retry start - this could be correct (we want onStop for the prior exit). But what if onStop was already going to fire via syncPendingStoppedEvents in next alarm? Now drain fires it, and the alarm's syncPendingStoppedEvents would skip (state has been reset by This seems OK, but it's complex enough that I think there could be edge cases. However, I don't see a clear bug to point to.
Overall, I don't see clear bugs. The PR has a clear changeset, tests are updated, and the optimization is sound. The behavior changes (deferred onStop, conditional initial scheduleNextAlarm) are subtle but defensible. LGTM |
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
Defers storage/alarm writes and outbound-interception setup until after this.container.start() so cold starts stop blocking on storage round-trips, and switches the persisted-state writes to allowUnconfirmed: true.
Few things worth a look before this lands:
-
Persisted outbound configuration is no longer restored on DO hydration when the container is already running. The old constructor unconditionally called
restoreOutboundConfiguration()and then (ifthis.container.running) ranapplyOutboundInterception()insideblockConcurrencyWhile. The new constructor callsupdateUsingInterception()without the persisted config and only callsrestoreOutboundConfigurationOnce()insidedoStartContainer-- and only when!containerWasRunning. So when a DO is evicted/rehydrated while its container is still running,this.outboundHandlerOverride/this.outboundByHostOverridesare never repopulated from storage. A subsequentsetOutboundByHost("example.com", ...)call will then update only that one host andapplyOutboundInterceptionwill rebuild the props from the (empty) instance fields, effectively wiping every other runtime override that was previously persisted. This is a behavior regression that isn't covered by the new tests. -
onStart()no longer runs underblockConcurrencyWhile. Bothstart()andstartAndWaitForPorts()previously wrappedawait this.onStart()inblockConcurrencyWhile, which queued all incoming requests untilonStartresolved. After this PR, the input gate is released on everyawaitinside user-suppliedonStart()handlers, so other requests can start being served beforeonStart()completes. This is a public lifecycle-hook semantics change that subclasses likely rely on -- worth either preserving the old behavior or calling it out explicitly in the changeset. -
onStopfor the prior run now fires afteronStartfor the new run. PreviouslysyncPendingStoppedEvents()ran beforestart()insidestartAndWaitForPorts. The PR removes that call and instead drains the pending stopped event in thefinallyofstartContainerIfNotRunning, which runs afteronStart()has already fired. So a subclass that observedonStop -> onStartordering between restarts will now seeonStart -> onStop. Also worth calling out in the changeset, since the changeset only mentions storage/alarm deferral. -
Tests don't cover the rehydrate-with-running-container path. The new
cold start should not touch storage before container.starttest is good, but there's no test for "DO wakes,this.container.running === true, persisted outbound overrides should still be honored." That's exactly the regression in #1 above. Worth adding before this ships.
No blocking objections beyond #1, which I do think is a real bug; the rest are behavior changes that should be documented in the changeset or guarded behind an opt-in.
| ) | ||
| `; | ||
| this.renewActivityTimeout(); | ||
| this.updateUsingInterception(); |
There was a problem hiding this comment.
This constructor path no longer calls restoreOutboundConfiguration() and no longer calls applyOutboundInterception() when this.container.running is true on hydration. The old code did both inside blockConcurrencyWhile.
Concrete consequence: when the DO is evicted while the container is still running, on the next hydration the instance fields this.outboundHandlerOverride and this.outboundByHostOverrides are not repopulated from OUTBOUND_CONFIGURATION_KEY. A subsequent setOutboundByHost(host, ...) call then sees an empty outboundByHostOverrides, sets just that one host, and applyOutboundInterception rebuilds the proxy props from the (now-truncated) instance state. Every other persisted runtime override is silently dropped.
Suggested fix: restore the persisted config in the constructor (it's a sync kv.get, so it doesn't reintroduce the cold-start storage round-trip you're trying to avoid), and pass it into updateUsingInterception:
| this.updateUsingInterception(); | |
| const persistedOutboundConfiguration = this.restoreOutboundConfigurationOnce(); | |
| this.renewActivityTimeout(); | |
| this.updateUsingInterception(persistedOutboundConfiguration); |
| await this.ctx.blockConcurrencyWhile(async () => { | ||
| await this.onStart(); | ||
| }); | ||
| await this.onStart(); |
There was a problem hiding this comment.
Dropping blockConcurrencyWhile around onStart() changes the public lifecycle contract: the input gate is now released on every await inside a user-supplied onStart handler, so other requests can begin executing before onStart resolves. Previously they were queued.
If this is intentional, please call it out in the changeset (currently it only mentions storage/alarm deferral). If not, restoring the wrapper here costs nothing -- the storage writes you're trying to avoid already happened before this point:
| await this.onStart(); | |
| await this.ctx.blockConcurrencyWhile(async () => { | |
| await this.onStart(); | |
| }); |
| }); | ||
| // All ports are ready | ||
| await this.state.setHealthy(); | ||
| await this.onStart(); |
There was a problem hiding this comment.
Same blockConcurrencyWhile removal here as in start() -- worth keeping consistent with whatever you decide for the other call site, and worth mentioning in the changeset. Note also that this code path now fires onStart() before drainPendingStoppedEvent() (which runs in the finally of startContainerIfNotRunning), so subclasses that previously saw onStop -> onStart ordering across a restart will now see onStart -> onStop.
| } | ||
| await this.capturePendingStoppedEvent(containerWasRunning); | ||
| await this.state.setRunning(); | ||
| await this.scheduleNextAlarm(); |
There was a problem hiding this comment.
Two related observations on this block:
restoreOutboundConfigurationOnce()is the only place the persisted outbound config is read post-PR. Because it's gated by!containerWasRunning, a hydrated DO that finds an already-running container never restores the persisted overrides into instance fields (see the constructor comment).capturePendingStoppedEvent(containerWasRunning)is always called withfalsehere (we're inside theif (!containerWasRunning)branch), so the parameter is effectively dead. Minor, but consider dropping the argument or inlining the check for clarity.
|
Review posted on PR #228. Key findings:
Suggestions were posted as inline GitHub |
…ent new trade-offs this means we need to bump to 0.4.0 as there is some breaking changes in behaviour.
|
LGTM |
… so we do not block on storage, and all states don't reach to storage on hot-paths
Reaching storage would make container startups slower; let's give in on a more greedy approach where if storage were to fail we get an error later after a container has been potentially started.