fix(planner): resolve state-sync witnesses for s3-bootstrap nodes#424
Conversation
An s3-restore node (fullNode.snapshot.s3, no stateSync) was stuck: the planner schedules ConfigureStateSync for any snapshot bootstrap, but the gate resolved canonical-syncer witnesses (Status.ResolvedStateSyncers) only for stateSync nodes — so s3 nodes ran that task witness-less and fell back to unreachable in-cluster peers. The staged snapshot never applied (seid came up at height 0 block-syncing from genesis). Root cause confirmed against the sidecar: TaskSnapshotRestore only stages the snapshot under data/snapshots; statesync.Configure applies it via CometBFT state-sync (use-local-snapshot=true) and still requires >=2 reachable rpc-server witnesses to verify the trust point. s3 genuinely needs witnesses. Fix: a single needsStateSyncWitnesses(snap)=snap!=nil predicate now drives BOTH the ConfigureStateSync insertion (buildSidecarProgression) and the fail-closed plan blocker (stateSyncBlocksPlan), and reconcileStateSyncGate resolves syncers for any snapshot bootstrap (gates on snap==nil). Plan-insertion and witness-resolution are now provably in lockstep; an s3 node with <2 syncers blocks its plan rather than proceeding witness-less. No change to the stateSync path; no CRD field/enum changes. Tests cover the s3 resolve + fail-closed paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR SummaryMedium Risk Overview The fix introduces Reviewed by Cursor Bugbot for commit 8c537a4. Bugbot is set up for automated code reviews on this repo. Configure here. |
fixtureSyncers added the 10th occurrence of the "atlantic-2" literal, tripping goconst on the new line. Define + use atlantic2ChainID for that occurrence. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8c537a4 to
145b32e
Compare
Problem
An s3-restore node (
fullNode.snapshot.s3, nostateSync) gets stuck during bootstrap. The planner schedulesConfigureStateSyncfor any snapshot bootstrap, but the gate resolved canonical-syncer witnesses (Status.ResolvedStateSyncers) only forstateSyncnodes. So an s3 node ranConfigureStateSyncwitness-less, the sidecar fell back to the in-cluster tenant peers (unreachable from eng namespaces), and the staged snapshot never applied —seidcame up at height 0 block-syncing from genesis.Root cause (confirmed against the sidecar)
TaskSnapshotRestoreonly stages the snapshot underdata/snapshots;statesync.Configureis what applies it, via CometBFT state-sync withuse-local-snapshot=true, and it still requires ≥2 reachable rpc-server witnesses to verify the trust point (and queries the trust hash from one). So s3 genuinely needs witnesses — the gate just wasn't resolving them.Fix
A single predicate
needsStateSyncWitnesses(snap) = snap != nilnow drives both theConfigureStateSynctask insertion (buildSidecarProgression) and the fail-closed plan blocker (stateSyncBlocksPlan), andreconcileStateSyncGateresolves canonical syncers for any snapshot bootstrap (gates onsnap == nil). Plan-insertion and witness-resolution are now provably in lockstep — the original mismatch (task planned, witnesses nil) is structurally impossible. An s3 node with<minCanonicalSyncersnow blocks its plan rather than proceeding witness-less.stateSyncpath.NotApplicablenow means "no snapshot (genesis)".XValidationmakes s3 and stateSync mutually exclusive at admission, so everySnapshotSourceshape routes correctly (verified: s3 / stateSync / genesis / bootstrapImage / archive).Why it matters
Unblocks s3-bootstrap shadow replayers, and is the concrete prereq for the sharded full-history validation design (which bootstraps replayers from S3 snapshots at arbitrary heights).
Tests
TestStateSyncGate_S3Restore_ResolvesSyncers,TestStateSyncGate_S3Restore_OneSyncer_FailsClosed, the renamed genesis regression guard, and fixture-syncer wiring for the broadened gate.go build/go testgreen forinternal/planner+internal/controller/node; doc.go invariant added with its guarding test named.Review
Cross-reviewed (systems-engineer: ship — Hypothesis A verified end-to-end, fail-closed has no hole, no stateSync regression, predicate complete across all shapes; idiomatic: reads native). Deploy to the shared controller is gated on explicit sign-off.
🤖 Generated with Claude Code