Skip to content

fix(planner): resolve state-sync witnesses for s3-bootstrap nodes#424

Merged
bdchatham merged 2 commits into
mainfrom
fix/s3-bootstrap-statesync-witnesses
Jun 22, 2026
Merged

fix(planner): resolve state-sync witnesses for s3-bootstrap nodes#424
bdchatham merged 2 commits into
mainfrom
fix/s3-bootstrap-statesync-witnesses

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

Problem

An s3-restore node (fullNode.snapshot.s3, no stateSync) gets stuck during bootstrap. The planner schedules ConfigureStateSync for any snapshot bootstrap, but the gate resolved canonical-syncer witnesses (Status.ResolvedStateSyncers) only for stateSync nodes. So an s3 node ran ConfigureStateSync witness-less, the sidecar fell back to the in-cluster tenant peers (unreachable from eng namespaces), and the staged snapshot never applied — seid came up at height 0 block-syncing from genesis.

Root cause (confirmed against the sidecar)

TaskSnapshotRestore only stages the snapshot under data/snapshots; statesync.Configure is what applies it, via CometBFT state-sync with use-local-snapshot=true, and it still requires ≥2 reachable rpc-server witnesses to verify the trust point (and queries the trust hash from one). So s3 genuinely needs witnesses — the gate just wasn't resolving them.

Fix

A single predicate needsStateSyncWitnesses(snap) = snap != nil now drives both the ConfigureStateSync task insertion (buildSidecarProgression) and the fail-closed plan blocker (stateSyncBlocksPlan), and reconcileStateSyncGate resolves canonical syncers for any snapshot bootstrap (gates on snap == nil). Plan-insertion and witness-resolution are now provably in lockstep — the original mismatch (task planned, witnesses nil) is structurally impossible. An s3 node with <minCanonicalSyncers now blocks its plan rather than proceeding witness-less.

  • No behavior change to the stateSync path. NotApplicable now means "no snapshot (genesis)".
  • No CRD field/enum/value changes (docstrings only) — not a one-way door.
  • The CRD's XValidation makes s3 and stateSync mutually exclusive at admission, so every SnapshotSource shape routes correctly (verified: s3 / stateSync / genesis / bootstrapImage / archive).

Why it matters

Unblocks s3-bootstrap shadow replayers, and is the concrete prereq for the sharded full-history validation design (which bootstraps replayers from S3 snapshots at arbitrary heights).

Tests

TestStateSyncGate_S3Restore_ResolvesSyncers, TestStateSyncGate_S3Restore_OneSyncer_FailsClosed, the renamed genesis regression guard, and fixture-syncer wiring for the broadened gate. go build/go test green for internal/planner + internal/controller/node; doc.go invariant added with its guarding test named.

Review

Cross-reviewed (systems-engineer: ship — Hypothesis A verified end-to-end, fail-closed has no hole, no stateSync regression, predicate complete across all shapes; idiomatic: reads native). Deploy to the shared controller is gated on explicit sign-off.

🤖 Generated with Claude Code

An s3-restore node (fullNode.snapshot.s3, no stateSync) was stuck: the planner
schedules ConfigureStateSync for any snapshot bootstrap, but the gate resolved
canonical-syncer witnesses (Status.ResolvedStateSyncers) only for stateSync
nodes — so s3 nodes ran that task witness-less and fell back to unreachable
in-cluster peers. The staged snapshot never applied (seid came up at height 0
block-syncing from genesis).

Root cause confirmed against the sidecar: TaskSnapshotRestore only stages the
snapshot under data/snapshots; statesync.Configure applies it via CometBFT
state-sync (use-local-snapshot=true) and still requires >=2 reachable rpc-server
witnesses to verify the trust point. s3 genuinely needs witnesses.

Fix: a single needsStateSyncWitnesses(snap)=snap!=nil predicate now drives BOTH
the ConfigureStateSync insertion (buildSidecarProgression) and the fail-closed
plan blocker (stateSyncBlocksPlan), and reconcileStateSyncGate resolves syncers
for any snapshot bootstrap (gates on snap==nil). Plan-insertion and
witness-resolution are now provably in lockstep; an s3 node with <2 syncers
blocks its plan rather than proceeding witness-less. No change to the stateSync
path; no CRD field/enum changes. Tests cover the s3 resolve + fail-closed paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 22, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes init/bootstrap gating for all snapshot-backed nodes (including S3), which can block plans until syncers are configured but prevents witness-less ConfigureStateSync that previously caused stuck bootstraps.

Overview
S3 snapshot bootstrap nodes were planning ConfigureStateSync but the StateSyncReady gate only resolved canonical RPC witnesses for stateSync specs, so S3 restores could run witness-less and stall bootstrap.

The fix introduces needsStateSyncWitnesses(snap) (snap != nil) as the single predicate for ConfigureStateSync task insertion, the fail-closed plan blocker, and reconcileStateSyncGate. Any snapshot bootstrap (S3 or stateSync) now requires ≥2 configured canonical syncers before the init plan proceeds; genesis nodes (snap == nil) stay NotApplicable. CRD comments are updated to match; tests cover S3 ready/fail-closed paths and default fixture syncer wiring.

Reviewed by Cursor Bugbot for commit 8c537a4. Bugbot is set up for automated code reviews on this repo. Configure here.

fixtureSyncers added the 10th occurrence of the "atlantic-2" literal, tripping
goconst on the new line. Define + use atlantic2ChainID for that occurrence.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham bdchatham force-pushed the fix/s3-bootstrap-statesync-witnesses branch from 8c537a4 to 145b32e Compare June 22, 2026 00:49
@bdchatham bdchatham merged commit 6facb7d into main Jun 22, 2026
5 checks passed
@bdchatham bdchatham deleted the fix/s3-bootstrap-statesync-witnesses branch June 22, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant