fix(database): progress-aware stall accessor for external restart logic#120
Merged
Merged
Conversation
Follow-up to #119. The monitor added there is progress-aware internally (it warns only when the write lock is held > WARN_SEC AND no operation was applied in that window), but the public accessor it exposed, push_block_lock_held_ms(), reports raw hold time only. Its doc comment and #119's suggested follow-up invite an orchestrator to auto-restart on hold time alone — which would restart on a legitimately heavy, DETERMINISTIC block (hardfork batch, large reorg) at the same height on every node simultaneously. That is the same synchronized network-wide self-destruct #119 deliberately avoided by rejecting std::_Exit, just moved one level out to the orchestrator. Add push_block_appears_stalled(): a lock-free, progress-aware verdict (held long AND no progress) published by the monitor each poll and backed by a new _push_block_stalled atomic. External healthcheck / restart logic should key off this, never off raw hold time. Correct the push_block_lock_held_ms() doc comment to say so explicitly. Also documents two review notes as comments (no behavior change): the single-caller assumption behind the lazy start/stop CAS, and that a future hardfork block running long in pure C++ without apply_operation could trip the verdict network-wide (log-only, so noise not restart, provided consumers gate on a SUSTAINED verdict).
Addresses review of #120: _push_block_stalled retained its last value across a close()->open() recovery cycle, so push_block_appears_stalled() could return a stale `true` for up to the ~1s poll interval before the lazily-restarted monitor re-evaluated. Reset the verdict to false in both start_push_block_monitor() (before launching the new thread) and stop_push_block_monitor() (a stopped monitor reports not-stalled), fully closing the window.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #119 (review findings). Stacks on #119 — merge after it (this branch is based on #119's head; once #119 lands as a merge commit the diff here collapses to just these changes).
#119's monitor is progress-aware internally — it warns only when the write lock is held past
WARN_SECand no operation was applied in that window. But the public accessor it added,push_block_lock_held_ms(), reports raw hold time only, and both its doc comment and #119's proposed follow-up ("surface it so an orchestrator can auto-restart on a sustained stall") invite keying a restart off hold time alone.That reintroduces the exact hazard #119 rejected for
std::_Exit: a legitimately heavy, deterministic block (hardfork batch, large reorg) holds the lock a long time at the same height on every node, so a hold-time-only restart would restart the whole network at once — just moved one level out, from the process to the orchestrator.Changes
bool push_block_appears_stalled() const— a lock-free, progress-aware verdict (held > WARN_SEC && no progress in that window) published by the monitor each poll and backed by a new_push_block_stalledatomic. This is the signal external healthcheck/restart logic should use; it cannot false-positive on a heavy deterministic block the way raw hold time can.push_block_lock_held_ms()doc comment to state plainly that hold time alone is not a wedge signal and restart logic must usepush_block_appears_stalled()._push_block_stalledtofalsein bothstart_push_block_monitor()andstop_push_block_monitor(), so after aclose()→open()recovery cycle a reader can't observe a staletrueduring the ~1s before the lazily-restarted monitor's first poll.apply_operationcould trip the verdict network-wide — log-only, so synchronized noise, not a synchronized restart, provided consumers gate on a sustained verdict rather than the first true reading.Guidance for wiring auto-restart (for whoever builds the healthcheck)
push_block_appears_stalled()is built for alerting first — page a human on a sustainedtrue. It is safe against the heavy-deterministic-block false positive that raw hold time is not, but it still cannot see a pure-C++ hardfork migration that runs long without callingapply_operation; such a block would flip the verdicttrueon every node at the same height simultaneously. The verdict is log-only, so that is synchronized noise — it becomes a synchronized restart only if someone automates restart off it without gating.So, if you automate restart at all:
truefor N minutes) ANDhead_block_numnot advancing over that window. Both conditions together distinguish a genuine wedge from a long-but-progressing block.truereading, and never onpush_block_lock_held_ms()(raw hold time) alone.Deliberately not changed (from the review, with rationale)
fc::time_pointfor consistency with the existing read-lockwriter_held_msmachinery and the accessor's comparability. NTP can momentarily inflate the elapsed math, but it is log-only and the dominant direction (a forward step) only risks a spurious line gated behind the 60s threshold. Switching tosteady_clockwould add parallel timestamp state for marginal benefit._push_block_lock_since_usclear. fix(database): log-only stall monitor for the push_block write lock #119's two explicit stores (normal +catch) are already correct on every path; a guard would be tidier but is out of scope here.Testing
push_block_appears_stalled()stays false through a block that applies many operations over >60s (the heavy-but-healthy case), and flips true only for a hold with noapply_operation.