fix(database): progress-aware stall accessor for external restart logic by chiliec · Pull Request #120 · VIZ-Blockchain/viz-cpp-node

chiliec · 2026-06-16T06:50:47Z

Summary

Follow-up to #119 (review findings). Stacks on #119 — merge after it (this branch is based on #119's head; once #119 lands as a merge commit the diff here collapses to just these changes).

#119's monitor is progress-aware internally — it warns only when the write lock is held past WARN_SEC and no operation was applied in that window. But the public accessor it added, push_block_lock_held_ms(), reports raw hold time only, and both its doc comment and #119's proposed follow-up ("surface it so an orchestrator can auto-restart on a sustained stall") invite keying a restart off hold time alone.

That reintroduces the exact hazard #119 rejected for std::_Exit: a legitimately heavy, deterministic block (hardfork batch, large reorg) holds the lock a long time at the same height on every node, so a hold-time-only restart would restart the whole network at once — just moved one level out, from the process to the orchestrator.

Changes

Add bool push_block_appears_stalled() const — a lock-free, progress-aware verdict (held > WARN_SEC && no progress in that window) published by the monitor each poll and backed by a new _push_block_stalled atomic. This is the signal external healthcheck/restart logic should use; it cannot false-positive on a heavy deterministic block the way raw hold time can.
Correct the push_block_lock_held_ms() doc comment to state plainly that hold time alone is not a wedge signal and restart logic must use push_block_appears_stalled().
Clear the verdict on monitor start/stop — reset _push_block_stalled to false in both start_push_block_monitor() and stop_push_block_monitor(), so after a close()→open() recovery cycle a reader can't observe a stale true during the ~1s before the lazily-restarted monitor's first poll.
Two documentation-only notes (no behavior change): the single-caller assumption behind the lazy start/stop CAS + thread assignment, and that a future hardfork block running long in pure C++ without apply_operation could trip the verdict network-wide — log-only, so synchronized noise, not a synchronized restart, provided consumers gate on a sustained verdict rather than the first true reading.

Guidance for wiring auto-restart (for whoever builds the healthcheck)

push_block_appears_stalled() is built for alerting first — page a human on a sustained true. It is safe against the heavy-deterministic-block false positive that raw hold time is not, but it still cannot see a pure-C++ hardfork migration that runs long without calling apply_operation; such a block would flip the verdict true on every node at the same height simultaneously. The verdict is log-only, so that is synchronized noise — it becomes a synchronized restart only if someone automates restart off it without gating.

So, if you automate restart at all:

Gate on a sustained verdict (held true for N minutes) AND head_block_num not advancing over that window. Both conditions together distinguish a genuine wedge from a long-but-progressing block.
Never restart on a single true reading, and never on push_block_lock_held_ms() (raw hold time) alone.
Prefer paging a human over auto-restart wherever an operator can respond in time — a wedged node is locally recoverable; a synchronized network-wide restart is not.

Deliberately not changed (from the review, with rationale)

Monotonic clock for elapsed time. Kept fc::time_point for consistency with the existing read-lock writer_held_ms machinery and the accessor's comparability. NTP can momentarily inflate the elapsed math, but it is log-only and the dominant direction (a forward step) only risks a spurious line gated behind the 60s threshold. Switching to steady_clock would add parallel timestamp state for marginal benefit.
RAII scope-guard for the _push_block_lock_since_us clear. fix(database): log-only stall monitor for the push_block write lock #119's two explicit stores (normal + catch) are already correct on every path; a guard would be tidier but is out of scope here.

Testing

⚠️ Not built locally (no Boost/OpenSSL in this environment); brace balance verified, net-zero brace delta, whitespace clean. Run normal CI + chain tests before merge. Suggested addition to #119's test plan: assert push_block_appears_stalled() stays false through a block that applies many operations over >60s (the heavy-but-healthy case), and flips true only for a hold with no apply_operation.

Follow-up to #119. The monitor added there is progress-aware internally (it warns only when the write lock is held > WARN_SEC AND no operation was applied in that window), but the public accessor it exposed, push_block_lock_held_ms(), reports raw hold time only. Its doc comment and #119's suggested follow-up invite an orchestrator to auto-restart on hold time alone — which would restart on a legitimately heavy, DETERMINISTIC block (hardfork batch, large reorg) at the same height on every node simultaneously. That is the same synchronized network-wide self-destruct #119 deliberately avoided by rejecting std::_Exit, just moved one level out to the orchestrator. Add push_block_appears_stalled(): a lock-free, progress-aware verdict (held long AND no progress) published by the monitor each poll and backed by a new _push_block_stalled atomic. External healthcheck / restart logic should key off this, never off raw hold time. Correct the push_block_lock_held_ms() doc comment to say so explicitly. Also documents two review notes as comments (no behavior change): the single-caller assumption behind the lazy start/stop CAS, and that a future hardfork block running long in pure C++ without apply_operation could trip the verdict network-wide (log-only, so noise not restart, provided consumers gate on a SUSTAINED verdict).

Addresses review of #120: _push_block_stalled retained its last value across a close()->open() recovery cycle, so push_block_appears_stalled() could return a stale `true` for up to the ~1s poll interval before the lazily-restarted monitor re-evaluated. Reset the verdict to false in both start_push_block_monitor() (before launching the new thread) and stop_push_block_monitor() (a stopped monitor reports not-stalled), fully closing the window.

On1x

ok

chiliec marked this pull request as draft June 16, 2026 06:53

chiliec marked this pull request as ready for review June 16, 2026 07:28

chiliec assigned chiliec and On1x and unassigned chiliec Jun 16, 2026

On1x approved these changes Jun 16, 2026

View reviewed changes

On1x merged commit 1bf813c into master Jun 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(database): progress-aware stall accessor for external restart logic#120

fix(database): progress-aware stall accessor for external restart logic#120
On1x merged 2 commits into
masterfrom
fix/push-block-stall-progress-aware-accessor

chiliec commented Jun 16, 2026 •

edited

Loading

Uh oh!

On1x left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chiliec commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Guidance for wiring auto-restart (for whoever builds the healthcheck)

Deliberately not changed (from the review, with rationale)

Testing

Uh oh!

On1x left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chiliec commented Jun 16, 2026 •

edited

Loading