Skip to content

fix(database): progress-aware stall accessor for external restart logic#120

Merged
On1x merged 2 commits into
masterfrom
fix/push-block-stall-progress-aware-accessor
Jun 16, 2026
Merged

fix(database): progress-aware stall accessor for external restart logic#120
On1x merged 2 commits into
masterfrom
fix/push-block-stall-progress-aware-accessor

Conversation

@chiliec

@chiliec chiliec commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

Follow-up to #119 (review findings). Stacks on #119 — merge after it (this branch is based on #119's head; once #119 lands as a merge commit the diff here collapses to just these changes).

#119's monitor is progress-aware internally — it warns only when the write lock is held past WARN_SEC and no operation was applied in that window. But the public accessor it added, push_block_lock_held_ms(), reports raw hold time only, and both its doc comment and #119's proposed follow-up ("surface it so an orchestrator can auto-restart on a sustained stall") invite keying a restart off hold time alone.

That reintroduces the exact hazard #119 rejected for std::_Exit: a legitimately heavy, deterministic block (hardfork batch, large reorg) holds the lock a long time at the same height on every node, so a hold-time-only restart would restart the whole network at once — just moved one level out, from the process to the orchestrator.

Changes

  • Add bool push_block_appears_stalled() const — a lock-free, progress-aware verdict (held > WARN_SEC && no progress in that window) published by the monitor each poll and backed by a new _push_block_stalled atomic. This is the signal external healthcheck/restart logic should use; it cannot false-positive on a heavy deterministic block the way raw hold time can.
  • Correct the push_block_lock_held_ms() doc comment to state plainly that hold time alone is not a wedge signal and restart logic must use push_block_appears_stalled().
  • Clear the verdict on monitor start/stop — reset _push_block_stalled to false in both start_push_block_monitor() and stop_push_block_monitor(), so after a close()→open() recovery cycle a reader can't observe a stale true during the ~1s before the lazily-restarted monitor's first poll.
  • Two documentation-only notes (no behavior change): the single-caller assumption behind the lazy start/stop CAS + thread assignment, and that a future hardfork block running long in pure C++ without apply_operation could trip the verdict network-wide — log-only, so synchronized noise, not a synchronized restart, provided consumers gate on a sustained verdict rather than the first true reading.

Guidance for wiring auto-restart (for whoever builds the healthcheck)

push_block_appears_stalled() is built for alerting first — page a human on a sustained true. It is safe against the heavy-deterministic-block false positive that raw hold time is not, but it still cannot see a pure-C++ hardfork migration that runs long without calling apply_operation; such a block would flip the verdict true on every node at the same height simultaneously. The verdict is log-only, so that is synchronized noise — it becomes a synchronized restart only if someone automates restart off it without gating.

So, if you automate restart at all:

  • Gate on a sustained verdict (held true for N minutes) AND head_block_num not advancing over that window. Both conditions together distinguish a genuine wedge from a long-but-progressing block.
  • Never restart on a single true reading, and never on push_block_lock_held_ms() (raw hold time) alone.
  • Prefer paging a human over auto-restart wherever an operator can respond in time — a wedged node is locally recoverable; a synchronized network-wide restart is not.

Deliberately not changed (from the review, with rationale)

  • Monotonic clock for elapsed time. Kept fc::time_point for consistency with the existing read-lock writer_held_ms machinery and the accessor's comparability. NTP can momentarily inflate the elapsed math, but it is log-only and the dominant direction (a forward step) only risks a spurious line gated behind the 60s threshold. Switching to steady_clock would add parallel timestamp state for marginal benefit.
  • RAII scope-guard for the _push_block_lock_since_us clear. fix(database): log-only stall monitor for the push_block write lock #119's two explicit stores (normal + catch) are already correct on every path; a guard would be tidier but is out of scope here.

Testing

⚠️ Not built locally (no Boost/OpenSSL in this environment); brace balance verified, net-zero brace delta, whitespace clean. Run normal CI + chain tests before merge. Suggested addition to #119's test plan: assert push_block_appears_stalled() stays false through a block that applies many operations over >60s (the heavy-but-healthy case), and flips true only for a hold with no apply_operation.

Follow-up to #119. The monitor added there is progress-aware internally
(it warns only when the write lock is held > WARN_SEC AND no operation
was applied in that window), but the public accessor it exposed,
push_block_lock_held_ms(), reports raw hold time only. Its doc comment
and #119's suggested follow-up invite an orchestrator to auto-restart on
hold time alone — which would restart on a legitimately heavy,
DETERMINISTIC block (hardfork batch, large reorg) at the same height on
every node simultaneously. That is the same synchronized network-wide
self-destruct #119 deliberately avoided by rejecting std::_Exit, just
moved one level out to the orchestrator.

Add push_block_appears_stalled(): a lock-free, progress-aware verdict
(held long AND no progress) published by the monitor each poll and
backed by a new _push_block_stalled atomic. External healthcheck /
restart logic should key off this, never off raw hold time. Correct the
push_block_lock_held_ms() doc comment to say so explicitly.

Also documents two review notes as comments (no behavior change): the
single-caller assumption behind the lazy start/stop CAS, and that a
future hardfork block running long in pure C++ without apply_operation
could trip the verdict network-wide (log-only, so noise not restart,
provided consumers gate on a SUSTAINED verdict).
@chiliec chiliec marked this pull request as draft June 16, 2026 06:53
Addresses review of #120: _push_block_stalled retained its last value
across a close()->open() recovery cycle, so push_block_appears_stalled()
could return a stale `true` for up to the ~1s poll interval before the
lazily-restarted monitor re-evaluated. Reset the verdict to false in
both start_push_block_monitor() (before launching the new thread) and
stop_push_block_monitor() (a stopped monitor reports not-stalled), fully
closing the window.
@chiliec chiliec marked this pull request as ready for review June 16, 2026 07:28
@chiliec chiliec assigned chiliec and On1x and unassigned chiliec Jun 16, 2026

@On1x On1x left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@On1x On1x merged commit 1bf813c into master Jun 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants