fix(database): log-only stall monitor for the push_block write lock#119
Merged
Conversation
Observability backstop for the global chainbase write lock. A wedged push_block() holding the write lock takes the whole node down (every reader + the validator's own block production block on it; observed: writer_held_ms climbing for hours during a minority-fork reorg). A monitor thread watches the per-push_block hold time and, when a single push_block holds the write lock past PUSH_BLOCK_STALL_WARN_SEC (60s) with no operation applied in that window, emits a loud, throttled stderr diagnostic. It also exposes push_block_lock_held_ms() for external healthchecks. LOG-ONLY by design: it never calls std::_Exit and never mutates state, so a false positive can only add a log line, never a node death. A force-exit watchdog was deliberately rejected: on a deterministic heavy block (e.g. a future hardfork) it would self-destruct the whole network at once -- far worse than the localized hang it would guard against. Restart decisions stay with the operator / orchestrator, driven by push_block_lock_held_ms(). Complements the pop-loop no-progress guards (#117/#118): those fix the known infinite loops; this only makes any remaining/future write-lock stall visible.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(database): log-only stall monitor for the push_block write lock
What & why
Observability backstop for the global chainbase write lock. A wedged
push_block()that holds the write lock and never releases it takes the whole node down — every reader and the validator's own block production block on that lock (in the incident:writer_held_msclimbing for hours during a minority-fork reorg, every scheduled slot missed). The only existing signal was the per-readerUnable to acquire READ lock … writer_held_mstimeout; there was no single, clear "this node is wedged" alert.PRs #117 / #118 fixed the known infinite loops (the four fork-switch
pop_block()loops). This change does not fix a bug — it makes any remaining or future write-lock stall visible.What it does
std::thread(started lazily on the firstpush_block, stopped+joined inclose()and~database()).push_block()records, inside the write-lock critical section, when it took the lock and which block it is processing; it clears that marker when the lock is released (normal and exception paths).push_blockhas held the lock pastPUSH_BLOCK_STALL_WARN_SEC(60s) and no operation has been applied in that window, the monitor prints a loud, throttled (every 30s) diagnostic to stderr, and prints a "recovered" line when the lock is finally released.int64_t database::push_block_lock_held_ms() const— lock-free, callable from any thread — so an external healthcheck / RPC / orchestrator can observe the stall and decide whether to restart.Why log-only (and why NOT a force-exit watchdog)
An earlier draft force-exited the process (
std::_Exit) on stall. That was rejected as unsafe and replaced with this log-only design, because:apply_operation). A deterministic heavy block — most dangerously a future hardfork block — would exceed the threshold on every node at the same height at the same time, so a force-exit would take the entire network down at once. That is far worse than the localized two-node hang it would guard against (a hung node is recoverable by an operator; a synchronized consensus-height self-destruct is not).std::_Exitmid-write leaves the mmap'd shared-memory segment inconsistent, pushing an otherwise-healthy node into the corruption-recovery / snapshot-reload path.With log-only, a false positive can only produce an extra log line — never a node death, never a consensus event. The restart decision is external (operator / orchestrator / healthcheck via
push_block_lock_held_ms()), where it belongs.Safety properties
stderr. It never takes a lock, never mutates chain state, never exits the process. It cannot itself deadlock or block consensus, and it usesstderr(not fc logging) so it can never block on a lock the wedged main thread holds.close()and~database(), so thestd::threadmember is never destroyed while joinable.relaxedatomic increment inapply_operation.Testing / limitations
g++ -std=c++17 -pthread -fsyntax-only). Run normal CI + chain tests before merge.push_block(hold the lock, noapply_operation) and confirm the WARNING appears after ~60s and the "recovered" line on release; and that a normal block run produces neither.Lifecycle
Permanent, dormant observability aid. Keep it as long as a single thread can hold the global write lock across complex logic. If that architectural precondition is ever removed (all write-lock holders provably bounded / non-blocking), it becomes redundant and can be dropped.
Follow-ups (optional, not in this PR)
PUSH_BLOCK_STALL_WARN_SECoperator-configurable (currently a compile-time constant).push_block_lock_held_ms()in a healthcheck/RPC endpoint so an orchestrator can auto-restart on a sustained stall.stderrdiagnostic to the node logger if operators prefer it in the main log.