fix(database): log-only stall monitor for the push_block write lock by chiliec · Pull Request #119 · VIZ-Blockchain/viz-cpp-node

chiliec · 2026-06-16T06:29:57Z

fix(database): log-only stall monitor for the push_block write lock

What & why

Observability backstop for the global chainbase write lock. A wedged push_block() that holds the write lock and never releases it takes the whole node down — every reader and the validator's own block production block on that lock (in the incident: writer_held_ms climbing for hours during a minority-fork reorg, every scheduled slot missed). The only existing signal was the per-reader Unable to acquire READ lock … writer_held_ms timeout; there was no single, clear "this node is wedged" alert.

PRs #117 / #118 fixed the known infinite loops (the four fork-switch pop_block() loops). This change does not fix a bug — it makes any remaining or future write-lock stall visible.

What it does

A monitor std::thread (started lazily on the first push_block, stopped+joined in close() and ~database()).
push_block() records, inside the write-lock critical section, when it took the lock and which block it is processing; it clears that marker when the lock is released (normal and exception paths).
When a single push_block has held the lock past PUSH_BLOCK_STALL_WARN_SEC (60s) and no operation has been applied in that window, the monitor prints a loud, throttled (every 30s) diagnostic to stderr, and prints a "recovered" line when the lock is finally released.
Adds int64_t database::push_block_lock_held_ms() const — lock-free, callable from any thread — so an external healthcheck / RPC / orchestrator can observe the stall and decide whether to restart.

libraries/chain/database.cpp                          | 77 +++++++++++++++
libraries/chain/include/graphene/chain/database.hpp   | 29 ++++++

Why log-only (and why NOT a force-exit watchdog)

An earlier draft force-exited the process (std::_Exit) on stall. That was rejected as unsafe and replaced with this log-only design, because:

Synchronized network-wide self-destruct. The progress signal cannot cover every kind of legitimate long work under the write lock (hardfork batch processing over all accounts, schedule recomputation, large reorg pop phases, shared-memory resize — none necessarily call apply_operation). A deterministic heavy block — most dangerously a future hardfork block — would exceed the threshold on every node at the same height at the same time, so a force-exit would take the entire network down at once. That is far worse than the localized two-node hang it would guard against (a hung node is recoverable by an operator; a synchronized consensus-height self-destruct is not).
Corruption risk. std::_Exit mid-write leaves the mmap'd shared-memory segment inconsistent, pushing an otherwise-healthy node into the corruption-recovery / snapshot-reload path.

With log-only, a false positive can only produce an extra log line — never a node death, never a consensus event. The restart decision is external (operator / orchestrator / healthcheck via push_block_lock_held_ms()), where it belongs.

Safety properties

The monitor thread only sleeps, reads atomics, and writes to stderr. It never takes a lock, never mutates chain state, never exits the process. It cannot itself deadlock or block consensus, and it uses stderr (not fc logging) so it can never block on a lock the wedged main thread holds.
Thread lifecycle is closed on every path: stopped+joined in both close() and ~database(), so the std::thread member is never destroyed while joinable.
Hot-path cost: one relaxed atomic increment in apply_operation.

Testing / limitations

No full project build in my environment (no Boost/OpenSSL). Brace balance verified on both files; the monitor + accessor compile in isolation (g++ -std=c++17 -pthread -fsyntax-only). Run normal CI + chain tests before merge.
Because it is log-only, the deployment risk is low. Still worth a quick canary check that the WARNING does not appear during normal sync/production (it shouldn't — a single block never holds the lock for 60s in healthy operation).
Suggested test: inject a debug stall in push_block (hold the lock, no apply_operation) and confirm the WARNING appears after ~60s and the "recovered" line on release; and that a normal block run produces neither.

Lifecycle

Permanent, dormant observability aid. Keep it as long as a single thread can hold the global write lock across complex logic. If that architectural precondition is ever removed (all write-lock holders provably bounded / non-blocking), it becomes redundant and can be dropped.

Follow-ups (optional, not in this PR)

Make PUSH_BLOCK_STALL_WARN_SEC operator-configurable (currently a compile-time constant).
Surface push_block_lock_held_ms() in a healthcheck/RPC endpoint so an orchestrator can auto-restart on a sustained stall.
Mirror the same stderr diagnostic to the node logger if operators prefer it in the main log.

Observability backstop for the global chainbase write lock. A wedged push_block() holding the write lock takes the whole node down (every reader + the validator's own block production block on it; observed: writer_held_ms climbing for hours during a minority-fork reorg). A monitor thread watches the per-push_block hold time and, when a single push_block holds the write lock past PUSH_BLOCK_STALL_WARN_SEC (60s) with no operation applied in that window, emits a loud, throttled stderr diagnostic. It also exposes push_block_lock_held_ms() for external healthchecks. LOG-ONLY by design: it never calls std::_Exit and never mutates state, so a false positive can only add a log line, never a node death. A force-exit watchdog was deliberately rejected: on a deterministic heavy block (e.g. a future hardfork) it would self-destruct the whole network at once -- far worse than the localized hang it would guard against. Restart decisions stay with the operator / orchestrator, driven by push_block_lock_held_ms(). Complements the pop-loop no-progress guards (#117/#118): those fix the known infinite loops; this only makes any remaining/future write-lock stall visible.

On1x

ok

On1x approved these changes Jun 16, 2026

View reviewed changes

On1x merged commit 45d7de1 into master Jun 16, 2026
1 check passed

chiliec mentioned this pull request Jun 16, 2026

fix(database): progress-aware stall accessor for external restart logic #120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(database): log-only stall monitor for the push_block write lock#119

fix(database): log-only stall monitor for the push_block write lock#119
On1x merged 1 commit into
masterfrom
fix/push-block-writelock-stall-monitor

chiliec commented Jun 16, 2026

Uh oh!

On1x left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chiliec commented Jun 16, 2026

fix(database): log-only stall monitor for the push_block write lock

What & why

What it does

Why log-only (and why NOT a force-exit watchdog)

Safety properties

Testing / limitations

Lifecycle

Follow-ups (optional, not in this PR)

Uh oh!

On1x left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants