Skip to content

fix(database): log-only stall monitor for the push_block write lock#119

Merged
On1x merged 1 commit into
masterfrom
fix/push-block-writelock-stall-monitor
Jun 16, 2026
Merged

fix(database): log-only stall monitor for the push_block write lock#119
On1x merged 1 commit into
masterfrom
fix/push-block-writelock-stall-monitor

Conversation

@chiliec

@chiliec chiliec commented Jun 16, 2026

Copy link
Copy Markdown
Member

fix(database): log-only stall monitor for the push_block write lock

What & why

Observability backstop for the global chainbase write lock. A wedged push_block() that holds the write lock and never releases it takes the whole node down — every reader and the validator's own block production block on that lock (in the incident: writer_held_ms climbing for hours during a minority-fork reorg, every scheduled slot missed). The only existing signal was the per-reader Unable to acquire READ lock … writer_held_ms timeout; there was no single, clear "this node is wedged" alert.

PRs #117 / #118 fixed the known infinite loops (the four fork-switch pop_block() loops). This change does not fix a bug — it makes any remaining or future write-lock stall visible.

What it does

  • A monitor std::thread (started lazily on the first push_block, stopped+joined in close() and ~database()).
  • push_block() records, inside the write-lock critical section, when it took the lock and which block it is processing; it clears that marker when the lock is released (normal and exception paths).
  • When a single push_block has held the lock past PUSH_BLOCK_STALL_WARN_SEC (60s) and no operation has been applied in that window, the monitor prints a loud, throttled (every 30s) diagnostic to stderr, and prints a "recovered" line when the lock is finally released.
  • Adds int64_t database::push_block_lock_held_ms() const — lock-free, callable from any thread — so an external healthcheck / RPC / orchestrator can observe the stall and decide whether to restart.
libraries/chain/database.cpp                          | 77 +++++++++++++++
libraries/chain/include/graphene/chain/database.hpp   | 29 ++++++

Why log-only (and why NOT a force-exit watchdog)

An earlier draft force-exited the process (std::_Exit) on stall. That was rejected as unsafe and replaced with this log-only design, because:

  1. Synchronized network-wide self-destruct. The progress signal cannot cover every kind of legitimate long work under the write lock (hardfork batch processing over all accounts, schedule recomputation, large reorg pop phases, shared-memory resize — none necessarily call apply_operation). A deterministic heavy block — most dangerously a future hardfork block — would exceed the threshold on every node at the same height at the same time, so a force-exit would take the entire network down at once. That is far worse than the localized two-node hang it would guard against (a hung node is recoverable by an operator; a synchronized consensus-height self-destruct is not).
  2. Corruption risk. std::_Exit mid-write leaves the mmap'd shared-memory segment inconsistent, pushing an otherwise-healthy node into the corruption-recovery / snapshot-reload path.

With log-only, a false positive can only produce an extra log line — never a node death, never a consensus event. The restart decision is external (operator / orchestrator / healthcheck via push_block_lock_held_ms()), where it belongs.

Safety properties

  • The monitor thread only sleeps, reads atomics, and writes to stderr. It never takes a lock, never mutates chain state, never exits the process. It cannot itself deadlock or block consensus, and it uses stderr (not fc logging) so it can never block on a lock the wedged main thread holds.
  • Thread lifecycle is closed on every path: stopped+joined in both close() and ~database(), so the std::thread member is never destroyed while joinable.
  • Hot-path cost: one relaxed atomic increment in apply_operation.

Testing / limitations

  • No full project build in my environment (no Boost/OpenSSL). Brace balance verified on both files; the monitor + accessor compile in isolation (g++ -std=c++17 -pthread -fsyntax-only). Run normal CI + chain tests before merge.
  • Because it is log-only, the deployment risk is low. Still worth a quick canary check that the WARNING does not appear during normal sync/production (it shouldn't — a single block never holds the lock for 60s in healthy operation).
  • Suggested test: inject a debug stall in push_block (hold the lock, no apply_operation) and confirm the WARNING appears after ~60s and the "recovered" line on release; and that a normal block run produces neither.

Lifecycle

Permanent, dormant observability aid. Keep it as long as a single thread can hold the global write lock across complex logic. If that architectural precondition is ever removed (all write-lock holders provably bounded / non-blocking), it becomes redundant and can be dropped.

Follow-ups (optional, not in this PR)

  • Make PUSH_BLOCK_STALL_WARN_SEC operator-configurable (currently a compile-time constant).
  • Surface push_block_lock_held_ms() in a healthcheck/RPC endpoint so an orchestrator can auto-restart on a sustained stall.
  • Mirror the same stderr diagnostic to the node logger if operators prefer it in the main log.

Observability backstop for the global chainbase write lock. A wedged push_block()
holding the write lock takes the whole node down (every reader + the validator's
own block production block on it; observed: writer_held_ms climbing for hours
during a minority-fork reorg).

A monitor thread watches the per-push_block hold time and, when a single
push_block holds the write lock past PUSH_BLOCK_STALL_WARN_SEC (60s) with no
operation applied in that window, emits a loud, throttled stderr diagnostic. It
also exposes push_block_lock_held_ms() for external healthchecks.

LOG-ONLY by design: it never calls std::_Exit and never mutates state, so a false
positive can only add a log line, never a node death. A force-exit watchdog was
deliberately rejected: on a deterministic heavy block (e.g. a future hardfork) it
would self-destruct the whole network at once -- far worse than the localized hang
it would guard against. Restart decisions stay with the operator / orchestrator,
driven by push_block_lock_held_ms().

Complements the pop-loop no-progress guards (#117/#118): those fix the known
infinite loops; this only makes any remaining/future write-lock stall visible.

@On1x On1x left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@On1x On1x merged commit 45d7de1 into master Jun 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants