Skip to content

spock_failover_slots: drop EXCLUSIVE locks around slot xmin compute #485

Merged
mason-sharp merged 2 commits into
mainfrom
fix/failover-slot-stuck-cherry
May 27, 2026
Merged

spock_failover_slots: drop EXCLUSIVE locks around slot xmin compute #485
mason-sharp merged 2 commits into
mainfrom
fix/failover-slot-stuck-cherry

Conversation

@mason-sharp
Copy link
Copy Markdown
Member

synchronize_one_slot held ReplicationSlotControlLock LW_EXCLUSIVE and
ProcArrayLock LW_EXCLUSIVE across the slot xmin seed, then later in the
same function ReplicationSlotRelease() / ReplicationSlotDropPtr() would
call ReplicationSlotsComputeRequiredXmin(false) which needs the SHARED
ReplicationSlotControlLock. On PG17+ standbys this contention wedged
the spock_failover_slots worker inside LWLockAcquire and queued every
backend trying to initialise behind it.

Take ProcArrayLock LW_SHARED for GetOldestSafeDecodingTransactionId,
update the slot fields under the slot's SpinLock, then let
ReplicationSlotsComputeRequiredXmin(false) acquire its own short
SHARED RSCL the way every PG core call site does.

Ibrar Ahmed added 2 commits May 27, 2026 11:41
synchronize_one_slot held ReplicationSlotControlLock LW_EXCLUSIVE and
ProcArrayLock LW_EXCLUSIVE across the slot xmin seed, then later in the
same function ReplicationSlotRelease() / ReplicationSlotDropPtr() would
call ReplicationSlotsComputeRequiredXmin(false) which needs the SHARED
ReplicationSlotControlLock. On PG17+ standbys this contention wedged
the spock_failover_slots worker inside LWLockAcquire and queued every
backend trying to initialise behind it.

Take ProcArrayLock LW_SHARED for GetOldestSafeDecodingTransactionId,
update the slot fields under the slot's SpinLock, then let
ReplicationSlotsComputeRequiredXmin(false) acquire its own short
SHARED RSCL the way every PG core call site does.

(cherry picked from commit fb8c2de)
qport() now sets PGCONNECT_TIMEOUT=10 so a hung standby fails in 10 s
instead of stalling prove until the harness kills it. Add qport_status()
which returns (output, psql_rc) so callers can assert connect succeeded
within the timeout, and use it in the read-only-standby regression
assertions for explicit "psql rc=N" failure messages.
@mason-sharp mason-sharp requested a review from ibrarahmad May 27, 2026 18:43
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR refactors slot synchronization locking in synchronize_one_slot() to avoid holding exclusive locks while computing required transaction minimums, then hardens test query helpers with connection timeouts and exit-code verification to catch hung operations.

Changes

Slot synchronization lock pattern and test coverage

Layer / File(s) Summary
Slot seeding lock refactoring
src/spock_failover_slots.c
Replaces exclusive ReplicationSlotControlLock and ProcArrayLock acquisition with shared ProcArrayLock hold to read safe transaction ID, slot field updates under SpinLock, and deferred ReplicationSlotsComputeRequiredXmin(false) call outside the prior exclusive-lock window.
Test timeout infrastructure and assertions
tests/tap/t/018_failover_slots.pl
qport() helper adds PGCONNECT_TIMEOUT=10 to fail fast on hung connections. New qport_status() helper captures query output and psql exit code. Regression assertions use qport_status() with explicit rc==0 checks for pg_is_in_recovery(), user tables, spock.node catalog, and replication slots.

Poem

🐰 Locks unwind in careful measure,
Shared where once were chains of treasure,
Tests catch hangs with timeout's pleasure,
Slots spin smooth, a fuzzy treasure! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: dropping EXCLUSIVE locks around slot xmin computation in spock_failover_slots, which directly addresses the locking contention issue described in the PR.
Description check ✅ Passed The description provides relevant context about the locking contention issue, the specific problem on PostgreSQL 17+ standbys, and the solution approach of using SHARED locks instead of EXCLUSIVE ones.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/failover-slot-stuck-cherry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/tap/t/018_failover_slots.pl`:
- Around line 439-474: The write-rejection check still invokes psql directly
without a connection timeout, so a wedged standby can hang the test; change that
write-probe to use the same qport_status(...) wrapper used for the reads (or, if
keeping the direct psql call, set PGCONNECT_TIMEOUT to the same timeout value)
so the write probe fails fast; update the code that performs the
"write-rejection" psql invocation to either call qport_status($pg_bin, $host,
$standby_port, $dbname, $db_user, "<write SQL>") or export/prepend
PGCONNECT_TIMEOUT=10 for that psql command to match the other probes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a0f282cc-6f41-4b10-80cb-dd3be1a609b3

📥 Commits

Reviewing files that changed from the base of the PR and between bcd6160 and 86422b4.

📒 Files selected for processing (2)
  • src/spock_failover_slots.c
  • tests/tap/t/018_failover_slots.pl

Comment on lines +439 to +474
#
# Use qport_status so a hung backend (initializing / futex_wait_queue) is
# reported as an explicit assertion failure with a clear message, instead
# of stalling the whole test until prove kills it.
my ($still_in_recovery, $rc_recov) = qport_status($pg_bin, $host, $standby_port,
$dbname, $db_user, "SELECT pg_is_in_recovery()");
$still_in_recovery =~ s/\s+//g;
ok($rc_recov == 0,
"Standby answered pg_is_in_recovery() within 10s (psql rc=$rc_recov)");
is($still_in_recovery, 't',
'Read-only standby is still in recovery (hot_standby mode)');

# 1) User-table SELECT against the standby returns the committed row.
my $val_on_standby = qport($pg_bin, $host, $standby_port,
my ($val_on_standby, $rc_user) = qport_status($pg_bin, $host, $standby_port,
$dbname, $db_user, "SELECT val FROM failover_test WHERE id = 1");
$val_on_standby =~ s/\s+//g;
ok($rc_user == 0,
"Standby answered user SELECT within 10s (psql rc=$rc_user)");
is($val_on_standby, 'before_failover',
'Read-only standby returns committed user data (SELECT works)');

# 2) Spock catalog SELECT against the standby — the original customer
# failure mode was that spock.* reads errored out on a recovery backend.
my $standby_node_count = qport($pg_bin, $host, $standby_port,
# 2) Spock catalog SELECT against the standby — guards against the
# failure mode where backends hang in "initializing" state on the
# standby with futex_wait_queue. An explicit timeout on the connect
# makes this a fast assertion failure, not a hang.
my ($standby_node_count, $rc_node) = qport_status($pg_bin, $host, $standby_port,
$dbname, $db_user, "SELECT count(*) FROM spock.node");
$standby_node_count =~ s/\s+//g;
ok($rc_node == 0,
"Standby answered spock.node SELECT within 10s (psql rc=$rc_node)");
ok(($standby_node_count =~ /^\d+$/) && $standby_node_count >= 1,
"Read-only standby returns spock.node ($standby_node_count rows)");

# 3) The synced logical slot is visible on the standby.
my $standby_slot_count = qport($pg_bin, $host, $standby_port,
my ($standby_slot_count, $rc_slot) = qport_status($pg_bin, $host, $standby_port,
$dbname, $db_user,
"SELECT count(*) FROM pg_replication_slots WHERE slot_name = '$slot_name'");
$standby_slot_count =~ s/\s+//g;
ok($rc_slot == 0,
"Standby answered pg_replication_slots SELECT within 10s (psql rc=$rc_slot)");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The regression still leaves one standby probe unbounded.

These reads now fail fast, but the later write-rejection check still shells out to psql without PGCONNECT_TIMEOUT. If the standby wedges in the same initialization path, the suite can still hang after these assertions start failing. Please route that write probe through qport_status() too, or add the same timeout there.

Suggested follow-up
-my $write_rc = system(
-    "$pg_bin/psql -X -h $host -p $standby_port -d $dbname -U $db_user "
-  . "-v ON_ERROR_STOP=1 "
-  . "-c \"INSERT INTO failover_test VALUES (999, 'must_fail')\" "
-  . ">/dev/null 2>&1");
-isnt($write_rc, 0,
+my (undef, $write_rc) = qport_status($pg_bin, $host, $standby_port,
+    $dbname, $db_user,
+    "INSERT INTO failover_test VALUES (999, 'must_fail')");
+ok($write_rc != 0,
     'Write against read-only standby is rejected (read-only enforced)');
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/tap/t/018_failover_slots.pl` around lines 439 - 474, The
write-rejection check still invokes psql directly without a connection timeout,
so a wedged standby can hang the test; change that write-probe to use the same
qport_status(...) wrapper used for the reads (or, if keeping the direct psql
call, set PGCONNECT_TIMEOUT to the same timeout value) so the write probe fails
fast; update the code that performs the "write-rejection" psql invocation to
either call qport_status($pg_bin, $host, $standby_port, $dbname, $db_user,
"<write SQL>") or export/prepend PGCONNECT_TIMEOUT=10 for that psql command to
match the other probes.

@mason-sharp mason-sharp merged commit cab39df into main May 27, 2026
14 checks passed
@mason-sharp mason-sharp deleted the fix/failover-slot-stuck-cherry branch May 27, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants